Daily Issue
Vol. I — No. 5
15 · 05
Friday, 15 May 2026
Generated 2026-05-15 11:37
google/gemini-2.5-flash-lite
It is not the beauty of a building you should look at its the construction of the foundation that will stand the test of time. — David Allan Coe 38 items · 3 sections
§ 0

The Morning

Local weather 1
This morning in
London
Overcast
Today's range
13.9°6.7°
currently 11.7°
Feels
8.1°
Rain
51%
Wind
12 km/h
Humid
47%
Rise
05:08
Set
20:45
§ I

From the arXiv

arXiv preprints 10 of 20
cs.AIarxiv:2605.07137Lead article

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra

his paper introduces Adaptive Negative Sample Reinforcement (A-NSR) to improve LLM reasoning. A-NSR dynamically adjusts the penalty for incorrect reasoning steps during training, initially prioritizing error correction and later shifting towards more nuanced updates to balance correction and diversity. This adaptive approach aims to enhance LLM reasoning performance beyond fixed penalty methods.

An example on a three-response policy simplex: entropy increases along the training direction when D RL ​ ( a ; s ) > 0 D_{\( \mathrm{RL} \)}(a;s)>0 i.e., θ ⟨ grad F ⁡ ℓ a , grad ​ ℋ resp ⟩ < 90 ∘ \( \theta \)_{\( \left \)<\( \operatorname{grad}^{F} \)\( \ell_{a} \),\( \operatorname \){grad{\( \mathcal{H} \)}_{\( \mathrm{resp} \)}}\( \right \)>}<90^{\( \circ \)} , and decreases otherwise.
An example on a three-response policy simplex: entropy increases along the training direction when D RL ​ ( a ; s ) > 0 D_{\( \mathrm{RL} \)}(a;s)>0 i.e., θ ⟨ grad F ⁡ ℓ a , grad ​ ℋ resp ⟩ < 90 ∘ \( …
cs.AIarxiv:2605.00425

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Haotian Zhao, Songlin Zhou et al.

AEM addresses the challenge of credit assignment in multi-turn agentic reinforcement learning by adaptively modulating entropy dynamics during training. Unlike methods requiring dense intermediate supervision, AEM is supervision-free and improves the explorati…

cs.AIarxiv:2605.03327

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Hongbo Jin, Rongpeng Zhu et al.

DGPO addresses credit assignment challenges in reinforcement learning for large language models by reinterpreting distribution deviation as a guiding signal instead of a penalty. It uses the bounded Hellinger distance to enable safe, token-level exploration, o…

Conceptual comparison between standard GRPO and our proposed DGPO. While GRPO uniformly broadcasts a coarse-grained sequence-level advantage and imposes an unbounded Reverse KL penalty that stifles exploration , DGPO dynamically reallocates advantages to individual tokens.
Conceptual comparison between standard GRPO and our proposed DGPO. While GRPO uniformly broadcasts a coarse-grained sequence-level advantage and imposes an unbounded Reverse KL penalty that stifles ex…
An illustration of the ESSAM parameter update.
An illustration of the ESSAM parameter update.
cs.AIarxiv:2602.01003

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun, Sizhe Dang et al.

This paper introduces ESSAM, a novel approach for fine-tuning LLMs using competitive Evolution Strategies combined with Sharpness-Aware Maximization. ESSAM addresses the high memory demands of traditional RL methods by leveraging zero-order parameter search an…

cs.AIarxiv:2510.16079

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang et al.

EvolveR enables LLM agents to self-improve by creating a closed-loop experience lifecycle. It first distills interaction trajectories into reusable strategic principles (Offline Self-Distillation) and then uses these principles to guide online task interaction…

An illustration of four major paradigms for LLM agent learning. (1) Stateless Execution : Standard agents discard experiences after each task; (2) Learning by Raw Trajectories : Agents retrieve raw, un-distilled past trajectories; (3) Learning via External Scribing : Agents rely on an external teacher model to distill insights; (4) EvolveR (Ours) : A complete, self-contained lifecycle where the agent autonomously distills its own experiences into principles and evolves its policy.
An illustration of four major paradigms for LLM agent learning. (1) Stateless Execution : Standard agents discard experiences after each task; (2) Learning by Raw Trajectories : Agents retrieve raw, u…
№06
cs.AI
9

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

Zhixin Han, Yanzhi Zhang et al.

FutureWorld introduces a novel reinforcement learning environment for training agents to make live future predictions. Its core method involves a delayed reward mechanism where age…

№07
cs.AI
9

InvThink: Premortem Reasoning for Safer Language Models

Yubin Kim, Taehan Kim et al.

InvThink is a novel framework that enhances language model safety by requiring a three-step process: enumerating potential harms, analyzing their consequences, and then generating …

№08
cs.AI
9

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou et al.

MemSearcher trains LLM agents using end-to-end reinforcement learning to manage a compact, question-relevant memory, avoiding the costly full history concatenation of traditional m…

№09
cs.AI
9

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin et al.

This paper addresses the "Modality Gap," where visual and linguistic embeddings for the same meaning are systematically offset. The authors propose the "Fixed-frame Modality Gap Th…

№10
cs.AI
9

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

Erhan Zhang, Yiqun Chen et al.

OASES trains agentic search models by generating intermediate rewards that are aligned with the final task outcome. It achieves this by evaluating how well each search step contrib…

§ II

The Town Square

Hacker News 9
compiled overnight by google/gemini-2.5-flash-lite · end of issue no. 5 · thank you for reading