2026-05-15 — Linnet Daily

It is not the beauty of a building you should look at its the construction of the foundation that will stand the test of time. — David Allan Coe 38 items · 3 sections

§I arXiv Papers (20) §II Hacker News (9) §III GitHub Trending (9)

§0 Weather §I arXiv Papers §II Hacker News §III GitHub Trending

§ 0

The Morning

Local weather 1

This morning in

London

Overcast

Today's range

13.9°↓6.7°

currently 11.7°

Feels

8.1°

Rain

51%

Wind

12 km/h

Humid

47%

Rise

05:08

Set

20:45

§ I

From the arXiv

arXiv preprints 10 of 20

cs.AIarxiv:2605.07137Lead article

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra

his paper introduces Adaptive Negative Sample Reinforcement (A-NSR) to improve LLM reasoning. A-NSR dynamically adjusts the penalty for incorrect reasoning steps during training, initially prioritizing error correction and later shifting towards more nuanced updates to balance correction and diversity. This adaptive approach aims to enhance LLM reasoning performance beyond fixed penalty methods.

Read abstract →Full PDF

An example on a three-response policy simplex: entropy increases along the training direction when D RL ( a ; s ) > 0 D_{\( \mathrm{RL} \)}(a;s)>0 i.e., θ ⟨ grad F ⁡ ℓ a , grad ℋ resp ⟩ < 90 ∘ \( \theta \)_{\( \left \)<\( \operatorname{grad}^{F} \)\( \ell_{a} \),\( \operatorname \){grad{\( \mathcal{H} \)}_{\( \mathrm{resp} \)}}\( \right \)>}<90^{\( \circ \)} , and decreases otherwise. — An example on a three-response policy simplex: entropy increases along the training direction when D RL ( a ; s ) > 0 D_{\( \mathrm{RL} \)}(a;s)>0 i.e., θ ⟨ grad F ⁡ ℓ a , grad ℋ resp ⟩ < 90 ∘ \( …

cs.AIarxiv:2605.00425

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Haotian Zhao, Songlin Zhou et al.

AEM addresses the challenge of credit assignment in multi-turn agentic reinforcement learning by adaptively modulating entropy dynamics during training. Unlike methods requiring dense intermediate supervision, AEM is supervision-free and improves the explorati…

abstract pdf

cs.AIarxiv:2605.03327

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Hongbo Jin, Rongpeng Zhu et al.

DGPO addresses credit assignment challenges in reinforcement learning for large language models by reinterpreting distribution deviation as a guiding signal instead of a penalty. It uses the bounded Hellinger distance to enable safe, token-level exploration, o…

abstract pdf

Conceptual comparison between standard GRPO and our proposed DGPO. While GRPO uniformly broadcasts a coarse-grained sequence-level advantage and imposes an unbounded Reverse KL penalty that stifles exploration , DGPO dynamically reallocates advantages to individual tokens. — Conceptual comparison between standard GRPO and our proposed DGPO. While GRPO uniformly broadcasts a coarse-grained sequence-level advantage and imposes an unbounded Reverse KL penalty that stifles ex…

An illustration of the ESSAM parameter update.

cs.AIarxiv:2602.01003

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun, Sizhe Dang et al.

This paper introduces ESSAM, a novel approach for fine-tuning LLMs using competitive Evolution Strategies combined with Sharpness-Aware Maximization. ESSAM addresses the high memory demands of traditional RL methods by leveraging zero-order parameter search an…

abstract pdf

cs.AIarxiv:2510.16079

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang et al.

EvolveR enables LLM agents to self-improve by creating a closed-loop experience lifecycle. It first distills interaction trajectories into reusable strategic principles (Offline Self-Distillation) and then uses these principles to guide online task interaction…

abstract pdf

An illustration of four major paradigms for LLM agent learning. (1) Stateless Execution : Standard agents discard experiences after each task; (2) Learning by Raw Trajectories : Agents retrieve raw, un-distilled past trajectories; (3) Learning via External Scribing : Agents rely on an external teacher model to distill insights; (4) EvolveR (Ours) : A complete, self-contained lifecycle where the agent autonomously distills its own experiences into principles and evolves its policy. — An illustration of four major paradigms for LLM agent learning. (1) Stateless Execution : Standard agents discard experiences after each task; (2) Learning by Raw Trajectories : Agents retrieve raw, u…

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

Zhixin Han, Yanzhi Zhang et al.

FutureWorld introduces a novel reinforcement learning environment for training agents to make live future predictions. Its core method involves a delayed reward mechanism where age…

abstract pdf

InvThink: Premortem Reasoning for Safer Language Models

Yubin Kim, Taehan Kim et al.

InvThink is a novel framework that enhances language model safety by requiring a three-step process: enumerating potential harms, analyzing their consequences, and then generating …

abstract pdf

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou et al.

MemSearcher trains LLM agents using end-to-end reinforcement learning to manage a compact, question-relevant memory, avoiding the costly full history concatenation of traditional m…

abstract pdf

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin et al.

This paper addresses the "Modality Gap," where visual and linguistic embeddings for the same meaning are systematically offset. The authors propose the "Fixed-frame Modality Gap Th…

abstract pdf

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

Erhan Zhang, Yiqun Chen et al.

OASES trains agentic search models by generating intermediate rewards that are aligned with the final task outcome. It achieves this by evaluating how well each search step contrib…

abstract pdf

See all 20 papers →

§ II

The Town Square

Hacker News 9

613

pts

Top story

RTX 5090 and M4 MacBook Air: Can It Game?

This article explores whether the upcoming RTX 5090 graphics card, when paired with an M4 MacBook Air via eGPU, can deliver a viable gaming experience.

scottjg.com14 May discuss on HN →

495

AI is making me dumb

jpain.io14 May

323

Bitcoin trader recovers wallet with help of Claude

tomshardware.com14 May

246

Ontario auditors find doctors' AI note takers routinely blow basic facts

theregister.com14 May

197

Sam Altman's Business Dealings Under GOP Scrutiny Ahead of OpenAI's IPO