№01
cs.AI arxiv:2605.07137

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Yash Ingle, Jaival Chauhan, Ankit Yadav et al.

This paper introduces Adaptive Negative Sample Reinforcement (A-NSR) to improve LLM reasoning. A-NSR dynamically adjusts the penalty for incorrect reasoning steps during training, initially prioritizing error correction and later shifting towards more nuanced updates to balance correction and diversity. This adaptive a…

9
№02
cs.AI arxiv:2605.00425

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Haotian Zhao, Songlin Zhou, Yuxin Zhang et al.

AEM addresses the challenge of credit assignment in multi-turn agentic reinforcement learning by adaptively modulating entropy dynamics during training. Unlike methods requiring dense intermediate supervision, AEM is supervision-free and improves the exploration-exploitation trade-off by analyzing and adjusting entropy…

9
№03
cs.AI arxiv:2605.03327

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Hongbo Jin, Rongpeng Zhu, Zhongjing Du et al.

DGPO addresses credit assignment challenges in reinforcement learning for large language models by reinterpreting distribution deviation as a guiding signal instead of a penalty. It uses the bounded Hellinger distance to enable safe, token-level exploration, overcoming the gradient instability and conservatism of KL di…

9
№04
cs.AI arxiv:2602.01003

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun, Sizhe Dang, Guang Dai et al.

This paper introduces ESSAM, a novel approach for fine-tuning LLMs using competitive Evolution Strategies combined with Sharpness-Aware Maximization. ESSAM addresses the high memory demands of traditional RL methods by leveraging zero-order parameter search and improving generalization. Its core contribution is achievi…

9
№05
cs.AI arxiv:2510.16079

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei et al.

EvolveR enables LLM agents to self-improve by creating a closed-loop experience lifecycle. It first distills interaction trajectories into reusable strategic principles (Offline Self-Distillation) and then uses these principles to guide online task interactions, iteratively refining the agent's performance through rein…

9
№06
cs.AI arxiv:2604.26733

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

Zhixin Han, Yanzhi Zhang, Chuyang Wei et al.

FutureWorld introduces a novel reinforcement learning environment for training agents to make live future predictions. Its core method involves a delayed reward mechanism where agent predictions are evaluated and rewarded only after real-world outcomes are realized. This allows agents to learn from actual events, closi…

9
№07
cs.AI arxiv:2510.01569

InvThink: Premortem Reasoning for Safer Language Models

Yubin Kim, Taehan Kim, Eugene Park et al.

InvThink is a novel framework that enhances language model safety by requiring a three-step process: enumerating potential harms, analyzing their consequences, and then generating a response with explicit mitigation constraints. This "premortem" reasoning approach significantly improves safety scores, especially in lar…

9
№08
cs.AI arxiv:2511.02805

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou, Zichao Li et al.

MemSearcher trains LLM agents using end-to-end reinforcement learning to manage a compact, question-relevant memory, avoiding the costly full history concatenation of traditional methods. Its core innovation is multi-context GRPO, which enables unified optimization across multiple turns with varying LLM contexts. This …

9
№09
cs.AI arxiv:2602.07026

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Yuhui Zhang et al.

This paper addresses the "Modality Gap," where visual and linguistic embeddings for the same meaning are systematically offset. The authors propose the "Fixed-frame Modality Gap Theory" to precisely model this gap as stable biases and anisotropic residuals. Based on this, they introduce "ReAlign," a training-free strat…

9
№10
cs.AI arxiv:2604.03675

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

Erhan Zhang, Yiqun Chen, Zechun Niu et al.

OASES trains agentic search models by generating intermediate rewards that are aligned with the final task outcome. It achieves this by evaluating how well each search step contributes to answering the original question, providing more reliable supervision than traditional methods. This outcome-aligned process reward m…

9
№11
cs.AI arxiv:2506.00886

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li et al.

This paper argues that tool-augmented agents should only use external tools when their internal reasoning is insufficient to reliably complete a task. It introduces the Theory of Agent (ToA) framework, which views agents as making sequential decisions about resolving uncertainty internally or delegating it externally. …

9
№12
cs.AI arxiv:2605.06230

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

Xinquan Chen, Zhenyun Yin, Shan He et al.

Safactory introduces a unified infrastructure for training trustworthy autonomous agents. It integrates parallel simulation for generating diverse agent experiences, a trustworthy data platform for managing and extracting insights from these experiences, and an autonomous evolution platform for continuous learning and …

9
№13
cs.AI arxiv:2602.10693

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng et al.

VESPO addresses the high variance issue in off-policy LLM training by introducing a principled, closed-form sequence-level reshaping kernel. This kernel explicitly incorporates variance reduction into a variational framework, directly operating on importance weights without heuristic token-level approximations. VESPO's…

9
№14
cs.AI arxiv:2508.15294

A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents

Gaoke Zhang, Bo Wang, Yunlong Ma et al.

This paper introduces a Multi-Memory Segment System (MMS) to generate higher-quality long-term memory content for agents. Inspired by cognitive psychology, MMS processes short-term memory into multiple distinct long-term memory segments, creating corresponding retrieval and contextual memory units. This approach aims t…

8
№15
cs.AI arxiv:2605.05703

Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems

Huchen Yang, Xinghao Dong, Dan Negrut et al.

This paper proposes an active learning method to optimize communication structures in LLM-based multi-agent systems. Instead of random task sampling, it uses an ensemble-based information-theoretic framework to identify the most informative tasks for improving communication. This approach efficiently estimates task val…

8
№16
cs.AI arxiv:2512.10371

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen et al.

AgentProg tackles the challenge of long-horizon GUI agent context management by representing interaction history as a program. This program structure guides information retention and discarding, mitigating context overhead. The paper also introduces a global belief state for handling partial observability and environme…

8
№17
cs.AI arxiv:2604.07277

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Guo Gan, Yuxuan Ding, Cong Chen et al.

This paper introduces Android Coach, a framework to improve the efficiency of training Android agents with online reinforcement learning. It addresses the costly nature of emulator interactions by shifting from a "single state, single action" to a "single state, multiple actions" paradigm. This allows the agent to expl…

8
№18
cs.AI arxiv:2509.03736

Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

James Mooney, Josef Woldense, Zheng Robert Jia et al.

This paper introduces a novel method to assess the behavioral coherence of LLM agents by first identifying their underlying latent profiles and then testing their consistency in conversational settings. The core contribution is demonstrating that LLM agents often exhibit significant behavioral inconsistencies, challeng…

8
№19
cs.AI arxiv:2605.07103

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Ye Liu, Botao Yu, Xinyi Ling et al.

ARMOR is an agentic framework that addresses the challenge of reaction feasibility prediction by adaptively leveraging multiple AI tools. It models tool-specific utilities and prioritizes them hierarchically, resolving conflicts to produce more accurate predictions than single-tool or simple aggregation methods. This a…

8
№20
cs.AI arxiv:2601.18681

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

Yilie Huang, Wenpin Tang, Xunyu Zhou

This paper introduces Adaptive Reparameterized Time (ART), a method to optimize the timestep schedule for diffusion model sampling. ART learns a reparameterized time variable to dynamically adjust computation across the sampling trajectory, minimizing discretization error. The contribution is a reinforcement learning f…

8