From the arXiv
Friday, 15 May 2026 · 20 papers
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
This paper introduces Adaptive Negative Sample Reinforcement (A-NSR) to improve LLM reasoning. A-NSR dynamically adjusts the penalty for incorrect reasoning steps during training, initially prioritizing error correction and later shifting towards more nuanced updates to balance correction and diversity. This adaptive a…
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
AEM addresses the challenge of credit assignment in multi-turn agentic reinforcement learning by adaptively modulating entropy dynamics during training. Unlike methods requiring dense intermediate supervision, AEM is supervision-free and improves the exploration-exploitation trade-off by analyzing and adjusting entropy…
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO addresses credit assignment challenges in reinforcement learning for large language models by reinterpreting distribution deviation as a guiding signal instead of a penalty. It uses the bounded Hellinger distance to enable safe, token-level exploration, overcoming the gradient instability and conservatism of KL di…
ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
This paper introduces ESSAM, a novel approach for fine-tuning LLMs using competitive Evolution Strategies combined with Sharpness-Aware Maximization. ESSAM addresses the high memory demands of traditional RL methods by leveraging zero-order parameter search and improving generalization. Its core contribution is achievi…
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
EvolveR enables LLM agents to self-improve by creating a closed-loop experience lifecycle. It first distills interaction trajectories into reusable strategic principles (Offline Self-Distillation) and then uses these principles to guide online task interactions, iteratively refining the agent's performance through rein…
FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
FutureWorld introduces a novel reinforcement learning environment for training agents to make live future predictions. Its core method involves a delayed reward mechanism where agent predictions are evaluated and rewarded only after real-world outcomes are realized. This allows agents to learn from actual events, closi…
InvThink: Premortem Reasoning for Safer Language Models
InvThink is a novel framework that enhances language model safety by requiring a three-step process: enumerating potential harms, analyzing their consequences, and then generating a response with explicit mitigation constraints. This "premortem" reasoning approach significantly improves safety scores, especially in lar…
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLM agents using end-to-end reinforcement learning to manage a compact, question-relevant memory, avoiding the costly full history concatenation of traditional methods. Its core innovation is multi-context GRPO, which enables unified optimization across multiple turns with varying LLM contexts. This …
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
This paper addresses the "Modality Gap," where visual and linguistic embeddings for the same meaning are systematically offset. The authors propose the "Fixed-frame Modality Gap Theory" to precisely model this gap as stable biases and anisotropic residuals. Based on this, they introduce "ReAlign," a training-free strat…
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
OASES trains agentic search models by generating intermediate rewards that are aligned with the final task outcome. It achieves this by evaluating how well each search step contributes to answering the original question, providing more reliable supervision than traditional methods. This outcome-aligned process reward m…
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
This paper argues that tool-augmented agents should only use external tools when their internal reasoning is insufficient to reliably complete a task. It introduces the Theory of Agent (ToA) framework, which views agents as making sequential decisions about resolving uncertainty internally or delegating it externally. …
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory introduces a unified infrastructure for training trustworthy autonomous agents. It integrates parallel simulation for generating diverse agent experiences, a trustworthy data platform for managing and extracting insights from these experiences, and an autonomous evolution platform for continuous learning and …
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
VESPO addresses the high variance issue in off-policy LLM training by introducing a principled, closed-form sequence-level reshaping kernel. This kernel explicitly incorporates variance reduction into a variational framework, directly operating on importance weights without heuristic token-level approximations. VESPO's…
A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents
This paper introduces a Multi-Memory Segment System (MMS) to generate higher-quality long-term memory content for agents. Inspired by cognitive psychology, MMS processes short-term memory into multiple distinct long-term memory segments, creating corresponding retrieval and contextual memory units. This approach aims t…
Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
This paper proposes an active learning method to optimize communication structures in LLM-based multi-agent systems. Instead of random task sampling, it uses an ensemble-based information-theoretic framework to identify the most informative tasks for improving communication. This approach efficiently estimates task val…
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg tackles the challenge of long-horizon GUI agent context management by representing interaction history as a program. This program structure guides information retention and discarding, mitigating context overhead. The paper also introduces a global belief state for handling partial observability and environme…
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
This paper introduces Android Coach, a framework to improve the efficiency of training Android agents with online reinforcement learning. It addresses the costly nature of emulator interactions by shifting from a "single state, single action" to a "single state, multiple actions" paradigm. This allows the agent to expl…
Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
This paper introduces a novel method to assess the behavioral coherence of LLM agents by first identifying their underlying latent profiles and then testing their consistency in conversational settings. The core contribution is demonstrating that LLM agents often exhibit significant behavioral inconsistencies, challeng…
ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning
ARMOR is an agentic framework that addresses the challenge of reaction feasibility prediction by adaptively leveraging multiple AI tools. It models tool-specific utilities and prioritizes them hierarchically, resolving conflicts to produce more accurate predictions than single-tool or simple aggregation methods. This a…
ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule
This paper introduces Adaptive Reparameterized Time (ART), a method to optimize the timestep schedule for diffusion model sampling. ART learns a reparameterized time variable to dynamically adjust computation across the sampling trajectory, minimizing discretization error. The contribution is a reinforcement learning f…