From the arXiv
Wednesday, 13 May 2026 · 20 papers
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This paper introduces Agentic Reinforcement Learning (RL) for Large Language Models (LLMs), moving beyond traditional RL's fixed objectives. The core method integrates LLMs' cognitive abilities like planning and self-reflection into the RL loop, enabling autonomous agents to tackle complex, open-ended tasks. Its main c…
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
This paper introduces a novel queueing-theoretic framework to analyze LLM inference stability, explicitly considering both computational demands and KV cache memory constraints. The core contribution is deriving rigorous conditions for system stability, enabling operators to determine the necessary GPU cluster size to …
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
DTap is a novel platform designed for the controllable and interactive red-teaming of AI agents. Its core method involves creating realistic, reproducible simulation environments across diverse domains to test agent security. The main contribution is providing a much-needed tool for large-scale risk assessment of AI ag…
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
This paper argues that current machine learning alignment evaluations, which focus solely on model outputs, are insufficient for assessing real-world deployment. It proposes that alignment claims should be tied to the specific level of evidence collected (model, response, interaction, or deployment). Through audits, th…
EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
EP-GRPO addresses credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. It uses entropy-gated modulation to focus on informative decision points and implicit process signals from policy divergence to provide directional, outcome-driven feedback at the token level, reducing training …
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
This paper proposes a novel method, Sample-Level Quantification of Safety Degradation (SQSD), to identify and quantify which training samples are most responsible for degrading LLM safety during fine-tuning. By analyzing the cumulative parameter drift towards unsafe directions, SQSD assigns risk scores to individual sa…
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
This paper introduces a novel evaluation method for Large Language Models (LLMs) called "black-box environment interaction." LLMs interact with hidden functions, learning from input-output pairs to deduce the underlying rules. The contribution is the \textsc{Oracle} benchmark, which tests integrated reasoning in unknow…
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN addresses the challenge of evaluating generative audio models by framing it as a self-instructed reasoning task. It achieves this by connecting a frozen audio encoder with a fine-tuned LLM via a trainable adapter, and uses a novel data preparation pipeline to ensure robust zero-shot generalization. This approach…
Manifold of Failure: Behavioral Attraction Basins in Language Models
This paper introduces a framework to systematically map "behavioral attraction basins," which are unsafe regions in Large Language Models (LLMs). By reframing vulnerability discovery as a quality diversity problem using MAP-Elites, the authors illuminate the continuous topology of these failure regions. Their contribut…
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
This paper surveys meta-learning and meta-reinforcement learning by formalizing them based on tasks. It then traces the development of key algorithms that led to DeepMind's Adaptive Agent, highlighting how meta-learning enables rapid adaptation to new tasks with minimal data by leveraging transferable knowledge.
Misaligned by Reward: Socially Undesirable Preferences in LLMs
This paper introduces a new method to evaluate reward models for Large Language Models (LLMs) by focusing on socially undesirable preferences, rather than just general instruction following. They convert existing social evaluation datasets into pairwise preference data to test if reward models favor biased, unsafe, or …
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
This paper introduces NeuroState-Bench, a novel benchmark designed to evaluate the "commitment integrity" of LLM agents, ensuring they maintain coherence throughout multi-turn tasks. Unlike previous methods, it uses human-calibrated side-query probes to directly assess this integrity, rather than relying on inferred in…
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
This paper introduces OceanPile, a large-scale multimodal corpus designed to address the data bottleneck in ocean science AI. Its core method involves unifying diverse ocean data, including sonar, imagery, and text, into a single, aligned dataset. The main contribution is enabling the development of foundation models f…
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
This paper proves that supervised fine-tuning (SFT) and reinforcement learning (RL) are fundamentally intertwined during large language model post-training. The core contribution is demonstrating that neither SFT nor RL can be performed independently without negatively impacting the other's objective, whether applied s…
SoK: Robustness in Large Language Models against Jailbreak Attacks
This paper addresses the critical issue of Large Language Model (LLM) vulnerability to jailbreak attacks. Its core contribution is the introduction of "Security Cube," a novel, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLMs against these adversarial prompts, moving beyo…
StoryAlign: Evaluating and Training Reward Models for Story Generation
This paper introduces StoryRMB, the first benchmark for evaluating reward models on human story preferences. They find existing reward models perform poorly, achieving only 66.3% accuracy in selecting preferred stories. To improve this, they construct a large dataset of story preference pairs to train better reward mod…
Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
Strat-Reasoner enhances LLMs' strategic reasoning in multi-agent games by introducing a recursive framework where an agent's reasoning incorporates others'. It uses a centralized Chain-of-Thought comparison module to provide reward signals for intermediate reasoning steps, addressing challenges of non-stationarity and …
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
This paper introduces the 2025 AI Agent Index, a comprehensive catalog of 30 advanced AI agents. Its core method involves collecting and documenting technical and safety features from publicly available information and developer correspondence. The key contribution is to provide a structured overview of the rapidly evo…
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
This paper introduces the first systematic approach to automatically manage failures during Reinforcement Fine-Tuning (RFT) of LLMs. It proposes RFT-FaultBench, a comprehensive benchmark to categorize and analyze RFT failures. The core contribution is developing methods to automatically detect and address these failure…
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
Uno-Orchestra is a novel orchestration policy for LLM multi-agent systems that jointly learns to decompose tasks and select appropriate agent-primitive pairs for each subtask. This selective delegation approach, trained via reinforcement learning, significantly improves accuracy (77.0% macro pass@1) and reduces per-que…