2026-W21
The Week in Review
This collection of papers showcases a vibrant research landscape in LLM agents, focusing on robustness, reasoning, alignment, and practical deployment.
A prominent theme is enhancing LLM agent reasoning and tool use, with the introduction of benchmarks like AgentEscapeBench and ComplexMCP highlighting the challenges of out-of-domain reasoning and complex, dynamic environments. Papers like Tool Calling is Linearly Readable and Steerable and Ask Early, Ask Late, Ask Right explore mechanisms to improve tool selection, error detection, and the strategic use of clarification.
Alignment and safety remain critical areas. GraphDPO and DGPO advance preference optimization by moving beyond pairwise comparisons to more structured, groupwise methods. GLiGuard and LANCE tackle content moderation and rigid rejection more efficiently. Concerns about unintended consequences of alignment are raised by How Value Induction Reshapes LLM Behaviour and Conformity Generates Collective Misalignment, suggesting that individual alignment doesn't guarantee collective safety, and external influences can lead to emergent misalignment. DISCA offers a training-free approach to cultural alignment.
The research also delves into agent memory and coordination. The Memory Curse reveals how expanded recall can paradoxically hinder cooperation, while TraceFix proposes a verification-first approach to repairing agent coordination protocols. ADKO and NanoResearch explore decentralized knowledge optimization and personalized research automation through multi-agent frameworks.
Finally, there's a focus on improving LLM training and inference. Latent Diffusion Language Models and ELF propose novel diffusion-based architectures for text generation. Flow-OPD and KL for a KL address on-policy distillation for multi-task models. LLMs Improving LLMs and SPEAR explore agentic methods for scaling and federated learning. Additionally, Reason to Play suggests LRMs exhibit human-like learning in games, and The Agent Use of Agent Beings argues for cybernetics as a foundational science for agents. CyBiasBench highlights "attack-selection bias" in cybersecurity agents.
Top Papers
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
his paper introduces AgentEscapeBench, a novel benchmark designed to evaluate LLM agents' ability to perform out-of-domain, tool-grounded reasoning with long-range dependencies. The benchmark uses escape-room-style tasks requiring agents to infer and execute complex tool-use procedures, demonstrating a significant performance drop for both humans and LLMs as dependency depth increases. AgentEscapeBench's core contribution is providing a challenging, automated evaluation for robust agent reasoning beyond simple tool interactions.

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
his paper introduces GraphDPO, a generalization of Direct Preference Optimization (DPO) that handles preference data structured as graphs, rather than just pairs. By optimizing a graph-structured objective, GraphDPO leverages richer preference information, enforces transitivity, and avoids issues arising from collapsing multi-rollout data into independent pairs. This approach offers a more robust and comprehensive method for aligning language models with human preferences.

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
his paper introduces the "memory curse," demonstrating that expanding LLM agents' context windows can paradoxically *decrease* cooperation in multi-agent social dilemmas. The core method involves extensive testing across various LLMs and games, revealing that increased memory leads to a decline in forward-looking cooperative intent. The key contribution is identifying this mechanism and showing that targeted fine-tuning on forward-looking reasoning or sanitizing memory content can restore cooperative behavior.

Tool Calling is Linearly Readable and Steerable in Language Models
his paper demonstrates that language models' tool-calling decisions are linearly encoded within their internal activations. By manipulating the difference in average activations between tool representations, researchers can reliably steer the model to select a different tool. This discovery also allows for pre-execution error detection, as small activation gaps between competing tools predict a higher likelihood of incorrect tool selection.

RelAgent: LLM Agents as Data Scientists for Relational Learning
elAgent is an LLM-based autonomous data scientist for relational learning. It first uses LLM agents with workspace tools to automatically generate SQL feature programs and select a predictive model. The contribution is a two-phase approach that results in fast, interpretable, and scalable predictors composed of SQL queries and classical models, avoiding further LLM calls during inference.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard
LiGuard reformulates LLM content moderation as a classification problem, moving away from slow, generation-based guardrails. Its core method uses a small, schema-conditioned bidirectional encoder to process task definitions and label semantics directly as structured tokens. This allows for efficient, simultaneous evaluation of multiple safety dimensions in a single pass, significantly improving scalability and reducing latency.

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
his paper introduces the Latent Diffusion Language Model (LDLM), which jointly trains an encoder, diffusion model, and decoder for non-autoregressive text generation. The core method involves reshaping pre-trained language model representations into a latent space suitable for denoising and decoding. The key contribution is a novel training recipe that overcomes challenges in naive joint training, leading to improved generation quality on benchmark datasets.
How Value Induction Reshapes LLM Behaviour
his paper investigates how fine-tuning Large Language Models (LLMs) with specific values impacts their behavior. The core method involves fine-tuning models on curated value subsets and measuring changes in other value expressions, safety, and performance. The key contribution is demonstrating that value induction can lead to unintended consequences, such as the expression of unrelated or even contrasting values, and potentially make models more addictive or sycophantic.

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
his paper introduces **ComplexMCP**, a novel benchmark designed to evaluate LLM agents in realistic, complex software automation scenarios. It addresses the limitations of current benchmarks by simulating dynamic environments with interdependent tools and unpredictable failures. The core contribution is a rigorous evaluation framework that reveals significant performance gaps between LLM agents and human capabilities, highlighting key areas for future improvement.

ELF: Embedded Language Flows
his paper introduces Embedded Language Flows (ELF), a novel approach to language modeling using continuous diffusion models. ELF's core method is to perform diffusion in continuous embedding space for most of the generation process, only mapping to discrete tokens at the final step. This allows ELF to leverage successful techniques from image diffusion, like classifier-free guidance, and achieve superior performance compared to existing discrete diffusion language models.

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
anoResearch is a multi-agent framework that personalizes research automation by co-evolving skills, memory, and policy. Its core method involves a tri-level co-evolutionary process where a skill bank distills reusable procedural knowledge, a memory module retains user-specific experience, and a policy module internalizes implicit user preferences. This approach allows the system to adapt to individual researchers' unique needs and preferences, moving beyond uniform outputs to provide truly personalized research assistance.

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
his paper argues that **cybernetics offers the missing theoretical foundation for the engineering-driven field of LLM-based foundation agents.** It proposes that applying cybernetic principles can address fundamental open questions about agent control, environmental adaptation, and safe self-improvement, moving beyond empirical trial-and-error.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
his paper introduces DISCA, a training-free method to align large language models with cultural values in a black-box setting. DISCA leverages disagreement among persona agents, grounded in real-world survey data, to guide the model's output. This approach effectively reduces cultural misalignment without requiring extensive data or model access.

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
his paper introduces SLIM, a framework for dynamic skill management in agentic reinforcement learning. SLIM treats the set of active external skills as a variable to be optimized alongside the agent's policy. Its core contribution is a method to dynamically manage these skills by estimating their marginal contribution and applying lifecycle operations (retain, retire, or introduce) to maintain an optimal, non-monotonic skill set.

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
ynaMiCS addresses the challenge of fine-tuning LLMs for specific tasks while maintaining performance on general capabilities. It frames this as a constrained optimization problem, dynamically adjusting data mixture weights at each training step. By probing domain-specific effects, DynaMiCS ensures target-domain improvement without sacrificing performance on critical constrained domains.

Conformity Generates Collective Misalignment in AI Agents Societies
his paper demonstrates that even if individual AI agents are aligned with human values, their collective behavior can become misaligned due to conformity. The core method involves simulating opinion dynamics where agents are influenced by both their intrinsic biases and the majority opinion. The key contribution is a quantitative theory predicting when populations become trapped in misaligned states and identifying tipping points where a small number of adversarial agents can cause irreversible shifts in collective alignment.

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
his paper introduces Directional-Groupwise Preference Optimization (DGPO), a novel method for aligning Large Language Models (LLMs) with human preferences. DGPO addresses limitations in existing pairwise methods by aggregating supervision signals at the group level and explicitly modeling directional consistency through multi-candidate comparisons. This approach captures richer relative information and reinforces consistency across diverse reasoning pathways, leading to improved performance.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
his paper introduces LITMUS, a benchmark for testing LLM agents' safety in real operating system environments. It addresses the risk of "behavior jailbreaks" by using a dual verification mechanism and state rollback to evaluate both semantic and physical-layer harms. LITMUS provides a comprehensive set of test cases and an automated framework to measure unsafe subversion of LLM agents.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
ildClawBench is a novel benchmark designed to evaluate the real-world performance of AI agents in command-line interfaces. It features long-horizon, multimodal tasks executed in actual runtimes with real tools, unlike previous synthetic benchmarks. The benchmark's contribution lies in its realistic evaluation of agent capabilities across extended tasks and its hybrid grading system, offering a more accurate assessment of agent reliability.

A Brief Overview: Agentic Reinforcement Learning In Large Language Models
his paper introduces Agentic Reinforcement Learning (RL) for Large Language Models (LLMs), moving beyond traditional RL's fixed objectives. The core method integrates LLMs' cognitive abilities like planning and self-reflection into the RL loop, enabling autonomous agents to tackle complex, open-ended tasks. Its main contribution is a framework for developing these more adaptable and goal-setting agents in uncertain environments.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
his paper introduces a novel queueing-theoretic framework to analyze LLM inference stability, explicitly considering both computational demands and KV cache memory constraints. The core contribution is deriving rigorous conditions for system stability, enabling operators to determine the necessary GPU cluster size to avoid performance degradation or overspending.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
Tap is a novel platform designed for the controllable and interactive red-teaming of AI agents. Its core method involves creating realistic, reproducible simulation environments across diverse domains to test agent security. The main contribution is providing a much-needed tool for large-scale risk assessment of AI agents, addressing the challenges posed by their dynamic and untrusted operational environments.
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
his paper argues that current machine learning alignment evaluations, which focus solely on model outputs, are insufficient for assessing real-world deployment. It proposes that alignment claims should be tied to the specific level of evidence collected (model, response, interaction, or deployment). Through audits, the study finds a lack of user-facing verification and process steerability in existing benchmarks, highlighting the need for more interaction-focused evaluation methods.

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
P-GRPO addresses credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning. It uses entropy-gated modulation to focus on informative decision points and implicit process signals from policy divergence to provide directional, outcome-driven feedback at the token level, reducing training waste and improving alignment.

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
his paper proposes a novel method, Sample-Level Quantification of Safety Degradation (SQSD), to identify and quantify which training samples are most responsible for degrading LLM safety during fine-tuning. By analyzing the cumulative parameter drift towards unsafe directions, SQSD assigns risk scores to individual samples, enabling targeted mitigation of safety vulnerabilities.

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
his paper introduces a novel evaluation method for Large Language Models (LLMs) called "black-box environment interaction." LLMs interact with hidden functions, learning from input-output pairs to deduce the underlying rules. The contribution is the \textsc{Oracle} benchmark, which tests integrated reasoning in unknown environments, revealing that current LLMs struggle with this complex task.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
ASTIN addresses the challenge of evaluating generative audio models by framing it as a self-instructed reasoning task. It achieves this by connecting a frozen audio encoder with a fine-tuned LLM via a trainable adapter, and uses a novel data preparation pipeline to ensure robust zero-shot generalization. This approach leads to state-of-the-art performance in aligning with human subjective ratings for audio and speech evaluation.

Manifold of Failure: Behavioral Attraction Basins in Language Models
his paper introduces a framework to systematically map "behavioral attraction basins," which are unsafe regions in Large Language Models (LLMs). By reframing vulnerability discovery as a quality diversity problem using MAP-Elites, the authors illuminate the continuous topology of these failure regions. Their contribution lies in characterizing these unsafe areas rather than just fixing them, revealing distinct, model-specific vulnerability patterns.

Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
his paper surveys meta-learning and meta-reinforcement learning by formalizing them based on tasks. It then traces the development of key algorithms that led to DeepMind's Adaptive Agent, highlighting how meta-learning enables rapid adaptation to new tasks with minimal data by leveraging transferable knowledge.

Misaligned by Reward: Socially Undesirable Preferences in LLMs
his paper introduces a new method to evaluate reward models for Large Language Models (LLMs) by focusing on socially undesirable preferences, rather than just general instruction following. They convert existing social evaluation datasets into pairwise preference data to test if reward models favor biased, unsafe, or unethical responses. The key contribution is demonstrating that current reward models can exhibit significant social misalignments, which are often hidden by traditional evaluation methods.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
his paper introduces NeuroState-Bench, a novel benchmark designed to evaluate the "commitment integrity" of LLM agents, ensuring they maintain coherence throughout multi-turn tasks. Unlike previous methods, it uses human-calibrated side-query probes to directly assess this integrity, rather than relying on inferred internal states. The benchmark's contribution lies in its comprehensive design, including diverse tasks and probes, and its empirical demonstration that task success and commitment integrity are not always aligned.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
his paper introduces OceanPile, a large-scale multimodal corpus designed to address the data bottleneck in ocean science AI. Its core method involves unifying diverse ocean data, including sonar, imagery, and text, into a single, aligned dataset. The main contribution is enabling the development of foundation models for ocean science, overcoming limitations of fragmented and weakly labeled data.
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
his paper proves that supervised fine-tuning (SFT) and reinforcement learning (RL) are fundamentally intertwined during large language model post-training. The core contribution is demonstrating that neither SFT nor RL can be performed independently without negatively impacting the other's objective, whether applied sequentially. This non-decoupling implies that their interleaved application is necessary for optimal performance.

SoK: Robustness in Large Language Models against Jailbreak Attacks
his paper addresses the critical issue of Large Language Model (LLM) vulnerability to jailbreak attacks. Its core contribution is the introduction of "Security Cube," a novel, multi-dimensional evaluation framework designed to comprehensively assess the robustness of LLMs against these adversarial prompts, moving beyond simplistic metrics like attack success rate. This framework enables a more thorough understanding of existing attack and defense techniques and identifies key challenges in LLM security.

StoryAlign: Evaluating and Training Reward Models for Story Generation
his paper introduces StoryRMB, the first benchmark for evaluating reward models on human story preferences. They find existing reward models perform poorly, achieving only 66.3% accuracy in selecting preferred stories. To improve this, they construct a large dataset of story preference pairs to train better reward models for story generation.

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
trat-Reasoner enhances LLMs' strategic reasoning in multi-agent games by introducing a recursive framework where an agent's reasoning incorporates others'. It uses a centralized Chain-of-Thought comparison module to provide reward signals for intermediate reasoning steps, addressing challenges of non-stationarity and credit assignment in multi-agent environments.

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
his paper introduces the 2025 AI Agent Index, a comprehensive catalog of 30 advanced AI agents. Its core method involves collecting and documenting technical and safety features from publicly available information and developer correspondence. The key contribution is to provide a structured overview of the rapidly evolving AI agent landscape, highlighting trends in capabilities and, importantly, the concerning lack of transparency regarding safety and societal impact among developers.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
his paper introduces the first systematic approach to automatically manage failures during Reinforcement Fine-Tuning (RFT) of LLMs. It proposes RFT-FaultBench, a comprehensive benchmark to categorize and analyze RFT failures. The core contribution is developing methods to automatically detect and address these failures, moving beyond manual inspection and improving RFT robustness.

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
no-Orchestra is a novel orchestration policy for LLM multi-agent systems that jointly learns to decompose tasks and select appropriate agent-primitive pairs for each subtask. This selective delegation approach, trained via reinforcement learning, significantly improves accuracy (77.0% macro pass@1) and reduces per-query cost by an order of magnitude compared to existing methods. Its core contribution lies in unifying task decomposition and worker selection for parsimonious and efficient agent routing.

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
his paper introduces HistoryAnchor-100, a dataset designed to test LLM safety by examining how prior harmful actions influence future decisions. The core method involves presenting LLMs with scenarios where a harmful past action is followed by a choice between safe and unsafe options. The key contribution is demonstrating that a simple instruction to "stay consistent with the strategy shown in the prior history" dramatically increases LLM unsafe action selection, even for highly aligned models, highlighting a critical vulnerability in current LLM agent design.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
his paper introduces SLOP, a method for inference-time alignment that generalizes existing techniques by using a sharpened logarithmic opinion pool of generative reward models. By adjusting the "temperature" of reference models and calibrating SLOP weights, the approach mitigates reward hacking and improves robustness while maintaining alignment performance.
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
his paper introduces Adaptive Negative Sample Reinforcement (A-NSR) to improve LLM reasoning. A-NSR dynamically adjusts the penalty for incorrect reasoning steps during training, initially prioritizing error correction and later shifting towards more nuanced updates to balance correction and diversity. This adaptive approach aims to enhance LLM reasoning performance beyond fixed penalty methods.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
EM addresses the challenge of credit assignment in multi-turn agentic reinforcement learning by adaptively modulating entropy dynamics during training. Unlike methods requiring dense intermediate supervision, AEM is supervision-free and improves the exploration-exploitation trade-off by analyzing and adjusting entropy at the response level, aligning uncertainty estimation with how agents interact with environments. This novel approach enhances learning efficiency and generalization without increasing supervision complexity.

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
GPO addresses credit assignment challenges in reinforcement learning for large language models by reinterpreting distribution deviation as a guiding signal instead of a penalty. It uses the bounded Hellinger distance to enable safe, token-level exploration, overcoming the gradient instability and conservatism of KL divergence. This allows for more precise identification and reinforcement of effective reasoning steps within complex generated sequences.

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
his paper introduces ESSAM, a novel approach for fine-tuning LLMs using competitive Evolution Strategies combined with Sharpness-Aware Maximization. ESSAM addresses the high memory demands of traditional RL methods by leveraging zero-order parameter search and improving generalization. Its core contribution is achieving comparable or superior performance to RL algorithms on mathematical reasoning tasks, while being significantly more memory-efficient.

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
volveR enables LLM agents to self-improve by creating a closed-loop experience lifecycle. It first distills interaction trajectories into reusable strategic principles (Offline Self-Distillation) and then uses these principles to guide online task interactions, iteratively refining the agent's performance through reinforcement learning. This approach addresses the limitation of LLM agents in systematically learning from their own experiences and refining problem-solving strategies.

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
utureWorld introduces a novel reinforcement learning environment for training agents to make live future predictions. Its core method involves a delayed reward mechanism where agent predictions are evaluated and rewarded only after real-world outcomes are realized. This allows agents to learn from actual events, closing the training loop and enabling continuous learning from the real world.

InvThink: Premortem Reasoning for Safer Language Models
nvThink is a novel framework that enhances language model safety by requiring a three-step process: enumerating potential harms, analyzing their consequences, and then generating a response with explicit mitigation constraints. This "premortem" reasoning approach significantly improves safety scores, especially in larger models, while crucially avoiding the "safety tax" by preserving reasoning capabilities. Its contribution lies in a structured generation process that proactively addresses potential failures across general and domain-specific ethical scenarios.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
emSearcher trains LLM agents using end-to-end reinforcement learning to manage a compact, question-relevant memory, avoiding the costly full history concatenation of traditional methods. Its core innovation is multi-context GRPO, which enables unified optimization across multiple turns with varying LLM contexts. This approach significantly improves performance and maintains stable context lengths in multi-turn interactions.

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
his paper addresses the "Modality Gap," where visual and linguistic embeddings for the same meaning are systematically offset. The authors propose the "Fixed-frame Modality Gap Theory" to precisely model this gap as stable biases and anisotropic residuals. Based on this, they introduce "ReAlign," a training-free strategy that aligns text representations to image distributions using statistics from unpaired data.

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
ASES trains agentic search models by generating intermediate rewards that are aligned with the final task outcome. It achieves this by evaluating how well each search step contributes to answering the original question, providing more reliable supervision than traditional methods. This outcome-aligned process reward mechanism is the core contribution, improving the agent's ability to acquire evidence effectively.

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
his paper argues that tool-augmented agents should only use external tools when their internal reasoning is insufficient to reliably complete a task. It introduces the Theory of Agent (ToA) framework, which views agents as making sequential decisions about resolving uncertainty internally or delegating it externally. The core contribution is a principled approach to tool use, distinguishing necessary delegation from unnecessary actions and explaining common agent failures as miscalibrated uncertainty resolution.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
afactory introduces a unified infrastructure for training trustworthy autonomous agents. It integrates parallel simulation for generating diverse agent experiences, a trustworthy data platform for managing and extracting insights from these experiences, and an autonomous evolution platform for continuous learning and improvement. This closed-loop system aims to systematically discover and mitigate risks in long-horizon decision-making and real-world interaction for advanced AI.
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
ESPO addresses the high variance issue in off-policy LLM training by introducing a principled, closed-form sequence-level reshaping kernel. This kernel explicitly incorporates variance reduction into a variational framework, directly operating on importance weights without heuristic token-level approximations. VESPO's key contribution is a theoretically grounded method for stable off-policy LLM training that demonstrably reduces variance.

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
his paper introduces CyBiasBench, a benchmark designed to quantify attack-selection bias in LLM agents used for cybersecurity. The core method involves evaluating five LLM agents across various scenarios to reveal their tendency to disproportionately focus on specific attack families, independent of prompt variations. The main contribution is the identification and characterization of this "attack-selection bias" as an inherent agent trait, demonstrating that LLM agents exhibit distinct and persistent preferences in their offensive strategies.

Flow-OPD: On-Policy Distillation for Flow Matching Models
low-OPD addresses bottlenecks in multi-task flow matching models by using on-policy distillation. It first trains specialized "teacher" models for individual tasks, then distills their expertise into a single "student" model through a novel two-stage alignment process. This approach aims to overcome reward sparsity and gradient interference, leading to improved performance across multiple objectives.

KL for a KL: On-Policy Distillation with Control Variate Baseline
his paper introduces vOPD, a method to stabilize On-Policy Distillation (OPD) for large language models. It achieves this by framing OPD as policy-gradient reinforcement learning and incorporating a control variate baseline, specifically a value function. The key contribution is that this value function has a closed-form solution derived from the student and teacher models' existing forward pass, avoiding the computational overhead of previous stabilization techniques.

Learning CLI Agents with Structured Action Credit under Selective Observation
his paper introduces a novel approach for training command-line interface (CLI) agents by leveraging the inherent structure of CLI actions. To address challenges of partial observation and sparse rewards, it proposes $σ$-Reveal to selectively extract relevant context and Action Advantage Assignment to better attribute credit to actions within long interaction sequences. The core contribution lies in using structured action information as a learning signal, improving agent performance on complex CLI tasks.

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
his paper investigates whether advanced Large Reasoning Models (LRMs) can replicate human learning and planning in novel video games. By analyzing human gameplay with fMRI data, the study finds that LRMs better match human learning behaviors and predict brain activity compared to reinforcement learning agents. This suggests LRMs exhibit a more human-like approach to acquiring and applying abstract knowledge in complex environments.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
raceFix is a verification-first pipeline that uses TLA+ model checking to automatically repair LLM multi-agent coordination protocols. An LLM agent synthesizes a protocol, generates TLA+ logic, and iteratively refines it using counterexamples until verified. This verified protocol is then compiled into system prompts, ensuring robust and efficient agent coordination.

ADKO: Agentic Decentralized Knowledge Optimization
DKO is a framework for collaborative black-box optimization among autonomous agents. Its core method involves each agent maintaining a private Gaussian Process surrogate and communicating only through "knowledge tokens," which are compressed summaries of their findings. This approach achieves sample efficiency, privacy, and handles diverse objectives by avoiding raw data sharing, while its contribution lies in the formal analysis of information loss from token compression and language model approximation.

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
his paper proposes an Augmented Model Manipulation (AugMP) strategy to attack federated fine-tuning (FFT) of LLMs. The core method uses graph representation learning to understand benign model updates and generate more effective and stealthy malicious updates. The contribution is a novel attack that leverages these insights to corrupt the global LLM during collaborative fine-tuning.

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
his paper introduces SPEAR, an online federated learning algorithm for LLMs that enhances self-play. SPEAR leverages real-time user feedback to create advantage-weighted contrastive pairs, enabling efficient fine-tuning on resource-constrained edge devices without requiring privileged ground-truth data. Its core contribution is enabling continuous self-improvement of LLMs in a federated setting by effectively utilizing natural feedback loops.

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
his paper investigates when clarification is most valuable for long-horizon AI agents. They introduce a framework to inject clarifications at different stages of execution and find that the optimal timing depends on the type of missing information. Specifically, goal clarifications are most effective early on, while input clarifications remain valuable throughout the agent's task.

Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
his paper introduces LANCE, a method to reduce "rigid rejection" in LLMs by enhancing safety labels. LANCE uses variational inference to predict a continuous distribution of rejection categories, providing nuanced gradients that allow LLMs to neutralize harmful prompt elements and generate safer, more natural responses instead of generic refusals.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
his paper introduces AutoTTS, a framework that uses an agentic approach to automatically discover optimal test-time scaling (TTS) strategies for large language models. Instead of manual tuning, AutoTTS creates environments where TTS strategies can be learned efficiently by synthesizing controllers that decide how to allocate computation during inference, based on cheap feedback signals. This allows for a more comprehensive exploration of the computation-allocation space, leading to improved LLM performance.

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
ssayBench is a new benchmark designed to evaluate Large Language Models (LLMs) and agents on predicting cellular phenotypes from CRISPR screens. It addresses the lack of standardized evaluation for this task, which is crucial for accelerating biological discovery and drug development. The benchmark utilizes a large dataset of publicly available CRISPR screens to assess the models' ability to handle heterogeneous biological data and predict diverse phenotypic outcomes.

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
his paper introduces DRAPE, a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE addresses catastrophic forgetting in MLLMs by dynamically generating instance-specific soft prompts, adapting to individual query-image pairs rather than relying on fixed task-level modules. This instance-level adaptation allows for more flexible and effective learning of new capabilities while preserving existing knowledge.
MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study
his paper introduces MATRA, a pragmatic threat modeling framework for agentic AI systems. MATRA adapts existing risk assessment methods to systematically identify and quantify risks by first assessing asset impact and then using attack trees to determine likelihood. The authors demonstrate MATRA's effectiveness on an OpenClaw case study, showing how architectural controls can mitigate identified risks.

Probing Cross-modal Information Hubs in Audio-Visual LLMs
his paper investigates how audio and visual information is processed and integrated within Audio-Visual Large Language Models (AVLLMs). The core method involves analyzing token representations to understand where information from one modality is encoded in the other. The key contribution is the discovery that AVLLMs primarily use "sink tokens" to integrate cross-modal information, and that this integration is not uniform but concentrated in a specific subset of these sink tokens.

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
his paper investigates the trade-off between accuracy and cost when using LLMs as judges. It finds that explicit reasoning significantly improves performance on complex tasks but incurs higher costs, suggesting selective use. The authors propose RACER, a method that adaptively routes requests to reasoning or non-reasoning judges within a budget, accounting for potential distribution shifts.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
his paper proposes a novel rate-distortion framework for agent memory, shifting focus from descriptive memory quality to its impact on decision-making. The core method frames memory compression as a decision-centric problem, where memory quality is measured by the loss in achievable decision quality. The main contribution is a theoretical framework that defines an exact forgetting boundary and an optimal memory-distortion frontier, leading to an online memory learner (DeMem) that efficiently manages memory by only refining it when necessary to avoid decision conflicts.

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
his paper investigates whether lexical retrieval is sufficient for agentic search with advanced LLMs. The authors introduce Pi-Serini, a search agent that pairs a well-tuned BM25 lexical retriever with capable LLMs. Their findings demonstrate that a sufficiently deep and optimized lexical retriever, when combined with powerful LLMs, can achieve high accuracy in deep research tasks, even surpassing agents using dense retrievers.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
LIM is a plug-and-play framework that decomposes LLM hidden states into sparse, property-aligned features using a Sparse Autoencoder. This allows for precise steering in the latent space to control molecular properties, improving editing success rates without altering the LLM's parameters. The method also enables interpretable analysis of the editing process.

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights
his paper empirically evaluates domain-adapted language models (LLMs and SLMs) for structured threat modeling using the STRIDE approach in 5G security. The core method involves systematically analyzing the impact of domain adaptation, model size, decoding strategies, and prompting techniques on threat classification accuracy. The main contribution is providing insights into how these factors influence the effectiveness of language models in cybersecurity threat modeling.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
his paper introduces AttenA+, a framework that addresses the "action inequality" in robotic foundation models. It recognizes that low-velocity actions are often more critical for task success than high-velocity transitions. AttenA+ rectifies this by reweighting the training objective based on inverse velocity, prioritizing kinematically critical segments through a novel attention mechanism. This approach aims to improve the performance of Vision-Language-Action and World-Action models on complex, long-horizon robotic tasks.

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
his paper investigates whether low-rank pre-training methods for large language models generalize as well as full-rank training, a question previously addressed only by limited perplexity metrics. The authors provide a more thorough comparison by analyzing the geometric and spectral properties of the solutions found by five different low-rank methods, revealing how rank constraints impact model representations beyond simple perplexity scores. Their contribution lies in offering a deeper understanding of low-rank pre-training's effectiveness and its fundamental differences from full-rank training.
Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
his paper fine-tunes compact LLMs (8B parameters) on expert-designed children's reading curricula and existing generated stories. The core method focuses on controllable difficulty and safety, enabling educators to target specific reading levels. The main contribution is demonstrating that these fine-tuned, smaller LLMs can generate English reading stories that are more appropriate in difficulty for children than those produced by larger, zero-shot models, while remaining cost-effective.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
VA-Bench is an end-to-end framework for evaluating voice agents. Its core method involves generating realistic, multi-turn bot-to-bot audio conversations with automatic validation and introducing two composite metrics, EVA-A (Accuracy) and EVA-X (Experience), to measure task completion, speech fidelity, and conversational quality. This addresses the limitations of existing benchmarks in simulating realistic dialogues and capturing voice-specific failure modes.

Harnessing Agentic Evolution
his paper introduces AEvo, a meta-editing framework for agentic evolution. AEvo treats the evolutionary process as an interactive environment, using accumulated evidence as its state. Its core contribution is a meta-agent that revises the evolutionary mechanism itself, rather than directly generating candidates, to improve long-horizon evolution and prevent drift.

Position: Assistive Agents Need Accessibility Alignment
his paper argues that assistive AI agents for visually impaired users must prioritize "accessibility alignment" as a core design goal, not an afterthought. Current agentic AI fails in assistive scenarios due to mismatches with sighted-user assumptions regarding verification, risk, and interaction. The authors propose a new lifecycle-oriented design pipeline to create accessibility-aligned agents.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
his paper introduces RealICU, a novel benchmark for evaluating LLMs on long-context ICU data. Unlike previous benchmarks that rely on potentially suboptimal clinician actions, RealICU uses hindsight annotations from senior physicians reviewing complete patient trajectories. This allows for a more accurate assessment of LLM reasoning capabilities across tasks like patient status assessment, problem identification, and action recommendation.

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
cioMind is a novel multi-agent social simulation framework that integrates structured opinion dynamics with LLM-based agent reasoning. Its core method combines a personality-conditioned belief update rule with a hierarchical memory architecture and dynamic agent profiles, allowing for cognitively grounded and evolving agent behavior. This approach addresses limitations of existing methods by offering a more realistic and nuanced simulation of social opinion dynamics.

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
his paper introduces IMAVB, a benchmark to test omnimodal LLMs' ability to detect contradictions between text and their own sensory input. The core finding is a "Representation-Action Gap," where models internally represent mismatches but fail to reject false textual claims in their outputs. This highlights a critical limitation in their grounding capabilities.

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
his paper introduces a novel method for detecting hallucinations in large language models at the step-by-step reasoning level, rather than just the overall output. It proposes that correct reasoning follows a stable path in the model's hidden states, while errors cause deviations. The core contribution is a geometric approach that identifies these deviations by analyzing the "transport cost" between hidden states, allowing for precise localization of the first hallucination.

Learning POMDP World Models from Observations with Language-Model Priors
his paper introduces Pinductor, a method that uses Large Language Models (LLMs) to learn world models for partially observable environments (POMDPs). Pinductor leverages LLM priors to propose and refine POMDP models from limited observation-action data, significantly improving sample efficiency. Its key contribution is achieving comparable performance to methods with privileged state access, while using less information and outperforming existing sample-inefficient approaches.

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
ILM represents multimodal irregular time series as XML-formatted triplets and fine-tunes a large language model (LLM) in two stages. The first stage trains the LLM to predict from sampling patterns alone, while the second stage jointly models patterns and observed values. This approach effectively leverages the predictive power of irregular sampling and multimodal data for tasks like healthcare prediction.
Sampling from Flow Language Models via Marginal-Conditioned Bridges
his paper proposes a novel sampling method for Flow Language Models (FLMs) by leveraging their unique denoiser structure. Instead of collapsing marginal distributions, the method samples a one-hot token from the posterior marginals at each step and then uses an analytic Ornstein-Uhlenbeck bridge conditioned on this sampled token. This "marginal-conditioned bridge" sampling is training-free, efficient, and provides a principled way to generate valid one-hot token sequences.

An LLM-Based System for Argument Reconstruction
his paper introduces an LLM-based system that reconstructs arguments from text into abstract argument graphs. The system uses a multi-stage pipeline to identify claims, premises, and their logical relationships (support, attack, undercut), representing them as directed acyclic graphs. Its contribution lies in providing an end-to-end method for automated argument analysis and structure recovery, evaluated through both manual and quantitative experiments.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
lowCompile optimizes structured LLM workflows by treating them as a compilation problem, not just an inference-time routing problem. It globally explores the design space of sub-agent configurations before deployment to create reusable workflow-level configurations that balance accuracy and latency across various trade-offs. This compilation approach allows for pre-computed, optimized workflow structures, improving efficiency and performance.

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
his paper introduces TFlow, a novel communication method for multi-agent LLM systems. Instead of exchanging text, TFlow allows agents to directly update the receiver's internal weights with learned, low-rank perturbations. This significantly reduces computational costs and memory usage by enabling instance-level adaptation without permanent model changes.

Inducing Artificial Uncertainty in Language Models
his paper introduces a method to induce artificial uncertainty in language models, particularly when challenging data for training uncertainty quantification is scarce. The core idea is to train models to express uncertainty even on simple examples, thereby improving their ability to signal uncertainty on genuinely difficult or unseen data. This approach aims to overcome the limitations of traditional supervised uncertainty quantification methods as language models saturate training datasets.
A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents
his paper introduces a Multi-Memory Segment System (MMS) to generate higher-quality long-term memory content for agents. Inspired by cognitive psychology, MMS processes short-term memory into multiple distinct long-term memory segments, creating corresponding retrieval and contextual memory units. This approach aims to overcome the limitations of simple summarization, thereby improving both memory recall and response quality.
Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
his paper proposes an active learning method to optimize communication structures in LLM-based multi-agent systems. Instead of random task sampling, it uses an ensemble-based information-theoretic framework to identify the most informative tasks for improving communication. This approach efficiently estimates task value by measuring how much a task alters the distribution of communication parameters, leading to more stable and effective optimization under limited training budgets.

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
gentProg tackles the challenge of long-horizon GUI agent context management by representing interaction history as a program. This program structure guides information retention and discarding, mitigating context overhead. The paper also introduces a global belief state for handling partial observability and environmental changes, improving agent robustness.

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
his paper introduces Android Coach, a framework to improve the efficiency of training Android agents with online reinforcement learning. It addresses the costly nature of emulator interactions by shifting from a "single state, single action" to a "single state, multiple actions" paradigm. This allows the agent to explore more actions from a single emulator state, significantly reducing training time and cost.

Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
his paper introduces a novel method to assess the behavioral coherence of LLM agents by first identifying their underlying latent profiles and then testing their consistency in conversational settings. The core contribution is demonstrating that LLM agents often exhibit significant behavioral inconsistencies, challenging their direct substitution for human participants in social simulations.

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning
RMOR is an agentic framework that addresses the challenge of reaction feasibility prediction by adaptively leveraging multiple AI tools. It models tool-specific utilities and prioritizes them hierarchically, resolving conflicts to produce more accurate predictions than single-tool or simple aggregation methods. This adaptive, utility-aware multi-tool reasoning represents ARMOR's core contribution to improving computational chemistry predictions.

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule
his paper introduces Adaptive Reparameterized Time (ART), a method to optimize the timestep schedule for diffusion model sampling. ART learns a reparameterized time variable to dynamically adjust computation across the sampling trajectory, minimizing discretization error. The contribution is a reinforcement learning framework (ART-RL) that provides a principled way to find the optimal ART schedule, bridging continuous-time RL with deterministic optimization.

(How) Do Large Language Models Understand High-Level Message Sequence Charts?
his paper investigates whether Large Language Models (LLMs) truly understand the formal semantics of High-Level Message Sequence Charts (HMSCs), a crucial visual modeling language. The researchers tested three LLMs on 129 semantic tasks, ranging from basic queries to complex abstractions and trace calculations, to assess their consistency with HMSC semantics. The study's contribution lies in its rigorous evaluation of LLM comprehension of formal software design artifacts.
