From the arXiv
Tuesday, 12 May 2026 · 20 papers
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
This paper introduces **ComplexMCP**, a novel benchmark designed to evaluate LLM agents in realistic, complex software automation scenarios. It addresses the limitations of current benchmarks by simulating dynamic environments with interdependent tools and unpredictable failures. The core contribution is a rigorous eva…
ELF: Embedded Language Flows
This paper introduces Embedded Language Flows (ELF), a novel approach to language modeling using continuous diffusion models. ELF's core method is to perform diffusion in continuous embedding space for most of the generation process, only mapping to discrete tokens at the final step. This allows ELF to leverage success…
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch is a multi-agent framework that personalizes research automation by co-evolving skills, memory, and policy. Its core method involves a tri-level co-evolutionary process where a skill bank distills reusable procedural knowledge, a memory module retains user-specific experience, and a policy module internali…
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
This paper argues that **cybernetics offers the missing theoretical foundation for the engineering-driven field of LLM-based foundation agents.** It proposes that applying cybernetic principles can address fundamental open questions about agent control, environmental adaptation, and safe self-improvement, moving beyond…
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
This paper introduces DISCA, a training-free method to align large language models with cultural values in a black-box setting. DISCA leverages disagreement among persona agents, grounded in real-world survey data, to guide the model's output. This approach effectively reduces cultural misalignment without requiring ex…
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
This paper introduces SLIM, a framework for dynamic skill management in agentic reinforcement learning. SLIM treats the set of active external skills as a variable to be optimized alongside the agent's policy. Its core contribution is a method to dynamically manage these skills by estimating their marginal contribution…
DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
DynaMiCS addresses the challenge of fine-tuning LLMs for specific tasks while maintaining performance on general capabilities. It frames this as a constrained optimization problem, dynamically adjusting data mixture weights at each training step. By probing domain-specific effects, DynaMiCS ensures target-domain improv…
Conformity Generates Collective Misalignment in AI Agents Societies
This paper demonstrates that even if individual AI agents are aligned with human values, their collective behavior can become misaligned due to conformity. The core method involves simulating opinion dynamics where agents are influenced by both their intrinsic biases and the majority opinion. The key contribution is a …
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
This paper introduces Directional-Groupwise Preference Optimization (DGPO), a novel method for aligning Large Language Models (LLMs) with human preferences. DGPO addresses limitations in existing pairwise methods by aggregating supervision signals at the group level and explicitly modeling directional consistency throu…
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
This paper introduces LITMUS, a benchmark for testing LLM agents' safety in real operating system environments. It addresses the risk of "behavior jailbreaks" by using a dual verification mechanism and state rollback to evaluate both semantic and physical-layer harms. LITMUS provides a comprehensive set of test cases a…
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench is a novel benchmark designed to evaluate the real-world performance of AI agents in command-line interfaces. It features long-horizon, multimodal tasks executed in actual runtimes with real tools, unlike previous synthetic benchmarks. The benchmark's contribution lies in its realistic evaluation of agent…
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new benchmark designed to evaluate Large Language Models (LLMs) and agents on predicting cellular phenotypes from CRISPR screens. It addresses the lack of standardized evaluation for this task, which is crucial for accelerating biological discovery and drug development. The benchmark utilizes a large da…
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
This paper introduces DRAPE, a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE addresses catastrophic forgetting in MLLMs by dynamically generating instance-specific soft prompts, adapting to individual query-image pairs rather than relying on fixed task-level modules. This instance-level adap…
MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study
This paper introduces MATRA, a pragmatic threat modeling framework for agentic AI systems. MATRA adapts existing risk assessment methods to systematically identify and quantify risks by first assessing asset impact and then using attack trees to determine likelihood. The authors demonstrate MATRA's effectiveness on an …
Probing Cross-modal Information Hubs in Audio-Visual LLMs
This paper investigates how audio and visual information is processed and integrated within Audio-Visual Large Language Models (AVLLMs). The core method involves analyzing token representations to understand where information from one modality is encoded in the other. The key contribution is the discovery that AVLLMs p…
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
This paper investigates the trade-off between accuracy and cost when using LLMs as judges. It finds that explicit reasoning significantly improves performance on complex tasks but incurs higher costs, suggesting selective use. The authors propose RACER, a method that adaptively routes requests to reasoning or non-reaso…
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
This paper proposes a novel rate-distortion framework for agent memory, shifting focus from descriptive memory quality to its impact on decision-making. The core method frames memory compression as a decision-centric problem, where memory quality is measured by the loss in achievable decision quality. The main contribu…
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
This paper investigates whether lexical retrieval is sufficient for agentic search with advanced LLMs. The authors introduce Pi-Serini, a search agent that pairs a well-tuned BM25 lexical retriever with capable LLMs. Their findings demonstrate that a sufficiently deep and optimized lexical retriever, when combined with…
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM is a plug-and-play framework that decomposes LLM hidden states into sparse, property-aligned features using a Sparse Autoencoder. This allows for precise steering in the latent space to control molecular properties, improving editing success rates without altering the LLM's parameters. The method also enables inte…
Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights
This paper empirically evaluates domain-adapted language models (LLMs and SLMs) for structured threat modeling using the STRIDE approach in 5G security. The core method involves systematically analyzing the impact of domain adaptation, model size, decoding strategies, and prompting techniques on threat classification a…