№01
cs.AI arxiv:2605.10787v1

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang et al.

This paper introduces **ComplexMCP**, a novel benchmark designed to evaluate LLM agents in realistic, complex software automation scenarios. It addresses the limitations of current benchmarks by simulating dynamic environments with interdependent tools and unpredictable failures. The core contribution is a rigorous eva…

9
№02
cs.AI arxiv:2605.10938v1

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu et al.

This paper introduces Embedded Language Flows (ELF), a novel approach to language modeling using continuous diffusion models. ELF's core method is to perform diffusion in continuous embedding space for most of the generation process, only mapping to discrete tokens at the final step. This allows ELF to leverage success…

9
№03
cs.AI arxiv:2605.10813v1

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Jinhang Xu, Qiyuan Zhu, Yujun Wu et al.

NanoResearch is a multi-agent framework that personalizes research automation by co-evolving skills, memory, and policy. Its core method involves a tri-level co-evolutionary process where a skill bank distills reusable procedural knowledge, a memory module retains user-specific experience, and a policy module internali…

9
№04
cs.AI arxiv:2605.10754v1

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Xinrun Wang, Chang Yang, He Zhao et al.

This paper argues that **cybernetics offers the missing theoretical foundation for the engineering-driven field of LLM-based foundation agents.** It proposes that applying cybernetic principles can address fundamental open questions about agent control, environmental adaptation, and safe self-improvement, moving beyond…

9
№05
cs.AI arxiv:2605.10843v1

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen et al.

This paper introduces DISCA, a training-free method to align large language models with cultural values in a black-box setting. DISCA leverages disagreement among persona agents, grounded in real-world survey data, to guide the model's output. This approach effectively reduces cultural misalignment without requiring ex…

9
№06
cs.LG arxiv:2605.10923v1

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Junhao Shen, Teng Zhang, Xiaoyan Zhao et al.

This paper introduces SLIM, a framework for dynamic skill management in agentic reinforcement learning. SLIM treats the set of active external skills as a variable to be optimized alongside the agent's policy. Its core contribution is a method to dynamically manage these skills by estimating their marginal contribution…

9
№07
cs.LG arxiv:2605.10770v1

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

Eleonora Gualdoni, Sonia Laguna, Louis Bethune et al.

DynaMiCS addresses the challenge of fine-tuning LLMs for specific tasks while maintaining performance on general capabilities. It frames this as a constrained optimization problem, dynamically adjusting data mixture weights at each training step. By probing domain-specific effects, DynaMiCS ensures target-domain improv…

9
№08
cs.CL arxiv:2605.10721v1

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano et al.

This paper demonstrates that even if individual AI agents are aligned with human values, their collective behavior can become misaligned due to conformity. The core method involves simulating opinion dynamics where agents are influenced by both their intrinsic biases and the majority opinion. The key contribution is a …

9
№09
cs.CL arxiv:2605.10863v1

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Mengyi Deng, Zhiwei Li, Xin Li et al.

This paper introduces Directional-Groupwise Preference Optimization (DGPO), a novel method for aligning Large Language Models (LLMs) with human preferences. DGPO addresses limitations in existing pairwise methods by aggregating supervision signals at the group level and explicitly modeling directional consistency throu…

9
№10
cs.CL arxiv:2605.10779v1

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang et al.

This paper introduces LITMUS, a benchmark for testing LLM agents' safety in real operating system environments. It addresses the risk of "behavior jailbreaks" by using a dual verification mechanism and state rollback to evaluate both semantic and physical-layer harms. LITMUS provides a comprehensive set of test cases a…

9
№11
cs.CL arxiv:2605.10912v1

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing et al.

WildClawBench is a novel benchmark designed to evaluate the real-world performance of AI agents in command-line interfaces. It features long-horizon, multimodal tasks executed in actual runtimes with real tools, unlike previous synthetic benchmarks. The benchmark's contribution lies in its realistic evaluation of agent…

9
№12
cs.AI arxiv:2605.10876v1

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu et al.

AssayBench is a new benchmark designed to evaluate Large Language Models (LLMs) and agents on predicting cellular phenotypes from CRISPR screens. It addresses the lack of standardized evaluation for this task, which is crucial for accelerating biological discovery and drug development. The benchmark utilizes a large da…

8
№13
cs.AI arxiv:2605.10765v1

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

This paper introduces DRAPE, a novel framework for Multimodal Continual Instruction Tuning (MCIT). DRAPE addresses catastrophic forgetting in MLLMs by dynamically generating instance-specific soft prompts, adapting to individual query-image pairs rather than relying on fixed task-level modules. This instance-level adap…

8
№14
cs.AI arxiv:2605.10763v1

MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study

Tim Van hamme, Thomas Vissers, Javier Carnerero-Cano et al.

This paper introduces MATRA, a pragmatic threat modeling framework for agentic AI systems. MATRA adapts existing risk assessment methods to systematically identify and quantify risks by first assessing asset impact and then using attack trees to determine likelihood. The authors demonstrate MATRA's effectiveness on an …

8
№15
cs.AI arxiv:2605.10815v1

Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim et al.

This paper investigates how audio and visual information is processed and integrated within Audio-Visual Large Language Models (AVLLMs). The core method involves analyzing token representations to understand where information from one modality is encoded in the other. The key contribution is the discovery that AVLLMs p…

8
№16
cs.AI arxiv:2605.10805v1

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Wenbo Zhang, Lijinghua Zhang, Liner Xiang et al.

This paper investigates the trade-off between accuracy and cost when using LLMs as judges. It finds that explicit reasoning significantly improves performance on complex tasks but incurs higher costs, suggesting selective use. The authors propose RACER, a method that adaptively routes requests to reasoning or non-reaso…

8
№17
cs.AI arxiv:2605.10870v1

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou, Zhihan Guo, Langzhang Liang et al.

This paper proposes a novel rate-distortion framework for agent memory, shifting focus from descriptive memory quality to its impact on decision-making. The core method frames memory compression as a decision-centric problem, where memory quality is measured by the loss in achievable decision quality. The main contribu…

8
№18
cs.AI arxiv:2605.10848v1

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin

This paper investigates whether lexical retrieval is sufficient for agentic search with advanced LLMs. The authors introduce Pi-Serini, a search agent that pairs a well-tuned BM25 lexical retriever with capable LLMs. Their findings demonstrate that a sufficiently deep and optimized lexical retriever, when combined with…

8
№19
cs.AI arxiv:2605.10831v1

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

Mingxu Zhang, Yuhan Li, Lujundong Li et al.

SLIM is a plug-and-play framework that decomposes LLM hidden states into sparse, property-aligned features using a Sparse Autoencoder. This allows for precise steering in the latent space to control molecular properties, improving editing success rates without altering the LLM's parameters. The method also enables inte…

8
№20
cs.AI arxiv:2605.10808v1

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy

This paper empirically evaluates domain-adapted language models (LLMs and SLMs) for structured threat modeling using the STRIDE approach in 5G security. The core method involves systematically analyzing the impact of domain adaptation, model size, decoding strategies, and prompting techniques on threat classification a…

8