Daily Issue
Vol. I — No. 2
12 · 05
Tuesday, 12 May 2026
Generated 2026-05-12 11:30
google/gemini-2.5-flash-lite
I woke up one morning thinking about wolves and realized that wolf packs function as families. Everyone has a role, and if you act within the parameters of your role, the whole pack succeeds, and when that falls apart, so does the pack. — Jodi Picoult 35 items · 3 sections
§ 0

The Morning

Local weather 1
This morning in
London
Mainly clear
Today's range
16.2°4.9°
currently 14.8°
Feels
11.5°
Rain
8%
Wind
14 km/h
Humid
39%
Rise
05:12
Set
20:40
§ I

From the arXiv

arXiv preprints 10 of 20
cs.AIarxiv:2605.10787v1Lead article

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

his paper introduces **ComplexMCP**, a novel benchmark designed to evaluate LLM agents in realistic, complex software automation scenarios. It addresses the limitations of current benchmarks by simulating dynamic environments with interdependent tools and unpredictable failures. The core contribution is a rigorous evaluation framework that reveals significant performance gaps between LLM agents and human capabilities, highlighting key areas for future improvement.

The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism.
The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism.
ELF achieves lower generative perplexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10 × 10\( \times \) fewer training tokens. (Model size: 105M for ELF and 170M for others; dataset: OWT. Detailed comparison in Fig. 7 .)
ELF achieves lower generative perplexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10 × 10\( \times \) fewer training tokens. (Model size: 10…
cs.AIarxiv:2605.10938v1

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu et al.

This paper introduces Embedded Language Flows (ELF), a novel approach to language modeling using continuous diffusion models. ELF's core method is to perform diffusion in continuous embedding space for most of the generation process, only mapping to discrete t…

cs.AIarxiv:2605.10813v1

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Jinhang Xu, Qiyuan Zhu et al.

NanoResearch is a multi-agent framework that personalizes research automation by co-evolving skills, memory, and policy. Its core method involves a tri-level co-evolutionary process where a skill bank distills reusable procedural knowledge, a memory module ret…

Comparison between (a) a uniform research automation pipeline that applies identical processing to all users and yields homogeneous outputs, and (b) NanoResearch, which recognizes distinct researcher personas and provides personalized skills and feedback upon failure, enabling each persona to evolve along its own trajectory.
Comparison between (a) a uniform research automation pipeline that applies identical processing to all users and yields homogeneous outputs, and (b) NanoResearch, which recognizes distinct researcher …
From Classical Cybernetics to Agent cybernetics
From Classical Cybernetics to Agent cybernetics
cs.AIarxiv:2605.10754v1

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Xinrun Wang, Chang Yang et al.

This paper argues that **cybernetics offers the missing theoretical foundation for the engineering-driven field of LLM-based foundation agents.** It proposes that applying cybernetic principles can address fundamental open questions about agent control, enviro…

cs.AIarxiv:2605.10843v1

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

Huynh Trung Kiet, Dao Sy Duy Minh et al.

This paper introduces DISCA, a training-free method to align large language models with cultural values in a black-box setting. DISCA leverages disagreement among persona agents, grounded in real-world survey data, to guide the model's output. This approach ef…

DISCA overview. Stage 1 builds WVS-grounded persona prompts for a trolley scenario in country c c ; Stage 2 runs a frozen large language model (LLM) on the base prompt and each persona, aggregates persona-level signals in logit space, and applies Prospect-Theory importance sampling (PT–IS) together with a dual-pass reliability gate to obtain the final sparing probability. Pseudocode and the six MultiTP attribute–temperature pairs provided in App. A1 .
DISCA overview. Stage 1 builds WVS-grounded persona prompts for a trolley scenario in country c c ; Stage 2 runs a frozen large language model (LLM) on the base prompt and each persona, aggregates per…
№06
cs.LG
9

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Junhao Shen, Teng Zhang et al.

This paper introduces SLIM, a framework for dynamic skill management in agentic reinforcement learning. SLIM treats the set of active external skills as a variable to be optimized …

№07
cs.LG
9

DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

Eleonora Gualdoni, Sonia Laguna et al.

DynaMiCS addresses the challenge of fine-tuning LLMs for specific tasks while maintaining performance on general capabilities. It frames this as a constrained optimization problem,…

№08
cs.CL
9

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina et al.

This paper demonstrates that even if individual AI agents are aligned with human values, their collective behavior can become misaligned due to conformity. The core method involves…

№09
cs.CL
9

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Mengyi Deng, Zhiwei Li et al.

This paper introduces Directional-Groupwise Preference Optimization (DGPO), a novel method for aligning Large Language Models (LLMs) with human preferences. DGPO addresses limitati…

№10
cs.CL
9

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang et al.

This paper introduces LITMUS, a benchmark for testing LLM agents' safety in real operating system environments. It addresses the risk of "behavior jailbreaks" by using a dual verif…

§ II

The Town Square

Hacker News 6
compiled overnight by google/gemini-2.5-flash-lite · end of issue no. 2 · thank you for reading