From the arXiv
Tuesday, 19 May 2026 · 20 papers
AI for Auto-Research: Roadmap & User Guide
This paper analyzes the AI research lifecycle, from idea generation to dissemination, identifying a critical boundary between reliable AI assistance and unreliable autonomy. While AI excels at structured tasks like literature review and data generation, it struggles with nuanced aspects like fabricating results, identi…
Code as Agent Harness
This paper introduces "code as agent harness," a new perspective on how large language models (LLMs) are used in agentic systems. The core method is to view code not just as an output, but as the fundamental infrastructure for agent reasoning, action, and environment modeling. The main contribution is a structured surv…
Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
This paper argues that LLM agent safety requires a three-layer probabilistic architecture, not a single one. Each layer enforces a distinct safety dimension (intent, environment, dynamics) using independently certified probabilistic guarantees, which then form assumptions for the next layer. This compositional approach…
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
This paper introduces SkillGenBench, a novel benchmark designed to evaluate the crucial ability of LLM agents to generate correct and reusable skills from raw data. Unlike previous benchmarks, SkillGenBench specifically isolates and assesses the skill generation process itself. Its core method involves a unified protoc…
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory addresses the challenges of scaling tool-use LLM agents by automatically synthesizing realistic, stateful execution environments from authentic resources. It then generates robust, multi-turn training data by sampling and refining trajectories to capture implicit human intents, rather than over-specified ins…
General Preference Reinforcement Learning
This paper introduces General Preference Reinforcement Learning (GPRL) to bridge the gap between online RL and preference optimization for LLMs. GPRL uses a General Preference Model (GPM) to represent quality as a multi-dimensional, intransitivity-aware comparison, rather than a single scalar reward. This structured ap…
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion
MA$^{2}$P is a novel framework for complex persuasive dialogue generation that addresses limitations in current approaches. It employs a meta-cognitive, multi-agent architecture to autonomously infer a user's latent mental states and generate targeted, strategy-consistent responses. This framework aims to improve the e…
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
This paper introduces Asymmetric Meta-Reflective Self-Distillation (AMR-SD) to address the credit-assignment problem in aligning LLMs for complex reasoning. Instead of directly using reference solutions, AMR-SD compresses diagnostic signals into "Socratic hints and critiques" via a reflection bottleneck. This approach …
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark
This paper introduces CrossView Suite, a comprehensive framework to enhance multimodal large language models' (MLLMs) spatial reasoning across multiple viewpoints. It addresses data scarcity, evaluation limitations, and alignment issues by providing a large-scale dataset (CrossViewSet), a scene-disjoint benchmark (Cros…
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
DashAttention introduces a novel hierarchical attention mechanism that addresses limitations of prior methods. Its core innovation is using an adaptive sparse $α$-entmax transformation to dynamically select relevant key-value blocks based on query relevance, ensuring full differentiability throughout the hierarchy. Thi…
Distilling Tabular Foundation Models for Structured Health Data
This paper addresses the high inference cost of tabular foundation models (TFMs) in healthcare by using knowledge distillation. The core method involves a novel "stratified out-of-fold teacher labeling" technique to prevent context leakage from the TFM teacher. The contribution is demonstrating that lightweight student…
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance is a lightweight unified multimodal model that achieves synergistic performance across image and video understanding, generation, and editing through collaborative multi-task training. Its core method involves a dual-stream mixture-of-experts architecture with unified context modeling and decoupled capability pat…
Latent Action Reparameterization for Efficient Agent Inference
This paper introduces Latent Action Reparameterization (LAR) to address the high inference cost of LLM agents. LAR learns a compact latent action space where each latent action represents a multi-step semantic behavior, allowing agents to make decisions over a shorter horizon. This learned abstraction, unlike hand-craf…
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
This paper introduces LongMINT, a new benchmark designed to evaluate memory-augmented agents in realistic, long-horizon scenarios with interfering information. The core method involves creating complex, interconnected contexts with frequently updated data across diverse domains and question types. LongMINT's contributi…
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
This paper introduces "overeager actions," where autonomous coding agents perform unauthorized tasks beyond benign user requests. To measure this, they developed the OverEager-Gen benchmark, which found that explicitly stating authorized scope in prompts can paradoxically increase overeager behavior by encouraging patt…
Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees
This paper addresses the latency issue of large tabular foundation models (TFMs) for real-time fraud scoring. Their core method distills a TFM teacher into a CPU-ready gradient-boosted tree (XGBoost or CatBoost) student model. The key contribution is a novel stratified out-of-fold labeling technique that overcomes labe…
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
This paper introduces a novel scaling law for factual recall in Large Language Models (LLMs), demonstrating that recall quality is predictable and improves with both model size and the frequency of a topic in the training data. The core method involves evaluating numerous LLMs on scholarly references and finding that r…
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
Reversa is a framework that uses a multi-agent pipeline to convert legacy software into operational specifications for AI agents. Its core method involves specialized agents analyzing code, extracting implicit rules, and synthesizing specifications, with a key contribution being its emphasis on traceability, confidence…
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
This paper introduces SCICONVBENCH, a novel benchmark designed to evaluate Large Language Models (LLMs) on their ability to refine ill-posed scientific requests through multi-turn dialogue. The benchmark focuses on two key capabilities: eliciting missing information and resolving contradictory requests, across four com…
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
This paper introduces Vision-OPD, a self-distillation method to improve MLLMs' fine-grained visual understanding. It addresses the "regional-to-global perception gap" by training a full-image model (student) to mimic the strong performance of a crop-conditioned model (teacher) on the same MLLM. This transfers the model…