From the arXiv
Thursday, 14 May 2026 · 20 papers
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
This paper introduces HistoryAnchor-100, a dataset designed to test LLM safety by examining how prior harmful actions influence future decisions. The core method involves presenting LLMs with scenarios where a harmful past action is followed by a choice between safe and unsafe options. The key contribution is demonstra…
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
This paper introduces SLOP, a method for inference-time alignment that generalizes existing techniques by using a sharpened logarithmic opinion pool of generative reward models. By adjusting the "temperature" of reference models and calibrating SLOP weights, the approach mitigates reward hacking and improves robustness…
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
This paper introduces AttenA+, a framework that addresses the "action inequality" in robotic foundation models. It recognizes that low-velocity actions are often more critical for task success than high-velocity transitions. AttenA+ rectifies this by reweighting the training objective based on inverse velocity, priorit…
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
This paper investigates whether low-rank pre-training methods for large language models generalize as well as full-rank training, a question previously addressed only by limited perplexity metrics. The authors provide a more thorough comparison by analyzing the geometric and spectral properties of the solutions found b…
Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
This paper fine-tunes compact LLMs (8B parameters) on expert-designed children's reading curricula and existing generated stories. The core method focuses on controllable difficulty and safety, enabling educators to target specific reading levels. The main contribution is demonstrating that these fine-tuned, smaller LL…
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench is an end-to-end framework for evaluating voice agents. Its core method involves generating realistic, multi-turn bot-to-bot audio conversations with automatic validation and introducing two composite metrics, EVA-A (Accuracy) and EVA-X (Experience), to measure task completion, speech fidelity, and conversati…
Harnessing Agentic Evolution
This paper introduces AEvo, a meta-editing framework for agentic evolution. AEvo treats the evolutionary process as an interactive environment, using accumulated evidence as its state. Its core contribution is a meta-agent that revises the evolutionary mechanism itself, rather than directly generating candidates, to im…
Position: Assistive Agents Need Accessibility Alignment
This paper argues that assistive AI agents for visually impaired users must prioritize "accessibility alignment" as a core design goal, not an afterthought. Current agentic AI fails in assistive scenarios due to mismatches with sighted-user assumptions regarding verification, risk, and interaction. The authors propose …
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
This paper introduces RealICU, a novel benchmark for evaluating LLMs on long-context ICU data. Unlike previous benchmarks that rely on potentially suboptimal clinician actions, RealICU uses hindsight annotations from senior physicians reviewing complete patient trajectories. This allows for a more accurate assessment o…
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
ScioMind is a novel multi-agent social simulation framework that integrates structured opinion dynamics with LLM-based agent reasoning. Its core method combines a personality-conditioned belief update rule with a hierarchical memory architecture and dynamic agent profiles, allowing for cognitively grounded and evolving…
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
This paper introduces IMAVB, a benchmark to test omnimodal LLMs' ability to detect contradictions between text and their own sensory input. The core finding is a "Representation-Action Gap," where models internally represent mismatches but fail to reject false textual claims in their outputs. This highlights a critical…
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
This paper introduces a novel method for detecting hallucinations in large language models at the step-by-step reasoning level, rather than just the overall output. It proposes that correct reasoning follows a stable path in the model's hidden states, while errors cause deviations. The core contribution is a geometric …
Learning POMDP World Models from Observations with Language-Model Priors
This paper introduces Pinductor, a method that uses Large Language Models (LLMs) to learn world models for partially observable environments (POMDPs). Pinductor leverages LLM priors to propose and refine POMDP models from limited observation-action data, significantly improving sample efficiency. Its key contribution i…
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
MILM represents multimodal irregular time series as XML-formatted triplets and fine-tunes a large language model (LLM) in two stages. The first stage trains the LLM to predict from sampling patterns alone, while the second stage jointly models patterns and observed values. This approach effectively leverages the predic…
Sampling from Flow Language Models via Marginal-Conditioned Bridges
This paper proposes a novel sampling method for Flow Language Models (FLMs) by leveraging their unique denoiser structure. Instead of collapsing marginal distributions, the method samples a one-hot token from the posterior marginals at each step and then uses an analytic Ornstein-Uhlenbeck bridge conditioned on this sa…
An LLM-Based System for Argument Reconstruction
This paper introduces an LLM-based system that reconstructs arguments from text into abstract argument graphs. The system uses a multi-stage pipeline to identify claims, premises, and their logical relationships (support, attack, undercut), representing them as directed acyclic graphs. Its contribution lies in providin…
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile optimizes structured LLM workflows by treating them as a compilation problem, not just an inference-time routing problem. It globally explores the design space of sub-agent configurations before deployment to create reusable workflow-level configurations that balance accuracy and latency across various trad…
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
This paper introduces TFlow, a novel communication method for multi-agent LLM systems. Instead of exchanging text, TFlow allows agents to directly update the receiver's internal weights with learned, low-rank perturbations. This significantly reduces computational costs and memory usage by enabling instance-level adapt…
Inducing Artificial Uncertainty in Language Models
This paper introduces a method to induce artificial uncertainty in language models, particularly when challenging data for training uncertainty quantification is scarce. The core idea is to train models to express uncertainty even on simple examples, thereby improving their ability to signal uncertainty on genuinely di…
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
This paper investigates whether Large Language Models (LLMs) truly understand the formal semantics of High-Level Message Sequence Charts (HMSCs), a crucial visual modeling language. The researchers tested three LLMs on 129 semantic tasks, ranging from basic queries to complex abstractions and trace calculations, to ass…