№01
cs.AI arxiv:2605.13825v1

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

This paper introduces HistoryAnchor-100, a dataset designed to test LLM safety by examining how prior harmful actions influence future decisions. The core method involves presenting LLMs with scenarios where a harmful past action is followed by a choice between safe and unsafe options. The key contribution is demonstra…

9
№02
cs.AI arxiv:2605.13537v1

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Ye Wang, Jing Liu, Toshiaki Koike-Akino

This paper introduces SLOP, a method for inference-time alignment that generalizes existing techniques by using a sharpened logarithmic opinion pool of generative reward models. By adjusting the "temperature" of reference models and calibrating SLOP weights, the approach mitigates reward hacking and improves robustness…

9
№03
cs.AI arxiv:2605.13548v1

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Daojie Peng, Fulong Ma, Jiahang Cao et al.

This paper introduces AttenA+, a framework that addresses the "action inequality" in robotic foundation models. It recognizes that low-velocity actions are often more critical for task success than high-velocity transitions. AttenA+ rectifies this by reweighting the training objective based on inverse velocity, priorit…

8
№04
cs.AI arxiv:2605.13652v1

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira et al.

This paper investigates whether low-rank pre-training methods for large language models generalize as well as full-rank training, a question previously addressed only by limited perplexity metrics. The authors provide a more thorough comparison by analyzing the geometric and spectral properties of the solutions found b…

8
№05
cs.AI arxiv:2605.13709v1

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Qian Shen, Fanghua Cao, Min Yao et al.

This paper fine-tunes compact LLMs (8B parameters) on expert-designed children's reading curricula and existing generated stories. The core method focuses on controllable difficulty and safety, enabling educators to target specific reading levels. The main contribution is demonstrating that these fine-tuned, smaller LL…

8
№06
cs.AI arxiv:2605.13841v1

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz et al.

EVA-Bench is an end-to-end framework for evaluating voice agents. Its core method involves generating realistic, multi-turn bot-to-bot audio conversations with automatic validation and introducing two composite metrics, EVA-A (Accuracy) and EVA-X (Experience), to measure task completion, speech fidelity, and conversati…

8
№07
cs.AI arxiv:2605.13821v1

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan et al.

This paper introduces AEvo, a meta-editing framework for agentic evolution. AEvo treats the evolutionary process as an interactive environment, using accumulated evidence as its state. Its core contribution is a meta-agent that revises the evolutionary mechanism itself, rather than directly generating candidates, to im…

8
№08
cs.AI arxiv:2605.13579v1

Position: Assistive Agents Need Accessibility Alignment

Jie Hu, Changyuan Yan, Yu Zheng et al.

This paper argues that assistive AI agents for visually impaired users must prioritize "accessibility alignment" as a core design goal, not an afterthought. Current agentic AI fails in assistive scenarios due to mismatches with sighted-user assumptions regarding verification, risk, and interaction. The authors propose …

8
№09
cs.AI arxiv:2605.13542v1

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky et al.

This paper introduces RealICU, a novel benchmark for evaluating LLMs on long-context ICU data. Unlike previous benchmarks that rely on potentially suboptimal clinician actions, RealICU uses hindsight annotations from senior physicians reviewing complete patient trajectories. This allows for a more accurate assessment o…

8
№10
cs.AI arxiv:2605.13725v1

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

Yitian Yang, Yiqun Duan, Linghan Huang et al.

ScioMind is a novel multi-agent social simulation framework that integrates structured opinion dynamics with LLM-based agent reasoning. Its core method combines a personality-conditioned belief update rule with a hierarchical memory architecture and dynamic agent profiles, allowing for cognitively grounded and evolving…

8
№11
cs.AI arxiv:2605.13737v1

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu et al.

This paper introduces IMAVB, a benchmark to test omnimodal LLMs' ability to detect contradictions between text and their own sensory input. The core finding is a "Representation-Action Gap," where models internally represent mismatches but fail to reject false textual claims in their outputs. This highlights a critical…

8
№12
cs.AI arxiv:2605.13772v1

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez, Ali Baheri

This paper introduces a novel method for detecting hallucinations in large language models at the step-by-step reasoning level, rather than just the overall output. It proposes that correct reasoning follows a stable path in the model's hidden states, while errors cause deviations. The core contribution is a geometric …

8
№13
cs.LG arxiv:2605.13740v1

Learning POMDP World Models from Observations with Language-Model Priors

Valentin Six, Frederik Panse, Mathis Fajeau et al.

This paper introduces Pinductor, a method that uses Large Language Models (LLMs) to learn world models for partially observable environments (POMDPs). Pinductor leverages LLM priors to propose and refine POMDP models from limited observation-action data, significantly improving sample efficiency. Its key contribution i…

8
№14
cs.LG arxiv:2605.13711v1

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

Hsing-Huan Chung, Shijun Li, Yoav Wald et al.

MILM represents multimodal irregular time series as XML-formatted triplets and fine-tunes a large language model (LLM) in two stages. The first stage trains the LLM to predict from sampling patterns alone, while the second stage jointly models patterns and observed values. This approach effectively leverages the predic…

8
№15
cs.LG arxiv:2605.13681v1

Sampling from Flow Language Models via Marginal-Conditioned Bridges

Iskander Azangulov, Leo Zhang

This paper proposes a novel sampling method for Flow Language Models (FLMs) by leveraging their unique denoiser structure. Instead of collapsing marginal distributions, the method samples a one-hot token from the posterior marginals at each step and then uses an analytic Ornstein-Uhlenbeck bridge conditioned on this sa…

8
№16
cs.CL arxiv:2605.13793v1

An LLM-Based System for Argument Reconstruction

Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman et al.

This paper introduces an LLM-based system that reconstructs arguments from text into abstract argument graphs. The system uses a multi-stage pipeline to identify claims, premises, and their logical relationships (support, attack, undercut), representing them as directed acyclic graphs. Its contribution lies in providin…

8
№17
cs.CL arxiv:2605.13647v1

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

Junyan Li, Zhang-Wei Hong, Maohao Shen et al.

FlowCompile optimizes structured LLM workflows by treating them as a compilation problem, not just an inference-time routing problem. It globally explores the design space of sub-agent configurations before deployment to create reusable workflow-level configurations that balance accuracy and latency across various trad…

8
№18
cs.CL arxiv:2605.13839v1

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Wenrui Bao, Huan Wang, Jian Wang et al.

This paper introduces TFlow, a novel communication method for multi-agent LLM systems. Instead of exchanging text, TFlow allows agents to directly update the receiver's internal weights with learned, low-rank perturbations. This significantly reduces computational costs and memory usage by enabling instance-level adapt…

8
№19
cs.CL arxiv:2605.13595v1

Inducing Artificial Uncertainty in Language Models

Sophia Hager, Simon Zeng, Nicholas Andrews

This paper introduces a method to induce artificial uncertainty in language models, particularly when challenging data for training uncertainty quantification is scarce. The core idea is to train models to express uncertainty even on simple examples, thereby improving their ability to signal uncertainty on genuinely di…

8
№20
cs.AI arxiv:2605.13773v1

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi

This paper investigates whether Large Language Models (LLMs) truly understand the formal semantics of High-Level Message Sequence Charts (HMSCs), a crucial visual modeling language. The researchers tested three LLMs on 129 semantic tasks, ranging from basic queries to complex abstractions and trace calculations, to ass…

7