The Morning
From the arXiv
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
his paper introduces AgentEscapeBench, a novel benchmark designed to evaluate LLM agents' ability to perform out-of-domain, tool-grounded reasoning with long-range dependencies. The benchmark uses escape-room-style tasks requiring agents to infer and execute complex tool-use procedures, demonstrating a significant performance drop for both humans and LLMs as dependency depth increases. AgentEscapeBench's core contribution is providing a challenging, automated evaluation for robust agent reasoning beyond simple tool interactions.


Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
This paper introduces GraphDPO, a generalization of Direct Preference Optimization (DPO) that handles preference data structured as graphs, rather than just pairs. By optimizing a graph-structured objective, GraphDPO leverages richer preference information, en…
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
This paper introduces the "memory curse," demonstrating that expanding LLM agents' context windows can paradoxically *decrease* cooperation in multi-agent social dilemmas. The core method involves extensive testing across various LLMs and games, revealing that…


Tool Calling is Linearly Readable and Steerable in Language Models
This paper demonstrates that language models' tool-calling decisions are linearly encoded within their internal activations. By manipulating the difference in average activations between tool representations, researchers can reliably steer the model to select …
RelAgent: LLM Agents as Data Scientists for Relational Learning
RelAgent is an LLM-based autonomous data scientist for relational learning. It first uses LLM agents with workspace tools to automatically generate SQL feature programs and select a predictive model. The contribution is a two-phase approach that results in fas…

GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard reformulates LLM content moderation as a classification problem, moving away from slow, generation-based guardrails. Its core method uses a small, schema-conditioned bidir…
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
This paper introduces the Latent Diffusion Language Model (LDLM), which jointly trains an encoder, diffusion model, and decoder for non-autoregressive text generation. The core met…
How Value Induction Reshapes LLM Behaviour
This paper investigates how fine-tuning Large Language Models (LLMs) with specific values impacts their behavior. The core method involves fine-tuning models on curated value subse…
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
This paper introduces CyBiasBench, a benchmark designed to quantify attack-selection bias in LLM agents used for cybersecurity. The core method involves evaluating five LLM agents …
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD addresses bottlenecks in multi-task flow matching models by using on-policy distillation. It first trains specialized "teacher" models for individual tasks, then distills …
The Town Square
The article argues that local AI, running on user devices, should become the standard for privacy, security, and user control, rather than relying on cloud-based solutions.
Workshops
CloakBrowser is a stealthy Chromium browser designed to bypass bot detection, functioning as a drop-in Playwright replacement with source-level fingerprint patches that pass all tests.
This repository, AiToEarn, aims to leverage Artificial Intelligence to create opportunities for users to earn money.