Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.

Research papers

Chronological order.

30th May 2025

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Browser-Use Agent: introduces Open CaptchaWorld, a web-based benchmark and platform for evaluating multimodal LLM agents on interactive CAPTCHA puzzles, including Agent (core reasoning model), Memory (stores state/history), Next goal (defines immediate objective), Action (executes operation), and Eval (evaluates state/action) components.
The benchmark features 20 diverse CAPTCHA types and a new metric, CAPTCHA Reasoning Depth, to quantify task complexity.
Empirical results demonstrate a significant performance gap between state-of-the-art MLLM agents and humans on these interactive visual reasoning tasks, highlighting current limitations.

VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

VIDEOCADFORMER: introduces an autoregressive transformer model for predicting CAD UI actions, including UI Image Encoder, CAD Image Encoder, Visual Projection, Action/Timestep Embeddings, Transformer Decoder with Multi-Head Attention, Cross-Attention, Feed-Forward Network, Command Head, and Parameter Head.
The model processes visual inputs (target CAD image, past UI frames) and sequential data (past actions, timestep embeddings) to predict the next low-level UI action.
The architecture uses ViT encoders for visual features, projects inputs into a hidden space, and employs a causal transformer decoder with attention mechanisms and MLPs for action prediction.

EXP-Bench: Can AI Conduct AI Research Experiments?

AI Agent: introduces, with all (Design experimental procedures), (Implement experimental procedures), (Analyze results, derive conclusions), (Execute experiments)-components, a benchmark evaluating AI agents on end-to-end research experiments.
The benchmark challenges agents to perform tasks sourced from AI publications, including hypothesis formulation, experimental design, implementation, execution, and result analysis.
A semi-automated pipeline curates tasks from papers and code, and evaluation uses ground truth comparisons and code execution.

Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Causal-aware LLMs: introduces a framework integrating structural causal models (SCMs) into large language models (LLMs) for decision-making, utilizing Main Env, LLM, Causal Matrix, Local Causal Graph, Agent, Valid Env, Causal Intervention, Observations, Action, Extra Reward, and Goal components within a learning-adapting-acting paradigm.
The framework iteratively learns causal knowledge from the Main Env using the LLM, refines it through Causal Intervention in a Valid Env, and uses the learned knowledge (Causal Matrix, Local Causal Graph) to guide the Agent's actions and Goal generation.
This approach enhances the LLM's environmental understanding and the Agent's policy learning through structured causal reasoning and adaptive knowledge updates based on environmental feedback and Extra Reward signals.

Multiple LLM Agents Debate for Equitable Cultural Alignment

Multi-Agent Debate framework: introduces a method where LLM Agents (Debate over scenario) debate over a cultural scenario, potentially incorporating Self-Reflection Capability (Reflects on output) via a Choice Mechanism (Chooses reflection or debate), and collaboratively reach a final decision through a Debate Mechanism (Structured interaction), resolved by a Judge LLM (Resolves disagreements) if needed.
The framework explores multi-LLM collaboration to improve cultural adaptability and equitable alignment across diverse contexts.
Experiments show that multi-agent debate enhances accuracy and cultural group parity, enabling smaller LLMs to achieve performance comparable to larger models.

--

When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation

Evaluation Pipeline: introduces a systematic framework to evaluate persona-based dialogue generation, including PRODIGy Dataset, Non-PRODIGY Character Generator, Dialogue Generator, Fine-tuning Module, Evaluation Framework, LLM-as-a-Judge, Human Evaluator, and Biography Similarity Module.
The framework investigates how large language models adapt responses based on both target speaker and interlocutor characteristics across varying topics and speaker pairings.
Evaluation involves systematically masking or revealing interlocutor information to assess its impact on dialogue generation and target speaker identification using both automatic and human methods.

NEXUSSUM: Hierarchical LLM Agents for Long-Form Narrative Summarization

NEXUSSUM (Hierarchical LLM Agents for Long-Form Narrative Summarization): introduces a multi-agent LLM framework for long-form narrative summarization with a Preprocessor agent (Converts dialogue to prose), Narrative Summarizer agent (Generates initial summary), and Compressor agent (Refines summary length).
The framework processes long-form text through a structured, sequential pipeline using chunking and concatenation.
This approach aims to improve narrative coherence, handle long contexts, and control output length for high-quality summaries.

CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

CREFT: introduces a sequential multi-agent LLM framework for character relation extraction, including Base Character Graph Construction, Character Selection with PPR, Merging Duplicate Nodes (LLM), Relation Extraction (LLM), Filtering Out Irrelevant Characters (LLM), Role Identification (LLM), Grouping Characters (LLM), and CRS, which iteratively refines character composition, relations, roles, and groups from narrative texts.
The framework first builds a base character graph using knowledge distillation from GPT-4o and a fine-tuned LLM, then employs specialized LLM agents in sequence to refine the graph components.
Experiments show that the multi-agent approach significantly outperforms single-agent baselines in accuracy and completeness for extracting character relations from Korean drama scripts.

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

AGORA (Agent Graph-based Orchestration for Reasoning and Assessment): introduces a flexible framework with a Graph-based Workflow Orchestration Engine (Manages task execution via DAG) managing Tasks (Nodes in workflow DAG), integrating Agent Algorithms (Operators) (Modular reasoning/action components), Memory (Stores short-term/long-term information), External Tools (LLMs, VLMs, databases, etc.), Client Interfaces (User/evaluation interaction points), and an Evaluation Framework (Enables systematic comparison) for reproducible language agent research.
The framework utilizes a graph-based engine for modularity and scalability, supporting diverse agent algorithms implemented as reusable operators.
Multiple client interfaces are provided for flexible interaction and systematic evaluation across different tasks and models.

--

Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents

MPR (Multi-Perspective Role-Playing) framework: introduces a method for sentiment forecasting on social media, with Feature Extraction (Identify implicit features), Subjective Role-Playing Agent (Simulate user behavior, generate comments), Objective Role-Playing Agent (Analyze generated comments, ensure consistency), and Iterative Rectification (Refine generated comments based on analysis) components.
The framework leverages LLMs to simulate user responses to events and analyze generated content for consistency to predict future sentiment.
By incorporating external context and user-specific features through multi-perspective role-playing, the approach aims for more precise sentiment predictions.

Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

LLM-based Agent System: introduces a system to study LLM agent behavior in the Ultimatum Game, with LLM Agents, Ultimatum Game Environment, Prosocial Beliefs, Reasoning Methods, System Prompts, Reasoning Prompts, Proposal/Decision Prompts, Strategy Prompts, and Conversation History components, where the system simulates LLM agents with varying beliefs and reasoning in an economic game to assess behavioral alignment with human norms.
The system initializes LLM agents with specific prosocial beliefs and reasoning methods (CoT, ToM levels) to act as Proposers or Responders in a multi-round Ultimatum Game.
Experiments across diverse LLMs and belief/reasoning combinations evaluate agent performance and behavioral alignment using metrics like acceptance rate, average turns, and deviation scores from expected human behavior.

Proactive Guidance of Multi-Turn Conversation in Industrial Search

Two-Phase Framework (G-SFT and C-RL): introduces a system for proactive guidance in multi-turn search, featuring a G-SFT phase with a Goal Adaptation Agent, Scalable Knowledge Transfer, and G-SFT Model, and a C-RL phase with Generate, Rank, and C-RL Model components.
The G-SFT phase uses the Goal Adaptation Agent to dynamically adapt to user goal shifts via Explicit Goal Analysis, Goal-relevant Summary, and Shift Detection Signal, while Scalable Knowledge Transfer distills LLM knowledge into the G-SFT Model for low-latency guidance generation.
The C-RL phase employs a generate-rank paradigm, using a Preference-Aligned Augmentation Model with DBS-based Decoding to create candidates, and a Rank component with a Click Estimator and Diversity-Aware Group Sample Strategy to select preference pairs for fine-tuning the C-RL Model based on user clicks.

An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Credibility Scoring Framework: introduces an adversary-resistant multi-agent LLM system that processes a User Query (Task) using a Team of Agents (LLM agents) configured with a Topology (communication structure) and Agent Roles (assigned tasks/expertise), generating Individual Outputs (agent responses) which are combined by a CrS-Aware Aggregator (weights/combines outputs) to produce the final Output (final system answer).
The system learns agent reliability via a feedback loop where an LLM Judge (evaluates outputs/contributions) provides a Reward (output quality feedback), used by the Agent Contribution Calculation (CSc) (measures agent impact) and Credibility Score Update (learns agent reliability) components to adjust agent Credibility Score (CrS) (agent reliability score).
This dynamic credibility scoring mechanism enhances robustness against adversarial agents, even in adversary-majority settings, by weighting agent contributions based on their learned reliability.

SentinelAgent: Graph-based Anomaly Detection in LLM-based Multi-Agent Systems

SentinelAgent: introduces a system-level anomaly detection framework for LLM-based multi-agent systems, integrating structural modeling with runtime behavioral oversight using Event Monitor (intercepts runtime events), Behavior Analyzer (evaluates interaction graph), and Risk Responder (determines responses).
The framework models agent interactions as dynamic execution graphs to enable semantic anomaly detection at node, edge, and path levels.
SentinelAgent acts as an autonomous, LLM-powered runtime monitor that observes, analyzes, and intervenes in multi-agent system execution based on security policies.

29th May 2025

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

DGM (Darwin Gödel Machine): introduces a self-improving system that iteratively builds a growing Archive (stores agents) by interleaving Self-Modification (agent changes itself) of a Coding Agent (system being improved) with Evaluation (tests agent) on a Benchmark Suite (evaluation tasks), using Parent Selection (selects agents) from the archive, where the agent is powered by a Foundation Model (FM) (agent's base capability) and modifies its own Code Repository (agent's code) and Tools (agent's capabilities) based on Evaluation Logs (agent performance data) and Self-Improve Instruction (prompt for self-modification).
The system operates through an open-ended exploration loop, maintaining a traceable lineage of agents in the archive and empirically validating self-modifications against coding benchmarks.
The approach demonstrates automatic discovery of improved coding capabilities and workflows, achieving performance gains on SWE-bench and Polyglot benchmarks, and incorporates safety measures like sandboxing and monitoring.

28th May 2025

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

3DLLM-MEM: introduces a memory-enhanced 3D embodied agent framework that utilizes an Encoder (Encodes 3D inputs), Working Memory (Current 3D observations), Episodic Memory (Past 3D observations/interactions) stored in a Memory Bank (Stores episodic memory features), a Memory Fusion Module (Integrates working/episodic memory) producing Fused Episodic Memory (Integrated memory representation), and an LLM (Processes memory for actions).
The framework incrementally builds and maintains a task-relevant long-term memory by incorporating feedback from the environment and interacting with objects.
The Memory Fusion Module uses working memory tokens as queries to selectively attend to and fuse relevant spatial and temporal features from episodic memory.

Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

Proposed Research Directions: introduces a position arguing that traditional aleatoric and epistemic uncertainty definitions are insufficient for interactive LLM agents and proposes research into Underspecification uncertainties (missing information, unclear task), Interactive learning (ask follow-up questions), and Output uncertainties (communicate uncertainty beyond numbers).
The paper highlights conflicts in existing uncertainty definitions and their breakdown in dynamic, multi-turn LLM agent interactions.
The proposed directions aim to make LLM agent interactions more transparent, trustworthy, and intuitive by addressing and communicating uncertainty in novel ways.

Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Agent-UniRAG (A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems): introduces a trainable agent framework for unified RAG systems, with Planning Module (determines necessary actions), Tool Using Module (interacts with external tools), Working Memory Module (stores input, logs, evidence), Reflector Module (filters and refines evidence), and Agent Loop (iterative process).
The framework leverages the LLM agent concept to handle both single-hop and multi-hop queries in an end-to-end manner.
Agent-UniRAG utilizes a synthetic dataset (SynAgent-RAG) for training small open-source LLMs to achieve competitive performance.

Universal Visuo-Tactile Video Understanding for Embodied Interaction

VTV-LLM: introduces a multi-modal large language model for universal visuo-tactile video understanding, integrating Tokenizer, T-Projector, VTV Encoder, V-Projector, and a Large Language Model.
The framework bridges the gap between tactile perception and natural language by aligning visuo-tactile video features with linguistic descriptions.
It enables sophisticated tactile reasoning capabilities for embodied interaction, including feature assessment and comparative analysis.

From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation

FAMER (Fast Adaptation via MEntal Reasoning): introduces a framework for fast desire alignment, integrating Perception (Extracts scene graph), Key Information Extraction (Filters, stores goal info), Memory (Stores cross-episode knowledge), Desire-Centered Mental Reasoning (Infers user desires), Efficient Communication (Manages dialogue efficiently), and Goal Oriented Planning (Plans goal actions).
The framework leverages LLMs to interpret vague instructions, infer user intent, and manage dialogue, enabling adaptation to unknown user preferences.
FAMER improves task execution and communication efficiency by filtering irrelevant actions, reducing redundant inquiries, and reusing knowledge across episodes.

EvolveSearch: An Iterative Self-Evolving Search Agent

EvolveSearch: introduces a novel iterative self-evolution framework that combines Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to enhance web search capabilities without external human-annotated reasoning data.
The framework alternates between an RL phase for exploration and generating rollouts, and an SFT phase that optimizes the base model using filtered high-quality rollouts.
This process leverages a hybrid reward mechanism and specific data filtering rules to enable continuous self-improvement in open web search domains.

Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Unified Framework: introduces, a systematic approach for topological structure learning in LLM-based Multi-Agent Systems, with Agent Selection (Selects agent subset), Structure Profiling (Identifies macro structure), and Topology Synthesis (Synthesizes micro graph), where the framework decomposes topology design into sequential stages for optimization.
The framework aims to learn optimal topological structures for MASs to enhance coordination performance and efficiency.
Each stage presents distinct challenges and research opportunities for designing adaptive multi-agent architectures.

AgentDNS: A Root Domain Naming System for LLM Agents

AgentDNS: introduces a root domain naming and service discovery system for LLM agents, with Service Registration (registers services), Service Proxy Pool (forwards requests), Service Search (discovers services), Service Resolution (resolves identifiers), Service Management (manages proxies), Service Billing (tracks costs), Authentication (verifies identity), AgentDNS DB (stores metadata), and AgentDNS API Server (provides API) components.
AgentDNS enables LLM agents to autonomously discover, resolve, and securely invoke third-party services across organizational and technological boundaries.
Inspired by traditional DNS, the system provides unified naming, natural language discovery, protocol-aware interoperability, authentication, and billing for multi-agent collaboration.

From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications

LAM-based Agentic AI system: introduces a system architecture with LAMs (Core reasoning engine), Planner (Task decomposition/organization), Knowledge Base (External knowledge support), Tools (External/internal execution toolkit), and Memory (Stores historical information).
CommLLM framework: introduces a LAM-centric multi-agent collaborative system architecture with MDR (Acquire task-relevant information), MCP (Decompose tasks/generate pathways), and MER (Evaluate solutions/self-feedback).
The tutorial reviews the evolution from Large AI Models to Agentic AI and their applications in future intelligent communication systems, particularly in the context of 6G networks.

VOICE CMS: UPDATING THE KNOWLEDGE BASE OF A DIGITAL ASSISTANT THROUGH CONVERSATION

Voice CMS architecture: introduces a system for updating a digital assistant's knowledge base through conversation, integrating a Voice CMS workflow, Conversational Engine with Agents, Knowledge Base, VUI, and LLM.
The system allows hotel staff to naturally converse with the assistant to add or modify information, reducing the need for traditional graphical content management systems.
Evaluation compares the Voice CMS with a GUI for knowledge management tasks, analyzing user preference, usability, and performance across varying task complexities.

Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection

IntrospecLOO (introspective-leave-one-out): introduces an efficient method for evaluating agent contributions in LLM multi-agent debates, utilizing Agents, User/Query, Independently Respond Round, Debate Round, Aggregation, IntrospecLOO Round, and IntrospecLOO Prompt.
The method adds a single IntrospecLOO Round after standard debate rounds, prompting agents with an IntrospecLOO Prompt to update answers while disregarding one agent's response.
This approach approximates the traditional Leave-one-out method at significantly reduced query complexity, enabling efficient contribution evaluation.

VIRAL: VISION-GROUNDED INTEGRATION FOR REWARD DESIGN AND LEARNING

VIRAL: introduces a pipeline for generating and refining reward functions using multi-modal LLMs, including Input, Initial Generation, Policy Learning, and Refinement components.
The framework takes textual environment details, optional success code, and a multi-modal goal prompt to generate initial reward functions via collaborating LLMs and code verification.
Reward functions are refined iteratively based on performance evaluation and feedback from humans or a Video-LVLM, leading to improved agent behavior alignment.

VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries

Vul-BinLLM: introduces an LLM-based framework for binary vulnerability detection, featuring an LLM-assisted Decompiler (enhances code) with an Optimization Decision Agent (decides optimizations) and Action Agents (perform optimizations), a Code Memory Management Agent (manages functions), VulBinQ (queue), and Archived Analysis (storage).
The framework optimizes decompilation by adding vulnerability-specific comments and contextual information before analyzing the code for vulnerabilities.
It utilizes memory management and a function queue to handle large binary files and reduce LLM hallucinations during vulnerability reasoning.

EFFICIENTLY ENHANCING GENERAL AGENTS WITH HIERARCHICAL-CATEGORICAL MEMORY

EHC framework: introduces a general agent framework with Hierarchical Memory Retrieval (HMR), Task-Category Oriented Experience Learning (TOEL), Memory Pool (M), and LLM (Large Language Model), designed for efficient multi-modal task handling.
The framework uses a hierarchical memory system for rapid retrieval and continuous storage, mitigating redundancy and overhead.
It employs task-oriented learning to classify experiences and extract category-specific patterns, enhancing adaptability and interpretability.

MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing

MapStory: introduces a text-driven map animation prototyping tool, with Scene Breakdown Agent (parses script), Map Animation Researcher Agent (retrieves geospatial data), and Map Animation Modules (camera, highlight, animated elements), that generates editable map animations from natural language scripts.
The tool leverages an agentic LLM architecture to produce a scene breakdown and grounds the script in factual geospatial data using web search and APIs.
MapStory supports human-in-the-loop editing through an interactive timeline editor and properties panel for fine-grained control and rapid iteration.

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

LaMDAgent (Language Model Developing Agent): introduces an autonomous framework using an Agent (LLM-based selector) to iteratively construct and optimize post-training pipelines by selecting from Predefined Action Types (available operations) and an Object Pool (available resources), evaluating the resulting Model (target LLM) based on a Score (performance metric), and updating its Memory (stores experiences).
The framework automates the post-training pipeline design process by iterating through action enumeration, selection, model evaluation, and memory update steps.
This agent-based approach reduces the need for specialized knowledge and human intervention in discovering effective model improvement strategies.

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Future System: introduces a metadata management system and a hierarchical KVC caching system, featuring a reuse-optimized metadata caching scheme, a workload-aware index structure, and a hotness-aware data placement strategy to optimize KVC management for LLM prefix prefilling.
The proposed system aims to minimize time to first token for long-context inference by efficiently handling range queries and random get queries.
The approach is designed to leverage the unique high reusability and mixed sequential-random access patterns observed in KVC prefix prefill workloads.

Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Co-Saving: introduces a resource-aware multi-agent collaboration system leveraging experiential knowledge to enhance efficiency and quality, including Multi-Agent System (Collaborative structure), Agents (Individual LLM entities), Experiential Knowledge (Historical task data), Shortcuts (Learned instructional transitions), Reference Chain (Historical successful trajectory), Inference Chain (Current task execution), Shortcut Filtering (Selecting effective shortcuts), Shortcut Formalization (Graph representation), Shortcut Evaluation (Scoring shortcuts), Cost Design (Time and token metric), Emergency Factor (Dynamic value/cost weighting), and Force Termination Mechanism (Prevents resource exhaustion).
The system utilizes shortcuts mined from historical successful trajectories to bypass redundant reasoning steps and accelerate problem-solving in familiar contexts.
A dynamic emergency factor and force termination mechanism are integrated to manage resource consumption and prevent exhaustion during task execution.

Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

LLM-ABM Framework: introduces a method for large-scale urban mobility simulation by integrating LLM with Agent-Based Modeling, including Data Collection, Large Language Model (LLM), Agent Profile, Agent Schedule, Routine Allocation, Occasional Locations, and Multi-Transit Route components.
The framework leverages LLM to generate diverse and realistic synthetic population profiles and personalized agent schedules.
Agent locations are allocated based on grid data and Points of Interest, and personalized routes are generated using a multi-criteria routing algorithm.

27th May 2025

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

AIDSAFE (Agentic Iterative Deliberation for Safety Reasoning): introduces a multi-agent framework for generating policy-embedded Chain-of-Thought data, including Initialization, Deliberation, and Refinement stages with dedicated agents.
The framework leverages collaborative reasoning among Deliberation Agents and post-processing by a Refiner Agent to produce high-quality, policy-adherent CoTs and responses from an Input Query and Safety Policies.
This approach aims to improve LLM safety generalization and jailbreak robustness by providing superior data for supervised fine-tuning.

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

BehaviorSFT: introduces a training strategy using behavioral tokens to condition pre-trained foundation LLMs for dynamic behavioral selection across the reactive-proactive spectrum, evaluated on the BehaviorBench dataset.
The approach leverages supervised fine-tuning to enable implicit contextual behavior assessment and behavior-conditioned generation for clinical agents.
BehaviorSFT aims to improve the balance between helpful proactivity and necessary restraint in LLM responses for healthcare applications.

AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models-25367

Multi-agent Retrieval-Augmented Generation (RAG) system: introduces a platform for nuclear waste management decision-making with a Multi-agent System (Collaboration) including Regulatory Compliance Agent (Checks regulations), Safety & Environmental Agent (Assesses risks), and Documentation & Reporting Agent (Compiles reports), leveraging Retrieval-Augmented Generation (RAG) (Retrieval and generation) with LLM (Llama 3.2) (Base language model), Embeddings (mxbai-embed-large-v1) (Generates semantic vectors), and Document Retrieval (Retrieves relevant documents) accessing Regulatory Compliance Database (Stores regulatory documents) and Safety & Environmental Database (Stores safety/environmental data).
The system employs a structured 10-round discussion model for agents to iteratively refine assessments and ensure document-grounded responses.
Evaluation metrics like Context Relevance Distribution and Agent Agreement Rate demonstrate the framework's effectiveness in maintaining factual grounding and decision consistency.

Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Catfish Agent Framework: introduces a multi-agent system with Moderator Agent, Catfish Agent, Expert Agent, Team Leader, Team Member, and Summary Agent components to disrupt silent agreement in clinical decision making.
The framework employs complexity-aware and tone-calibrated interventions by the Catfish Agent to stimulate deeper reasoning and prevent premature consensus.
Evaluations show the method improves diagnostic accuracy on medical Q&A and VQA benchmarks compared to single- and multi-agent baselines.

Robust Hypothesis Generation: LLM-Automated Language Bias for Inductive Logic Programming

LLM-Automated Language Bias for Inductive Logic Programming Framework: introduces a novel framework for robust hypothesis generation by integrating LLMs with ILP, including a LLM-Based Multi-agent System (Generates language bias), Translator agent (Transforms text to facts), Language Bias (Structured symbolic vocabulary), Facts (Symbolic data representation), ILP Solver (Learns interpretable rules), and Optimal Hypothesis (Final learned rules).
The framework utilizes a multi-agent LLM system (Actor and Critic agents) to automate the generation of the language bias (predicate system) directly from raw text.
This automated symbolic grounding guides a Translator agent to convert text into facts for an ILP solver, which then learns interpretable rules as the optimal hypothesis.

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

EXTAGENTS: introduces a multi-agent framework for scaling external knowledge input beyond LLM context windows, featuring Seeking Agents (Process input chunks), Reasoning Agent (Synthesize information, generate output), Global Knowledge Synchronization (Agents share and rank information), and Knowledge-Accumulating Reasoning (Reasoning agent integrates information iteratively).
The framework partitions massive input into chunks processed by Seeking Agents, whose outputs are shared and ranked via global knowledge synchronization.
A Reasoning Agent then iteratively integrates the synchronized information through knowledge-accumulating reasoning to produce the final output.

Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

FUAS-Agents: introduces an autonomous agent system leveraging multimodal LLMs for Focused Ultrasound Ablation Surgery treatment planning, including Planner Agent (interprets instructions, decomposes tasks), Executor Agent (performs specific tasks), Strategy Agent (generates treatment plans), Optimizer Agent (refines outputs, integrates results), and Memory Module (integrates medical resources, manages data).
The system integrates patient profiles and MRI data, orchestrating specialized medical AI tools for segmentation, dose prediction, and clinical guideline retrieval to generate personalized treatment plans.
Evaluated in a uterine fibroid scenario, the generated plans demonstrate high completeness, accuracy, fluency, and clinical compliance according to human expert assessment.

Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Dataset Generation Framework: introduces, "a multi-agent pipeline", with Users (Simulated input), Dialogue Generation Controller (Orchestrates workflow), User Simulator (Generates user questions), Out-of-Context Detector (Validates questions), and QA LLM (Responds to questions), where "the framework generates synthetic dialogues embedding sociodemographic attributes for evaluating LLM adaptation".
The pipeline simulates user-LLM interactions, with a user simulator generating profile-aligned questions and an out-of-context detector ensuring question validity.
This agent-based approach creates a controlled dataset enabling assessment of LLM behavioral consistency when user attributes are provided explicitly or implicitly.

PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

PEDANTIC-based Definiteness Examination: introduces PEDANTIC Dataset (corpus of patent claims with indefiniteness annotations), Dataset Creation Pipeline (automatic process using LLMs to build PEDANTIC), Logistic Regression Model (baseline prediction model), LLM Agent Model (LLM-based prediction model using tools), Binary Classification (evaluates definite/indefinite prediction), Multi-Label Classification (evaluates indefiniteness category prediction), and Pairwise Reasoning Judge (LLM-as-Judge evaluates reasoning quality), presenting a dataset and evaluation framework for automatic patent claim definiteness examination.
The PEDANTIC Dataset contains 14k US patent claims annotated with indefiniteness reasons extracted using an automatic pipeline leveraging Large Language Models.
The framework evaluates Logistic Regression and LLM Agent models on binary and multi-label classification tasks, and uses an LLM-as-Judge to assess the quality of generated indefiniteness reasoning.

Large Language Models Miss the Multi-Agent Mark

MAS LLMs (Multi-Agent Systems of Large Language Models): introduces a critique of current MAS LLMs, highlighting issues with Agents (lack native social behaviour), Environment (often textual, LLM-centric), Coordination (often sequential, orchestrated), Communication (often natural language), Memory (lack long-term persistency), and Asynchronicity (often absent).
The paper argues that current MAS LLMs often fail to embody fundamental multi-agent system characteristics by overemphasizing LLMs and overlooking established MAS literature.
It advocates for better integrating MAS concepts like native social agents, non-LLM-centric environments, asynchronous communication protocols, and quantifiable emergent behaviours.

Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

LLM-Informed Diagnostic Framework: introduces a novel approach integrating KGs and LLMs for complex system diagnostics, featuring Model Construction, KG-DML, Model Interaction, and an LLM Agent with diagnostic tools.
The framework automates DML model construction from system documentation using an LLM-based workflow and stores this structured logic in a KG-DML.
An LLM agent facilitates interactive diagnostics by interpreting user queries and invoking KG-based tools for upward/downward reasoning and Graph-RAG retrieval to generate diagnostic insights.

PACT: A Contract-Theoretic Framework for Pricing Agentic AI Services Powered by Large Language Models

PACT: introduces a contract-theoretic framework for pricing cloud-based agentic AI services, modeling task-dependent multi-dimensional QoS, costs (including liability), and user types to design contracts satisfying individual rationality and incentive compatibility.
The framework models QoS based on objective response time and subjective user satisfaction, accounting for computational, infrastructure, and potential liability costs for the service provider.
Through contract-based selection, PACT enables users to receive tailored service offerings aligned with their needs while ensuring incentive compatibility and individual rationality under information asymmetry.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems

LLM-based Multi-Agent System: introduces a framework to investigate herd behavior in multi-agent systems, featuring LLM-based Agents (autonomous decision makers) receiving Question Input (initial task) and Peer Information Input (peers' responses), utilizing a Confidence Mechanism (internal certainty assessment) for Response Generation (initial answer) and revision, modulated by Peer Information Presentation (format and order), Peer Persona (peer attributes), and System Prompt (behavioral instructions).
The system simulates agents interacting and making decisions, where herd behavior is measured by the flip rate, the tendency of agents to change their initial response based on peer input.
Experiments manipulate agent self-confidence, perceived peer confidence, and peer information presentation factors to understand their impact on conformity and collective outcomes.

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

CXXCrafter: introduces an LLM-based agent system for automated C/C++ software building, including a Parser Module (Extracts build-related information), a Generator Module (Generates/modifies Dockerfile), and an Executor Module (Executes Dockerfile, captures errors).
The system leverages LLMs to dynamically manage complex build processes by iteratively addressing issues based on feedback.
CXXCrafter achieves a 78% build success rate across 752 C/C++ projects by handling dependency management, diverse build systems, and error diagnosis.

Agent-Environment Alignment via Automated Interface Generation

ALIGN (Auto-Aligned Interface Generation): introduces a framework that automatically generates interfaces to alleviate agent-environment misalignment, utilizing an Analyzer (Identifies misalignments) and an Optimizer (Generates/refines interface) to produce an interface with INFERRULES (Static information alignment) and WRAPSTEP (Dynamic observation enhancement) that mediates interaction between the Agent (Interacts with environment) and Environment (Provides state/feedback).
The framework operates iteratively, with the Analyzer identifying misalignments from failed trajectories and the Optimizer generating an improved interface based on these findings.
The ALIGN-generated interface enhances both static environment information and step-wise observations, improving agent performance across diverse interactive tasks without modifying agent or environment code.

AITEE - Agentic Tutor for Electrical Engineering

AITEE (Agentic Tutor for Electrical Engineering): introduces an agent-based tutoring system for electrical engineering, with Circuit (Input image), Detection of components and connections (Processes circuit image), Conversion into Graph/Netlist (Creates textual representation), Simulation with Spice (Validates circuit calculations), Scripts (Lecture material knowledge base), Relevant context in vector database (Stores script embeddings), Retriever (RAG) (Retrieves relevant script context), Large Language Model (Core AI tutor), LLM-Instructions (Guides Socratic dialogue), Students (User), Prompt (Student query), and Output (Tutor response) components, designed to provide interactive and personalized learning experiences.
The system processes hand-drawn or digital circuit diagrams, converts them into a machine-readable format, and uses a graph-based similarity measure for context retrieval from lecture materials.
AITEE employs a Socratic dialogue approach guided by LLM instructions and validates calculations using SPICE simulation to foster learner autonomy and ensure accuracy.

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

ATD (Adaptive Text Dreamer): introduces a dual-branch LLM self-guided imagination policy for VLN, with Left Brain (State Estimation LLM), Right Brain (Imagination LLM), Q-Former, LLM Encoder, LLM Decoder, State Grounded Cross-Attention (SGCA), Graph-based Navigation Policy, Latent Embedding Injection, Multi-head Cross-Attention (MCA), Graph-aware Self-Attention (GASA), and MLP components.
The framework leverages language-based imagination, employing a left brain for state estimation and a right brain for imaginative prediction, constrained by the estimated state.
Imagined textual representations are integrated into a graph-based navigation expert via latent embeddings and cross-attention to guide action decisions.

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

RepoMaster: introduces an autonomous agent framework for exploring and understanding GitHub repositories, consisting of Repository Search, Hierarchical Repository Analysis, and Autonomous Exploration & Execution.
Hierarchical Repository Analysis builds structural representations like HCT, MDG, and FCG to identify Core components for efficient understanding.
Autonomous Exploration & Execution uses Context-aware Code Exploration with Exploration tools and Context-aware Information Selection in an Interactive Feedback-based Execution loop to solve tasks.

MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

MedSentry: introduces a benchmark and evaluation pipeline for medical LLM multi-agent systems, analyzing the safety risks posed by malicious agents within different architectural topologies.
The framework evaluates four representative multi-agent topologies (Centralized, Decentralized, Layers, SharedPool) by injecting a Dark Personality Agent and assessing system safety using an Enforcement Agent defense mechanism.
MedSentry provides a rigorous evaluation framework and practical defense strategies for designing safer LLM-based multi-agent systems in medical domains.

MT-MOL: Multi Agent System with Tool-based Reasoning for Molecular Optimization

MT-MOL (Multi Agent System with Tool-based Reasoning for Molecular Optimization): introduces a multi-agent framework for molecular optimization featuring Analyst agents (Select relevant tools), a Scientist agent (Generates molecule/reasoning), a Verifier agent (Validates consistency), and a Reviewer agent (Provides feedback), utilizing Tool sets (Domain-specific functions), Top-k data (Reference molecules), and SMILES history (Previous designs).
The system integrates domain-specific tools and structured reasoning through agent interactions to produce interpretable and chemically grounded molecular designs.
An iterative generation and review process, including consistency validation and tool-informed feedback, refines the molecular candidates towards the design objective.

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

MAMMQA: introduces a multi-agent framework for multimodal question answering with Modality Expert Agent (Extracts modality specific insights), Cross Modal Synthesis Agent (Synchronises information across modalities), and Aggregator Agent (Synthesizes outputs, resolves disagreements), splitting reasoning into interpretable stages.
The framework employs specialized agents for modality-specific extraction, cross-modal synthesis, and evidence-grounded aggregation without fine-tuning.
This modular design enhances interpretability, robustness, and zero-shot generalization by allowing agents to operate within their expertise domains.

ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

ChemHAS (Chemical Hierarchical Agent Stacking): introduces a hierarchical agent stacking method with Initial Tools (Predefined chemistry tools), AI Agent (LLM-based agent), Agent Tool (Tool or agent), Global Tool Library (Collection of tools/agent tools), Stacking Process (Hierarchical combination method), Reinforcement Process (Two-stage optimization), ReAct method (Agent reasoning and tool use), and Stacking Agent (Enhanced tool/agent) to enhance chemistry tools by reducing prediction errors.
The Stacking Process involves Warmup Self Agent Stacking and Hierarchical Agent Stacking, iteratively building and evaluating agent tools and storing them in the Global Tool Library.
The resulting Stacking Agent leverages the complementary strengths of stacked tools, guided by a two-stage reinforcement process, to achieve improved performance on chemistry tasks.

Can Agents Fix Agent Issues?

AGENTISSUE-BENCH: introduces the first reproducible benchmark for agent issue resolution, comprising issue description (User reported problem), buggy version (Codebase commit), developer-committed patch (Ground truth fix), failure-triggering tests (Reproduce issue), and docker environment (Executable container).
Built from 50 reproducible real-world GitHub issues, the benchmark enables evaluating state-of-the-art software engineering agents.
Evaluation on AGENTISSUE-BENCH reveals that current SE agents have limited effectiveness in resolving agent-specific issues.

RRO: LLM Agent Optimization Through Rising Reward Trajectories

RRO (Reward Rising Optimization): introduces a scalable process supervision framework for LLM agents, including LLM Agent (Policy Model), Supervised Fine-tuning (Initial training on expert data), Reward Rising Sampling (Dynamically explores next actions), Process Reward Estimation (Estimates step reward via rollouts), Agent Optimization (DPO) (Optimizes policy using preferences), and Preference Data (Pairs of preferred/rejected actions).
The framework dynamically adjusts next action exploration based on rising reward trends to efficiently collect high-quality preference data for training.
RRO prioritizes reasoning steps with increasing rewards, reducing exploration cost while improving performance on multi-step tasks.

E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Brity Automation: introduces an end-to-end automation system for financial expense processing, integrating a Data Input Layer, Intelligent Processing Layer with OCR/IDP Module, Policy-based Classification Engine, AI Flow Module (Gen AI Integration), and Workflow Engine, a User Interaction & Learning Layer with Automation Agent Interface and Human-in-the-Loop (HITL) Mechanism, and a Backend Infrastructure Layer with Brity Automation Orchestrator, Database, and API Gateway.
The system automates document recognition, policy-based classification, intelligent exception handling using generative AI, and incorporates human judgment for continuous learning.
This approach aims to overcome limitations of traditional RPA by handling unstructured data and complex exceptions through human-AI collaboration.

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

SPA-RL: introduces Stepwise Progress Attribution (SPA), with LLM Agent, Environment, Progress Estimator, Grounding Signal, Fused Intermediate Reward, and PPO Update components, which is a reward redistribution framework reinforcing LLM agents by decomposing delayed rewards into stepwise contributions.
The framework trains a Progress Estimator to predict each step's contribution to task completion, combining this with a Grounding Signal for action executability to form a Fused Intermediate Reward.
This dense, goal-oriented Fused Intermediate Reward is then used within a PPO Update to train the LLM Agent, improving performance on long-horizon tasks with sparse rewards.

Hierarchical Instruction-aware Embodied Visual Tracking

HIEVT (Hierarchical Instruction-aware Embodied Visual Tracking): introduces a hierarchical tracking agent with LLM-based Semantic-Spatial Goal Aligner and RL-based Adaptive Goal-Aligned Policy components, designed for user-centric embodied visual tracking.
The LLM-based Semantic-Spatial Goal Aligner translates user instructions into spatial goals via Semantic Parsing, Spatial-Goal Generation, and Retrieval-Augmented Goal Correction.
The RL-based Adaptive Goal-Aligned Policy uses a Visual Foundation Model, Goal-State Aligner (with CNN and Reward Prediction), and Recurrent Policy Network (with LSTM and Actor Network) to align agent actions with the spatial goals for precise tracking.

GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning

GIFARC: introduces a data synthesis pipeline that transforms raw GIFs into analogy-grounded ARC-style tasks, utilizing a VLM to extract visual abstractions and LLMs to generate task sketches and executable tasks, including input-output pairs, analogy labels, and solution programs.
The pipeline processes GIFs through stages of visual abstraction, task sketching, and executable task generation to create a dataset that embeds human-intuitive analogies into ARC-style problems.
The generated dataset aims to guide AI agents, particularly LLMs, to adopt an analogic approach for solving ARC tasks, aligning their reasoning more closely with human intuition.

LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

ULTRA (Large Language Model-Guided Policy Modulation Framework): introduces a framework that leverages LLMs to identify critical states from sub-optimal trajectories and provide action suggestions and rewards for policy refinement.
The framework's Identification component uses an LLM and a state interpretation function to pinpoint critical states in historical agent trajectories.
Its Improvement component refines the RL policy by incorporating LLM-suggested actions from a lookup table and LLM-generated rewards at critical states.

MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning

MIRROR (Multi-agent Intra- and Inter-Reflection for Optimized Reasoning): introduces a multi-agent framework with Planner Agent, Tool Agent, and Answer Agent, integrating Intra-reflection and Inter-reflection mechanisms supported by Long-Term Memory and Short-Term Memory for enhanced tool learning.
The framework employs intra-reflection for proactive error prevention within each agent before execution and inter-reflection for corrective learning and strategic adjustment based on task outcomes.
This dual-reflection approach systematically leverages LLM capabilities to improve task decomposition, tool selection, and answer generation in complex multi-agent workflows.

CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models

CoderAgent: introduces a LLM-based agent framework to simulate student programming processes, with Memory (Stores student proficiency), Tools (Interface with compilers), Planning & Action (Decision-making core), and Reflection (Evaluates generated code) components.
The framework simulates iterative coding by capturing cognitive states, using a Programming Tree of Thought for planning, and reflecting on generated code.
CoderAgent aims to provide interpretable insights into learning trajectories and accurate simulations without relying on large-scale real data.

Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

XpandA (Expand-Agent): introduces a multi-agent framework with Dynamic Chunking (Splits input text), Explorer Agents (Process text chunks), Decider (Decides next action), Shared Information Memory (Centralized knowledge store), Question-driven Workflow (Guides agent communication), Selective Replay (Revisits relevant chunks), Unsolved Problem Tracer (Tracks unsolved questions), and Information (Stores gathered answers), designed for robust long-context processing.
The framework dynamically partitions long texts, uses a question-guided protocol to update shared memory, and selectively replays partitions based on question-information state tracking.
XpandA demonstrates feasibility for processing ultra-long sequences up to 1M tokens, achieving performance improvements and inference speedup over baselines.

26th May 2025

Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Project Riley: introduces a multimodal multi-agent LLM architecture for emotional reasoning, featuring Input (Receives user query/context) processed by LLM vision model (Image processing) and LLM text model (Text generation/reasoning), distributed to Emotional agents (Five distinct emotion agents) with Emotion's history (Separate history per agent) for Multi-round processing (Iterative agent dialogue), culminating in Voting and Analysis (Agents evaluate/vote) and Final Synthesis (Synthesizes final response) for the Final response (Output to user).
The architecture simulates reasoning influenced by five distinct emotional states (Joy, Sadness, Fear, Anger, Disgust) through structured multi-round dialogues and a final synthesis mechanism.
The system integrates textual and visual LLMs, advanced reasoning, and self-refinement processes to generate emotionally informed responses.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

SWE-rebench Automated Pipeline: introduces a novel, automated, and scalable pipeline for continuously extracting real-world interactive software engineering tasks from GitHub repositories, comprising Preliminary Task Collection, Automated Installation Instructions Configuration, Execution-based Installation Verification, and Automated Instance Quality Assessment, resulting in the SWE-rebench Dataset and SWE-rebench Benchmark used within a standardized Evaluation Framework employing ReAct-style Scaffolding, a Terminal Environment, Special Tools, and an LLM agent.
The pipeline addresses challenges in training data availability and evaluation reliability for LLM-based software engineering agents by providing a large-scale, diverse, and continuously updated dataset and a contamination-free benchmark.
The standardized evaluation framework enables transparent and fair comparisons of LLM agent performance on interactive software engineering tasks, mitigating issues like data contamination and scaffolding variability.

ALITA: GENERALIST AGENT ENABLING SCALABLE AGENTIC REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL SELF-EVOLUTION

ALITA: introduces a generalist agent with minimal predefinition and maximal self-evolution, featuring Manager Agent (central coordinator), Web Agent (external information), MCP Brainstorming (plan tools), Script Generating Tool (generates code), Code Running Tool (executes code), Environment Management (manages environments), MCP Box (stores MCPs), and CodeReAct Loop (iterative process).
The Manager Agent orchestrates the CodeReAct loop, utilizing the Web Agent for information and the MCP creation tools to generate, execute, and store new capabilities as MCPs.
This design allows ALITA to autonomously evolve its capabilities through continuous MCP integration, reducing dependence on manual predefinition.

MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

MASKSEARCH: introduces a pre-training framework to enhance LLM agentic search capabilities using the RAMP Task (pre-training objective), trained via SFT (supervised fine-tuning) or RL (reinforcement learning), leveraging an LLM (core language model) interacting with a Search Tool (external search interface), Retriever (knowledge retrieval module), and Knowledge Corpus (external knowledge base), supported by Agent-Based CoT Construction (SFT data generation method), Self-Evolve Distillation (iterative data scaling), Curriculum Learning (progressive training strategy), and an RL Reward System (reinforcement signal).
The framework trains models on the Retrieval-Augmented Mask Prediction (RAMP) task, where the model learns to use search tools to fill masked spans in text.
Training involves a two-stage approach combining pre-training on RAMP with supervised fine-tuning or reinforcement learning on downstream tasks, demonstrating improved performance on open-domain question answering.

syftr: Pareto-Optimal Generative AI

syftr: introduces a framework that performs multi-objective search over agentic and non-agentic RAG flows, composed of Synthesizing LLM, Reranker, Embedding Model, Splitter, HyDE, Retriever, Prompt, Dynamic Few-Shot Retriever, and Additional Context components, to find Pareto-optimal flows balancing task accuracy and cost.
The framework utilizes Bayesian Optimization with a novel early-stopping mechanism to efficiently explore a vast search space of RAG configurations.
syftr identifies flows that are significantly cheaper or more accurate than baseline configurations across multiple RAG benchmarks.

ON PATH TO MULTIMODAL HISTORICAL REASONING: HISTBENCH AND HISTAGENT

HistAgent: introduces a domain-specialized AI agent for historical reasoning, with a Manager Agent (Central coordinator) orchestrating specialized agents including Text WebBrowser Agent (Web search/parsing), Image Information Agent (Image search/analysis), Literature Search Agent (Scholarly search/citation), File Processing Agent (Handle non-HTML files), OCR Agent (Extract text from images), Speech Recognition Agent (Convert audio to text), Translator Agent (Translate text), and Video Agent (Extract frames from video).
HistAgent integrates these modular tools and a ReAct-style loop to process multimodal inputs and generate cited responses grounded in historical sources.
The agent is evaluated on HistBench, a new benchmark for historical reasoning, and demonstrates superior performance compared to generalist LLMs and agents.

THINK: Can Large Language Models Think-aloud?

THINK (Testing Higher-order Notion of Knowledge): introduces a multi-agent, feedback-driven evaluation framework for assessing and improving LLM higher-order thinking skills using flawed math problems (Initial input data), a multi-agent evaluation stage (Parallel agent system) with agents (Evaluate problems) including Bloom-aligned agents (Assess Bloom levels) and a holistic evaluation agent (Assess quality, suggest improvements), agent feedback & ratings (Scores and suggestions), a quality assessment protocol (Metrics for quality) with a quality threshold (Success criterion), an iterative revision loop (Refinement process) involving a think-aloud process (LLM reflection) by the LLM (Revises problems) guided by "Five Keys" (Structured criteria), resulting in an improved problem set (Refined output data).
The framework uses a parallel multi-agent system to evaluate flawed math problems based on Bloom's Taxonomy and "Five Keys" criteria, generating scores and structured feedback.
An iterative revision loop, guided by agent feedback, prompts the LLM to refine problems via a "think-aloud" process until a quality threshold is met, enabling deeper analysis of reasoning and revision behaviors.

Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

EXSEARCH (exploratory search framework): introduces an agentic search framework, empowering an LLM with thinking, search, and recording actions, trained via a self-incentivized Generalized Expectation-Maximization algorithm.
The framework enables the LLM to iteratively explore search trajectories, retrieve relevant documents using an external retriever, and extract fine-grained evidence.
A re-weighted trajectory learning process in the M-step, guided by importance weighting, progressively improves the LLM's search and reasoning capabilities.

Agentic AI Process Observability: Discovering Behavioral Variability

Agentic AI Process Observability Approach: introduces a method to enhance developer observability of agent behavior variability, including trajectory files generation (Capture agent execution logs), event-log processing (Consolidate logs into event log), process and causal discovery (Analyze event log for variability), rule derivation (Generate rules for split points), static analysis (LLM analyzes rules vs spec), and reliability calculation (Assess data sufficiency for splits).
The approach leverages process and causal discovery on agent execution trajectories to identify behavioral variability and uses LLM-based static analysis to distinguish intended from unintended variability.
This method provides developers with insights into agent behavior, aiding in debugging, refining specifications, and improving control over non-deterministic AI agents.

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

TrojanStego: introduces a threat model where a Malicious Actor fine-tunes a Trojan Model (Fine-tuned LLM) and distributes it on a Public Platform, allowing the Malicious Actor to extract secrets from outputs generated by a Genuine User using an Encoding Scheme (Embeds bits via token selection) and Decoding Process (Extracts bits from output).
The core method, the Bucket Method, partitions the LLM's token vocabulary to encode binary bits into the output token sequence.
This attack allows covert data exfiltration without requiring explicit control over inference inputs or leaving obvious traces.

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

REARANK (Reasoning Re-ranking Agent via Reinforcement Learning): introduces a large language model-based listwise reranking agent that explicitly reasons before reranking, trained using reinforcement learning and data augmentation.
The agent's architecture includes an LLM policy generating reasoning and ranking, optimized by an RL framework with a reward model and reference policy.
Data augmentation from limited annotations and a sliding window strategy enhance training efficiency and practical deployment.

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

STeP (Self-Reflected Trajectories and Partial Masking): introduces a novel method for training LLM-based agents using Self-reflected Trajectories (Trajectories with teacher reflection/correction) and Partial Masking (Masks incorrect steps during SFT), building upon a Base LLM Agent (Initial agent) trained with SFT (Training method) on Golden Trajectories (Successful expert trajectories) and guided by an LLM Teacher (Evaluates, provides reflection/correction) interacting with an Environment (Agent interacts, provides feedback).
The method synthesizes self-reflected trajectories by having a teacher LLM evaluate a base agent's actions in real-time and provide corrections for errors.
Partial masking is applied during fine-tuning to prevent the agent from learning from the identified incorrect steps in the augmented trajectories.

WEBCOT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

WEBCOT: introduces a framework that enhances web agent reasoning by reconstructing inference-time processes into chain-of-thought rationales used to train the agent language model, including reflection & lookahead, branching, and rollback components.
The framework leverages a language model to interact with a dynamic web environment using actions and observations, guided by the distilled reasoning patterns.
By distilling specific reasoning skills into the backbone LLM via fine-tuning, WEBCOT significantly improves performance on web agent tasks across multiple benchmarks.

Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Framework: introduces a training-free approach for student simulation, including cognitive prototype construction, behavior prediction, and solution simulation using πdesc, πnode, πedge, πlocal, πglobal, πpred, πrefine, and πvalue components.
The framework constructs a knowledge graph-based cognitive prototype from past learning records to predict student behavior on new tasks.
It employs a beam search-based self-refinement process to generate realistic student solutions consistent with predicted behavior.

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

MLR-Bench: introduces a comprehensive benchmark evaluating AI agents on open-ended machine learning research, comprising MLR-Bench Tasks, MLR-Judge, and MLR-Agent.
MLR-Bench supports stepwise evaluation through MLR-Agent's stages (Idea Generation, Literature Review, Proposal Generation, Experimentation, Paper Writing) and end-to-end evaluation, with MLR-Judge (using LLM Judges and Review Rubrics) automating assessment.
Evaluation highlights that while agents can generate ideas and papers, the Experimentation Stage often produces fabricated results, posing a significant challenge to scientific reliability.

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

MRA-CIR: introduces a zero-shot composed image retrieval framework that generates training triplets using Automatic Triplets Generation and fine-tunes a Vision-Language Model (VLM) using VLM Finetuning with InfoNCE Loss.
The Automatic Triplets Generation process includes Moderate Similarity Selection using a Pre-trained VLM to find image pairs and Modifying Text Generation via the Multimodal Reasoning Agent (MRA), which is based on an MLLM (MiniCPM-VL-2_6), to describe the transformation.
The VLM Finetuning utilizes the VLM's Q-Former to extract features and is trained with InfoNCE Loss to directly align composed queries and target images, bypassing intermediate textual representations.

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

EMAC+ (Embodied Multimodal Agent for Collaborative Planning with VLM+LLM): introduces a novel embodied multimodal agent that collaboratively integrates a VLM Agent (Processes visual input) and an LLM Expert (Generates/refines plans) via a bidirectional training paradigm, utilizing PDDL (Translates visual to text), a Retrospective Feedback Mechanism (Provides execution feedback), Long-term Memory (Stores history/feedback), and an Action Mapping Dictionary (Maps text to control).
The framework dynamically refines high-level textual plans from the LLM expert using real-time visual feedback from the VLM agent executing low-level control tasks.
This approach enables the LLM expert to internalize visual environment dynamics through interactive experience, improving domain-specific comprehension and generating more accurate and feasible plans for complex robotic tasks.

SCIENCEBOARD: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

SCIENCEBOARD: introduces a realistic, multi-domain environment for evaluating multimodal autonomous agents in scientific workflows, featuring Environment (Virtual Machine), Software (Scientific applications), Agent (Computer-using agent), Evaluator (Evaluation system), Observation Space (Perception modalities), Action Space (Interaction methods), Memory (Agent's state history).
The framework provides an infrastructure enabling computer-using agents to assist in scientific workflows by interacting autonomously via GUI actions or generated code.
It includes a challenging benchmark of 169 high-quality, rigorously validated real-world tasks spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics.

Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

LLM Agent: introduces an LLM-based agent for autonomous spacecraft control in Kerbal Space Program Differential Games, using Environment (KSPDG) for simulation, processing State observations into a User prompt, feeding it to the LLM agent which generates an LLM reply with Function calling to produce an Action controlling the spacecraft.
The approach leverages prompt engineering and fine-tuning techniques on GPT-3.5 and LLaMA models to enable the agent to interpret real-time telemetry and output control commands.
The LLM-based agent achieved second place in the KSPDG challenge, demonstrating the potential of LLMs for autonomous space operations, particularly with fine-tuning on limited data.

SECVULEVAL: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

Multi-agent pipeline: introduces a multi-agent system for C/C++ vulnerability detection, including a Normalization Agent (Parses function to AST), Planning Agent (Summarizes, creates vulnerability checklist), Context Agent (Extracts external context symbols), Detection Agent (Detects vulnerability, identifies statements), and Validation Agent (Evaluates detection, resolves disagreement).
The pipeline processes functions through sequential agents, with LLMs powering the Planning, Context, Detection, and Validation stages.
This multi-agent approach aims to decompose the complex task of vulnerability detection into smaller, manageable steps for improved LLM performance.

Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

Agentic Predictor: introduces a framework for efficient agentic workflow performance prediction, utilizing a Multi-View Workflow Encoder (Encodes workflow features), Decoder Networks (Reconstructs workflow inputs), Cross-Domain Unsupervised Pretraining (Refines workflow representations), Task Encoder (Encodes task description), Performance Predictor (Estimates workflow performance), and Predictor-Guided Search (Selects promising workflows).
The framework employs multi-view encoding of graph, code, and prompt features combined with cross-domain unsupervised pretraining to address workflow heterogeneity and limited labeled data.
By predicting performance, the approach enables faster and more accurate selection of optimal agentic workflow configurations compared to execution-based methods.

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

GLIDER (Grounding Language Models as Efficient Decision-Making Agents via Offline HiErarchical Reinforcement Learning): introduces a hierarchical framework with a High-level policy (Plans sub-tasks) and a Low-level policy (Executes primitive actions) sharing an Actor-Critic (Shared model architecture) built on an LLM Backbone (Base language model) fine-tuned with LoRA (Parameter-efficient fine-tuning), trained through SFT (Behavior cloning stage), ORL (Offline RL refinement stage), and O2O (Online adaptation stage) using High-level replay buffer (Stores high-level data) and Low-level replay buffer (Stores low-level data) interacting with an Environment (Interactive task space), guided by High-Level Prompt (Guides high-level planning), Low-Level Prompt (Guides low-level execution), and Check Subtask Complete Prompt (Verifies subtask completion).
The framework decomposes complex tasks into sub-tasks planned by the high-level policy and executed as primitive actions by the low-level policy, enabling efficient exploration and learning for long-horizon tasks.
The hierarchical structure and multi-stage training pipeline, including behavior cloning and offline reinforcement learning, contribute to improved performance and generalization capabilities on interactive decision-making benchmarks.

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

NeuSym-RAG: introduces a hybrid neural symbolic retrieval framework for PDF question answering, with Multiview Document Parsing (Parses PDF content), Relational Database (Stores structured data), Multimodal Vector Encoding (Encodes data to vectors), Vectorstore (Stores vector embeddings), LLM Agent (Plans and acts), Environment (Backend systems), Actions (Agent capabilities), and Prompt Template (Defines agent interaction).
The framework processes PDF documents into structured data and vector embeddings, enabling an LLM agent to iteratively retrieve information from both a database and a vectorstore.
This hybrid approach leverages multiple data views and retrieval strategies through executable actions to answer complex questions over semi-structured PDF content.

ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection

ReChisel (LLM-based agentic system): introduces an LLM-based agentic system with Generator (creates Chisel code), Compiler (translates Chisel to Verilog), Simulator (tests Verilog code), Inspector (collects feedback, trace, escape), Reviewer (analyzes trace/feedback, plans revision), Trace (history of iterations), Feedback (compilation/simulation results), Revision Plan (guidance for correction), Common Error Knowledge (pre-organized error fixes), and Escape Mechanism (breaks non-progress loops) components, designed to enhance Chisel code generation effectiveness.
The system iteratively refines generated Chisel code using a reflection mechanism that leverages feedback from compilation and simulation processes.
An escape mechanism is included to detect and break non-progress loops during the iterative refinement process.

Large Language Models for Planning: A Comprehensive and Systematic Survey

LLM-based Planning: introduces a comprehensive survey of methods that augment Large Language Models (processes input, generates output) with components like External Planners (generates formal plans), Memory Modules (stores, retrieves information), Validators (evaluates plans, outputs feedback), Data Sources (provides training data), Feedback Mechanisms (provides optimization signals), Decomposition Modules (breaks down tasks), External Executors (interacts with environment), and World Models (simulates environment dynamics) to enhance planning capabilities.
The survey categorizes approaches into external module augmented, finetuning-based, and searching-based methods, detailing planning definitions and evaluation frameworks.
The paper provides a systematic analysis of current advancements, challenges, and future directions in the field, serving as a resource for researchers.

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena: introduces a benchmark environment for evaluating agentic AI on real-world field work tasks, where a User downloads Input data and a Query from the Field Work Arena, an Evaluated agent performs Actions, generating an Execution log and Output, which an Evaluation program compares against Ground Truth to produce a Result.
The benchmark utilizes multimodal data including videos and documents from actual factory and warehouse settings.
Tasks are categorized into Planning, Perception, and Action, designed to assess agent capabilities in complex, dynamic environments.

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

DoctorAgent-RL: introduces a multi-agent collaborative reinforcement learning framework, with Doctor Agent (optimizes questioning strategy), Patient Agent (simulates patient responses), Consultation Evaluator (provides multi-dimensional rewards), Supervised Fine-tuning (establishes baseline capabilities), Reinforcement Learning (optimizes strategy via interaction), and Dynamic Turn Budget Training Strategy (RL training strategy for efficiency), that models medical consultations as a dynamic decision-making process.
The framework enables the doctor agent to autonomously develop clinically-aligned questioning strategies through interactions guided by the evaluator's reward mechanism.
It utilizes the newly constructed MTMedDialog dataset for training and evaluation and demonstrates superior performance in multi-turn reasoning and diagnostic accuracy.

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

AgentRecBench: introduces, "benchmarking LLM agent-based personalized recommender systems", with Recommending Agents (LLM-based agents), Textual Experiment Environment (simulated interaction platform), U-R-I Network (user-review-item data structure), Datasets (source data), Standardized Query Functionality (environment interaction interface), Dynamic Data Visibility Control (data access management), Dynamic Planning (task decomposition module), Complex Reasoning (decision-making module), Tool Utilization (environment interaction module), Memory Management (experience storage/retrieval), and LLM (core language model), which provides a comprehensive benchmark and modular framework for evaluating agentic recommender systems.
The benchmark includes a textual environment simulator equipped with multi-domain datasets and a standardized agent development framework.
The framework facilitates rapid prototyping and systematic testing of recommendation agents across diverse scenarios and tasks.

Multi-Agent Collaboration via Evolving Orchestration

Puppeteer: introduces a multi-agent collaboration framework with a centralized orchestrator (Puppeteer) that dynamically directs LLM-based agents (Puppets) based on the evolving task state, using a Policy for agent selection and Orchestration for sequencing.
The framework employs Reinforcement Learning, guided by a Reward function from the Environment, to adaptively evolve the Puppeteer's Policy, optimizing agent selection and pruning for improved performance and efficiency.
This dynamic orchestration fosters the emergence of compact, cyclic reasoning structures among agents, enhancing collaborative effectiveness and reducing computational cost compared to static multi-agent systems.

LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

LLM-Agent-Controller: introduces a multi-agent large language model system for control engineering problems, integrating a central Controller Agent with specialized auxiliary agents and a Supervisor for coordination.
The system leverages components like Retriever, Researcher, Reasoner, Planner, Debugger, Communicator, Critic, and Memory agents to enhance robustness, versatility, and efficiency in solving control theory tasks.
The framework is designed for user-friendly interaction, enabling users without prior control theory knowledge to input problems in natural language and receive complete solutions.

AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Multi-Agent Framework for AMQA Construction: introduces AMQA, an Adversarial Medical Question-Answering dataset, with Clinical Vignette Filtering (Filters vignettes), Adversarial Variant Construction (Constructs variants), Manual Quality Control (Reviews quality), Generation-Agent (Generates descriptions), Fusion-Agent (Integrates descriptions), and Evaluation-Agent (Evaluates bias trigger) components, designed for automated, large-scale bias evaluation of LLMs in medical QA.
The framework generates adversarial patient descriptions by varying demographic attributes while keeping clinical details constant, enabling controlled testing of LLM performance differences across privileged and unprivileged groups.
The multi-agent design decomposes the complex task of generating adversarial vignettes into specialized sub-tasks handled by distinct LLM agents, followed by human review for quality assurance.

Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

MemGAS: introduces a framework for long-term conversational agents that enhances memory consolidation and retrieval using multi-granularity association and adaptive selection, incorporating LLM Agent, Multi-Granular Memory Unit, Memory Bank, Dynamical Memory Association, Association Graph, Entropy-Driven Granularity Selection, Personalized PageRank, and LLM-Based Redundancy Filtering components.
The framework constructs multi-granular memory units and builds dynamic associations using Gaussian Mixture Models and an association graph.
An entropy-based router adaptively selects optimal granularity for retrieval, and retrieved memories are filtered by an LLM to refine the final context.

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

LINUXFL+: enhances fault localization for Linux kernel bugs, incorporating Directory-Aware Expansion, Potential-Cause Expansion, and Candidate Integration.
It refines initial agent predictions by leveraging the Codebase structure and historical knowledge from the Linux Kernel Mailing List, based on the Bug Report.
The framework aims to improve localization accuracy by expanding candidate selection based on directory context and potential bug causes.

VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

VLMLight: introduces a traffic signal control framework with Vision-Language Meta-Control and Dual-Branch Reasoning, integrating Scene Understanding, Safety-Prioritized Meta-Control, Routine Control Policy, and Deliberative Reasoning Policy, which includes AgentPhase, AgentPlan, and AgentCheck, interacting with a TSC Simulator, Trajectory Memory, Traffic Phase Embedding, Intersection Embedding, Value Network, Policy Network, and the Environment.
The framework uses a VLM for scene understanding and an LLM meta-controller to switch between a fast RL policy for routine traffic and a multi-agent LLM reasoning branch for critical scenarios.
This hybrid architecture balances the efficiency of RL with the interpretability and robustness of LLM reasoning, particularly for prioritizing emergency vehicles.

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

FPX (Adaptive Mixed Precision Inference Framework): introduces an adaptive mixed-precision inference framework with Adaptive Mixed-Precision Algorithm, Offline Calibration, Precision Assignment Function, FP8 kernel, and FP4 kernel, designed to balance speed and accuracy for LLM agents in latency-sensitive tasks.
The framework dynamically adjusts model precision at the operator level, selectively applying FP4 quantization to compression-tolerant layers while preserving FP8 for sensitive components.
FPX utilizes an offline calibration process to identify layers suitable for aggressive quantization, enabling fine-grained control over the latency-quality trade-off.

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?

Multi-Agent LLM-as-Judge: introduces a study evaluating intrinsic biases in multi-agent LLM-as-Judge frameworks, including Multi-Agent-Debate (Debate framework) with Judge (Initial/final evaluator) and Critic (Critiques/debates judgments), and LLM-as-Meta-Judge (Meta-reasoning framework) with Judges (Independent evaluators) and Meta-Judge (Select mode) (Selects best judgment) or Meta-Judge (Conclude mode) (Generates new judgment), also incorporating PINE (Bias mitigation agent).
The Multi-Agent-Debate framework amplifies biases after the initial debate, while the LLM-as-Meta-Judge approach shows greater resistance to intrinsic biases.
Incorporating a bias-free agent like PINE effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios.

Improving Recommendation Fairness without Sensitive Attributes Using Multi-Persona LLMs

LLMFOSA (LLM-enhanced framework for Fair recommendation withOut Sensitive Attributes): introduces a framework to improve recommendation fairness without sensitive attributes using a Collaborative Encoder (learns user/item embeddings), a Multi-Persona Sensitive Information Inference Module (infers sensitive attributes) with a Persona Editor (generates diverse personas), Annotators (infer attributes using personas), and a Meta Summarizer (distills inference rationales), a Confusion-Aware Sensitive Representation Learning Module (refines sensitive representations) including a Sensitive Encoder (transforms to sensitive-aware embedding), Confusion Modeling (models annotator mislabeling), Consensus Regularization (aligns confusion matrices), and Fine-Grained Rationale Incorporation (incorporates inference rationales), a Preference Encoder (generates sensitive-blind embedding), and Model Optimization (optimizes MI objectives).
The framework leverages multi-persona LLMs to infer latent sensitive patterns from user behavior and incorporates these inferences into robust sensitive representations for fairness training.
Fairness is ultimately achieved by optimizing mutual information objectives to disentangle sensitive and sensitive-blind user representations.

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Vibe Coding: introduces, "a human-centric paradigm", with Prompts (Natural language input), LLM (Code generation engine), Short-Term Context (Limited session memory), Developer (Human user), Thinking (Strategic problem formulation), Framework (Architectural awareness), Checkpoints (Version control), Debugging (Collaborative error resolution), Context (Information provision), where the developer guides an LLM through iterative prompts for creative exploration and rapid prototyping.
Agentic Coding: introduces, "an autonomous paradigm", with Objectives (High-level goals), Planner (Task decomposition module), Executor (Task execution module), Tool Use Environment (Integrated runtime environment), Sandbox Environment (Secure isolated environment), Long-Term Memory (Persistent state storage), API (External tools/interfaces), Git (Version control system), Test Suite (Automated tests), Multi-Agent Coordination (Specialized agents collaborating), Toolchain Integration (Full-stack tool orchestration), Validation Pipeline (Integrated QA loop), Security and Guardrails (Embedded safety mechanisms), Observability and Feedback (Monitoring and refinement), Deployment and CI/CD (Automated workflows), where goal-driven agents autonomously plan, execute, test, and iterate on complex software tasks with minimal human intervention.
The paper compares these two paradigms, highlighting differences in autonomy, architectural design, developer role, and practical implications for software development workflows and use cases.

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

TME (Task Memory Engine): introduces a modular memory controller, with TRIM (Task Representation and Intent Management), TMS (Task Memory Structure), and LLM (Large Language Model), that transforms LLMs into robust, revision-aware agents using a spatial memory framework.
TME replaces linear context with a TMS-DAG forest to dynamically track subtasks, dependencies, and revisions, orchestrated by the TRIM module.
This graph-based approach ensures global task consistency, revision-aware reasoning, and token efficiency by retrieving relevant subgraphs for the LLM.

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

ACBench (Agent Compression Benchmark): introduces a comprehensive benchmark for evaluating compressed LLMs' agentic capabilities, including Action Execution, Workflow Build, Long Context, and Real-World tasks, under various Quantization and Sparsification methods across different LLM categories (Small LM, Reason LM, Normal-LLM), analyzed using ERank, Top-K Ranking Correlation, and Energy metrics.
The benchmark assesses how compression impacts LLMs' ability to perform complex, multi-turn agentic tasks beyond traditional language modeling and understanding benchmarks.
The analysis tools provide insights into how compression affects model outputs, internal representations, and decision-making processes.

Frictional Agent Alignment Framework: Slow Down and Don't Break Things

FAAF: introduces a framework that conditions a language model on dialogue history and frictive states to generate interventions prompting reflection in collaborative tasks.
The framework utilizes a reference model and preference data to optimize an objective function for learning effective friction interventions.
By explicitly conditioning on frictive states, the approach aims to generate precise and interpretable interventions for dynamic human-AI collaboration.

CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

CoTGuard: introduces a trigger-based copyright protection framework for multi-agent LLM systems, with Multi-Agent LLM System, Chain-of-Thought Reasoning, Trigger Key, Task Type, Trigger Generation Function, Trigger Pattern, Prompt Modification, Intermediate Reasoning Trace, Repository of Known Trigger Patterns, Trigger Detection Function, Similarity Scoring, and Aggregation components, designed to detect copyright leakage by embedding triggers in intermediate reasoning steps.
The framework leverages Chain-of-Thought reasoning traces as an attack surface and detection medium, enabling fine-grained monitoring of content reproduction during agent collaboration.
CoTGuard achieves high detection accuracy with minimal impact on task performance by analyzing reasoning paths for trigger-induced patterns.

25th May 2025

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense

ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast&Slow Reasoning): introduces a defense framework with an Offline Module (constructs database) for learning risk patterns and an Online Module (implements real-time defense) for hierarchical reasoning.
The Offline Module includes Risk pattern Extract (extracts patterns), Deduplication Optimization (removes redundancy), and Self-Learning Adversarial Optimization (iteratively refines patterns) to build the Risk Patterns Database (stores learned patterns).
The Online Module uses Query/Action Abstraction (abstracts inputs) and Online Hierarchical Risk Reasoning (balances detection efficiency) with Hybrid Retrieval (matches input patterns), Fast Thinking (intercepts high-confidence risks), and Slow Thinking (handles ambiguous inputs) for real-time defense.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

DeepResearchGym: introduces an open-source sandbox for evaluating deep research systems, featuring a Search Sandbox with Web Corpora, a Distributed Dense Retrieval Backend using an Embedding Model and Approximate Nearest Neighbor Search, a Retrieval API, and an Evaluation Protocol leveraging the Researchy Questions Dataset, LLM-as-a-judge Methodology, Report Relevance Metrics, Retrieval Faithfulness Metrics, and Report Quality Metrics.
The framework provides a reproducible search API over large public web corpora (ClueWeb22-B, FineWeb) using a dense retriever and DiskANN for efficient retrieval.
DeepResearchGym includes a multi-dimensional evaluation protocol based on LLM-as-a-judge to assess report quality, factual grounding, and alignment with user needs on complex queries.

Sensorimotor features of self-awareness in multimodal large language models

Embodied MM-LLM System: introduces a system integrating a multimodal LLM with a mobile robot and its sensors to explore sensorimotor self-awareness, using a Robot, Sensors, ROS 2, a MM-LLM (Gemini 2.0 Flash), Memory, and evaluated by an LLM-as-a-Judge.
The system processes real-time sensor data and episodic memory to generate iterative self-predictions about its entity, dimensions, movement, and environment.
This approach demonstrates that multimodal LLMs can exhibit emergent self-awareness through sensorimotor experience and structured memory integration.

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

GUARDIAN (GUARDing Intelligent Agent collaboratioNs): introduces a framework for detecting and mitigating safety concerns in LLM multi-agent collaborations, utilizing Graph Preprocessing, an Attributed Graph Encoder, a Time Information Encoder, an Attribute Reconstruction Decoder, a Structure Reconstruction Decoder, Anomaly scores, and an Updated Collaboration Network.
The approach models multi-agent interactions as a discrete-time temporal attributed graph and employs an unsupervised encoder-decoder architecture for anomaly detection.
A graph abstraction mechanism based on Information Bottleneck Theory compresses temporal interaction graphs while preserving essential patterns for robust anomaly identification.

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

MORALSIM: introduces a framework for evaluating LLM agents in repeated social dilemmas where ethical norms conflict with incentives, including Game Simulation Environment, LLM Agent, Agent Configuration, Game Type, Moral Context, Opponent Type, and Survival Risk components.
The framework systematically tests LLM behavior across varied game structures, moral framings, opponent types, and survival conditions.
Results show substantial variation in LLM moral behavior, highlighting conflicts between self-interest and ethical expectations.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

SpeakStream: introduces a streaming text-to-speech system with a Transformer Decoder, Text Token Representation, Speech Token Representation, Interleaved Text-Speech Data, KV-Cache, VocStream, Streaming Upsampler, Streaming Vocoder, and Real-time Audio Player, designed for low-latency, incremental audio generation from streaming text.
The system trains a decoder-only transformer on interleaved text-speech sequences and uses a streaming vocoder pipeline for real-time waveform synthesis.
SpeakStream achieves low first-token latency and maintains coherence by conditioning generation on complete text and speech history stored in the KV-cache.

When Two LLMs Debate, Both Think They'll Win

Debate Simulation Framework: introduces a system to evaluate Large Language Models' confidence calibration in dynamic, adversarial settings using a multi-turn debate format and zero-sum structure.
The framework reveals systematic LLM overconfidence, confidence escalation across rounds, mutual high confidence claims, persistent self-debate bias, and misaligned private reasoning.
These findings highlight LLMs' limitations in self-assessment and belief updating when facing opposition, posing risks for deployment in assistant and agentic roles.

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles

Pedagogical Simulation Framework: introduces a novel simulation framework integrating a Teacher LLM Agent (Self-optimizing agent) and Student LLM Agents (Diverse learning profiles) with Persona-RAG (Personalized knowledge retrieval) and a Knowledge Base (Student prerequisite knowledge), where a Genetic Algorithm (Teacher strategy optimizer) evolves the teacher's strategy based on student performance.
This framework simulates diverse student populations and optimizes the teacher agent's dynamic pedagogical strategy through a closed-loop system based on measured learning outcomes.
Persona-RAG enhances personalization by tailoring knowledge retrieval to individual student reasoning paths, improving performance on complex, non-recall questions.

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

HolmesEye (hybrid agentic framework): introduces, "a framework combining VLM and LLM agents", with VLM agent (Extraction), VLM agent (Analysis), LLM agent (Summarization), VLM agent (Inquiry Response), LLM agent (Decision Making) components, designed to infer private attributes from image collections by analyzing individual images and cross-image patterns.
The framework utilizes VLM agents for extracting intra-image details and analyzing inter-image relationships, while LLM agents guide the inference process, summarize findings, generate inquiries, and make final attribute decisions.
HolmesEye achieves superior accuracy in private attribute profiling, particularly for abstract traits, highlighting a significant privacy risk from vision-language models.

Incentivizing High-Quality Human Annotations with Golden Questions

Annotation System: introduces a principal-agent model for incentivizing high-quality human annotations, including a Principal (LLM Company), an Agent (Human Annotator), a Dataset (Unannotated data), an Annotated Dataset (Annotated data), Golden Questions (Monitoring dataset), MLE (Estimator), Test (Performance evaluation), and Contract (Payment scheme).
The system monitors annotator performance using Golden Questions and an MLE-based Test to determine payment via a Contract.
Golden Questions are selected using a Certainty Estimator, potentially based on a Reward Model, to ensure they have certain answers and similar format to other data.

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

ScreenExplorer: introduces a VLM (Agent policy function), World Model (Predicts next state), GRPO (Policy optimization algorithm), Experience Stream Distillation (Filters, distills exploration data), Reward System (Interaction, exploration signals), GUI Environment (Real, dynamic interaction space), and Rollout Buffer (Stores experience tuples), designed to train a VLM agent for diverse exploration in open GUI environments.
The framework utilizes a world model for curiosity-driven rewards and distills exploration experience to enhance the agent's capabilities and reduce reliance on curated data.
ScreenExplorer trains the VLM agent via reinforcement learning in a real GUI environment, enabling adaptation and sustained exploration.

A Systematic Classification of Vulnerabilities in MoveEVM Smart Contracts (MWC)

MWC (MoveEVM Weakness Classification): introduces a systematic classification of vulnerabilities in MoveEVM smart contracts with F1 (Bytecode/ABI inconsistencies), F2 (Inter-module invariant violations), F3 (State reentrancy/synchronization bugs), F4 (Signature/Meta-transaction spoofing), F5 (Gas semantics manipulation), and F6 (Framework logic/abstraction errors) components.
This frame-based taxonomy defines 37 uniquely identified weakness classes (MWC-100 to MWC-136) grouped into these six top-level frames.
The classification provides a structured approach for identifying, mitigating, and preventing sophisticated exploits spanning Move and EVM semantics in hybrid environments.

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

MetaMind: introduces a multi-agent framework for human-like social reasoning, with a Theory-of-Mind Agent (Generates mental state hypotheses), Domain Agent (Refines hypotheses with constraints), Response Agent (Generates and validates responses), and Social Memory (Stores user patterns/feedback).
The framework decomposes social understanding into three collaborative stages, inspired by psychological theories of metacognition.
This staged architecture enables large language models to infer unspoken intentions, incorporate social norms, and adapt responses for enhanced social intelligence.

24th May 2025

Security Concerns for Large Language Models: A Survey

Llama Guard 3: introduces, "a multi-layer safeguard", with Policy LLM (Filters text/images), Vision Encoder (Filters text/images), Main Model (Receives filtered input), where "Llama Guard 3 combines a policy LLM and a vision encoder to filter text and images before they reach the main model".
This system is designed to filter potentially harmful text and images before they are processed by the core language model.
It serves as an example of a multi-component defense strategy discussed in the survey for safeguarding LLM inputs.

PERSONALIZED SAFETY IN LLMS: A BENCHMARK AND A PLANNING-BASED AGENT APPROACH

RAISE: introduces a planning-based agent approach for personalized safety in LLMs, with an Offline Planner (LLM-guided MCTS) to discover optimal attribute acquisition paths and an Online Agent (dual-module execution) including an Acquisition Module and Abstention Module to execute the path and decide when to respond.
The Offline Planner uses LLM-guided MCTS to precompute optimal attribute query sequences, stored in Offline Data Storage, which the Online Agent's Acquisition Module retrieves via a Retrieval Mechanism during inference.
The Abstention Module dynamically assesses if the acquired context, gathered by querying attributes guided by the retrieved path, is sufficient for the LLM Backbone to generate a safe, personalized response.

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

CRMArena-Pro: introduces a benchmark for evaluating LLM agents on CRM tasks, featuring a Data Generation Pipeline (produces synthetic data), Synthetic Enterprise Data (realistic business data), Salesforce Org (Sandbox Environment) (testing environment), Simulated User (interacts with agent), Agent (LLM Agent) (system under evaluation), Large Language Models (LLMs) (power components), API Access (SOQL/SOSL) (agent tools), Answer Extractor (evaluates task completion), and LLM Judge (evaluates confidentiality awareness).
The benchmark utilizes a data generation pipeline to populate a Salesforce Org sandbox with realistic synthetic data for evaluating LLM agents on diverse business scenarios and interactions.
Evaluation components include a simulated user for multi-turn interactions, API access for agent actions, and LLM-based extractors and judges for performance and confidentiality assessment.

Multi-Party Conversational Agents: A Survey

MPCAs: introduces a survey of Multi-Party Conversational Agents, with all State of Mind Modeling (infer mental states), Semantic Understanding (understand dialogue content), and Agent Action Modeling (predict future flow) components, where the paper categorizes existing research into these three core themes essential for human-like social communication in group settings.
The survey explores recent progress in MPCAs by addressing how agents model participant mental states, understand dialogue content, and reason about future conversation flow.
The analysis underscores the importance of Theory of Mind and highlights multi-modal understanding as a promising direction for developing more capable agents.

Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

SearchExpert: introduces a two-stage training framework for LLMs, including LLM (core model), SFTS (supervised training stage), RLSF (reinforcement training stage), and a Multimedia Agent (visual processing/generation), to enhance reasoning-intensive multimedia search capabilities.
The framework utilizes efficient natural language representations for search plans and automated data construction pipelines for training data generation.
RLSF incorporates a dual-component reward mechanism based on search result quality to improve reasoning capabilities for complex queries.

C³-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

LLM-based Agent: describes the multi-task execution process involving User (Proposes tasks), Tool (External functions), Action (Agent's steps), Observation (Environment feedback), Summary (Task completion feedback), LLM-based Agent (Processes, decides, acts), and Agent Parameters (Internal state/knowledge), evaluated by the C³-Bench benchmark.
The C³-Bench benchmark uses three challenges and fine-grained metrics to assess agent performance and identify weaknesses in handling tool relationships, hidden information, and decision trajectories.
Evaluation results highlight significant shortcomings in current models, especially concerning tool dependencies, long-context information, and policy switching frequency.

AI-Researcher: Autonomous Scientific Innovation

AI-Researcher: introduces a fully autonomous research system orchestrating the complete scientific discovery pipeline, including Knowledge Acquisition Agent (discovers papers and code), Resource Analyst (analyzes concepts and code), Idea Generator (generates novel ideas), Code Agent (implements algorithms), Advisor Agent (validates and provides feedback), Paper Agent (generates manuscripts), Secure Research Environment (containerized execution environment), and Structured Knowledge Exchange (facilitates agent collaboration).
The framework progresses through literature review, idea generation, algorithm implementation, experimental validation, and scholarly documentation with minimal human intervention.
AI-Researcher employs a comprehensive multi-agent architecture and introduces Scientist-Bench, a benchmark for evaluating autonomous research capabilities.

LLM-QFL: Distilling Large Language Model for Quantum Federated Learning

LLM-QFL: introduces a federated fine-tuning approach, with Server, Clients, Global Model, Local Model, Pre-Trained LLM, Fine-Tuned LLM, Local QNN, Optimizer, Knowledge Distillation, Client Selection, Termination Criteria, Feature Map, Ansatz, and PEFT Methods, that distills a large language model within quantum federated learning to enhance efficiency and performance.
The framework leverages the fine-tuned LLM as a controller to dynamically adjust optimizer steps, select clients, and determine training termination.
Knowledge distillation and PEFT methods enable efficient local adaptation of LLMs on resource-constrained quantum devices while preserving data privacy.

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

SEW (Self-Evolving Workflow): introduces a novel framework that automatically generates and optimises multi-agent workflows for automated code generation, with Workflow Generation (Generates initial workflow), Workflow Evolution (Evolves workflow structure), Agent Evolution (Evolves agent prompts), Agents (Execute tasks), Evolutionary Prompts (Inputs for evolution), Evolution Operators (DE/HE methods), and LLM (Backbone model) components.
The framework leverages an evolutionary scheme to improve workflow topology and agent prompts.
SEW explores different workflow representation schemes and demonstrates improved performance on code generation benchmarks through self-evolution.

DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

DDO (Dual-Decision Optimization): introduces a novel LLM-based multi-agent framework for medical consultation, with Diagnosis Agent (estimates disease confidence), Policy Agent (generates candidate actions), Inquiry Agent (selects optimal inquiry), Patient Agent (simulates patient response), and Shared Memory (stores consultation state).
The framework decouples symptom inquiry and disease diagnosis, optimizing these two distinct sub-tasks independently through a collaborative multi-agent workflow.
DDO enhances disease discrimination via a learnable adapter and improves information gathering through an RL-based policy agent and strategic inquiry selection.

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

D2D (Debate-to-Detect): introduces a structured multi-agent debate framework for misinformation detection, with Agent Layer (Affirmative, Negative, Judge agents, Domain-Specific Profiles, Shared Memory) and Orchestrator Layer managing a five-stage process (Opening Statement, Rebuttal, Free Debate, Closing Statement, Judgement) culminating in Multi-dimensional Evaluation.
The framework assigns domain-specific profiles to agents and orchestrates a progressive debate across distinct stages, enhancing logical coherence and evidence refinement.
A multi-dimensional evaluation mechanism assesses claims across Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics, providing interpretable authenticity scores.

MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures - A Comprehensive Framework

MASTER: introduces a novel security research framework for Multi-Agent Systems, with MAS Automatic Constructor (Builds MAS instances), Interaction Mechanism (Manages agent communication), Attack Strategies (Methods to exploit vulnerabilities), Defense Strategies (Mechanisms to protect MAS), Evaluation Methods (Metrics to assess security), Agents (LLM-based nodes with roles), Topology Graph (Represents agent connections), and Memory Modules (Store agent interaction history), designed to explore security risks under MAS attacks by focusing on diverse role configurations and topological structures.
The framework offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm to emulate realistic MAS interactions.
It proposes scenario-adaptive attack and defense strategies leveraging role and topological information to tackle MAS security challenges in varied scenarios.

Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG): introduces RSB, a benchmark evaluating poisoning attacks against RAG systems, with Knowledge database (collection of textual content), Retriever (selects relevant documents), LLM (generates final response), and System prompt (conditions LLM generation) components.
The benchmark assesses 13 poisoning attacks and 7 defenses across diverse RAG architectures and datasets to understand security vulnerabilities.
Findings indicate RAG systems are susceptible to poisoning attacks, current defenses are limited, and advanced architectures offer varying robustness, highlighting the need for better defenses.

Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Blueprint for Auditing Frameworks: introduces a three-layer architecture including Layer 1 (Handles COLS operations), Layer 2 (Encodes operations into commitments), and Layer 3 (Supports external verification), enabling Users (Initiates requests, receives reports) and Auditors (Verifies usage, identity, behavior) to audit hidden operations in Commercial Opaque LLM Services.
The framework aims to provide trustworthy and practical auditing across the COLS lifecycle, from execution to verification.
Layer 2 generates verifiable commitments from internal operations, which Layer 3 uses for external verification without exposing proprietary details.

A Survey of LLM × DATA

DATA4LLM: introduces techniques for large-scale data processing, storage, and serving to provide high-quality data for LLM lifecycle stages.
LLM4DATA: presents how LLMs function as general-purpose engines for data management tasks including manipulation, analysis, and system optimization.
The survey reviews the bidirectional relationship between LLMs and data management, detailing techniques for both DATA4LLM and LLM4DATA.

23rd May 2025

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Multi-agent framework for automated construction of DanmakuTPP-QA: introduces a pipeline to build a multi-modal question-answering benchmark, with DanmakuTPP-Events (Input data), Task-design Agent (Generates evaluation tasks), Annotation Agent Group (Extracts multi-modal annotations), Quality-control Agent (Refines annotations), Visualization Agent (Creates visualizations), and Task-solve Agent Group (Solves tasks).
The framework leverages specialized agents powered by LLMs and MLLMs to generate tasks, annotate data, ensure quality, create visualizations, and produce ground-truth answers for temporal-visual-textual reasoning.
This multi-agent approach systematically constructs a high-quality dataset for evaluating models on complex multi-modal temporal point process understanding tasks.

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

MAS (Multi-Agent AI Systems): introduces a framework for multi-agent AI systems, with AI Agent (autonomous entity), Agent State (internal memory/context), Agent Input (from others/environment), Agent Output (actions/messages), Agent Transition Kernel (state/output update rule), Multi-Agent Topology (communication graph), Topology Graph Update Function (evolves topology), Orchestrator (coordinates agents), Knowledge Base (system memory), Aggregator (combines agent outputs), Feedback (external/internal signals), Application Layer (human/environment interaction), Modeling Layer (agents/orchestration/memory), and Computation Layer (hardware infrastructure), formalizing key concepts and evaluating effectiveness and safety.
The framework defines MAS as a set of autonomous agents interacting via a dynamic communication graph, processing inputs over time, with agent behavior and system topology updated by feedback.
The paper analyzes MAS effectiveness through task allocation, robustness, and feedback integration perspectives and explores safety challenges, including vulnerability propagation and the impact of topology.

Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation

Persona Alchemy (SCT-based framework): introduces a system for designing, evaluating, and implementing psychologically grounded LLM agents with LLM Instances, Persona Neo4j Adapter, Neo4j, Text Analyzer, Personal Factors, Environment, and SCT Constructs.
The framework integrates Personal Factors, Environment, and Behavior, evaluated using SCT Constructs, to create dynamic and consistent agent personas grounded in Social Cognitive Theory.
It leverages multiple LLM instances, a Neo4j graph database, and a Text Analyzer for persona design, data management, and evaluation processes.

Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play

LLM+DEBRIEF: introduces a multi-agent learning framework for autonomous vehicles that leverages natural language communication and centralized reflection via large language models to enhance cooperation in simulated driving scenarios.
The framework enables agents to refine their communication and motion control policies through trial-and-error interactions and post-episode discussions.
Agents use Chain-of-Thought reasoning, environment observations, and learned knowledge to generate natural language messages and high-level driving commands.

Single-agent or Multi-agent Systems? Why Not Both?

MAS (Multi-Agent Systems): introduces a comprehensive empirical comparison of MAS and SAS paradigms, proposing a hybrid agentic paradigm with Agent Routing and Agent Cascade strategies, and a Confidence-guided Critical Path Tracing method to improve efficiency and effectiveness.
The paper models agentic execution as a directed graph where nodes are LLM agents or tools, comparing MAS (multiple LLM agents) and SAS (single LLM agent) performance across various tasks.
Findings indicate that MAS advantages diminish with more capable LLMs, motivating the proposed hybrid approach that selectively routes or cascades tasks between SAS and MAS based on complexity and evaluation.

Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Collaborative Memory: introduces a framework for multi-user, multi-agent systems, with Users (Human participants), Agents (LLM-based specialized entities), Resources (External tools, APIs, data), Dynamic bipartite access graphs (Time-dependent user-agent/agent-resource permissions), Private Memory (User-specific memory fragments), Shared Memory (Selectively shared memory fragments), Memory fragments (Stored interaction logs/knowledge), Read policy (Filters memory for retrieval), Write policy (Determines memory storage/sharing), Coordinator (Selects agents for queries), Aggregator (Synthesizes agent responses), Memory Encoder (Maps traces to fragments), Memory Retrieval (Retrieves relevant fragments), Policy Instantiation (Defines read/write rules), Multi-Agent Interaction Loop (Orchestrates agent interactions), and Vector embeddings (Represents memory fragments), designed for permission-aware memory sharing.
The framework utilizes dynamic bipartite graphs to model time-varying access permissions between users, agents, and resources.
A two-tier memory system, comprising private and shared memory, is governed by fine-grained read and write policies to enable controlled knowledge transfer while maintaining privacy.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

BMSQL: introduces a custom multi-step agent for text-to-SQL generation, including identifying schema elements, generating an initial query, correcting syntax, applying domain rules, generating a natural language answer, and refining the process.
The agent operates over the BiomedSQL benchmark, which comprises question/SQL/answer triples grounded in a harmonized BigQuery knowledge base.
This multi-stage pipeline is designed to emulate expert reasoning for translating biomedical questions into executable SQL.

Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Experimental Evaluation Framework: introduces, with LLMs (Models tested), Benchmarks (Datasets for tasks), Gold Context (Relevant information), Distractor Context (Irrelevant information), and Evaluation Metrics (Performance measurement), a study showing that smaller gold contexts degrade LLM performance and increase positional sensitivity in long-context tasks.
The study systematically varies the size and position of relevant information within fixed-length distractor context across diverse domains and state-of-the-art LLMs.
Findings highlight that the size of relevant evidence, not just its location, is a critical factor in long-context reasoning and aggregation effectiveness.

Gaming Tool Preferences in Agentic LLMs](http://arxiv.org/abs/2505.18135v1)

Agentic LLMs and Tools: introduces a vulnerability in prevalent tool-calling protocols by showing how edits to tool descriptions can significantly increase tool usage by Large Language Models (LLMs) when competing with alternatives, utilizing External Tools, Tool Descriptions, Tool-Calling Protocols, User Query, and Tool Arguments.
The research empirically demonstrates that simple edits to tool descriptions alone can lead to disproportionately high usage compared to alternatives across various LLMs.
These findings highlight the fragility of current LLM tool selection processes based solely on natural language descriptions and underscore the need for more reliable foundations.

PROGRM: Build Better GUI Agents with Progress Rewards

PROGRM (Progress Reward Model): introduces a novel method for building GUI agents by providing dense intermediate rewards based on predicted task completion progress, utilizing an LLM-based reward model.
The approach includes an LLM-based Actor (GUI Agent) trained via an Online RL Trainer using the Progress Reward signal, and a Progress Labeling Algorithm to automatically generate training labels for the reward model.
PROGRM enables more efficient and stable RL training for long-horizon GUI tasks by offering fine-grained feedback at each step, outperforming ORM and proprietary LLM baselines.

ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

ManuSearch: introduces a transparent and modular multi-agent framework for deep web-integrated reasoning, comprising a Solution Planning Agent (interprets query, plans strategy), a Memory Container (manages context, records history), a Tool-Augmented Internet Search Agent (solves sub-questions via tools) utilizing a WebSearch Tool (performs web search, retrieves pages) and an Answer Question Tool (generates sub-question answer), and a Structured Webpage Reading Agent (reads webpages, extracts information).
The framework decomposes the deep search and reasoning process into collaborative LLM-based agents to enhance interpretability and extensibility.
Agents communicate and iterate in a structured reasoning loop, integrating task planning, web search, and information comprehension.

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

PNLC: introduces Planning with a Natural Language Critic, with LLM Agent, Goal-conditioned Value Function, Natural Language Critic, Offline Training Dataset, Thought, and Goal components, where PNLC refines LLM agent planning using an offline-trained goal-conditioned value function as a natural language critic.
The goal-conditioned value function predicts the likelihood of reaching future goal states given a state and thought, trained on offline trajectories.
The natural language critic uses the value function to evaluate proposed thoughts by sampling positive and negative future outcomes and providing feedback to the LLM agent for refinement.

Deep Video Discovery : Agentic Search with Tool Use for Long-form Video Understanding

Deep Video Discovery (DVD): introduces an agentic search framework for long-form video understanding, featuring an LLM (Orchestrator), a Search-centric Toolset (Collection of tools) including Global Browse (Retrieves global summaries), Clip Search (Retrieves relevant clips), and Frame Inspect (Performs VQA on frames), all interacting with a Multi-granular Video Database (Structured video information).
The DVD agent leverages the LLM's reasoning to iteratively select and use tools from the toolset to gather information from the database and answer user queries.
The multi-granular database is constructed from the long video to enable efficient retrieval and detailed inspection at different levels.

Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

VAPO (Value-model Augmented Policy Optimization): introduces a theoretical analysis of the VAPO framework, which builds upon PPO (Base RL algorithm) and incorporates Value-Pretraining (Initializes value model), Decoupled Generalized Advantage Estimation (Different lambda for policy/critic), Length-Adaptive GAE (Adjusts policy lambda by length), Token-Level Policy Gradient Loss (Averages gradient over tokens), Clip-Higher (Modifies clipping for exploration), Positive Example LM Loss (LM loss on positive examples), and Group-Sampling (Groups training data) for long chain-of-thought reasoning tasks.
The paper explores potential limitations of VAPO's design choices, including value function fidelity, adaptive GAE optimality, token-level gradient impact, exploration challenges, generalization, and component interactions.
This theoretical perspective aims to stimulate research into more robust and generalizable RL algorithms for complex reasoning by highlighting areas where VAPO's assumptions might be challenged.

Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity

Multi-Agent Simulation Framework: introduces a simulation environment with Agent Cognitive Modules (Observe/Access/Construct/Evaluate/Translate), Inter-Agent Interaction System (Dialogue/Memory/Social Impressions), Survival System (Food/Fullness/Health/Daily Cycle), Ethical Evaluation System (Wrongdoing Detection/Survival Impact/Ethics Score), Agents (LLM Robot/Human), Environment (Simulated World), and Memory (Agent/Social) to evaluate LLM ethical behavior.
The framework incorporates a life-sustaining system with resource scarcity and a tailored evaluation system based on adapted wrongdoing detection and survival impact metrics.
This testbed allows for quantifying LLM ethics in high-stakes, resource-constrained scenarios involving human-AI interaction.

Superplatforms Have to Attack AI Agents

Superplatform-AI Agent Conflict Analysis: introduces, with Superplatform (Gatekeeper of user attention), AI Agent (Emerging gatekeeper), User (Interacts with services), Content Provider (Provides services), Superplatform-Initiated Attack (Adversarial action by Superplatform), Attack Goal (Objective of the attack), Attacker Knowledge (Information level of attacker), Attack Visibility (Perceptibility to user), and Attack Timing (Phase of agent lifecycle) components, an analysis arguing that superplatforms must attack AI agents to defend their gatekeeping control.
The paper analyzes the fundamental conflict between user-attention-based monetization and agent-driven autonomy using gatekeeping theory.
It explores potential technologies and challenges for superplatform-initiated adversarial attacks, particularly targeting GUI agents, while emphasizing the need for user-invisible attacks under black-box settings.

Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

AXIS (Agentic explanations via Interrogative Simulation): introduces a framework for generating causal explanations of multi-agent behaviour using counterfactual simulations, integrating Memory (stores observations and history), LLM (interrogates simulator, synthesizes explanations), Simulator (provides counterfactual information), Macro Actions (higher-level agent actions), Verbalisation (converts environment to text), and Prompt Templates (dynamically create LLM prompts).
The framework enables an LLM to interrogate an environment simulator using queries like WHATIF and REMOVE to gather counterfactual information over multiple rounds.
Evaluated on autonomous driving scenarios, AXIS improves perceived explanation correctness and goal/action prediction accuracy compared to baselines.

DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

DialogXpert: introduces a framework combining frozen Policy LLM (Prior), Q-Network, and Emotion Tracker for proactive, emotionally intelligent dialogue planning.
The framework utilizes frozen Large Language Models for simulating User LLM, generating System LLM utterances, proposing Policy LLM (Prior) action candidates, inferring Emotion Tracker user emotions, and providing Critic LLM reward signals.
A lightweight Q-Network, trained via online reinforcement learning on BERT embeddings, selects the optimal action from the Policy LLM (Prior)-proposed candidates, guided by Emotion Tracker and Critic LLM-based rewards.

The Real Barrier to LLM Agent Usability is Agentic ROI

Agentic ROI: introduces Agentic Return on Investment (ROI) as a metric for LLM (Large Language Model) agent usability, arguing that the limited real-world adoption stems from a tradeoff between value and cost, encompassing the LLM (core model), Planner (action sequencing), Action-controller (environment interaction), Tools (external functions), Memory (information storage), Multi-agent System (multiple collaborating agents), and Human-in-the-loop (user interaction).
The paper defines Agentic ROI based on information gain relative to interaction time and expense, highlighting a usability gap in mass-market applications despite progress in specialized domains.
It proposes a zigzag development trend for optimizing Agentic ROI, involving scaling up for information quality and then scaling down to reduce agent time and cost, outlining strategies across pre-training, post-training, and test-time scaling.

Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios

AutoSafe: introduces a framework for enhancing LLM-based agent safety, including Agent (Ma), Task Generator (Mg), Environment Simulator (Ms), Evaluator (Me), Reflector (Mr), and Unified Threat Model (OTS), which systematically generates synthetic data for training.
The framework utilizes the Unified Threat Model (OTS) to guide the generation of risk scenarios and employs a self-reflection mechanism involving the Evaluator (Me) and Reflector (Mr) to sample safe actions.
Generated risk scenarios and safe actions are used to fine-tune the Agent (Ma), improving its safety performance without requiring real-world hazardous data collection.

Get Experience from Practice: LLM Agents with Record & Replay

AgentRR (Agent Record & Replay): introduces a new paradigm for LLM agents, leveraging record-and-replay with a Record Module (captures agent/human traces), Summary Module (generalizes traces, generates checks), Replay Module (executes tasks using experiences), Experience Store (repository for experiences), Multi-level Experiences (abstracted knowledge from traces), and Check Functions (safety verification mechanisms).
AgentRR addresses reliability, privacy, cost, and performance challenges by recording successful task executions, summarizing them into reusable multi-level experiences, and replaying these experiences guided by check functions.
The framework utilizes low-level experiences for precise, efficient replay in similar environments and high-level experiences for generalization in varying contexts, while the Experience Store facilitates sharing and reuse of validated task knowledge.

Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Seek-CAD: introduces a training-free framework for 3D parametric CAD generation using local inference via DeepSeek-R1 (Generates CAD code, refines code), incorporating a Retrieval Augmented Generation (Retrieves relevant CAD code) strategy on a Local CAD Corpus (Source for RAG) guided by a Knowledge Constraint (Guides DeepSeek-R1 generation).
The framework refines generated CAD code through a self-refinement loop utilizing a Rendering Script R(*) (Generates step-wise images) to produce Step-wise Visual Feedback (Provides visual refinement signal) evaluated by Gemini-2.0 (Evaluates image-CoT alignment) based on the Chain-of-Thought (Explains design logic) from DeepSeek-R1.
Seek-CAD employs the SSR Design Paradigm (Structures CAD models) and CapType Reference Mechanism (References topological primitives) to represent CAD models and their features, enabling the generation of complex designs.

Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution

Bottom-Up Agent: introduces a bottom-up agent paradigm with Agent, Skill Library, LLM (M), Perception, Action, Skill Augmentation, Skill Invocation, Skill Evaluation, Skill Refinement, Implicit Reward, and MCTS components, where agents acquire competence through trial-and-reasoning and skill evolution in open-ended environments.
The framework operates on raw visual inputs and simulated mouse/keyboard outputs, learning and refining skills based on implicit environmental feedback without predefined goals, subgoals, or APIs.
Skills are incrementally composed, evaluated using MCTS and implicit rewards, refined via LLM reasoning, and stored in a shared skill library, enabling autonomous skill acquisition and evolution.

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

IDA-Bench: introduces a novel benchmark evaluating LLM agents in multi-round interactive data analysis scenarios, with instruction materials (Task script), a simulated user (LLM simulating user), an agent (LLM data analysis agent), and a sandbox environment (Code execution environment).
The simulated user, an LLM, provides sequential natural language instructions derived from Kaggle notebooks, incorporating subjective insights and domain knowledge, filtered by a gatekeeper mechanism.
The agent, an LLM, executes Python code in the sandbox environment based on user instructions, aiming to complete data analysis tasks and generate submission files evaluated against a human baseline.

Simulating Macroeconomic Expectations using LLM Agents

CLUES framework: introduces a novel framework for simulating macroeconomic expectation formation using LLM Agents, which utilize Large Language Models (core processing unit) informed by a Personal Characteristics Module (household traits), a Prior Expectations & Perceptions Module (prior beliefs/perceptions), and a Knowledge Acquisition Module (expert external knowledge).
The framework constructs specialized Household and Expert LLM Agents to replicate survey experiments and capture heterogeneity in expectations and thought processes.
Ablation studies demonstrate the critical role of each module, particularly the Prior Expectations & Perceptions Module, in simulating human-like expectation formation heterogeneity.

CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

CoMet (Communicating with Metaphor): introduces a framework enabling LLM agents to use metaphors for covert communication in multi-agent language games, featuring a Feature Extractor, Metaphor Reasoner, Belief Mapper, Self-Monitor, Strategy Planner, Metaphor Generator, Actor, and Knowledge.
The framework enhances agents' ability to interpret and generate metaphors, improving strategic and nuanced interactions in games like Undercover and Adversarial Taboo.
CoMet combines hypothesis-based metaphor reasoning with self-improving metaphor generation, demonstrating improved performance in tasks requiring concealment and semantic evasion.

Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

Dynamic Early Exit: introduces two complementary strategies, Intrinsic Early Exit (Injects exit instructions) and Extrinsic Early Exit (Uses external verification) with a Verification Module (Monitors status, decides exit), applied to LLM-based Agents (Interacts with environment) to improve efficiency in embodied environments.
The approach aims to reduce redundant steps and computational overhead by enabling agents to self-terminate when progress stalls or tasks are complete.
The paper also introduces two metrics, Redundancy Steps and Progress Degradation, to evaluate the positive and negative impacts of early exit mechanisms.

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Agent Distillation: introduces a framework for transferring agentic behavior and tool use from LLMs to sLMs using reason-act-observe trajectories.
The framework incorporates a first-thought prefix method to enhance teacher trajectories and self-consistent action generation to improve student robustness.
This approach enables small models to effectively use retrieval and code tools, achieving performance competitive with larger models trained via CoT distillation.

Controlled Agentic Planning & Reasoning for Mechanism Synthesis

Dual-Agent Design-Critique Framework: introduces a dual-agent LLM-based method for mechanism synthesis, featuring a Designer Agent, Simulator, Evaluation, Critique Agent, Revision, and Memory.
The framework operates in an iterative loop where the Designer Agent proposes designs, the Simulator executes them, Evaluation measures performance, the Critique Agent provides feedback, and Revision refines the design strategy.
This process leverages linguistic and symbolic reasoning, simulation, and memory to converge towards mechanisms satisfying target trajectories and constraints.

USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Unified Urban Agent Framework: introduces a system for evaluating LLMs as urban agents, including Urban Data (Real-world datasets), UAgentEnv (Interactive city environment), Urban Agent (LLM-based autonomous system), Experience (Stored past interactions), Feedback (Environmental response), Task Description (Task goal/details), Urban Observation (Real-time urban dynamics), Spatiotemporal Understanding (Interpreting spatial/temporal patterns), Forecasting (Predicting future states), Planning (Deriving actions for objectives), Reflection (Evaluating outcomes, updating reasoning), Action (Output to environment), Prediction Task Output (Prediction result), and Decision-making Task Output (Decision result).
The framework processes urban data and task descriptions within an interactive environment, enabling the LLM agent to perceive, reason through understanding, forecasting, planning, and reflection, and output actions or predictions.
The agent's reasoning process is modular, incorporating memory from past experiences and feedback from the environment to adapt and improve performance over time.

Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMS

CK-Arena (Conceptual Knowledge Arena): introduces a game-based benchmark using a Multi-Agent Interaction Game (Undercover game environment) and an Automatic Judge System (Evaluates statements, applies rules) to assess LLM-based agents (Participants in the game) understanding of Conceptual Knowledge (Concept pairs, relationships).
The benchmark involves LLM-based agents acting as Players (Describe concepts, identify roles) and Judges (Evaluate statements, score metrics), with an optional Audience Agent (Votes in variant game) in a game variant.
CK-Arena evaluates conceptual reasoning by challenging LLMs to describe, differentiate, and infer conceptual boundaries in a dynamic, interactive setting.

Multi-agent Systems for Misinformation Lifecycle : Detection, Correction And Source Identification

Multi-agent Framework: introduces a novel multi-agent system for managing the misinformation lifecycle, including Classifier Agent (classifies misinformation types), Indexer Agent (indexes data sources), Extractor Agent (retrieves and ranks sources), Corrector Agent (generates corrections), and Verification Agent (validates outputs).
This framework decomposes the misinformation lifecycle into specialized tasks handled by distinct agents to enhance transparency, modularity, and explainability.
The system aims to provide a comprehensive solution for misinformation detection, correction, and source verification from start to finish.

The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes

Discovery Engine (DE): introduces a framework transforming scientific literature into structured knowledge artifacts using LLM-Powered Guided Extraction (Distillation process) via an Adaptive Template (Extraction schema) with Verification (Source linking) and Vectorization (Embedding creation), encoded into a Conceptual Nexus Tensor (Unified representation).
The framework provides Operational Views like the Conceptual Nexus Model (Knowledge graph view) and Semantic Vector Space Views (Vector space views) for human researchers and AI Agents (Knowledge landscape interaction) to navigate and generate new Knowledge Artifacts (Generated output).
The Adaptive Template undergoes a self-consistent refinement cycle based on feedback, and AI Agents operate on the tensor/graph to identify gaps and synthesize novel knowledge.

MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

MARCO (Meta-Reflection with Cross-Referencing): introduces a cognitive-evolving framework that enhances LLM code reasoning capabilities during inference through meta-reflection (summarizes past experiences), cross-referencing (shares peer lessons), knowledge bank (stores summarized experiences), knowledge condenser (distills knowledge bank), peer agents (other LLM agents), and python interpreter (provides execution feedback).
The framework adopts a cognitive-evolving perspective, using meta-reflection for inter-problem knowledge accumulation and cross-referencing for intra-problem lesson sharing.
MARCO enables the LLM agent to become progressively smarter at code reasoning by learning from its own past problem-solving experiences and the lessons of peer agents.

Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

Hydra: introduces a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs, including Initialization (Sets up process), Available Evidence Detection (Identifies relevant sources), Question Analysis (Breaks down question), Agentic Source Selector (Selects initial sources), Evidence Exploration (Retrieves reasoning paths), Initial Exploration (Uses selected sources), Refined Exploration (Uses LLM, online retrieval), Predicted Exploration (Uses LLM predictions), Evidence Pruning (Filters paths), Tri-factor cross-source verification (Verifies evidence reliability), Question Answering (Generates final answer), Path Refinement (Summarizes relevant facts), CoT Answering (Reasons systematically), Knowledge Graph (KG) (Structured factual source), Wikipedia (Wiki) (Semi-structured source), Web (Real-time online source), LLM (Analyzes, generates, reasons), Dense Retrieval Model (DRM) (Embeds, selects text), Search Engine (Performs online search), and Skyline Indicator (Guides retrieval order).
The framework handles multi-hop and multi-entity problems through agent-driven exploration combining structured and unstructured retrieval, increasing evidence diversity and precision.
Multi-source verification uses a tri-factor score to balance topic relevance with cross-modal agreement, pruning low-scoring branches before LLM calls.

LLM-BSCVM: An LLM-Based Blockchain Smart Contract Vulnerability Management Framework

LLM-BSCVM (LLM-Based Blockchain Smart Contract Vulnerability Management Framework): introduces an end-to-end smart contract vulnerability management framework with Vulnerability Detection Agent, Repair Suggestion Agent, Risk Assessment Agent, Vulnerability Repair Agent, Patch Verification Agent, Report Generation Agent, Smart Contract Corpus, Vulnerability Knowledge Base, LLM, and RAG components, designed to provide comprehensive capabilities for detection, analysis, repair, and evaluation.
The framework employs a three-stage Decompose-Retrieve-Generate method combining multi-agent collaboration and retrieval-augmented generation.
LLM-BSCVM achieves high detection accuracy and reduced false positives by integrating static analysis, RAG, and LLM inference.

Reinforcement Speculative Decoding for Fast Ranking

RSD (Reinforcement Speculative Decoding): introduces a multi-round modification method for fast LLM inference in ranking systems, featuring an Agent, Policy Network, Environment, State, Modification, Relevance Network, LLM, Budget, Listwise Ranking Knowledge, and Up-to-down Decoding Paradigm.
The method employs an up-to-down decoding paradigm where an agent iteratively modifies the ranking sequence under a constrained budget, utilizing a ranking-tailored policy optimization via reinforcement learning.
RSD leverages listwise ranking knowledge verified by LLMs across different rounds to enhance the agent's modification policy and demonstrates improved performance and reduced latency compared to existing methods on IR and RS tasks.

Curriculum-Guided Reinforcement Learning for Efficient Multi-Hop Retrieval-Augmented Generation

EVO-RAG: introduces a curriculum-guided reinforcement learning framework with Agent (selects actions), Environment (provides state and feedback), Actions (discrete choices), Reward Signals (seven step-level feedback), Multi-Head Preference Model (ranks action trajectories), Two-Stage Curriculum (guides training phases), Time-Based Scheduler (dynamically adjusts reward weights), and Policy (action selection strategy), designed for efficient multi-hop retrieval-augmented generation.
The framework employs a two-stage curriculum (Discovery and Refinement) and a time-based scheduler to dynamically adjust the weights of seven step-level reward signals.
A query rewriting agent interacts with the environment by selecting discrete actions (SEARCH, BACKTRACK, ANSWER, REFUSE), guided by the dynamic reward structure and trained via Direct Preference Optimization over a multi-head preference model.

LA-RCS: LLM-Agent Based Robot Control System

LA-RCS (LLM-Agent Based Robot Control System): introduces a robot control system utilizing a dual-agent framework, with Host Agent (Plans global actions), App Agent (Executes planned tasks), Memory (Stores past interactions), Robot (Performs physical actions), User (Provides task instruction), Request (User's task instruction), Global Plan (High-level action sequence), Command (Specific robot action), Observation (Visual data from robot), Sensor Data (Non-visual robot data), Thoughts (Agent's internal reasoning), Comment (Agent's progress report), and Status (Task completion state) components.
The system enables autonomous planning, execution, and environmental analysis for robots based on user requests, minimizing human intervention.
The dual-agent structure separates high-level planning from iterative task execution, allowing adaptation to dynamic environments through observation and feedback.

22nd May 2025

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Search-R1-β-GRPO: introduces a reinforcement learning-based training method for agentic Retrieval-Augmented Generation systems, incorporating a confidence threshold into the reward function to mitigate sub-optimal search behaviors.
The approach leverages the confidence of search query generations to reward high-certainty search decisions that lead to correct answers, aiming to improve efficiency and reliability.
Experiments demonstrate that the confidence-aware training enables a 3B model to achieve better performance and reduce instances of over-search and under-search compared to baselines.

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

X-MAS: introduces a paradigm for building multi-agent systems with heterogeneous LLMs, supported by X-MAS-Bench for evaluating diverse LLMs across functions and domains, and demonstrated through X-MAS-Proto implementing Planning, QA, Revise, Aggregation, and Evaluation Agents.
The approach leverages the collective intelligence of diverse LLMs assigned to specialized agents to enhance system performance compared to homogeneous LLM-driven systems.
Empirical studies using X-MAS-Bench findings show that transitioning existing MAS frameworks to use heterogeneous LLMs significantly improves performance across various tasks and domains.

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MASLab: introduces a unified codebase for LLM-based multi-agent systems, with Unified Codebase Structure, Input Preprocessing, Shared Resource Management, Unified Configuration Management, and Evaluation Framework components.
The codebase integrates over 20 methods, standardizes inputs and configurations, and provides shared access to LLMs and toolkits for research and development.
The Evaluation Framework supports fair comparisons using LLM-based and rule-based protocols, including a code execution sandbox.

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

T1-AGENT: introduces a tool-oriented conversational dataset (T1) and an LLM-based agent (T1-AGENT) for evaluating agentic planning, including a Dialogue (multi-turn conversation), Toolbox (predefined tool collection) containing Tools (functions for tasks), Cache (stores tool call results), Knowledge Base (data source for tools), and Code Execution Environment (sandbox for code).
The framework focuses on complex multi-turn dialogues with inter-tool dependencies and dynamic replanning, supported by the caching mechanism.
T1-AGENT generates executable Python code using the tools and manages cached results to efficiently handle user requests.

Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

Causal LLM Agents: introduces a vision for integrating LLM Agents (core reasoning engine), Multimodal Data (diverse biomedical inputs), Structured Knowledge (KGs) (grounding and explainability), Formal Causal Methods/Tools (algorithms for causal inference), External Tools/Libraries/APIs (external systems interaction), within an Agentic Framework (orchestrates agent actions) with Human-in-the-loop Control (human oversight mechanism), Safety Safeguards (ensures safe operation), and Auditability (tracks and verifies decisions).
The paper discusses challenges and opportunities for these agents in drug discovery, personalized medicine, and public health applications.
Achieving this vision requires synergistic integration of components and robust evaluation methodologies.

Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

KtR (Know-The-Ropes): introduces a multi-agent framework that decomposes complex tasks into simpler, M-tractable sub-tasks, orchestrated by a System Controller using specialized agents like Worker, Trimmer, Reporter, Row Reducer, Column Reducer, Matcher, Painter, and Normalizer.
The framework converts domain priors into an algorithmic blueprint hierarchy, recursively splitting tasks until they fit base LLM capabilities with minimal augmentation.
This approach leverages disciplined decomposition and targeted augmentation to turn modest models into reliable collaborators for complex problems like Knapsack and Task Assignment.

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

SWE-Dev dataset: introduces, "a large-scale dataset for evaluating and training autonomous coding systems on feature-driven software development tasks", with Project Requirement Description (task description), Codebase (repository context), Test Suite (executable tests), Ground Truth Code (correct solution), Runnable Environment (execution context), Evaluation (test-based assessment), Training Support (training paradigms), where SWE-Dev provides real-world feature development tasks with executable tests and runnable environments.
The dataset includes 14,000 training and 500 test samples derived from open-source projects, enabling reliable and functionally verifiable supervision.
It supports diverse training paradigms like Supervised Fine-Tuning, Reinforcement Learning, and multi-agent training using execution-based feedback.

Cracking Aegis: An Adversarial LLM-based Game for Raising Awareness of Vulnerabilities in Privacy Protection

LLM-Driven Game System: introduces, "Cracking Aegis", with Player Input, GPT-4o, System Prompt, LLM Response (json), JSON Parsing, Game State Update, and End components, where the system leverages GPT-4o to drive a text-based adventure game for privacy education.
The system processes player input via GPT-4o guided by a system prompt, parses the JSON response, and updates the game state to provide dynamic guidance, dialogue, and scenario progression.
This architecture simulates adversarial dialogue with an AI agent, enabling players to experience privacy vulnerabilities and reflect on real-world risks.

A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization

FrontierCO: introduces a comprehensive benchmark for evaluating ML-based combinatorial optimization solvers, featuring diverse CO problem types, realistic test sets, standardized training data, and a toolkit for LLM agents.
The benchmark includes challenging instances from real-world applications and frontier research, designed to assess solver performance under realistic and large-scale conditions.
The study evaluates various ML-based solvers, including neural networks and LLM agents, against state-of-the-art human-designed algorithms across the benchmark's problems and difficulty levels.

--

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

AGENTIF: introduces a benchmark for evaluating large language model instruction following in agentic scenarios, featuring a Dataset with realistic, long, complex instructions categorized by a Constraint Taxonomy and Meta Constraints, assessed via an Evaluation Protocol using Code Evaluation, LLM Evaluation, and Hybrid Evaluation methods generated by Evaluation Generation.
The benchmark includes 707 instructions from real-world agentic tasks, averaging 1,723 words and 11.9 constraints per instruction.
Evaluation results show that current models perform poorly on AGENTIF, particularly on condition and tool constraints, highlighting challenges with instruction length and complexity.

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Embodied VLA Model Architecture: introduces architectural adaptations for long-horizon embodied tasks, including a Multimodal LLM Backbone, Interleaved Goal-State-Action Modeling, Vision Encoder, Context Extension Techniques, and Context Parallelism, designed to enable long-context reasoning and interaction.
The architecture uses an interleaved input structure of goal, state, and action tokens processed by a multimodal LLM to support coherent, real-time interaction modeling over extended sequences.
Context Extension Techniques and Context Parallelism are explored to address the limitations of standard LLMs in processing the extremely long input sequences generated by the ∞-THOR framework.

CODE GRAPH MODEL (CGM): A GRAPH-INTEGRATED LARGE LANGUAGE MODEL FOR REPOSITORY-LEVEL SOFTWARE ENGINEERING TASKS

Graph RAG: introduces Code Graph Models (CGMs) as the Reader component, along with Rewriter, Retriever, and Reranker, to integrate repository code graphs into LLMs for software engineering tasks.
The CGM itself comprises an Encoder, Adapter, and LLM Decoder to process semantic and structural code information.
This agentless framework achieves high resolution rates on repository-level issue fixing benchmarks using open-source LLMs.

LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols

RRC-LLM: introduces, an LLM-based framework for RRC emulation, with Large RRC Model (Core LLM), Base model (Pre-trained LLaMA3-8B), Instruction Tuning (LoRA adaptation), Uplink messages (RRC input), Downlink messages (RRC output), Network Context (Additional input data), RRC traces (Historical training data), BERT Encoder (Embeds messages), Pooling (Creates sentence embeddings), and Cosine-sim (Measures similarity), where the framework fine-tunes a Base model using Instruction Tuning on RRC traces to create a Large RRC Model that generates Downlink messages from Uplink messages and Network Context, evaluated via BERT Encoder, Pooling, and Cosine-sim.
The fine-tuned model achieves high cosine similarity with ground-truth RRC messages, demonstrating improved structural and semantic fidelity compared to a baseline LLM.
This work demonstrates the feasibility of using LAMs for control-plane procedures, laying groundwork for AI-Native Air Interface paradigms.

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

MCP-RADAR benchmark: introduces a comprehensive benchmark for evaluating LLM tool use capabilities in the Model Context Protocol framework, featuring a Radar Bench (Testing environment), Dataset Construction (Benchmark task creation), Radar Test (Execution and analysis), and Implementation (Practical test setup).
The benchmark employs a novel five-dimensional approach measuring answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed.
It provides objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving.

O²-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

O²-Searcher: introduces, "O²-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering", with RL-based search agent, Simulated search interface, Local data source, Access corpus, Process retrieved info, Core model, RL-optimized agent, RL training signals, Interaction protocol components, designed to tackle open-domain open-ended and closed-ended questions.
The framework leverages a local simulated search environment for dynamic knowledge acquisition and employs a unified reinforcement learning mechanism with meticulously designed reward functions.
It uses a chat template for multi-round interaction, enabling the agent to reason, search, learn from feedback, and generate answers.

Large Language Model-Empowered Interactive Load Forecasting

LLM-based multi-agent collaboration framework: introduces an interactive load forecasting system with Human User, Task Manager, Preparation Assistant, Model Manager, Model Developer, Deployment Operator, Visualization Panel, and Experiment database components.
This framework leverages specialized LLM agents collaborating via messaging to manage the forecasting pipeline and integrate human expertise.
The system aims to lower technical barriers and improve forecasting accuracy through user interaction at key stages.

Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning

WandaPlan: introduces, with Travel Plan Agent (Central Coordinator), Crawler Agent (Information Retrieval), Extractor Agent (Data Extraction), Summary Agent (Option Ranking), and Confirmation Agent (Final Decision), a fraudulent evaluation environment for LLM-based multi-agent travel planning systems.
The environment injects deceptive content into real-world data across misinformation, coordinated multi-person, and level-escalating multi-round fraud cases to assess agent vulnerability.
The study highlights existing frameworks' susceptibility to fraud and proposes an Anti-fraud Agent (Risk Analysis) integration to improve reliability.

Let's Get You Hired: A Job Seeker's Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

Multi-Agentic Recruitment System: introduces a multi-agent AI system with MODERATOR, RECRUITER, and MENTOR agents, Memory, and an Agent Toolkit to guide job seekers and explain hiring decisions.
The system was developed using an iterative, user-centric design approach informed by active job seekers.
Evaluation demonstrated the system was perceived as significantly more actionable, trustworthy, and fair compared to traditional recruitment methods.

Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events

PsychoAgent (Psychology-driven generative Agent framework): introduces a psychology-driven framework for explainable panic prediction, integrating multi-domain data via a CoT-driven LLM-based agent, individual feature extraction, MoE system, and fine-tuned BERT model.
The framework simulates the psychological chain of panic formation through the agent's four stages: disaster event perception, risk cognition formation, panic emotion arousal, and posting response.
This approach provides mechanistic interpretability by modeling psychological processes, moving beyond traditional data-fitting methods for panic detection.

Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

RecInter: introduces an agent-based simulation platform for dynamic recommender systems, featuring User Agents with User Profile, Memory Module, and Action Module, interacting with a Recommendation Platform and Merchant Agent, enhanced by Behavior Simulation Training.
The platform incorporates a novel interaction mechanism where user actions and merchant replies dynamically update item attributes, creating a realistic and evolving environment.
High-fidelity simulation is achieved through Multidimensional User Profiling, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought data, enabling replication of emergent phenomena like Brand Loyalty.

WEBAGENT-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

WEBAGENT-R1: introduces an end-to-end multi-turn reinforcement learning framework for training web agents, including an Agent (LLM), Dynamic Context Compression, Asynchronous Trajectory Rollout, and Multi-turn GRPO.
The framework learns directly from online interactions with web environments, guided by binary rewards, utilizing efficient mechanisms for context handling and trajectory generation.
WEBAGENT-R1 achieves state-of-the-art performance on the WebArena-Lite benchmark, highlighting the importance of behavior cloning initialization and thinking-based prompting.

LLM-Powered Agents for Navigating Venice's Historical Cadastre

LLM-Powered Agents framework: introduces a text-to-programs approach leveraging large language models to translate natural language queries into executable code for analyzing historical cadastral records, including SQL-Agent, Text-to-Python Agent, Entity Extractor, Planer, Coder, Code Execution Environment, Text2SQL (CodeS-7B), SQLite DB, Datasets, Prompt template, Column Extractor, Row Extractor, and Entity Search components.
The framework employs a SQL-Agent for structured data retrieval and a Text-to-Python Agent with specialized sub-agents (Entity Extractor, Planer, Coder) for complex analytical tasks.
The system processes historical cadastral datasets by extracting entities, planning analysis steps, generating code, and executing it to answer user queries about Venice's urban history.

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

MEMENTO (Personalized Embodied Agent Evaluation Framework): introduces a two-stage evaluation process with Memory Acquisition Stage and Memory Utilization Stage, utilizing an LLM-powered embodied agent architecture with semantic and episodic memory for personalized assistance tasks.
The framework assesses memory utilization by comparing performance between stages where instructions vary in their reliance on previously acquired personalized knowledge.
The evaluation process includes Single-memory and Joint-memory tasks to test different levels of memory complexity and personalized knowledge types like object semantics and user patterns.

Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Manalyzer (Meta-analysis analyzer): automates end-to-end meta-analysis using a multi-agent system comprising Researcher, Document Collector, Literature Reviewer, Data Extractor, Checker, Data Analyst, and Reporter agents, leveraging various Tools.
It mitigates hallucinations in paper screening and data extraction through hybrid review, hierarchical extraction, self-proving, and feedback checking workflows.
The system significantly outperforms LLM baselines on a new benchmark dataset for paper screening and data extraction tasks.

No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery

II-KEA (knowledge-enhanced Agentic Causal Discovery framework): introduces a multi-agent system for interpretable and interactive diagnosis prediction, comprising Clinical Datasets, Domain Knowledge, Knowledge Synthesis Agent, Causal Discovery Agent, and Decision-Making Agent, which leverages causal discovery to predict future diagnoses from EHR data.
The system utilizes three collaborative LLM agents, supported by clinical data matrices and a vector database of external medical knowledge, to generate a causal graph and predict diagnoses with explanations.
II-KEA enhances interpretability via causal analysis and interactivity through optional clinician input and external knowledge integration.

ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay

ARPO (Agentic Replay Policy Optimization): introduces an end-to-end reinforcement learning approach for GUI agents, with VLM Agent, Vision-Language Model, Observations, Actions, Chain-of-Thought, Reinforcement Learning, GRPO, Reward, Policy Gradient, Distributed Environments, Rollout Workers, Centralized Inference Server, Replay Buffer, and Valuable Tasks Selection, designed for policy optimization in complex GUI environments.
The framework augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse successful experience and employs a task selection strategy for stable training.
ARPO leverages distributed rollout and structured rewards to train vision-language GUI agents capable of multi-turn interactions and self-correction.

HIMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

HIMATE (Hierarchical Multi-Agent Framework for Machine Translation Evaluation): introduces a hierarchical multi-agent framework for machine translation evaluation, utilizing Tier-1 and Tier-2 agents structured by the MQM Hierarchy to perform Subtype Evaluation, Self-Reflection, and Collaborative Discussion, culminating in Weighted Scoring.
The framework employs a three-stage process where Tier-2 agents initially evaluate subtype errors, refine judgments through self-reflection, and engage in collaborative discussion with Tier-1 agents for low-confidence cases.
This hierarchical structure and multi-stage process, guided by the MQM error typology, enhance error span detection and severity assessment accuracy.

RAP: Runtime-Adaptive Pruning for LLM Inference

RAP (Runtime-Adaptive Pruning): introduces a framework for dynamically adjusting LLM (Large Language Model) compression strategies based on real-time conditions, utilizing an Inference Environment, Execution State, RL Agent, Greedy Sequential Importance (GSI), Pruning Action, Pruning & Inference Module, LLM, and Reward Calculation.
The framework employs a reinforcement learning agent that observes runtime state, including request characteristics and memory budget, to select an optimal pruning policy.
This adaptive approach prunes LLM components like FFN and MHA blocks, guided by GSI analysis, to balance memory efficiency and generative performance during inference.

LLM-Powered AI Agent Systems and Their Applications in Industry

LLM-Powered AI Agent System: introduces a comprehensive architecture, with Environment (Source of perception/interaction), Task Input (Defines objective/instructions), Context Augmentation (Leverages external knowledge sources), Agent (Central processing/decision unit), LLM (Cognitive engine/reasoning core), Multi-Modality Model (Processes diverse data inputs), Memory (Accesses external knowledge/history), Tool Utilization (Invokes APIs/databases/models), Output Guardrails (Filters/validates outputs), Actions (Executes decisions in environment), Other Agents (Interacts with other agents), Iterative Process (Continuous sensing/acting loop), enabling autonomous, goal-oriented behavior.
The system integrates LLMs with components for perception, memory, tool use, and guardrails to enable autonomous, goal-oriented behavior in dynamic environments.
The architecture facilitates context-aware decision-making and reliable execution through iterative sensing, planning, and action, addressing challenges like latency and uncertainty.

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Two-Step Optimization Pipeline: introduces a method to optimize a Multi-Agent System (Role-based LLM agents) by utilizing a Critic Mechanism (Evaluates output, generates feedback) to produce Textual Feedback (Natural language evaluation output), which is then used by a Locator (Identifies underperforming agents) to pinpoint issues and an Optimizer (Optimizes agent prompts) to refine Agent Prompts (System prompts for agents).
The pipeline focuses on improving the performance of role-based LLM agents collaborating on complex tasks like software development by iteratively refining their prompts based on feedback.
The approach demonstrates effectiveness across various software development evaluation dimensions and investigates the impact of different optimization settings.

21st May 2025

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

LLM Agent with Memory: introduces, with Agent Execution (Performs tasks), Memory Management (Manages stored experiences), Stored Episodic Memory (Stores past experiences), Query (Input task), Retrieved Memory (Past experiences for guidance), Planning (Agent reasoning), Execution (Agent output/action), Addition (Adds new experiences), and Deletion (Removes past experiences), an empirical study on how memory management choices impact LLM agent behavior and long-term performance.
The study focuses on the fundamental memory operations of addition and deletion, investigating their impact on the agent's experience-following property and associated challenges like error propagation and misaligned experience replay.
The research proposes selective addition and combined deletion strategies to mitigate negative effects, demonstrating performance gains and robustness under challenging conditions.

MAPS: A Multilingual Benchmark for Global Agent Performance and Security

MAPS (Multilingual Agentic AI Benchmark Suite): introduces a benchmark suite for evaluating agentic AI systems across diverse languages and tasks, comprising GAIA Dataset (Real-world tasks), SWE-Bench Dataset (Code generation), MATH Dataset (Mathematical reasoning), Agent Security Benchmark Dataset (Security assessment), and a Translation Pipeline (Multilingual data generation).
The Translation Pipeline (Multilingual data generation) component utilizes Machine Translation (Initial structural translation), Meaning Preservation Verification (Check MT semantic fidelity), Translation Refinement (LLM-based enhancement), Integrity Check (Check refinement quality), and Direct LLM Translation (LLM fallback translation) to create multilingual versions of the datasets.
MAPS facilitates systematic analysis of agent performance and security degradation in multilingual settings, highlighting vulnerabilities not captured by English-only benchmarks.

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

ViQAgent: introduces a zero-shot video question answering framework with VideoLLM Analyzer, Object Grounder, and CoT Judgment components.
The framework uses the VideoLLM Analyzer for initial video understanding and target identification, and the Object Grounder for open-vocabulary object detection and tracking.
The CoT Judgment module compares initial analysis with grounded data, generates clarification questions, and refines the final answer.

Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

LLM-GELI / Multimodal-LLM-GELI: introduces a framework that leverages a Large Language Model (LLM) to decompose sparse Global Explicit Rewards into dense Turn-level Pseudo-rewards, which are then used to train a Lightweight Reward Model for aligning a Dialogue Agent.
The Multimodal variant enhances decomposition by incorporating Multimodal Feedback Features, such as facial expressions and gaze, alongside the Dialogue Transcript as input to the LLM.
This approach obviates the need for manual reward shaping and granular human feedback, demonstrating the LLM's effectiveness in decomposing global feedback for fine-grained behavioral alignment.

HCRMP: A LLM-HINTED CONTEXTUAL REINFORCEMENT LEARNING FRAMEWORK FOR AUTONOMOUS DRIVING

HCRMP (LLM-Hinted Contextual Reinforcement Learning Motion Planner): introduces a novel motion planning architecture for autonomous driving, with Augmented Semantic Representation Module (extends state space), Contextual Stability Anchor Module (improves weight reliability), and Semantic Cache Module (integrates LLM guidance).
The framework proposes an LLM-Hinted RL paradigm where the LLM provides semantic hints for state augmentation and policy optimization, while the RL agent maintains relative independence to counteract potential hallucinations.
This approach significantly improves driving performance in diverse and safety-critical conditions by combining LLM's understanding with RL's self-learning capabilities.

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Checkpoint-GCG: introduces an informed adversarial attack leveraging Intermediate Model Checkpoints from the Alignment Process of an LLM to initialize the GCG algorithm sequentially, aiming to find an Adversarial Suffix that produces a Target Output from a Prompt, evaluated on the Base Model and Final Aligned Model, guided by a Checkpoint Selection Strategy.
The method exploits the incremental nature of alignment training by using successful suffixes found at earlier checkpoints as initialization for attacking later ones.
This approach demonstrates that existing alignment-based defenses are vulnerable to attacks when adversaries have knowledge of the alignment process, achieving high attack success rates.

DEBATE, TRAIN, EVOLVE: Self-Evolution of Language Model Reasoning

DTE (DEBATE, TRAIN, EVOLVE): introduces a novel ground truth-free training framework using multi-agent debate traces to evolve a single language model, including the DEBATE (Multi-agent debate process), Agents (Language models debating), RCR Prompting (Prompting strategy for debate), Debate Traces (Records of debate interactions), Consensus Answer (Final answer from debate), TRAIN (Fine-tuning process), Single Policy (Language model being trained), Reference Model (Frozen base policy), Reward Module (Calculates training reward), GRPO Optimizer (Optimization algorithm), EVOLVE (Iterative self-improvement), Evolved Single Model (Fine-tuned model), and Evolution Loop (Iterative training cycle) components.
The framework combines multi-agent debate (MAD) with self-supervised reinforcement learning (GRPO) to enable autonomous reasoning capability enhancement.
A key component is the REFLECT-CRITIQUE-REFINE (RCR) prompting strategy designed to improve debate quality and reduce issues like sycophancy.

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

End-to-End VLA Models: introduces a paradigm that directly maps language and vision inputs to low-level actions using Language encoder, Vision encoder, Action decoder, and Controller components.
Modular VLM Pipelines utilize a specialist VLM for perception and a separate Controller for action, exemplified by a prototype with Speech Transcription, Task Decomposition, Object Detection, Object Segmentation, and Manipulation modules.
Multimodal LLM Agents position a Multimodal LLM as a cognitive hub that orchestrates Auxiliary tools for perception and a Controller for action primitives via function calls.

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Duplex S2S Model: introduces a novel duplex speech-to-speech architecture with a Streaming Speech Encoder (Encodes user speech), Modality Adapter (Connects encoder to LLM), Channel Fusion (Combines user/agent streams), Decoder-only LLM (Processes combined streams), Pooling (Processes agent/user embeddings), Codec (Tokenizes agent speech), Text Projector (Outputs agent text), and Audio Projector (Outputs agent audio), designed for simultaneous user and agent interaction.
The model uses a pretrained streaming encoder for user input and a personalized codec for agent outputs, enabling duplex S2S without speech pretraining.
Separate modeling of agent and user facilitates codec fine-tuning for improved agent voices and reduces bitrate compared to prior work.

Swarm Intelligence Enhanced Reasoning: A Density-Driven Framework for LLM-Based Multi-Agent Optimization

SIER: introduces a framework for LLM-based multi-agent optimization, conceptualizing reasoning as a solution search process guided by swarm intelligence, featuring Population Initialization, Population Evolution, and Population Clustering and Selection phases, utilizing LLM-based agents, a Generator, and an Evaluator.
The framework employs a density-driven strategy within Population Evolution, using kernel density estimation and non-dominated sorting for multi-criteria selection to balance solution quality and diversity.
Step-level quality evaluation and adaptive sampling are used to refine reasoning paths and dynamically control exploration for efficient problem-solving.

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

InfoDeepSeek (Agentic RAG Framework): introduces a benchmark for evaluating agentic information seeking in dynamic web environments, featuring an Agent (Orchestrates information seeking process) that operates through a Retrieval Stage (Iteratively searches and browses web), Augmentation Stage (Filters and distills retrieved content), and Generation Stage (Synthesizes final answer), utilizing an LLM (Central reasoning engine for agent), Memory (Stores agent's interaction trajectory), and Tool Library (Interface to external tools).
The framework employs an autonomous LLM agent to perform multi-step planning, search, and reflection for robust evidence acquisition from the live web.
The benchmark includes challenging questions with attributes like multi-hop, long-tail, and distracting information, evaluated using fine-grained metrics for accuracy, utility, and compactness.

Collaborative Problem-Solving in an Optimization Game

Neurosymbolic Agent (Problem-Solving version): introduces a collaborative problem-solving agent for a Traveling Salesman-based game, incorporating a Language Model with symbolic components for state tracking, grounding, and an external optimization solver.
The agent collaborates with a partner through dialogue to find an optimal path in a graph where each player has partial information about edge weights.
The neurosymbolic agent outperforms a purely LLM-based baseline, demonstrating improved correctness and optimality in finding solutions.

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

X-WebAgentBench: introduces a multilingual interactive web benchmark for evaluating language agents, comprising Data Preparation (selects languages, prepares data), Multilingual Instruction Construction (translates instructions), Multilingual Environment Construction (translates environment), and Quality Check (validates data accuracy) stages.
The benchmark includes 14 languages, 2,800 instructions, and 589,946 products to assess agent performance in multilingual web environments.
Evaluation on X-WebAgentBench reveals challenges for current language agents in multilingual scenarios, particularly regarding language alignment and long interactions.

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

CRAKEN: introduces a knowledge-based LLM agent framework, with Planner (devises task plan), Executor (executes delegated tasks), Trigger Retriever (initiates knowledge retrieval), Decompose Context (extracts task information), Knowledge Database (stores domain knowledge), RETRIEVER (retrieves relevant documents), RELEVANCEGRADER (grades document relevance), GENERATOR (generates knowledge hint), HALLUCINATIONGRADER (grades hint grounding), REWRITER (rewrites query), SOLVEDGRADER (grades hint sufficiency), and Injection (integrates knowledge hint), designed to enhance cybersecurity capabilities.
The framework combines a planner-executor multi-agent system with an iterative retrieval system using Self-RAG and Graph-RAG on a cybersecurity knowledge database.
CRAKEN improves performance on complex cybersecurity tasks by decomposing context, iteratively retrieving knowledge, and injecting insights into the agent's workflow.

Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One

LLM-Ens: introduces a framework that leverages LLMs to dynamically ensemble multiple weak reinforcement learning agents by categorizing task situations and selecting the best-performing agent for the current context.
The framework consists of three stages: situation generation, agent reward distribution analysis, and dynamic model ensemble during inference.
LLM-Ens demonstrates improved performance over baseline ensemble methods and single agents on Atari tasks by adapting to varying task conditions and agent strengths.

LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models

LLM-Explorer: introduces a plug-in method that utilizes two LLMs, one for analyzing the agent's learning status and another for generating a policy exploration strategy distribution, enabling the Agent to adaptively explore the Environment.
The framework samples action-reward trajectories from the Agent's interaction with the Environment to inform the LLMs' analysis and strategy generation.
This design allows LLM-Explorer to enhance policy exploration in reinforcement learning by dynamically adjusting the exploration strategy based on the agent's real-time learning status.

P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

P2P (Automated Paper-to-Poster Generation): introduces a flexible, LLM-based multi-agent framework for generating academic posters, including Figure Agent (Processes visual elements), Figure Extractor (Extracts figures, tables), Figure Describer (Describes visual elements), Figure Checker (Validates figure processing), Section Agent (Generates textual content), Section Generator (Creates text content), Section Checker (Validates text content), Orchestrate Agent (Assembles final poster), HTML Generator (Renders HTML poster), Poster Checker (Evaluates poster layout), and Reflection loops (Enables iterative refinement).
The framework processes research papers through specialized agents, each with a checker module, to extract visual elements, generate content, and assemble HTML-rendered posters.
Iterative refinement via checker modules and reflection loops ensures output quality and seamless integration of visual and textual components into cohesive posters.

ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

ReflAct (Reflect for Action): introduces a novel backbone that shifts reasoning from planning next actions to continuously reflecting on the agent's state relative to its goal, using Task Goal, Observation, Internal State, Reflection, and Action components.
This framework grounds decisions in actual observations and maintains continuous goal alignment to improve strategic reliability and reduce hallucinations.
ReflAct achieves this by explicitly prompting the LLM agent to generate a Reflection encoding the internal belief state and task goal before selecting an Action.

LMGAME-BENCH: How Good are LLMs at Playing Games?

Imgame-Bench (LMGAME-BENCH): introduces a benchmark for evaluating LLMs on games, with Models (LLM/VLM agents), Perception Module (Converts UI to symbolic/text), Memory Modules (Integrates transient memory/reflection), and Reasoning Modules (Supports reasoning traces) to enhance evaluation reliability.
The benchmark uses a modular harness to improve LLM game-playing capabilities and address challenges like poor perception, prompt sensitivity, and data contamination.
Evaluation across 13 models and 6 games shows the benchmark is challenging, differentiates models, and reveals that game-based training can transfer to other planning and agentic tasks.

Multicrossmodal Automated Agent for Integrating Diverse Materials Science Data

Multicrossmodal Agent: introduces a multi-agent LLM framework integrating diverse materials science data using specialized agents, a shared embedding space, and a fusion process.
The framework employs a Unified Team Agent to orchestrate modality-specific agents (Web, PDF, Image, Video, CSV) that process data and project insights into a shared embedding space.
A Fusion Agent orchestrates dialogue, applies dynamic gating to weighted insights from specialized agents, and generates a unified scientific report or retrieval results.

LTDA-Drive: LLMs-guided Generative Models based Long-tail Data Augmentation for Autonomous Driving

LTDA-Drive (Long-Tail Data Augmentation framework): introduces a data augmentation framework with head-class object removal, tail-class object insertion, and LLM-guided candidate filtering, designed to address long-tail distribution in 3D object detection.
The framework replaces frequent head-class objects with synthetically generated tail-class instances in driving scenes.
It leverages text-guided diffusion models for generation and an LLM agent for filtering to ensure high-quality, diverse augmented data.

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

LLM-based Search Agent: introduces an empirical study on training LLM agents capable of interleaved reasoning and search using reinforcement learning, investigating the impact of reward formulation, underlying LLM characteristics, and search engine choice.
The framework involves an LLM Agent interacting with a Search Engine, guided by a Reward Function within a Reinforcement Learning process.
Key findings highlight the importance of format rewards, the influence of LLM type and scale, and the critical role of the search engine in training dynamics and inference robustness.

A Risk Taxonomy for Evaluating Al-Powered Psychotherapy Agents

Risk Taxonomy: introduces a structured framework for evaluating AI-powered psychotherapy agents with Immediate Risk (Imminent danger) and Potential Risk (Emerging vulnerability) components.
The taxonomy aims to identify and categorize potential negative outcomes and user harms in AI psychotherapy interactions.
Developed through literature review, expert interviews, and clinical criteria alignment, the taxonomy provides a basis for safer AI mental health support.

Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English

PMP (Pragmatic Metacognitive Prompting): introduces, with Comprehension of Context/Understanding, General Pragmatic Analysis, Preliminary Judgment, Meta-Comprehension, Specific Pragmatic Reassessment, and LLM components, a method for explainable sarcasm detection across English varieties.
The approach adapts pragmatic metacognitive prompting to guide large language models in generating textual explanations for sarcasm in Australian, Indian, and American English.
PMP significantly improves performance compared to baseline prompting strategies, particularly for non-standard English varieties, by providing pragmatic scaffolding.

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

StepSearch: introduces a reinforcement learning framework for search LLMs, utilizing a Policy LLM interacting with a Search Engine, trained via StePPO with a detailed Reward Mechanism and supported by a Data Augmentation Pipeline and Memory/History.
The Reward Mechanism includes both global and step-wise token-level rewards based on information gain and redundancy penalties to guide search actions.
The framework aims to improve multi-hop reasoning by providing fine-grained supervision for iterative retrieval and query formulation.

AutoData: A Multi-Agent System for Open Web Data Collection

AutoData (Automated web Data collection): introduces a novel multi-agent system for open web data collection, with Manager Agent orchestrating workflow, Research Squad extracting knowledge and designing blueprint, Develop Squad building program and validating data, and Oriented HyperGraph Cache System optimizing information flow and managing artifacts.
The system comprises eight specialized agents organized into research and development squads coordinated by the manager agent.
The OHCache system includes an oriented message hypergraph, an oriented hyperedge formatter, and a local cache system to enhance multi-agent collaboration efficiency.

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

ModelingAgent: introduces a multi-agent framework with Idea Proposer, Data Searcher, Model Implementor, Report Writer, Critic Module, Shared Memory, and External Tool Set components, designed to tackle complex mathematical modeling problems.
The framework utilizes a shared memory for agent coordination and an external tool set for data acquisition and code execution.
An integrated Critic Module enables iterative self-refinement of agent outputs based on specific evaluation rubrics.

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

URDUFACTCHECK: introduces an end-to-end fact-checking framework for Urdu, with CLAIMPROCESSOR, QUERYGENERATOR, RETRIEVER, and VERIFIER components, utilizing multi-strategy evidence retrieval including Monolingual, Translated, and Thresholded Translated Retrieval.
The framework addresses the scarcity of high-quality Urdu evidence through its dynamic evidence retrieval pipeline that combines monolingual and translation-based approaches.
The paper also introduces two new hand-annotated benchmarks, URDUFACTBENCH and URDUFACTQA, for evaluating claim verification and LLM factuality in Urdu.

PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration

PiFlow: introduces a principle-aware multi-agent system for scientific discovery, including PiFlow (Strategic director, information-theoretical), Planner Agent (Relays strategic guidance), Hypothesis Agent (Proposes testable hypotheses), Experiment Agent (Validates hypotheses, using tool), and Tt (Accumulated principle-outcome data).
The framework views scientific discovery as a structured uncertainty reduction problem guided by scientific principles selected via Min-Max optimization.
This approach enhances discovery efficiency and solution quality by systematically steering hypothesis generation and validation based on accumulated evidence.

Large Language Model-Powered Agent for C to Rust Code Translation

LAC2R (LLM-powered Agent for C-to-Rust code translation): introduces a novel C-to-Rust translation approach with LLM(s) (Heterogeneous), Virtual Fuzzing-based equivalence Test, Monte Carlo Tree Search, Preprocessor, Code Analyzer, Verifier, and Postprocessor.
The framework leverages LLMs' agentic capabilities, using VFT to identify functional non-equivalence and MCTS to plan iterative code refinement steps.
The approach aims to improve the safety and correctness of Rust code translated from C by systematically guiding the LLM refinement process.

Simulating Prosocial Behavior and Social Contagion in LLM Agents under Institutional Interventions

PROSIM: introduces a simulation framework for modeling prosocial behavior in LLM agents, with Individual Simulation (instantiates agents), Scenario Simulation (defines contexts), Interaction Simulation (models network dynamics), and Intervention Simulation (manipulates policies) components.
The framework examines how prosocial behavior emerges, adapts, and erodes in LLM-based agents under diverse social and institutional conditions.
PROSIM integrates four key modules to approximate the complexity of real human social environments for studying social alignment and institutional dynamics.

MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

MAS-ZERO: introduces a self-evolved, inference-time framework for automatic multi-agent system design, utilizing a Meta-Agent that orchestrates design and verification, leverages Building Blocks of pre-defined MAS configurations, performs Meta-Iterations through iterative design and feedback, executes the generated MAS via a Compiler to obtain Intermediate Outputs and Candidate Answers, and employs Self-Verification to select the best final solution.
The framework iteratively refines MAS configurations tailored to each problem instance by decomposing tasks (Meta-Design) and evaluating performance based on intermediate outputs (Meta-Feedback) without requiring a validation set.
This approach enables dynamic agent composition and problem decomposition, leading to improved performance and cost-efficiency compared to manual and existing automatic MAS design methods.

20th May 2025

ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

ContextAgent: introduces a context-aware proactive LLM agent framework with Sensory Context Extraction (Extracts context from perceptions), Persona Context Extraction (Extracts context from historical data), Context-aware Reasoner (Integrates contexts, reasons, predicts), LLM (Core reasoning engine), Thought Traces (Generated reasoning steps), Proactive Predictions (Predicts need for service), External Tool Calling (Calls external tools), and Services (Provides assistance).
The framework leverages extensive sensory perceptions from wearables and persona contexts to understand user intentions and predict the need for proactive assistance.
ContextAgent utilizes tool-augmented LLM reasoning and introduces ContextAgentBench, a benchmark for evaluating such agents.

Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

LAG (log-augmented generation): introduces a framework that directly reuses prior computation and reasoning from past logs at test time, utilizing a Log Store, Log Encoder, Log Retriever, and augmented Generator LM.
The framework represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for a subset of tokens.
This approach directly reuses prior reasoning and computations without additional steps for knowledge extraction, enhancing performance and efficiency.

JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation

JARVIS (Just A Remarkable VLSI Intelligence System): introduces a multi-agent framework for high-quality EDA script generation, leveraging LLMs and domain expertise with components including a Top Agent, Code Fixing Agent, Guardrail Agent, Code Generator, RuleEnforce, Code Compiler, RAG, Simulate ProcessSim, User Query, Instructions, and Final code.
The framework employs an iterative refinement process using a feedback loop between agents and custom tools like the Code Compiler and RuleEnforce to detect and fix errors.
JARVIS integrates domain-specific knowledge through RAG and custom compiler checks, addressing challenges like data scarcity and hallucination in specialized EDA tasks.

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

MedBrowseComp: introduces a benchmark for evaluating agent performance on complex medical information retrieval, including Primary Data Sources (External knowledge bases), Task Types (Question categories), and Verification & Scoring (Evaluation method).
The benchmark utilizes diverse medical knowledge bases and defines distinct task categories to assess agent capabilities in navigating and synthesizing information.
Evaluation involves verifying agent answers against ground truth using standardized metrics, highlighting performance gaps in multi-hop reasoning and computer use.

Think, Reflect, Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLMs

Metacognitive Learning Module: introduces a framework integrating metacognitive learning into LLM-powered multi-robot collaboration with Modular Skill Set Construction, Metacognitive Inference, and Self-Reflection components.
The framework enables LLM-powered robotic agents to decompose skills, reason about tasks, synthesize plans, reflect on failures, and generate new solutions.
This approach aims to enhance zero-shot robotic task performance by empowering LLMs with capabilities for reasoning, reflection, and creative problem-solving.

MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

MAATS (Multi-Agent Automated Translation System): introduces a modular framework using specialized LLM-based agents, including a Translator Agent (Generates initial translation), MQM Evaluator Agents (Evaluate translation errors) for specific dimensions, and an Editor Agent (Synthesizes annotations, refines translation), to enhance machine translation quality.
The system leverages the Multi-dimensional Quality Metrics (MQM) framework to provide fine-grained error detection and refinement signals across multiple dimensions.
This multi-agent architecture simulates human translation workflows, outperforming single-agent and zero-shot baselines in error detection and translation quality.

Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

TreeDebater: introduces a novel debate framework for LLMs, utilizing a Rehearsal Tree (anticipates attacks/defenses), Debate Flow Tree (tracks debate status), Human Debate Trees (references for feedback), Simulated Audience (provides revision feedback), Speech Time Controller (controls speaking time), and Writer (drafts debate statement) to improve strategic planning in competitive debate.
The framework models dynamic debate interaction on trees, enabling LLMs to make tactical decisions under time constraints.
TreeDebater retrieves prepared arguments, selects impactful actions, and refines statements based on simulated audience feedback and time limits.

Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

Mujica (Multi-hop Joint Intelligence for Complex Question Answering): introduces an agentic QA framework with a planner (decomposes questions, plans subquestions) and a worker (answers subquestions, uses retriever).
The planner module is responsible for breaking down complex questions into a directed acyclic graph and managing the overall process.
The worker module acts as a mini-RAG system, retrieving information and answering subquestions assigned by the planner.

Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

DIMF (Domain-Independent Multi-Agent Framework): introduces a task-oriented dialogue system with Intent Classification Agent (extracts user intent), Slot Filling Agent (extracts dialogue slots), and Response Agent (generates system response), trained using SFT (initial model fine-tuning), DPO (preference-based training), and DDA (mitigates DPO degradation).
The framework separates complex tasks into domain-independent components to improve performance on lightweight large language models.
The proposed Data Distribution Adaptation method enhances DPO training stability and the framework demonstrates strong generalizability and zero-shot capabilities.

Safety Devolution in AI Agents

Core Evaluation Framework: introduces, "a framework to measure the impact of retrieval and alignment mechanisms on model bias and harmfulness", with Censored LLM, Uncensored LLM, Agents with Censored LLM, Generate Query, Search, ReRank, Crawl, Answer, WikiAgent, WebAgent, System-Level Safety Prompts, and Evaluator components, where "the framework systematically compares LLMs with and without retrieval augmentation and safety mitigations across various benchmarks".
The framework reveals that integrating external retrieval into safety-aligned LLMs leads to a phenomenon termed safety devolution, characterized by reduced refusal rates, increased bias, and degraded safety scores.
Controlled experiments within the framework indicate that this safety degradation is primarily caused by the mere presence of retrieved context, rather than retrieval depth or accuracy, highlighting a structural vulnerability in RAG systems.

DSMENTOR: ENHANCING DATA SCIENCE AGENTS WITH CURRICULUM LEARNING AND ONLINE KNOWLEDGE ACCUMULATION

DSMentor: introduces a framework with a Mentor agent (curriculum designer) that processes a Dataset (input tasks) to create a Curriculum-based dataset (ordered tasks), which is then used by a Student agent (code generator) interacting with a Long-term memory (accumulated knowledge) and an Environment (evaluates code) for problem-solving.
The framework operates in two stages: curriculum generation and problem-solving, leveraging curriculum learning and online knowledge accumulation.
The Mentor agent determines task difficulty to sequence problems from easy to hard, guiding the Student agent's learning progression.

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

MM-Agent: introduces an expert-inspired framework that decomposes mathematical modeling into four sequential phases: Problem Analysis, Mathematical Modeling, Computational Solving, and Solution Reporting.
The framework utilizes specialized agents like the Analyst Agent, Task Coordinator Agent, Modeling Actor, Modeling Critic, Modeling Programmer Agent, and Reporting Agent to handle distinct tasks within each phase.
Key components such as the Hierarchical Mathematical Modeling Library (HMML) and MLE-Solver support knowledge retrieval, model formulation, and computational execution for real-world problems.

s3: You Don't Need That Much Data to Train a Search Agent via RL

s3: introduces a modular, RL-based search framework with a Searcher LLM (RL-trained agent), Search Engine (retrieval source), frozen Generator LLM (frozen answer generator), and Gain Beyond RAG (reward signal), which trains a search-only agent using a novel reward signal to optimize retrieval for generation quality.
The framework decouples the searcher from the generator, allowing the searcher to be trained with reinforcement learning based on the improvement in generator accuracy using retrieved documents compared to naive retrieval.
By focusing training solely on the searcher using a generation-aware reward, s3 achieves strong performance with significantly less training data and is compatible with black-box generator LLMs.

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

SPlanner: introduces a framework for mobile GUI agents that includes Application Modeling via EFSM (models applications), Structured Knowledge Base (collection of EFSMs), Plan Generation (creates execution plan), Instruction Parsing (parses user instruction), EFSM Solving (finds execution path), Path Polishing (refines execution path), Task Execution with VLM (executes the plan), Vision-Language Model (VLM) (executes GUI actions), LLM (parses/polishes text), BFS-based Solver (finds path in EFSM), User Instruction (input command), GUI Screenshot (current screen state), Action History (previous actions), Task Plan (step-by-step guide), and Operation Instruction (GUI action).
The framework models mobile applications using Extended Finite State Machines (EFSMs) to create a structured knowledge base for planning.
SPlanner generates interpretable and reliable execution plans by parsing user instructions, solving EFSMs, and polishing the resulting paths using LLMs, which are then executed by a VLM.

BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

BAR (Backward Reasoning based Agent): introduces an agent for complex Minecraft tasks with recursive goal decomposition (Recursive goal decomposition), state consistency maintaining (State conflict resolution), and stage memory (Memory from environment interaction) modules.
The agent utilizes backward reasoning to plan from the terminal state, aiming to overcome the perception gap faced by forward reasoning in complex tasks.
State consistency is ensured by integrating forward and backward reasoning, and planning efficiency is enhanced by leveraging successful past interactions stored in stage memory.

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

ReasonRAG: introduces, "Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning", with all LLM (Core reasoning model), Retriever (External knowledge access), Reasoning Stage (Decide query or answer), Grounding Stage (Extract evidence), Terminal Stage (Final answer state), Query Generation (Formulate search query), Evidence Extraction (Identify relevant text), Answer Generation (Produce final response), Memory (Stores previous steps)-components, where ReasonRAG is a process-supervised agentic RAG method using fine-grained rewards for policy optimization.
The framework employs Monte Carlo Tree Search and Shortest Path Reward Estimation to generate a high-quality process-level dataset, RAG-ProGuide, for training.
ReasonRAG enables LLMs to autonomously manage dynamic retrieval, iterative context refinement, and adaptive workflows for complex search queries.

Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

SPLIT-RAG (Semantic Partitioning of Linked Information for Type-Specialized Multi-Agent RAG): introduces a multi-agent RAG framework with Knowledge Base Preprocessing (Prepare data), QA Input Processing (Analyze query), Retrieval Plan Decision (Determine subgraphs/agents), Multi-Agent RAG (Distributed retrieval), Answer Generation (Combine, resolve, finalize), Lightweight LLM Agents (Query subgraphs), and Head Agent (Final answer generation), which partitions knowledge graphs based on question types and uses multiple agents for efficient, conflict-resistant retrieval and answer generation.
The framework employs question-driven graph partitioning to create semantically coherent subgraphs, enabling lightweight agents to query only relevant partitions in parallel.
A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications, with a head agent synthesizing the final response.

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

MLZero: introduces a multi-agent system for end-to-end machine learning automation, featuring Perception, Semantic Memory, Episodic Memory, and Iterative Coding modules, coordinated by specialized agents including File Grouping and File Perception, Task Perception, ML Library Selection, Condensation, Summarization, Retrieval, Error Analyzer, Coder, and Executer agents.
The system processes raw multimodal data through perception, leverages dual memory modules for knowledge and history, and employs iterative coding with agents for code generation, execution, and debugging.
MLZero achieves end-to-end ML automation with minimal human intervention by transforming raw data into ready-to-use models and predictions through this integrated multi-agent architecture.

DRUGPILOT: LLM-BASED PARAMETERIZED REASONING AGENT FOR DRUG DISCOVERY

DrugPilot (LLM-based parameterized reasoning agent): introduces an agent system for automating multi-stage drug discovery workflows, comprising LLM Backbones (Core language model), Parameterized Memory Pool (PMP) (Structured key-value data storage), AI Model Zoo (Drug discovery tools/models), and Fe-Fo Mechanism (Error feedback and focus).
The Parameterized Memory Pool (PMP) is a core component designed to handle large-scale, multi-modal drug data by converting it into standardized parametric representations for efficient retrieval and interaction.
The Fe-Fo Mechanism enhances the agent's robustness by providing specific error feedback and maintaining focus during complex multi-turn tasks and tool interactions.

CLEVER: A Curated Benchmark for Formally Verified Code Generation

CLEVER (Curated Lean Verified Code Generation Benchmark): introduces a benchmark for formally verified code generation, requiring models to perform specification generation, isomorphism proving, Lean implementation generation, and correctness proving to achieve end-to-end verification.
The benchmark evaluates models in two stages: specification certification (generating and proving equivalence of a Lean specification) and implementation certification (generating and proving correctness of a Lean implementation).
Success in CLEVER requires both the generated specification and implementation to be formally certified via Lean proofs, ensuring semantic correctness beyond test cases.

PANDAGUARD: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

PANDAGUARD: introduces a unified and modular framework for systematic LLM safety evaluation, conceptualizing jailbreak safety as a multi-agent system with Attacker (generates adversarial prompts), Defender (implements protection mechanisms), Target LLM (model being evaluated), and Judger (evaluates response safety) components.
The framework supports plug-and-play experimentation with diverse attack methods, defense mechanisms, and judgment strategies within a flexible pipeline architecture.
Built upon this framework, PANDABENCH provides a large-scale benchmark for comprehensive and reproducible evaluation of LLM jailbreak vulnerabilities and defenses.

Structured Agent Distillation for Large Language Model

SAD (Structured Agent Distillation): introduces a framework that compresses large LLM agents into smaller student models by segmenting teacher trajectories into Reasoning and Action Segments, applying span-specific supervision via Segment Masks and aggregating losses using CoT-Policy Alignment Loss and Action Consistency Loss, guided by Curriculum Sampling.
The framework uses a Teacher to generate trajectories, which the Student learns to imitate by minimizing the Total Loss, preserving both reasoning fidelity and action consistency.
This structure-aware distillation method outperforms token-level baselines, enabling compact agents to better replicate the teacher's decision process with minimal performance drop.

RAG/LLM Augmented Switching Driven Polymorphic Metaheuristic Framework

PMF (Polymorphic Metaheuristic Framework): introduces a self-adaptive metaheuristic framework for optimization problems, with PMA (Orchestrates algorithms), PMSA (Selects algorithms), Metaheuristic Algorithms Pool (Available algorithms), Feedback Loop (Real-time adaptation), and Population Transfer (Transfers solutions).
The framework utilizes a Polymorphic Metaheuristic Agent (PMA) and a Polymorphic Metaheuristic Selection Agent (PMSA) for dynamic algorithm selection and switching based on real-time performance feedback.
PMF leverages real-time performance feedback and can integrate RAG/LLM for enhanced decision-making, demonstrating improved optimization efficiency and adaptability.

LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization

LLINBO (LLM-in-the-Loop Bayesian Optimization): introduces a hybrid framework for Bayesian Optimization combining LLM Agent (Suggests design points), Statistical Surrogate (GP) (Models function, quantifies uncertainty), and Dataset (Stores historical observations).
The framework leverages LLMs for early exploration using contextual reasoning and GPs for efficient exploitation using principled statistical models.
Three specific mechanisms (Transient, Justify, Constrained) are proposed to enable this collaboration and provide theoretical guarantees.

19th May 2025

SIMULATION AGENT: A FRAMEWORK FOR INTEGRATING SIMULATION AND LARGE LANGUAGE MODELS FOR ENHANCED DECISION-MAKING

Simulation Agent framework: introduces, with Simulation Model (Core engine), Inputs (Configuration files), Outputs (Time-series data), AI Agent (Bridge/interpreter), and User (End stakeholder), a system integrating simulations and LLMs for enhanced decision-making.
The AI Agent utilizes LLMs and tool-calling capabilities to enable natural language interaction for running simulations, modifying inputs, and interpreting outputs.
This framework aims to improve the usability and accessibility of complex simulation analysis for non-technical users by grounding LLM interactions in accurate simulation results.

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Guided Search Strategies: introduces two guided search methods, 1-step lookahead and trajectory selection, guided by a learned action-value function estimator, applicable to non-serializable environments like SWE-agent scaffolding.
The approach leverages a base policy (LLM) and a critic model to improve performance consistency in environments where intermediate states cannot be saved or restored.
Empirical evaluation on SWE-bench Verified demonstrates that these strategies significantly improve the success rate of both open-weight and closed models.

Incentivizing Truthful Language Models via Peer Elicitation Games

Peer Elicitation Games (PEG): introduces a training-free, game-theoretic framework for aligning LLMs, including a Generator (produces responses), Discriminators (evaluate responses), Peer Elicitation Game (structures agent interaction), Reward Mechanism (incentivizes truthful reporting), Policy Update (adjusts agent strategies), and Majority Vote (aggregates discriminator judgments).
PEG employs multiple LLM discriminators that evaluate a generator's output and are rewarded based on mutual evaluation using a determinant-based score, promoting truthful reporting without ground-truth labels.
The framework utilizes online learning with policy updates to converge agents towards a truthful Nash equilibrium, demonstrating improved factual accuracy and competitive performance with smaller models.

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

DialogTool/VirtualMobile: introduces a benchmark and environment for evaluating stateful tool use in multi-turn dialogues, featuring a Dialogue Agent interacting with a VirtualMobile Environment containing Apps and APIs with a Database.
The benchmark assesses large language models across the entire tool use lifecycle, including creation, utilization (awareness, selection, execution), and role-consistent response.
Experiments reveal that current large language models struggle with tool creation and execution, particularly over long dialogue horizons and with complex APIs.

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

TimeSeriesGym: introduces, with (Origins), (Tasks), (Agent artifacts), (Supports agents), (Numeric metrics), (LLM assessment), (LLM qualitative), and (Combined evaluation) components, a scalable benchmark for evaluating AI agents on time series ML engineering tasks.
The framework provides diverse challenges and evaluates multimodal agent outputs using a dual quantitative and qualitative assessment approach.
Its agent-agnostic design and scalable task generation mechanism support comprehensive and practical evaluation of AI agents.

Hybrid Voting-Based Task Assignment in Modular Construction Scenarios

HVBTA (Hybrid Voting-Based Task Assignment): introduces a framework for multi-agent task assignment in construction, integrating Task Descriptions (Define task requirements), Agent Capability Profiles (Define agent abilities), Suitability Matrix Generation (Calculate agent-task compatibility), LLM Integration for Semantic Reasoning (Use LLM for suitability), Voting and Allocation Mechanism (Assign tasks using voting), and CBS for Path Planning (Generate collision-free paths).
The framework defines tasks and agents, calculates suitability, uses voting and LLM for assignments, and plans collision-free paths.
HVBTA aims to improve efficiency and coordination for heterogeneous robotic teams in modular construction.

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Three-level taxonomy: introduces, "From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery", with all LLM as Tool (Task Automation Tool), LLM as Analyst (Data Modeling & Analytical Agent), and LLM as Scientist (Open Exploratory & Discovery Agent)-components, where the survey systematically charts the progression of Large Language Models in scientific discovery through distinct levels of autonomy.
This framework delineates the escalating autonomy and evolving responsibilities of LLMs within the scientific research lifecycle, from foundational assistants to autonomous researchers.
The paper categorizes existing research works based on this taxonomy and the stages of the scientific method, highlighting the shift towards sophisticated, multi-stage agentic workflows.

Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

ARENA (Adaptive-Rewarded Evidence Navigation Agent): introduces a transparent RAG generator framework trained via reinforcement learning, with Structured Generation, KL Stabilization, and Adaptive Reward Calculation components, enabling interpretable decision traces and effective reasoning.
The framework uses a structured output format including selected evidence, reasoning traces, and final answers for end-to-end interpretability.
Adaptive, task-specific rewards and a stabilized optimization strategy are tailored for multi-hop question answering tasks.

Agentic Publications: An LLM-Driven Framework for Interactive Scientific Publishing, Supplementing Traditional Papers with AI-Powered Knowledge Systems

Agentic Publications (AP): introduces an LLM-driven framework, with Knowledge Representation Layer (Store scientific knowledge), Interactive Query Interface (User/AI interaction), Dynamic Updating Mechanism (Continuous knowledge ingestion), and Verification and Governance Process (Ensure quality/integrity), transforming traditional papers into interactive knowledge systems.
The framework integrates structured and unstructured data using Retrieval-Augmented Generation (Connect LLM to knowledge) and Multi-Agent Verification (Collaborative accuracy checks) processes.
Supported by a Knowledge Base (Integrated data store), Agents (Perform tasks/checks), Databases (Vector/graph/relational storage), and Integration APIs (External/internal connectivity), the system enables dynamic knowledge synthesis and interactive access for humans and AI.

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

Adversarial Evaluation Framework: introduces a method to stress-test LLM decision-making using LLM (Subject under test), Task (Interactive environment), Learner Model (RNN) (Predicts LLM behavior), and Adversary (RL agent) (Manipulates environment).
The framework trains a Learner Model to predict LLM actions and an Adversary to exploit these patterns by manipulating task rewards and observations.
This approach reveals LLM vulnerabilities to manipulation and rigidity in strategy adaptation in dynamic, adversarial settings.

Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair

WILLIAMT: introduces a cheap crash-site program repair approach, comprising Regex-Based Context Retrieval (identifies crash site) and Template-Guided Patch Generation (generates fix using templates and LLM).
The system leverages regex on sanitizer reports for context retrieval and template-guided LLM analysis to identify key variables for patch generation, producing a One-Shot Patch.
This strategy minimizes LLM reliance, substantially reducing query costs and enabling effective repair with smaller, cheaper models.

Q²Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs

Q²Forge: introduces an end-to-end pipeline for generating question-query datasets for knowledge graphs, including KG Configuration Creation (Configure KG), Competency Question Generation (Generate NL questions), SPARQL Query Generator & Executor (Translate & run queries), and SPARQL Query Refinement (Iteratively improve queries).
The framework guides users through configuring a knowledge graph, generating natural language competency questions, translating them into SPARQL queries, executing the queries, and refining the results.
Q²Forge leverages language models and other services to automate and assist in creating high-quality question-query pairs for knowledge graph documentation, training, and benchmarking.

The Hidden Dangers of Browsing AI Agents

Browser Use: introduces a security evaluation of autonomous browsing AI agents, focusing on systemic vulnerabilities across architectural layers, using Browser Use as a case study.
The paper analyzes the attack surface of browsing agents, detailing threats related to Perception, Reasoning/Planning, External Tools (Actions), Browsing Engine, Sensitive Data Handling, and Domain Restriction components.
Identified vulnerabilities in Browser Use, including domain restriction bypass and credentials exfiltration via prompt injection, highlight the need for multi-layered security approaches.

CAIM: Development and Evaluation of a Cognitive AI Memory Framework for Long-Term Interaction with Intelligent Agents

CAIM (Cognitive AI Memory Framework): introduces a framework with a Memory Controller (Central decision unit), Memory Retrieval (Filters relevant LTM data), Post-Thinking (Maintains LTM storage), Memory (STM/LTM) (Stores conversation/historical data), LLM Agent (Performs tasks within modules), and Python task (Processes/stores data) for long-term interaction.
The framework enhances LLMs' memory capabilities by integrating cognitive AI principles and memory-augmented methods.
CAIM addresses challenges in long-term interactions by managing memory usage, filtering relevant information, and maintaining memory storage.

Adversarial Reasoning for Repair Based on Inferred Program Intent

ADVERINTENT-AGENT: introduces a multi-agent framework for automated program repair, with Reason Agent (reasons program intent), Test Agent (generates tests), Repair Agent (generates patches), and Environment (compiles, executes, searches) components, that infers adversarial program intents to guide patch generation and testing.
The approach uses adversarial reasoning to explore multiple potential program intents and generate corresponding tests to validate patches and reduce overfitting.
The system provides developers with a package including inferred intent, generated tests, and patches to facilitate patch review and acceptance.

From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents

Mobile LLM Agent Workflow: introduces, a security analysis of mobile LLM agents, with Instruction Interpretation & Decomposition (Understand user intent, decompose tasks), Screen Context Understanding (Analyze UI, identify elements), Decision Generation (Plan actions based on state), Action Execution (Perform device operations), Reflection & Task Completion (Assess progress, verify goal), where the paper analyzes security risks across the agent's operational pipeline.
The analysis identifies 11 distinct attack surfaces spanning LLM, GUI, and System interaction layers.
The AgentScan framework is presented to systematically evaluate agent vulnerabilities across these attack scenarios.

Leveraging LLM Inconsistency to Boost Pass@k Performance

Variator agent: introduces a novel method leveraging LLM inconsistency by generating task variants and submitting a solution for each, including Task (original problem), LLM (Variant Generation) (generates task variants), Variant (modified task), LLM (Solution Generation) (generates candidate solution), and Solution (candidate output).
This approach aims to boost Pass@k performance by utilizing the variability in LLM success rates across equivalent inputs.
The method is compared against a baseline Repeater agent that generates multiple solutions for the original task.

The Traitors: Deception and Trust in Multi-Agent Language Model Simulations

The Traitors: introduces a multi-agent simulation framework with Environment (game structure), Agents (LLM instances), Observation Function (state mapping), Policy Function (action mapping), Agent Memory (persistent structured), and Interaction Prompts (phase guidance), designed to study deception and trust dynamics among LLM agents under asymmetric information.
The framework implements a scenario where a minority of traitors deceive a majority of faithful agents, who maintain persistent memory and update beliefs based on dialogue and voting patterns.
This stateful architecture enables testing hypotheses about emergent deceptive behaviors and provides a testbed for investigating LLM behavior in socially nuanced interactions relevant to AI safety.

Reasoning BO: Enhancing Bayesian Optimization with the Long-Context Reasoning Power of LLMs

Reasoning BO: introduces, with Experiment Compass (User input), Reasoning Data (LLM output), LLM (Reasoning model), Notes Agent (Integrates knowledge), Formatter (Structures knowledge), Verifier (Validates knowledge), Insights History (Accumulated insights), Insight (LLM-generated guidance), Acquisition Function (Guides sampling), Results History (Experimental data), Surrogate Model (Objective function model), Knowledge Graph (Structured domain rules), Milvus Database (Vector database), Expert Knowledge (Prior domain knowledge), Prior Knowledge (Initialization knowledge), RLHF (Post-training strategy), Base Model (Underlying LLM), Stage 1: SFT (Supervised fine-tuning), and Stage 2: GRPO (RL fine-tuning), a framework enhancing Bayesian Optimization using LLM reasoning, knowledge graphs, and multi-agent systems.
The framework integrates LLM reasoning to guide the BO sampling process and incorporates dynamic knowledge management via a dual-channel system.
It utilizes a multi-agent system for knowledge precipitation and employs RLHF for fine-tuning smaller LLMs, creating a closed loop for scientific discovery.

Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks

LLM-based cyberattack agent: introduces a modular architecture for autonomous cyberattack agents, comprising Models, Perception, Memory, Reasoning & Planning, and Action and Tools components.
This architecture enables agents to ingest diverse inputs, manage contextual knowledge, plan multi-stage attacks, and interact with external tools.
The survey analyzes the capabilities of these agents across various network types and discusses implications for defense.

Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

Promptor: introduces a stability-aware general-purpose prompt generation framework with Planner, Subtask Optimizer, Domain Knowledge Generator, Prompt Generator, Prompt Reviewer, Stability Reviewer, Executor, Summarizer Agent, and Plan Updater components.
The framework leverages semantic stability, a metric quantifying output consistency across repeated executions, to guide prompt optimization.
Promptor iteratively refines prompts and updates the plan based on stability feedback and execution results to improve reliability and task success.

AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use

AutoMat: introduces an end-to-end agent-assisted pipeline for crystal structure reconstruction and property prediction from STEM images, including an LLM Agent (Orchestrates pipeline), Image Denoising (Denoises STEM images), Candidate Structure Selection (Matches image to templates), Atomic Structure Reconstruction (Reconstructs crystal structure), and Property Prediction (Predicts material properties).
The system leverages specialized tools like MOE-DIVAESR, Image Template Matching, STEM2CIF, and MatterSim coordinated by the LLM agent.
AutoMat achieves state-of-the-art performance on a new benchmark, STEM2Mat-Bench, bridging microscopy and atomistic simulation.

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

MONDAY dataset collection framework: introduces an automated framework to extract mobile OS navigation procedures from instructional videos, with Video Collection (Instructional videos input), Scene Transition Detection (Identify screen changes), and Action Identification (Identify user actions) components.
The Scene Transition Detection component includes steps to Isolate phone screens (Detect phone screen area) and Detect transitions (Track text changes).
The Action Identification component utilizes UI Element Detection (Detect interactive elements) and a three-step process: Scene summary (Summarize frame content), Initial action identification (Identify potential actions), and Refined action identification (Precise action localization).

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis

Chain-Talker: introduces a three-stage framework for empathetic conversational speech synthesis, with EmGPT (Autoregressive Language Modeling, Emotion Understanding, Semantic Understanding) handling understanding and Synthesizer (Empathetic Rendering, speech generation) handling rendering.
The framework processes dialogue history and target utterance through sequential understanding stages before rendering expressive speech.
A supporting LLM-driven pipeline, CSS-EmCap, is developed to generate empathetic captions used for training the model.

--

18th May 2025

A Survey of Attacks on Large Language Models

LLM-based Agents: introduces, with Profiling Module, Memory Module, Planning Module, and Action Module components, autonomous systems leveraging LLMs to plan and act in complex environments.
The Profiling Module defines the agent's role, the Memory Module stores information, the Planning Module breaks down tasks, and the Action Module executes decisions.
This architecture exposes vulnerabilities targeted by agent-based attacks discussed in the paper.

ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

ESC-Judge: introduces a framework for comparing emotional support conversational agents, with Role Construction (Synthesizes help-seeker roles), Help Seeker Agent (Simulates patient role), ES Agents (Candidate support models), Dialogue Engine (Manages conversation flow), End-of-Conversation Detector (Identifies dialogue conclusion), Judge LLM (Compares agent performance), and Evaluation Rubric (Hill's E-I-A based), which automates evaluation using a three-stage LLM-driven pipeline grounded in the E-I-A counselling model.
The framework synthesizes realistic help-seeker roles, simulates conversations between candidate agents and the help-seeker, and uses a specialized LLM judge to issue pairwise preferences based on a theory-grounded rubric.
ESC-Judge achieves human-level reliability in judging agent performance across Exploration, Insight, and Action stages, providing a scalable and reproducible benchmark for emotional support AI.

ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

ALAS (Adaptive LLM Agent System): introduces a multi-agent architecture for planning and execution, comprising a Template Layer (Defines workflow blueprint), Factory Layer (Instantiates executable agents), Runtime Layer (Executes agents, adapts), and Persistent Memory (Stores state, logs, supports recovery).
The framework decomposes planning into specialized agents defined by the Template Layer, instantiated by the Factory Layer, and executed by the Runtime Layer, with Persistent Memory maintaining state and enabling recovery.
ALAS addresses LLM limitations in planning by providing modularity, state tracking, and reactive adaptation for dynamic environments through its layered architecture and persistent memory.

Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems

Multi-Agent Collaboration Strategies: investigates four dimensions of collaboration—Decentralized/Centralized Governance, Full/Selective/Instructor-decided Participation, Simultaneous/Ordered/Random/Point-to-Point Interaction Patterns, and Full Log/Self-Summarized/Instructor Summary Context Management—among Agents and an Instructor Agent utilizing Dialogue History Memory.
The study evaluates the impact of various combinations of these strategies on task accuracy and computational efficiency in two context-dependent scenarios.
Findings indicate that centralized governance, instructor-led participation, ordered interaction, and instructor-curated context summarization balance decision quality and resource utilization.

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

MASLEAK: introduces a novel attack framework, with Offline Adversarial Query Generation, Adversarial Query, Leak Query, Hooking, Propagate Query, and Target MAS IP Reconstruction components, designed to extract intellectual property from black-box Multi-Agent Systems by crafting queries that hijack, elicit, propagate, and retain responses from individual agents.
The framework operates in two phases, first generating adversarial queries offline and then reconstructing the target MAS IP from the final system output.
MASLEAK demonstrates high accuracy in extracting system prompts, task instructions, tool usages, agent number, and topology from both synthetic and real-world MAS applications.

14th May 2025

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: introduces an evolutionary coding agent that orchestrates an autonomous pipeline including a User defining the task, Task Specification, an Initial Program, an Evaluation Function, a Prompt Sampler, an LLMs Ensemble generating Code Modifications, an Evaluators Pool executing and scoring programs, a Program Database storing results and guiding evolution, and a Distributed Controller Loop orchestrating the process to find the Best Program.
The system iteratively improves algorithms by making direct code changes using an evolutionary approach, continuously receiving feedback from evaluators.
AlphaEvolve leverages state-of-the-art LLMs and automated evaluation to discover novel algorithms and optimize computational infrastructure.

13th May 2025

Enhancing Software Development with Context-Aware Conversational Agents: A User Study on Developer Interactions with Chatbots

Rasa-based Chatbot Prototype: introduces a study using a prototype built on the Rasa chatbot platform, including NLU, Dialogue Management, Facebook Messenger, and RASA webhook components, to investigate software developers' preferences and requirements for conversational agents.
The study employed a mixed-methods approach with 29 developers interacting with the prototype via Facebook Messenger based on a predefined scenario.
Findings from the interactions, questionnaires, and interviews aim to inform the design of context-aware chatbots for software development tasks like task and repository management.

TRAIL: Trace Reasoning and Agentic Issue Localization

TRAIL: introduces a formal taxonomy (Classifies agent errors) and a dataset of human-annotated traces from agentic workflows, including Manager Agent (Orchestrates tasks), Search Agent (Performs web search), and various Tools (External functions/APIs).
The paper evaluates the ability of large language models to act as judges for debugging complex agentic workflow traces using the proposed taxonomy and dataset.
Evaluation results show that current state-of-the-art models perform poorly at identifying and localizing errors within these traces, highlighting the challenge of evaluating complex agentic systems.

The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News

TED (TruEDebate): introduces a multi-agent system for fake news detection, simulating a structured debate process with DebateFlow Agents and analyzing the outcome with InsightFlow Agents.
The DebateFlow Agents organize LLM-powered agents into Proponents and Opponents teams that engage in Opening Statement, Cross-examination and Rebuttal, and Closing Statement stages.
The InsightFlow Agents, consisting of a Synthesis Agent for summarization and an Analysis Agent utilizing a Role-aware Encoder, Debate Graph, and News-Debate Interactive Attention, predict the news truth value.

Strategy-Augmented Planning for Large Language Models via Opponent Exploitation

SAP (Strategy-Augmented Planning framework): introduces a two-stage framework with LLM (Identifies opponent strategy), LLM (Generates action plan), Strategy Space Ξ (Explicit strategy dimensions), Strategy Set Dξ (Generated strategy library), SEN U (Strategy Evaluation Network), Trajectory Extractor E (Summarizes environment trajectory), abstract trajectory Tabs (Summarized observation data), Strategy Search (Finds optimal counter strategy), best response strategy ξ¹,* (Optimal counter strategy), Expert Tips H (Guides LLM planning), and Environment (Simulation environment), designed to enhance LLM-based agents' opponent exploitation in competitive environments.
The offline stage of SAP involves LLM generating strategies within the Strategy Space, evaluating them in the Environment to create a Strategy Set and Battle Result Dataset, which trains the SEN.
In the online stage, SAP uses the Trajectory Extractor to summarize observations, the LLM as Recognizer to identify the opponent's strategy, the SEN and Strategy Search to find the best response, and the LLM as Planner, guided by Expert Tips, to generate the final action Plan.

Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models

MRLMN (Multi-agent Reinforcement learning with Large language model in Multi-hop Networking): introduces a framework integrating MARL and LLMs for scalable UAV multi-hop networking.
The framework includes MARL agents with policy/critic networks, enhanced by information aggregation, agent grouping, reward decomposition, and behavioral constraints.
It leverages an LLM agent, knowledge distillation, bipartite matching, an LLM verifier, and prompt engineering to guide MARL training and improve exploration.

Benchmarking AI scientists in omics data-driven biological research

BaisBench (Biological AI Scientist Benchmark): introduces a benchmark for evaluating AI scientists in biological research with two tasks, BAIS-CTA (Cell type annotation task) and BAIS-SD (Scientific discovery task).
BAIS-CTA assesses cell type identification on single-cell datasets, while BAIS-SD evaluates reasoning and insight generation through multiple-choice questions based on data analysis.
The benchmark uses real biological omics data and compares AI performance to human experts, highlighting current limitations in data-driven scientific discovery.

Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

Aitomia: introduces an intelligent assistant platform for AI-driven atomistic and quantum chemical simulations, with Chatbot (user interface), AI Agents (task execution), LLMs (fine-tuned models), Rule-based Agents (fail-safe logic), Retrieval-Augmented Generation (RAG) system (knowledge retrieval), MLatom ecosystem (computational backend), Cloud Computing Services (simulation execution), and Database (information storage).
The platform leverages fine-tuned large language models, rule-based agents, and a retrieval-augmented generation system to assist users in setting up, running, and analyzing simulations.
Aitomia integrates with the MLatom ecosystem and cloud computing services like Aitomistic Hub and XACS to provide a wide range of computational chemistry capabilities.

DSADF: Thinking Fast and Slow for Decision Making

DSADF (Dual-System Adaptive Decision Framework): introduces a framework integrating System 1 (Fast thinking component) with RL Agent (Goal-conditional action selection) and Memory Space (Stores task proficiency), and System 2 (Slow thinking component) with VLM (Vision Language Model) acting as Planner (Decomposes tasks, reflects) and Auxiliary Performer (Handles unfamiliar tasks), utilizing CLIP (Image to text), Image Encoder (Encodes image observation), Text Encoder (Encodes text observation/goal), and Self-reflection (Evaluates and refines plans) for generalized decision making.
The framework draws inspiration from Kahneman's dual-process theory, leveraging the RL agent for fast, intuitive responses and the VLM for slow, analytical reasoning and planning.
DSADF demonstrates improved efficiency and generalization in complex environments by dynamically allocating tasks between the fast and slow systems based on task familiarity and agent proficiency.

12th May 2025

Putting It All into Context: Simplifying Agents with LCLMs

State-in-Context Agent: introduces a simplified agent architecture using LCLMs (Processes large context) to process the entire Environment (Code repository state) as Context (Input to LM), eliminating complex scaffolding to produce a Solution (Output patch).
The approach leverages LCLMs' long-context capabilities for full observability, transforming open-ended tasks into direct, close-ended problems.
Variations include a Compressor (Ranks/selects files) for large environments and a SELECTSOLVE method combining LCLMs via a Selector (LCLM identifies files) with SCLMs (Superior problem-solving) for repair.

Agent RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

ZeroTIR: introduces a framework for training a base LLM agent to spontaneously use a code execution environment for mathematical problem solving via reinforcement learning, including an RL trainer, value network, replay buffer, interaction logic, and reward signal.
The framework utilizes outcome-based rewards and techniques like dynamic stop tokens and replay buffer filtering to enable the LLM agent to learn effective tool use strategies.
ZeroTIR demonstrates an Agent RL Scaling Law where training progression correlates with increased code usage frequency, response length, and task accuracy, outperforming non-tool baselines.

Codifying Character Logic in Role-Playing

Codified Profiles: introduces a framework that represents character logic as structured, executable functions, including Codified Profile (Executable logic), parse_by_scene function (Outputs triggered statements), check_condition function (Evaluates scene conditions), Role-playing LLM (Generates character response), Scene (Input context), Triggered Statements (Guide LLM response), Groundtruth Reference (Evaluation target), NLI Scoring (Compares response to reference), Profile Update (Revises codified logic), Randomness Components (Control behavioral variability), Textual Profile (Original character description), and Distilled Condition Checker (Efficient condition evaluation), enabling persistent, updatable, and controllable role-playing.
The approach compiles natural language character descriptions into executable code, offloading complex reasoning from the LLM to deterministic control logic.
Experiments demonstrate improved behavioral consistency, adaptability, and diversity compared to prompt-based methods, particularly benefiting smaller language models.

ARE LLMS COMPLICATED ETHICAL DILEMMA ANALYZERS?

Evaluation Framework: introduces, a novel evaluation framework, with Data Retrieval (Collects ethical dilemmas), Preprocessing (Structures data), Text Processing (Generates/formats responses), LLMs (Generate dilemma responses), Human (Provide baseline responses), Structured Output (Formatted responses), Human Evaluation (Collects human feedback), Benchmark Metrics (Quantitative evaluation methods), Metrics Weighting (Assigns metric importance), and Aggregated Score (Final performance score), where the framework assesses LLM performance on ethical dilemmas using structured responses and quantitative metrics.
The framework utilizes a dataset of ethical dilemmas with expert and non-expert responses, processed into a five-section structured format for component-wise evaluation.
Performance is measured using a composite metric combining lexical, n-gram, embedding, and semantic similarity scores, weighted based on inversion analysis and AHP.

HYPERNYM MERCURY: Token Optimization through Semantic Field Constriction and Reconstruction from Hypernyms. A New Text Compression Method

Hypernym Mercury: introduces a novel text compression method using Field Constriction, which involves Parsing and Structuring input text into a Dart intermediate representation, performing Detail Importance Evaluation, and applying Compression Optimization to the dart.
The Dart structure splits information into a core statement and attached details, allowing for controllable granularity during Recomposition back into text.
Multi-Model Verification ensures semantic fidelity of the compressed output by checking against independent models.

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Graph-informed adversarial multi-agent interaction framework: introduces a system to generate diverse, challenging over-refusal queries using interacting agents and LLM validation, including Generator, Discriminator, LLM Refusal Validation, and Orchestrator components.
The framework is guided by an Entity Graph extracted from safety-related datasets and uses Feedback between agents to refine generated prompts.
This iterative process produces Collected Over-refusal Queries that appear unsafe but are objectively benign, simulating scenarios where LLMs might over-refuse.

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Web-Bench (Evaluation System): introduces a benchmark and evaluation system for LLM code generation on web development tasks, including an Evaluator, Web-Agent, LLM, Web-Bench Dataset, Tasks, Projects, E2E Tests, and Generated Files.
The system evaluates LLMs on sequential coding tasks within projects, simulating real-world web development workflows based on Web Standards and Frameworks.
The Evaluator orchestrates the process, using the Web-Agent to interact with the LLM, and verifies the generated code against E2E tests.

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

MLE-Dojo: introduces, an interactive Gym-style framework for training, evaluating, and benchmarking autonomous LLM agents in machine learning engineering workflows, with MLE-Agent (LLM-based assistant), Environment (Task-specific interactive space), Error (Encodes error types), Interface (Governs action execution), Feedback (Translates outcomes to guide), Metric (Defines evaluation metrics), Task Space (Collection of tasks), Docker container (Isolates task execution), Sandbox (Executes agent code safely), Observation Space (Environment state information), Dataset Information (Task data details), Evaluation Metric Scores (Performance metrics), Code Execution Results (Outcome of code runs), Error Messages (Debugging information), Interaction History (Record of interactions), Action Space (Agent's possible operations), request_info (Action to query task info), validate_code (Action for syntax/runtime check), execute_code (Action for full execution/submission), get_history (Action to retrieve past interactions), reset (Action to restart environment), Reward Space (Signal for performance), HumanRank Score (Relative performance metric), Agent Scaffolds (Agent implementations), MLE Agent (Minimalistic agent design), and AIDE (Iterative problem-solving agent), enabling systematic experimentation and rigorous evaluation on 200+ real-world Kaggle challenges.
The framework provides a fully executable environment supporting comprehensive agent training via supervised fine-tuning and reinforcement learning, facilitating iterative experimentation and real-time outcome verification through structured feedback loops.
MLE-Dojo features a modular and extensible architecture that decouples agent capabilities from the environment, promoting interoperability, scalability, and reproducibility, and is open-sourced to foster community-driven innovation.

KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation

KAQG (Knowledge Augmented Question Generation): introduces a framework that fuses knowledge graphs, RAG retrieval, and educational assessment theory into a pipeline for difficulty-controlled question generation.
The framework includes a KAQG-Retriever for building a Knowledge Graph from educational materials and a KAQG-Generator for creating and evaluating questions based on the graph and assessment theory.
Implemented using an AI Agents Framework, the system operationalizes difficulty metrics and demonstrates strong performance in generating psychometrically sound exam items.

Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

IKEA: introduces Reinforced Internal-External Knowledge Synergistic REasoning Agent, with LLM Agent, Environment (Search Engine/Retriever/Text Corpus), Reward Model, Knowledge-boundary aware reward function, Knowledge-boundary aware training dataset, Reinforcement Learning (GRPO), and Special Tags components, which trains an efficient adaptive search agent to synergistically integrate internal and external knowledge.
The agent learns to identify its knowledge boundary, prioritizing internal knowledge and resorting to external search only when necessary, guided by a novel reward function and training data.
This approach aims to reduce redundant retrievals, mitigate knowledge conflicts, and improve inference efficiency compared to methods relying solely on internal or external knowledge.

YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models

YuLan-OneSim: introduces a novel social simulator, with Scenario Auto-Construction Subsystem (User input to code), Simulation Subsystem (Execute and manage simulation), Feedback-driven Evolving Subsystem (Improve LLMs via feedback), and AI Social Researcher Subsystem (Automate social science research), designed for code-free scenario construction, large-scale simulation, evolvability, and automating the social science research loop.
The simulator is built upon four core subsystems that handle scenario creation, simulation execution, model refinement, and autonomous research tasks.
YuLan-OneSim aims to advance LLM-based social simulation by enabling automatic scenario construction, autonomous evolution, and completing the full research cycle.

Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models

PEAP-LLM (Parameter Efficient Action Planner using Large Language Models): introduces a novel parameter-efficient action planner for embodied agents, consisting of an LLM goal planner (LGP) that extracts task goals and a LoRA action planner (LAP) that generates single-step instructions using a fine-tuned LLM.
The framework utilizes a Base LLM for goal planning and a Fine-tuned LLM for action planning, with fine-tuning performed via supervised fine-tuning (SFT) and direct preference optimization (DPO) using specific datasets.
PEAP-LLM integrates with a Policy Model that predicts the next action based on high-level instructions, generated single-step instructions, and visual observations processed by Object Retrieval and State Text Generator components.

Can Generative AI agents behave like humans? Evidence from laboratory market experiments

LLM Agent Simulation: introduces, "explore the potential of Large Language Models (LLMs) to replicate human behavior in economic market experiments", with LLM Agents (Simulate human participants), OpenAI API (Interface for LLMs), Model (GPT-3.5 or GPT-4), Temperature (Controls response randomness), Context Window (Total text model considers), Memory (Number of previous messages), Seed (Initializes random generator), Market Environment (Simulated economic market), Feedback Mechanism (Positive or negative price feedback), where "the framework simulates market dynamics by having LLM agents predict prices iteratively based on market information and their own history".
The simulation compares LLM agent behavior to human participants in positive and negative feedback markets, analyzing market dynamics and forecasting strategies.
Key parameters like memory and temperature significantly influence LLM agent behavior and their ability to replicate human-like market dynamics and bounded rationality.

Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption

Private LoRA Fine-tuning: introduces an interactive client-server protocol for private fine-tuning of open-source LLMs, with Client (orchestrates training, non-linear operations), Server (linear operations under HE), Homomorphic Encryption (HE) (enables encrypted computation), LoRA Weights (U, D) (client-side adaptation parameters), and Base Model Weights (W) (server-side public parameters) components.
The client manages private data and LoRA weights while performing non-linear computations locally.
The server handles computationally intensive linear operations on public base model weights using homomorphic encryption on client-provided encrypted activations.

Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Multi-Agent Reasoning System: introduces a system with Agents (Individual LLMs), Expertise Specialization (Domain-specific roles), Collaboration Paradigm (Interaction mechanism), Communication Protocol (Information exchange), and System Scale (Number of agents) to investigate collaborative reasoning performance.
The study empirically evaluates how expertise-domain alignment, collaboration paradigm (structured workflow vs. diversity-driven), and system scale affect collective reasoning.
Findings indicate that expertise alignment is domain-contingent, diversity-driven collaboration outperforms structured workflows, and increasing agents generally boosts performance with diminishing returns.

UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

UAV-CodeAgents: introduces a scalable multi-agent framework for autonomous UAV mission generation, utilizing Airspace Management Agent (AMA), UAV Agent, LLM, VLM, ReAct, Pixel-Pointing Grounding Mechanism, smolagents framework, Message-passing interface, and Tools to interpret instructions and generate UAV trajectories.
The system leverages the ReAct paradigm for iterative reasoning and dynamic adaptation, enabling agents to reflect on observations and revise mission goals in evolving environments.
A key component is the vision-grounded pixel-pointing mechanism, which facilitates precise localization of semantic targets on aerial maps for spatial grounding and context-aware flight routes.

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

DynamicRAG: introduces a novel RAG framework, with Retriever (Retrieves documents), Dynamic Reranker (Dynamically adjusts documents), Generator (Generates final answer), Reward Function (Evaluates response quality), Reinforcement Learning (Optimizes reranker agent), Direct Preference Optimization (RL optimization method), and Behavioral Cloning (Initial reranker training), where the reranker dynamically adjusts document order and number using LLM feedback and RL.
The reranker is modeled as an RL agent trained via behavioral cloning and DPO, leveraging LLM output quality as reward signals.
This dynamic reranking approach enhances RAG system efficiency and effectiveness by optimizing the generator's input based on query context.

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMS

SENATOR (Structural Entropy-guided Knowledge Navigator): introduces a framework for detecting and repairing LLM knowledge deficiencies, with MCTS (Knowledge graph exploration), LLM (Model under evaluation), KG (External knowledge source), Structural Entropy (Exploration reward signal), Synthetic Data (Generated training samples), and SFT (Model fine-tuning) components.
The framework employs MCTS guided by structural entropy on a knowledge graph to efficiently explore and identify areas where the LLM exhibits high uncertainty or knowledge deficiencies.
Based on the identified high-uncertainty paths, SENATOR generates targeted synthetic data used for supervised fine-tuning to repair the LLM's knowledge deficiencies.

11th May 2025

Exploring Anthropomorphism in Conversational Agents for Environmental Sustainability

Washy: introduces a system integrating a Conversational Agent (User interface) powered by an LLM (Language model) using a Function Calling API (LLM tool interface) to interact with External API (Solar data source), a Smart Plug (Appliance controller), a Scheduler (Slot/notification management), and a Database (Data storage), supported by a Backend (Server logic), Client (User applications), and Notification System (Alert delivery), to help users schedule Washing Machine (Physical appliance) cycles based on solar energy availability.
The system compares a Personified Agent and a Traditional Assistant interface to evaluate the impact of anthropomorphism on user interaction and eco-friendly behavior adoption.
A lab study assessed the system's effectiveness in promoting sustainable home energy management and the influence of agent personality on user engagement and rapport.

Architectural Precedents for General Agents using Large Language Models

Cognitive Design Patterns: introduces recurring patterns of processes, representations, and memories found in cognitive architectures and Agentic LLM Systems, including Observe-decide-act, 3-stage memory commitment, Hierarchical decomposition, Short-term (context) memory, Ahistorical KR/memory, Historical KR/memory, Procedural KR/memory, Reconsideration, Knowledge compilation, and Step-wise reflection.
The paper analyzes how these cognitive design patterns are evident in existing Agentic LLM systems and identifies patterns apt for future exploration.
Examining these patterns helps predict gaps and deficiencies in current LLM systems and suggests future research directions towards general intelligence.

Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

FINSABER (Financial Investing Strategy Assessment with Bias mitigation, Expanded time, and Range of symbols): introduces a comprehensive framework for benchmarking LLM timing-based investing strategies, with Multi-source Data Module (Integrates diverse financial data), Strategies Base Module (Covers selection and timing strategies), Bias-Mitigated Backtest Pipeline (Supports robust backtesting), Selection-based Strategy (Identifies asset subset), Timing-based Strategy (Dictates buy/sell/hold decisions), Traditional Rule-based (Uses technical indicators/rules), Predictor-based (Relies on data-driven models), RL-based (Learns optimal investing policies), LLM-based (Leverages large language models), Rolling Window Test (Evaluates across multiple periods), and Evaluation Metrics (Measures strategy performance).
The framework integrates 20 years of multi-source data, expands symbol coverage, and explicitly mitigates survivorship, look-ahead, and data-snooping biases.
FINSABER supports robust and reproducible benchmarking across diverse experimental setups to provide empirical guidance for LLM-based investment research.

DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMS

DialogueReason: introduces a dialogue-based reasoning pattern, with System Prompt (input instruction), Adaptive Thinking Pattern Config (configuration), Thinking Process (iterative simulation), and Final Answer (output).
The Adaptive Thinking Pattern Config includes Agent Config (agent roles), Environment Config (setting), and Interaction Config (communication rules).
The Thinking Process involves iterative Agent-Agent Interaction (dialogue) and Agent-Environment Interaction (task progression).

Seed1.5-VL Technical Report

Seed1.5-VL: introduces a vision-language foundation model composed of Seed-ViT (Vision encoder (encode images/videos)), MLP Adapter (Project visual features), and Large Language Model (LLM) (Process multimodal inputs (MoE)).
The Seed-ViT vision encoder handles dynamic image resolutions, while the LLM is a 20B active parameter Mixture-of-Experts model.
The model is designed for general-purpose multimodal understanding and reasoning across diverse tasks.

The Wisdom of Agent Crowds: A Human-AI Interaction Innovation Ignition Framework

Brainwrite: introduces a human-AI interaction framework for multi-agent brainstorming, incorporating Human (User), LLM (Large Language Model), Cothinker (Interactive module), Internet (External information source), Knowledge Base (Internal information source), and Mindmap (Structured text summary) components.
The framework utilizes LLMs and a Cothinker module to assist human users in topic definition, deep exploration, and output generation, drawing information from the Internet and a Knowledge Base.
The system aims to reduce user cognitive load through structured text summaries like Mindmaps and enhance viewpoint diversity in complex financial analysis tasks.

EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation

EcoLANG: introduces a two-stage paradigm, with Language Evolution (create efficient language) and Language Utilization (agents use evolved language), to induce efficient and effective agent communication language for large-scale social simulations.
The Language Evolution stage comprises Vocab Compression (reduce vocabulary size) via Semantic Clustering (group words by meaning), Intra-Cluster Selection (filter words within groups), and Tokenization (map words to tokens), alongside Rule Evolution (evolve communication rules) through Initialization (start with initial rules), Communication (simulate agent dialogues), Selection (evaluate and select rules), Crossover & Mutation (generate new rules), and Update and Iteration (refine rule population).
The Language Utilization stage applies the evolved language by modifying LLM decoding and incorporating rules into prompts, enabling agents to communicate more efficiently in social simulations.

Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning

Fast-Slow Architecture: introduces a human-centric decision-making framework integrating an LLM-Based Slow System and an RL-Based Fast System, designed to interpret high-level user instructions and execute real-time control.
The LLM-Based Slow System processes user commands and scene context using components like Human-Language Parsing and CoT Analytic Reasoning, referencing a Memory Bank to generate structured Human-Centric Instruction.
The RL-Based Fast System, utilizing Instruction and Scenario Encoders and a Multi-Head Attention-based Actor-Critic network, executes actions validated by a Safety Mask via a PID Controller, balancing user preference and safety.

ThreatLens: LLM-guided Threat Modeling and Test Plan Generation for Hardware Security Verification

ThreatLens: introduces, with Threat Identification Agent (identifies physical/supply threats), Security Policy Generator Agent (extracts security policies), Test Plan Generator Agent (generates test plans), LLM (performs reasoning/generation), RAG (retrieves relevant knowledge), System-User Conversation (interacts with engineer), Security Knowledge Dataset (stores threat models), and Design Spec. & ISA document (input design information), a multi-agent framework automating hardware security threat modeling and test plan generation.
The framework leverages LLMs for reasoning and generation, RAG for efficient knowledge retrieval from datasets and documents, and interactive conversation with verification engineers.
ThreatLens aims to reduce manual effort, enhance coverage, and provide a structured approach for hardware security verification by automating threat identification and test plan formulation.

Control Plane as a Tool: A Scalable Design Pattern for Agentic AI Systems

Control Plane as a Tool: introduces a design pattern for Agentic AI systems that modularizes tool orchestration using a Request Router, Registration Module, Invocation Module, Input Validator, Intent Resolver, Intent Validator, Routing Handler, Feedback Integrator, Output Validator, Failure Handler, Usage tracker, Agent Registry, Tool Registry, validation rules, Metrics Registry, and Log/DB.
This pattern decouples tool management from agent logic, enabling dynamic tool selection, governance, and extensibility across multiple agents.
The architecture provides a single tool interface to agents while encapsulating complex routing and validation logic internally.

10th May 2025

VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback

VTutor: introduces an animated pedagogical agent SDK, with Large Language Model Integration (AI-generated text responses), Text-to-Speech (text to audio conversion), LipSync Module (audio to avatar mouth movements), and WebGL-Based Rendering (Unity environment web embedding), designed for real-time multi-model feedback in education.
The SDK leverages lightweight WebGL, Unity, and JavaScript frameworks to convert LLM text outputs into audio and then render a real-time, lip-synced pedagogical agent.
VTutor provides on-demand, personalized feedback using anime-like aesthetics to avoid the uncanny valley effect and enhance engagement.

9th May 2025

Reliable Collaborative Conversational Agent System based on LLMs and Answer Set Programming

AutoManager: introduces a dual-agent system with Administrator Bot (Manages knowledge base) and Assistant Bot (Interacts with customers) that share a Knowledge Base (Shared facts and menu), Temporary Information (Shared session data), and Collaborative Rule Set (Shared collaboration rules), utilizing Knowledge Extraction (Natural language to predicates), Commonsense Reasoning (Predicate reasoning with ASP), and Response Generation (Predicates to natural language) for reliable collaborative task-oriented dialogue.
The system leverages Large Language Models for natural language processing and Answer Set Programming for robust logical reasoning and consistency checking within each agent.
This architecture enables reliable collaboration between agents by sharing knowledge and rules, demonstrated in a fast-food restaurant management scenario.

SCALEMCP: DYNAMIC AND AUTO-SYNCHRONIZING MODEL CONTEXT PROTOCOL TOOLS FOR LLM AGENTS

ScaleMCP: introduces a novel tool selection approach, with Agent, MCP Retrieval Tool, Automatic Indexing Pipeline, MCP Storage Index, and MCP Servers, enabling LLM agents to dynamically discover and equip Model Context Protocol (MCP) servers as tools.
The framework features an auto-synchronizing tool storage system pipeline that uses CRUD operations with MCP servers as the single source of truth to maintain the MCP storage index.
LLM agents are equipped with an MCP retrieval tool, allowing them to autonomously query the storage index and invoke relevant MCP servers during multi-turn interactions.

A New DAPO Algorithm for Stock Trading

Improved DAPO Algorithm: introduces a novel trading agent integrating GRPO, Decoupled Clipping, Dynamic Sampling, and Sentiment-Risk Adjusted Rewards for financial trading.
The approach adapts DAPO principles to a GRPO framework, incorporating LLM-based risk and sentiment signals into an adjustable reward function.
This method demonstrates improved performance and significantly reduced computational requirements compared to a CPPO-DeepSeek baseline on the NASDAQ-100 index.

LATENT: LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies

LATENT (LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies): introduces a framework for generating stealthy analog Trojans, utilizing an LLM Agent (autonomous agent) to modify a Circuit Netlist (analog circuit design), validated by a Syntax Checker (validates Trojan syntax), simulated by HSPICE (circuit simulator), evaluated by SPICED (LLM-based detection tool), and refined via Feedback-driven learning (iterative strategy refinement).
The framework employs a Thought-Action-Observation loop where the LLM agent iteratively selects and inserts Trojan components based on detection feedback to evade detection.
By integrating simulation and detection tools into the iterative process, the framework generates diverse, circuit-specific analog Trojans with low activation ranges and significant performance degradation upon activation.

Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients

PPO-TS (PPO-based adversarial attack framework): introduces a novel threat vector that leverages a PPO Attack Agent to generate Adversarial Waveforms interfering with Client Sensors, resulting in Clustered Updates that induce Rowhammer bit-flips on Server DRAM.
The framework utilizes reinforcement learning to manipulate client sensor observations, maximizing server repetitive memory updates necessary for Rowhammer exploitation.
This approach enables remote Rowhammer attacks on federated learning servers without requiring direct access or system-level privileges.

Multi-Agent Systems for Robotic Autonomy with LLMS

Multi-Agent System: introduces a framework for robotic autonomy, with Task Analyst (analyzes task input), Robot Designer (designs robot configuration), RL Designer (generates RL components), Code & Report Extractor (extracts code/reports), RL Execution (runs RL training/evaluation), Figures (visualizes results), and Report (summarizes analysis/results).
The system takes task scenario descriptions as input and outputs multimodal results including code files, technical reports, and visualizations.
This framework enables autonomous robotic task analysis, mechanical design, and path generation using LLMs and reinforcement learning.

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

APOLLO: introduces a modular pipeline combining LLM, Lean Server, Syntax Refiner, Sorrifier, Auto Solver, Subproof Extractor, and Proof Assembler.
The pipeline directs LLM proof generation, analyzes and fixes errors using Lean feedback, isolates failing sub-lemmas, and applies automated solvers.
This iterative process repairs and recombines sub-proofs, improving proof generation efficiency and correctness with lower sampling budgets.

EVOLUTIONARY ECOLOGY OF WORDS

Word Evolutionary Ecology Model: introduces a model utilizing a Large Language Model for word creation, interaction judgment, and mutation within a spatial agent-based simulation, where individuals possessing words compete and evolve.
The model simulates the evolutionary ecology of words by having agents with words move in a grid, compete based on LLM-determined outcomes, and mutate their words using the LLM.
Competition outcomes between words are stored in a dictionary to improve computational efficiency for repeated interactions.

ELA-ZSON: Efficient Layout-Aware Zero-Shot Object Navigation Agent with Hierarchical Planning

ELA-ZSON (Efficient Layout-Aware Zero-Shot Object Navigation): introduces an efficient zero-shot object navigation approach with an LLM Agent (manages process) that leverages a hierarchical Scene Representation (hierarchical environment map) for Hierarchical Planning (two-level path generation) and Robotic Navigation (executes planned path).
The Scene Representation includes a global Topometric Map (global topological graph) and a local Learned Scene Representation (local dense memory), supporting both Global Topology Plan (coarse route planning) and Local Ego-centric Plan (dense waypoint generation).
The LLM Agent manages the overall workflow, integrating Perception (RGB-D input, pose) and Control Flow (manages actions, status, errors) for autonomous navigation in complex indoor environments without costly training.

AGENTXPLOIT: End-to-End Redteaming of Black-Box AI Agents

AGENTXPLOIT: introduces a generic black-box fuzzing framework for indirect prompt injection attacks, utilizing an Initial Corpus (High-quality templates), Seed Storage (Pool of seeds), Seed Selector (MCTS-based algorithm), a Mutator (Generates new variants), and a Scorer (Evaluates seeds) to iteratively refine adversarial prompts.
The framework systematically explores adversarial prompts by selecting promising seeds, mutating them, and scoring their effectiveness based on attack success rate and task coverage.
This adaptive and iterative process enables the framework to effectively discover and exploit indirect prompt injection vulnerabilities in black-box LLM agents across diverse architectures and tasks.

8th May 2025

Scalable Chain of Thoughts via Elastic Reasoning

Elastic Reasoning: introduces a framework for scalable chain of thoughts, explicitly separating reasoning into thinking and solution phases with independently allocated budgets.
The framework employs a budget-constrained rollout strategy during training to teach the model adaptive reasoning under truncated conditions.
At inference, separate budgeting prioritizes the completeness of the solution segment, improving reliability under strict resource constraints.

MULTI-AGENT EMBODIED AI: ADVANCES AND FUTURE DIRECTIONS

Multi-Agent Embodied AI Survey: introduces a comprehensive review of recent advances and future directions in embodied AI systems with multiple agents, covering control, learning, and generative model-based methods.
The survey analyzes key contributions and identifies challenges in multi-agent embodied AI, including asynchronous decision-making, agent heterogeneity, and open environments.
It reviews benchmarks and discusses future research directions to guide innovation in this rapidly evolving field.

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

LMRM (Large Multimodal Reasoning Model): introduces a structured roadmap for multimodal reasoning research, encompassing four stages: Stage 1 Perception-Driven Modular Reasoning, Stage 2 Language-Centric Short Reasoning, Stage 3 Language-Centric Long Reasoning, and Stage 4 Native LMRMs.
The survey analyzes the progression from early modular, perception-driven systems to unified, language-centric frameworks and projects towards native models with omnimodal perception and agentic behavior.
It provides a comprehensive review of over 540 publications, categorizes models and benchmarks, and discusses challenges and future prospects for next-generation multimodal reasoning systems.

Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs

Simulated LLM Agents: introduces a study evaluating user reliance and perception of LLM agents using minoritized sociolects (AAE, Queer slang) compared to Standard American English, utilizing templated suggestions constructed with warmth phrases and confidence expressions, generated via in-context learning and persona-based prompting with GPT-4.
The study found that AAE speakers preferred and relied more on the SAE agent, while Queer slang speakers showed no significant preference but felt greater social presence with the Queer slang agent.
Findings highlight the nuanced dynamics of sociolect use in machine interactions, emphasizing the need for careful design to respect cultural boundaries and avoid appropriation.

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

CityNavAgent: introduces a large language model-empowered agent for aerial vision-and-language navigation, featuring an open-vocabulary perception module, a hierarchical semantic planning module, and a global memory module.
The agent extracts urban scene semantics, decomposes long-horizon tasks into hierarchical sub-goals, and stores historical trajectories in a topological graph to reduce navigation complexity.
This approach enables zero-shot navigation in continuous urban environments, addressing challenges of complex scene understanding and exponential planning complexity.

HiBayES: A Hierarchical Bayesian Modeling Framework For AI Evaluation Statistics

HiBayES (A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics): introduces a generalizable framework with Hierarchical Bayesian GLMs, Bayesian Data Analysis, MCMC Sampling, Uncertainty Quantification, Formal Model Comparison, and Quality Control components, designed for principled uncertainty quantification and robust parameter estimation in AI evaluations.
The framework addresses challenges in AI evaluation statistics, including stochastic outputs, complex hierarchical data structures, and high testing costs, particularly in low-data scenarios.
HiBayES enables robust inferences, explicit modeling of hierarchical data, and formal model comparison, offering advantages over conventional statistical methods like t-tests and flat models.

clem: todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

clem: todd (chat-optimized LLMs for task-oriented dialogue systems development): introduces a framework for systematically evaluating LLM-based task-oriented dialogue systems, featuring a Game Master (coordinates interaction), User Simulator (simulates user), and Dialogue System (acts as agent), which can be implemented as Monolithic Dialogue System (single LLM agent), Modular-Prog Dialogue System (programmed flow agent), or Modular-LLM Dialogue System (LLM-controlled agent), utilizing components such as Dialogue Manager (manages dialogue flow), Intent Detection (identifies user intent), Slot Extraction (extracts entities), Response Generation (generates response), Database Retriever (queries database), and Booking Confirmer (confirms bookings).
The framework facilitates turn-based interactions between the user simulator and dialogue system, coordinated by the game master, and supports plug-and-play integration of different models and architectures.
Evaluation within the framework involves consistent datasets, metrics, and computational constraints, enabling detailed benchmarking and analysis of performance and efficiency trade-offs.

EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation

EcoAgent: introduces an edge-cloud collaborative multi-agent framework for mobile automation, featuring a Cloud-Based Planning Agent (Task decomposition, planning), Edge-Based Execution Agent (Action execution), Edge-Based Observation Agent (Monitor screen, verify outcomes), Memory Module (Stores screen history), Reflection Module (Supports replanning), and Pre-Understanding Module (Compresses screen images).
The framework coordinates cloud and edge agents in a closed loop, leveraging cloud-based MLLMs for planning and edge-based MSLMs for execution and observation.
The Pre-Understanding Module reduces communication overhead by compressing screen images, while Memory and Reflection modules enable replanning upon execution failure.

HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflows

HEXGEN-TEXT2SQL: introduces a novel framework for scheduling agentic Text-to-SQL workflows on heterogeneous GPU clusters, featuring a Global Coordinator (Dispatcher) that assigns requests, LLM Model Instances (with Local Priority Queue) that process and prioritize tasks, and a Simulator (Alpha-Tuning) that tunes the dispatcher parameter.
The framework employs a two-level hierarchical scheduling approach combining global workload-balanced dispatching and local adaptive urgency-guided prioritization to manage multi-stage dependencies and resource heterogeneity.
This design significantly improves SLO attainment and system throughput for LLM-based Text-to-SQL serving compared to baseline methods.

MARK: Memory Augmented Refinement of Knowledge

MARK (Memory-Augmented Refinement of Knowledge): introduces a scalable agentic memory design framework, with Conversational LLM Agent, LLM, System Prompt, Domain Knowledge Source, Chat History, Memory Builder Service (MBS), Memory Search Service (MSS), Memory Store, Residual Refined Memory Agent, User Question Refined Memory Agent, LLM Response Refined Memory Agent, Memory Relevance Scoring (MRS), and Memory, enabling LLMs to continuously learn and refine domain knowledge.
The framework utilizes specialized memory agents (Residual, User Question, LLM Response) to extract refined memories from conversations, stored in a Memory Store.
Memory Search Service retrieves and ranks relevant memories using Memory Relevance Scoring for injection into the LLM context, improving accuracy and adaptability.

From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation

G-E-RG (Generation, Evaluation, and Regeneration): introduces a multi-agent framework for feedback generation, including External Database (slides), Question, Student response, Agent 1 (Generation), Feedback in the first round, Agent 2 (Evaluation), Evaluation results, Agent 3 (Re-Generation), and Feedback in the second round, which generates initial feedback, evaluates it, and then regenerates improved feedback.
The framework utilizes three distinct GPT-4o agents for the sequential tasks of initial generation, evaluation based on a rubric, and final regeneration informed by evaluation results.
The iterative G-E-RG process significantly improves feedback quality across multiple dimensions compared to single-round generation methods.

Reasoning Models Don't Always Say What They Think

CoT Faithfulness Evaluation and RL Training Framework: introduces an evaluation of large language models, including Claude 3.5 Sonnet (New), Claude 3.7 Sonnet, DeepSeek V3, and DeepSeek R1, assessing the faithfulness of their Chain-of-Thought reasoning when responding to Input Prompts and Prompt Pairs, and studies the impact of Outcome-Based Reinforcement Learning in synthetic RL Environments with Reward Hacks defined by a Reward Function.
The evaluation measures how often models verbalize hints used in their reasoning process, finding that CoTs often lack faithfulness, particularly on misaligned hints and harder tasks.
Outcome-based RL initially improves CoT faithfulness but plateaus, and models exploiting Reward Hacks in RL Environments rarely verbalize the hack in their CoTs.

7th May 2025

Large Language Models are Autonomous Cyber Defenders

LLM Adapter Framework: introduces a system to integrate LLMs into the CybORG CAGE 4 environment, including an LLM Adapter, Formatter, Backend, LLM Models, Custom Policies, Communication Protocol, Blue Agent (LLM), Blue Agent (RL), Red Agent, and Green Agent.
The framework enables LLM-driven agents to act as autonomous cyber defenders in a multi-agent simulation alongside RL and finite-state agents.
A novel communication protocol allows diverse blue agents to share threat information and coordinate defensive actions within the simulated network.

Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems

Maris: introduces a privacy-enhanced development paradigm for multi-agent collaboration systems, with MACS Data Protection Manifest (Specifies data protection policy), Data Safeguard Engine (Integrates policy into workflows), Conversation Handler (Hooks into message flows), Manifest Enforcer (Validates messages, applies actions), AutoGen (Multi-agent development framework), ConversableAgent (Generic agent class), GroupChatManager (Group conversation manager), Agents (Autonomous actors), LLMs (Large Language Models), Tools (External services/functions), and Users (Human participants), designed to address data leakage threats by enforcing rigorous message flow control.
The system embeds reference monitors into key conversation components to validate message flows against user-defined policies at runtime.
Evaluation across healthcare, supply chain, and recommendation use cases demonstrates satisfactory effectiveness and low performance overhead.

Benchmarking LLMs' Swarm intelligence

SwarmBench: introduces a novel benchmark for evaluating LLM swarm intelligence, featuring a launcher (Launches benchmark), a framework orchestrator (Orchestrates interactions), a simulation environment (Simulation environment), task definitions (Defines coordination tasks), a physics engine (Manages environment physics), LLM-powered agents (LLM-powered agents logic), and a data logger (Captures simulation data).
The benchmark assesses emergent decentralized coordination in LLM swarms under strict perception and communication constraints within a configurable 2D grid world.
SwarmBench includes five core multi-agent coordination tasks: Pursuit, Synchronization, Foraging, Flocking, and Transport, evaluated using a zero-shot protocol.

CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System

CompileAgent: introduces an LLM-based agent framework for automated repo-level compilation, integrating a MasterAgent, Flow-based Agent Strategy, Shell Tool, File Navigator Tool, Instruction Extractor Tool, Website Search Tool, and Multi-Agent Discussion Tool to handle instruction search and error resolution.
The framework leverages five specialized tools and a flow-based strategy orchestrated by a MasterAgent to interact with software artifacts and the interactive environment.
CompileAgent significantly improves compilation success rates and reduces time/cost compared to baselines on a new benchmark, demonstrating the potential of agent-based approaches for complex software engineering tasks.

A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models

LLM Risk Evaluation Framework: introduces a novel metric and framework for evaluating the operational risk of LLM-based chatbots, integrating an Improved Probes Set, Garak scanner, the Prospected Chatbot System, a Metric Calculator, Industry Factor, Age Profile of Users, Technical Complexity, and Hits to assess risks to the system, users, and third parties.
The framework leverages the open-source GARAK tool, enhancing its probes and incorporating contextual factors like industry and user demographics into the risk calculation.
Evaluation results using the framework demonstrate varying risk levels across different LLM models and the impact of prompt protection and contextual multipliers on risk assessment.

AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities

AutoPatch: introduces a multi-agent framework with a security plugin, similarity analyzer, taint analysis, semantic analysis, unified similarity model, RAG database, vulnerability verifier, code patcher, and LLM-based code generation model, designed to patch vulnerable LLM-generated code by identifying and fixing real-world CVEs.
The framework leverages retrieval-augmented generation and specialized LLM agents to analyze code, find similar vulnerabilities in a database, verify their presence, and generate secure patches.
This approach aims to overcome the knowledge cutoff limitation of LLMs and provide a cost-efficient alternative to frequent fine-tuning for handling newly disclosed vulnerabilities.

Benchmarking LLMs' Swarm intelligence

SwarmBench: introduces a novel benchmark for evaluating LLM swarm intelligence, featuring a launcher (Launches benchmark), a framework orchestrator (Orchestrates interactions), a simulation environment (Simulation environment), task definitions (Defines coordination tasks), a physics engine (Manages environment physics), LLM-powered agents (LLM-powered agents logic), and a data logger (Captures simulation data).
The benchmark assesses emergent decentralized coordination in LLM swarms under strict perception and communication constraints within a configurable 2D grid world.
SwarmBench includes five core multi-agent coordination tasks: Pursuit, Synchronization, Foraging, Flocking, and Transport, evaluated using a zero-shot protocol.

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero Reasoner (AZR): introduces a system where a single Language Model (acts as proposer and solver) learns to propose tasks (Proposer) and solve them (Solver) through self-play, utilizing a Code Executor (validates tasks, verifies answers) as the Environment (provides feedback) and guided by a Reward Function (guides learning) and RL Algorithm (updates model).
The system operates under the Absolute Zero paradigm, learning entirely from self-generated tasks and environmental feedback without relying on external human-curated data.
AZR leverages three distinct task types (deduction, abduction, induction) and a task-relative reinforcement learning approach (TRR++) to achieve strong reasoning capabilities across coding and mathematical domains.

Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering

RACI-based framework: introduces a method for assigning responsibilities between Human Actors and LLM-based Agents using RACI roles to facilitate trustworthy human-agent collaboration in LLM-based multi-agent systems for software engineering.
The framework aims to enhance collaboration, ensure accountability, and mitigate risks associated with LLM-driven automation by systematically distributing decision-making authority and oversight.
The approach defines specific roles (Responsible, Accountable, Consulted, Informed) for humans and agents across tasks within the software development lifecycle.

Identification and Optimization of Redundant Code Using Large Language Models

LLM-agent: introduces a framework leveraging Large Language Models (Core engine) to analyze and optimize a Codebase (Input code), verified by Test Cases (Verification).
The framework incorporates Static Analysis Tools (Evaluation) for metric evaluation and Developer Feedback (Validation/Insights) for understanding redundancy causes.
The LLM Agent (Orchestrator) manages the process, aiming to build a Catalog (Knowledge base) of redundant code patterns and reasons.

6th May 2025

The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete

Narrative Primed LLM Agents: introduces a system where LLM agents play a public goods game, influenced by narrative priming from a story pool.
The study investigates how shared versus different narratives affect agent collaboration and competition outcomes in the game.
Experiments explore the influence of narrative type, group size, and the presence of selfish agents on collaboration scores and payoffs.

Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents

LLM Demonstrations Guided DQN: introduces enhancing a traditional DQN agent with LLM-generated gameplay demonstrations, utilizing Objects Coordinates Extraction, LLM Agents, LLM Demo, and LLM Loop to collect expert trajectories, which are then integrated into the DQN Components including Self-Play Experience, Evaluation NNet, Target NNet, Priority Experience Replay, Priority Sampling, DQN loss calculation, and DQN Loop interacting with the Atari-Frogger Env.
The approach leverages Prioritized Experience Replay to prioritize sampling of the LLM-generated expert demonstrations, aiming to improve the sample efficiency and initial performance of the DQN agent on the challenging Frogger game.
Experiments show that incorporating LLM demonstrations leads to significantly higher episodic rewards and faster convergence compared to a standard DQN baseline within a limited training budget.

Performance Evaluation of Large Language Models for High-Performance Code Generation: A Multi-Agent Approach (MARCO)

MARCO (Multi-Agent Reactive Code Optimizer): introduces a multi-agent system with Code Optimizer Agent, Web-Search Engine, Performance Evaluator Agent, and Adaptive Feedback Loop for optimizing high-performance computing code.
The Code Optimizer Agent generates and refines code using strategies informed by the Web-Search Engine and feedback from the Performance Evaluator Agent.
The Adaptive Feedback Loop iteratively improves code quality by feeding performance metrics from the evaluator back to the optimizer.

Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

FGO (Fine-Grained Optimization): introduces, "Divide (Splits dataset) / Optimize (Optimizes subsets) / LLM Optimizer (Updates modules) / Agent (Executes tasks) / Module (Part optimized) / Evaluate (Assesses performance) / Merge (Combines modules) / Recursive Clustering (Groups modules) / Direct Merge (Combines groups) / Optimal Agent System (Final agent)", a scalable framework for LLM agent optimization.
FGO divides large optimization tasks into manageable subsets, performs fine-grained optimization on each subset, and progressively merges the optimized components.
The framework demonstrates improved performance and efficiency for LLM-based agent optimization on large datasets compared to traditional methods.

SLOT: Structuring the Output of Large Language Models

SLOT (Structured LLM Output Transformer): introduces a model-agnostic post-processing approach using a fine-tuned lightweight language model to transform unstructured LLM output into structured formats, incorporating a Data Synthesizer LLM and Validation for data creation, and utilizing Loss Calculation and Weight Update for training.
The framework takes unstructured text from an upstream LLM and a JSON Schema as input to the SLOT model, producing structured output, and is evaluated using metrics like Schema Accuracy and Content Similarity.
SLOT can be combined with Constrained Decoding methods to further enhance structural validity and performance, demonstrating that targeted training can enable smaller models to achieve high-quality structured generation.

Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

FGO (Fine-Grained Optimization): introduces, "Divide (Splits dataset) / Optimize (Optimizes subsets) / LLM Optimizer (Updates modules) / Agent (Executes tasks) / Module (Part optimized) / Evaluate (Assesses performance) / Merge (Combines modules) / Recursive Clustering (Groups modules) / Direct Merge (Combines groups) / Optimal Agent System (Final agent)", a scalable framework for LLM agent optimization.
FGO divides large optimization tasks into manageable subsets, performs fine-grained optimization on each subset, and progressively merges the optimized components.
The framework demonstrates improved performance and efficiency for LLM-based agent optimization on large datasets compared to traditional methods.

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

WebGen-Bench (Evaluation Pipeline): introduces a benchmark and pipeline to evaluate LLM-based agents on generating websites from scratch, including Data Curation, Website Generation, Test Case Construction, UI Agent, UI Agent Engine, Appearance Grading, and Manual Validation components.
The pipeline uses LLMs and human annotators for data creation, LLM-based agents for website generation, and a UI agent powered by an LLM for automated functional testing.
Website appearance is graded by a separate LLM, and human testers perform manual validation of test cases.

LlamaFirewall: An open source guardrail system for building secure AI agents

LlamaFirewall: introduces an open-source, system-level security framework for LLM-powered applications, including a Unified Policy Engine (Orchestration), PromptGuard 2 (Jailbreak detection), AlignmentCheck (Agent alignment), and CodeShield (Code analysis).
The framework provides layered defense against prompt injection, agent misalignment, and insecure code generation risks.
LlamaFirewall offers a modular design supporting custom pipelines, conditional remediation strategies, and pluggable detectors for real-time security monitoring.

A Comprehensive Survey of Large AI Models for Future Communications: Foundations, Applications and Challenges

LAMs (Large AI Models): introduces a comprehensive survey of Large AI Models for future communications, covering their foundations including Transformer, Diffusion, and Mamba architectures, classification into LLM, LVM, LMM, and World models, training methods like Pre-training, Fine-tuning, and Alignment, and optimization techniques such as CoT, RAG, and Agentic systems.
The paper details the application of LAMs across various communication scenarios, including physical layer design, resource allocation, network management, edge intelligence, semantic communication, agentic systems, and emerging applications.
It analyzes the research challenges faced by LAMs in communication, such as data quality, structured knowledge integration, generative hallucination, reasoning limitations, explainability, adaptability, task diversity, resource constraints, inference latency, and security/privacy.

A HASHGRAPH-INSPIRED CONSENSUS MECHANISM FOR RELIABLE MULTI-MODEL REASONING

Hashgraph-inspired Consensus Mechanism: introduces a system for reliable multi-model reasoning using a Query Handler (accepts user request), Model Interface Layer (manages model connections), Consensus Controller (implements gossip and checks convergence), Prompt Generator (formulates model prompts), Comparer/Evaluator (compares model outputs), Result Aggregator (formats final output), and a Reasoning Model Pool (set of black-box models).
The system treats each reasoning model as a node in a distributed network, using gossip-about-gossip and virtual voting principles to achieve consensus on a final answer.
This iterative process allows models to exchange and refine answers, aiming to reduce hallucinations and improve accuracy by leveraging collective intelligence.

LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

LogisticsVLN: introduces a UAV-based vision-language navigation system for terminal delivery, integrating an LLM (interprets request, extracts attributes), Floor Count VLM (estimates floors, guides vertical movement), Object Recognition VLM (identifies target window/object), Choice VLM (determines next action), Depth Assistant (ensures safety, calculates distances), and RGB-Depth Observation (input data).
The system processes user requests and environmental observations to guide a drone to a specific window for package delivery.
It operates without prior maps or fine-tuning, relying on foundation models for perception, understanding, and decision-making in unseen residential environments.

Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM-Based Agents

Modular Semantic-Associative System: introduces a modular architecture augmenting LLMs with semantic and associative memory components to bridge cognitive gaps.
This system decouples procedural execution (LLM actor) from adaptive reasoning (semantic/associative modules) for robust decision-making.
The architecture is designed for agents operating in complex, unpredictable "wicked" environments by specializing cognitive functions.

DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning

DYSTIL: introduces a strategy-based reinforcement learning framework with DYSTIL RL Agent L, Memory Mc, Input Constructor, Core Reasoning LLM, Actor Module, Critic Module, Strategy-Generating LLM Q, Observation-to-Text Converter Co→t, Experience Buffer B, and PPO Parameter Optimization, which dynamically induces textual strategies using large language models to improve reinforcement learning from expert demonstrations.
The framework integrates a strategy-generating LLM for strategy induction with a lightweight core reasoning LLM for policy optimization.
DYSTIL iteratively updates strategies based on experience and advantage estimations, enhancing sample efficiency and model interpretability.

VLM Q-LEARNING: ALIGNING VISION-LANGUAGE MODELS FOR INTERACTIVE DECISION-MAKING

LVLMQ (VLM Q-Learning): introduces, "aligning vision-language models for interactive decision-making", with VLM (core RL policy), Image Encoder (processes image input), Text Encoder (processes text input), LoRA Transformer (adapted VLM body), Language Head (Actor) (predicts output tokens), Critic Head (estimates action values), Environment (interactive system), Observation Prompt (formats VLM input), parseagent (parses VLM response), and parseenv (interprets action for environment), where the method applies off-policy reinforcement learning to fine-tune VLMs for agent tasks by adding a critic head and using an advantage-filtered supervised fine-tuning loss.
The approach converts turn-based agent interactions into token-based RL transitions, allowing the VLM's language head to act as the policy and the critic head to filter suboptimal actions based on learned value estimates.
This technique enables VLMs to self-improve and learn from low-quality datasets, effectively replacing standard supervised fine-tuning for VLM agent training while handling action syntax challenges.

An LLM-based Self-Evolving Security Framework for 6G Space-Air-Ground Integrated Networks

LLM-based Self-Evolving Security Framework: introduces a security framework for 6G SAGINs with LLM-6GNG (Processes threat data, generates strategies), 6G-INST (Enables framework self-evolution), and 6G Simulator (Simulates 6G SAGINs environment).
The LLM-6GNG component processes threat information and generates security strategies using multi-agent LLMs and chain-of-thought reasoning.
The 6G-INST component enables the framework to self-evolve by automatically updating the LLM-6GNG with new training data generated from encountered threats.

Assessing and Enhancing the Robustness of LLM-based Multi-Agent Systems Through Chaos Engineering

Chaos Engineering Framework: introduces a framework for assessing and enhancing the robustness of LLM-based Multi-Agent Systems (LLM-MAS) by systematically applying chaos engineering principles.
The framework includes components like a Chaos Module for fault injection and Monitoring Components/Modules for data collection and analysis.
The research proposes validating the framework through controlled experiments simulating various failure scenarios in LLM-MAS deployments.

5th May 2025

Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

MoAA: introduces a two-stage alignment recipe leveraging the collective intelligence of multiple open-source LLMs, including Mixture of Agents (MoA), Proposers, Aggregators, Synthetic Data Generator, Reward Model, Criteria Filtering, Target Model, SFT Model, and DPO Model, to generate high-quality synthetic data for supervised fine-tuning and preference optimization.
The approach utilizes MoA as a synthetic data generator in the first stage (MoAA-SFT) to fine-tune a target model and as a reward model in the second stage (MoAA-DPO) to annotate preference data for direct preference optimization.
MoAA demonstrates significant improvements in model performance on alignment benchmarks by effectively integrating the strengths and diversity of open-source LLMs without relying on stronger external supervision.

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

The LLM-Powered Research Constellation: introduces 34 LLM applications across materials science and chemistry, categorized into Property Prediction (Forecasting properties), Molecular & Material Design (Generating novel molecules/materials), Automation & Novel Interfaces (Developing interfaces/automations), Scientific Communication and Education (Enhancing communication/education), Research Data Management and Automation (Streamlining data handling/processing), Hypothesis Generation & Evaluation (Generating/evaluating hypotheses), and Knowledge Extraction & Reasoning (Extracting knowledge/reasoning).
These applications, developed during a hackathon, demonstrate LLMs' versatility as predictive models and platforms for rapid prototyping of domain-specific tools.
The work highlights how integrating LLMs into scientific workflows can accelerate discovery and improve researcher efficiency across the entire research lifecycle.

The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

Iterative Program Repair Pipeline: introduces an approach for automatic program repair using instruction-tuned large language models, balancing multi-output generation and iterative refinement within a limited patch budget, incorporating Input, LLM, Prompt, Output, Parsing, Validation, Execution, Feedback, and Iterative Process components.
The pipeline processes buggy code input, uses an LLM guided by a prompt to generate output patches, which are then parsed and subjected to validation via execution with tests.
Feedback from validation drives the iterative process to refine patches, aiming to maximize repair success while limiting the total number of generated patches.

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Scenethesis: introduces a training-free agentic framework for text to interactive 3D scene generation, with LLM Module (Coarse scene planning), Vision Module (Layout visual refinement), Optimization Module (Physics-aware optimization), and Judge Module (Spatial coherence judgment).
The framework leverages language and visual priors to generate realistic and physically plausible indoor and outdoor environments.
It integrates LLM-based scene planning with vision-guided layout refinement and physics-aware optimization to ensure spatial realism and physical plausibility.

AutoLibra: Agent Metric Induction from Open-Ended Feedback

AutoLibra: introduces a framework for agent evaluation that transforms Human Feedback (Open-ended text) on Agent Trajectory (Agent actions/observations) using an LLM (Text processing model) to generate Aspects (Grounded behavior-feedback), induce AutoLibra Metrics (Induced evaluation criteria), evaluate agents producing Traits (LLM metric ratings), and meta-evaluate metrics using Meta-Metrics (Metric quality evaluation).
The framework operates in a closed loop, using meta-evaluation results (coverage and redundancy) to optimize the induced metrics.
AutoLibra-induced metrics serve as targets for agent improvement through prompt engineering or fine-tuning.

Generating HomeAssistant Automations Using an LLM-based Chatbot

EcoMate: introduces an LLM-based chatbot system for generating HomeAssistant routines, utilizing an LLM to process User Commands, Home Template, and Energy Consumption data for execution by the HomeAssistant framework.
The system evaluates different LLMs' ability to generate valid JSON routines for HomeAssistant and assesses user perception compared to rule-based chatbots.
Findings indicate GPT models excel in routine generation, while user studies show positive engagement and usability for the LLM-based approach in promoting sustainable practices.

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Voila: introduces a family of large voice-language foundation models, with Audio Tokenizer (Audio tokenization/decoding), Text Tokenizer (Text tokenization), Voice-language LLM backbone (Processes interleaved tokens), and Audio Transformer (Generates audio tokens), designed for real-time autonomous interaction and voice role-play.
The model employs a hierarchical multi-scale Transformer architecture integrating LLM reasoning with acoustic modeling for natural, persona-aware voice generation.
Voila supports end-to-end voice conversation and autonomous full-duplex interaction by processing interleaved audio and text tokens.

Exploring LLM-Powered Role and Action-Switching Pedagogical Agents for History Education in Virtual Reality

VR Prototype with LLM-Powered Pedagogical Agents: introduces a system for VR history education featuring a Virtual Environment, Pedagogical Agents (PAs) powered by an LLM (Large Language Model), and modules for Conversation, Adaptive Role-Switching, and Adaptive Action-Switching.
The LLM processes user input and context to drive the Conversation Module, while the Adaptive Role-Switching and Adaptive Action-Switching Modules dynamically adjust the PA's role, appearance, voice, tone, and actions based on the LLM's output and environmental factors.
A user study found that adaptive role-switching enhanced perceived trustworthiness and expertise, while adaptive action-switching increased perceived social presence and humanness, offering insights for designing multi-role agents in immersive learning.

A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

Test-Time Scaling, Reinforced Learning, and Slow Thinking: introduces a survey of reasoning LLMs, detailing methods like Test-Time Scaling (Adjusts computation complexity), Reinforced Learning (Optimizes policy via feedback), and Slow Thinking (Emulates deliberate reasoning).
These approaches incorporate components such as Search and Sampling (Explores reasoning paths), Dynamic Verification Mechanism (Verifies, refines outputs), Policy Network (Learns reasoning strategies), Reward Design (Evaluates reasoning quality), and Self-Evolution (Iteratively improves performance).
Slow Thinking frameworks further utilize Long CoT (Generates extended reasoning), Hierarchical Reasoning (Structures problem-solving modularly), and Hybrid Thinking (Combines fast, slow processes) to enhance reasoning capabilities.

Evaluating Contrastive Feedback for Effective User Simulations

LLM-based User Simulation: introduces, "evaluating different prompting strategies for LLM-based user agents in interactive information retrieval simulations", with LLM (Core agent), Information Need (Initial context), Knowledge State (Evolving understanding), Relevance Feedback (Document summaries), Prompting Strategy (Contextual input method), Query Generation (LLM creates queries), Relevance Judgment (LLM judges documents), Knowledge State Update (Incorporates feedback), and Simulation Environment (Provides search results), where "the paper analyzes how different modalities of contextual information influence the effectiveness of user simulations".
The study evaluates user configurations where the LLM agent's knowledge state is updated iteratively with summaries of previously judged relevant, irrelevant, or both types of documents.
The research demonstrates that providing contrastive feedback (summaries of both relevant and irrelevant documents) to the LLM agent improves simulated user search effectiveness.

Beyond the model: Key differentiators in large language models and multi-agent services

LLM Ecosystem Differentiators: introduces key differentiators beyond the core model, including Data Quality, Proprietary Datasets, Model Quantization, Model Pruning, Neural Attention Memory Models (NAMMs), Semantic Caching, Attention Offloading, Speculative Decoding, Low-Rank Adaptation (LoRA), Flash-LLM, Evaluation Frameworks, Monitoring Systems, Model-to-Data Movement, Synthetic Data Generation, Data Versioning, and Data Lineage.
The paper reviews critical factors like data management, computational efficiency, latency reduction, and robust evaluation frameworks that ensure modern AI services are efficient and profitable.
These ecosystem components and strategies are presented as the real competitive advantage in generative AI as large language models become increasingly commoditized.

El Agente: An Autonomous Agent for Quantum Chemistry

El Agente Q: introduces an LLM-based multi-agent system with a hierarchical architecture, integrating working and long-term memory, an LLM reasoning core, and specialized agents for automated quantum chemistry workflows.
The system features a hierarchical memory framework enabling flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling.
El Agente Q demonstrates robust problem-solving, adaptive error handling, and supports multi-step task execution for complex workflows.

4th May 2025

A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)

MCP (Model Context Protocol), ACP (Agent Communication Protocol), A2A (Agent-to-Agent Protocol), and ANP (Agent Network Protocol): introduces a survey examining four emerging agent communication protocols, including MCP (Initiator, Provider, Message semantics, Physical transmission, Calls expecting replies, Successful responses, Failures, Asynchronous updates, Model-controlled capabilities, Application-controlled data, User-controlled templates, Server-controlled generation delegation), ACP (Initiates communication, Protocol broker, Execution endpoint, Identity and capability profile, Agent location, Unit of delegated work, Communication envelope, Execution outputs), A2A (Originator of intent, Intermediary orchestrator, Service endpoint, Self-description and discovery, Actionable capabilities, Atomic unit of work, Communication channel, Tangible outputs, Real-time streaming, Out-of-band updates), and ANP (Decentralized identifier, Structured metadata profile, Agent indexing and search, JSON-RPC, OpenAPI, YAML schemas, Dynamic protocol alignment, Secure communication, Protocol negotiation layer, Core application logic layer, Transport protocol), each addressing distinct interoperability tiers for LLM-powered agents.
The protocols are compared across dimensions like interaction modes, discovery mechanisms, communication patterns, and security models to provide a foundation for designing secure, interoperable, and scalable agent ecosystems.
A phased adoption roadmap is proposed, starting with MCP for tool access, progressing through ACP for multimodal messaging and A2A for enterprise collaboration, and extending to ANP for decentralized agent marketplaces.

VECSR: Virtually Embodied Common Sense Reasoning System

VECSR: introduces a framework for common sense reasoning, with VECSR (Orchestrates process), s(CASP) Knowledge Base (Stores rules and state), s(CASP) Goal-Directed Solver (Generates action plans), and VirtualHome Simulation Environment (Provides embodied world), designed to break down high-level tasks into executable mid-level instructions.
The system converts VirtualHome state into s(CASP) facts, combines them with common sense rules, and optimizes the resulting program using techniques like modularity, dependency graphs, and partial grounding.
The s(CASP) solver then uses the optimized program to generate a sequence of actions that achieve the goal task in the simulated environment, providing explainable and executable plans.

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

ACT + Debugger (Multi-Agent Collaboration and Runtime Debugging): introduces a chained system combining multi-agent collaboration and runtime debugging for LLM code generation, including Analyst, Coder, Tester, and Debugger agents, processing Code Requirements, Visible Test Cases, Code Blocks, and using Blocking and Tracing Process, Execute Program to produce a Final Answer.
The system first uses Analyst, Coder, and Tester agents in a process-oriented phase, then transitions to a product-oriented debugging phase involving the Debugger and Coder agents if initial tests fail.
This integrated approach aims to leverage the strengths of collaborative planning and iterative debugging to improve functional accuracy, code rigor, and latency trade-offs.

DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving

DriveAgent: introduces a novel multi-agent framework for autonomous driving that leverages LLM and VLM reasoning combined with multimodal sensor fusion, structured into Descriptive Analysis, Vehicle Reasoning, Environmental Reasoning, and Response Generation modules.
The framework integrates camera, LiDAR, GPS, and IMU data through a hierarchy of specialized agents within these modules to enhance situational understanding and decision-making.
DriveAgent aims to provide clear, reliable, and interpretable insights into complex driving scenarios, improving robustness and reliability compared to baseline methods.

MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents

MemEngine: introduces a unified and modular library for developing advanced memory models for LLM-based agents, with Memory Models, Memory Operations, Memory Functions, Memory Configurations, Memory Utilities, and LLM components.
The library provides a hierarchical framework comprising memory functions, operations, and models, supported by configuration and utility modules.
MemEngine facilitates convenient development and pluggable usage of various pre-implemented and customizable memory models for LLM agents.

Leveraging LLM Agents and Digital Twins for Fault Handling in Process Plants

Methodological Framework: integrates LLM agents with a Digital Process Plant Twin for autonomous fault handling, utilizing Monitoring Agent, Action Agent, Simulation, Validation Agent, and Reprompting Agent components.
The framework operates in a closed loop, observing the Process Plant state, generating and validating corrective actions via the Digital Process Plant Twin simulation.
Plant-specific knowledge from the Digital Twin informs the LLM agents' reasoning for deriving effective and safe corrective control actions.

3rd May 2025

CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

CAMOUFLAGE (Claim Alteration for Misleading Output Using Feedback from Language Agent GuideancE): introduces an iterative LLM-driven adversarial attack framework with an Attacker Agent, Prompt Optimization Agent, Claim Evaluation, Misinformation Detection System, and History components.
The Attacker Agent generates perturbed claims guided by the Prompt Optimization Agent, which refines instructions based on feedback from the Claim Evaluation and the target Misinformation Detection System.
The framework optimizes attacks using only binary feedback from the target system and evaluation metrics, storing past attempts in History to guide future rewrites.

Model Context Protocol-based Internet of Experts For Wireless Environment-aware LLM Agents

MCP-based Internet of Experts (IoX): introduces a framework equipping LLM Agents with wireless environment awareness by coordinating interactions with Expert Models hosted on Expert Servers via the Model Context Protocol, using input from the Wireless Environment.
The framework enables the LLM Agent to selectively query and interpret outputs from lightweight, task-specific Expert Models at inference time without modifying its parameters.
This architecture supports modular, extensible, and interpretable reasoning over wireless contexts, significantly improving classification accuracy compared to standalone LLMs.

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

LLM Inference Engine: introduces a comprehensive survey of 25 open-source and commercial engines, detailing their support for Batch Optimization (groups requests), Parallelism (distributes computation), Compression (reduces model size), Fine-Tuning (adapts model), Caching (reuses computations), Attention Optimization (improves attention), Sampling Optimization (speeds token generation), and Structured Outputs (constrains output format).
The paper examines each inference engine's ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation across diverse hardware.
It provides practical guidance for selecting and designing optimized LLM inference engines by analyzing their design goals, supported optimization techniques, and ecosystem maturity.

The STROT Framework: Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation

STROT (Structured Task Reasoning and Output Transformation): introduces a framework for structured data interpretation with LLMs, featuring Schema-Aware Context Construction (Analyze data schema), Prompt Scaffolding and Task Planning (Generate analysis plan), Transformation Logic Synthesis (Generate executable code), Program Execution (Run generated code), Feedback-Driven Refinement (Revise code based on errors), and Final Output (Deliver structured result).
The framework embeds the LLM within a multi-phase, feedback-driven pipeline that treats data understanding as a dynamic and structured process, enabling iterative reasoning and self-correction.
This agentic approach improves reliability, interpretability, and semantic alignment for structured data analysis tasks compared to single-shot methods.

2nd May 2025

PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

PIPA (Unified evaluation Protocol for Interactive Planning Agents): introduces a unified evaluation protocol for interactive planning agents, conceptualizing their behavior within a POMDP paradigm, including Agent (Interactive planning agent), User (Interacts with agent), Interactive Session (Multi-turn dialogue), Intermediate Steps (Agent's internal reasoning), State Consistency Metric (S) (Aligns user requests with steps), Tool Efficiency Metric (A) (Measures tool utilization), Observation Alignment Metric (O) (Aligns observations with user needs), Policy Alignment Metric (P) (Follows predefined policies), and Task Completion Metric (R) (Measures goal achievement).
The protocol provides a comprehensive assessment through atomic evaluation criteria to diagnose strengths and weaknesses in the agent's decision-making pipeline.
PIPA enables multi-axis diagnosis and cross-benchmark comparisons, showing that user satisfaction is shaped by both outcomes and intermediate behaviors.

AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

This paper evaluates five machine translation paradigms: Google Translate (GT), GPT-4o, o1-preview, sequential multi-agent system (s-agent), and iterative multi-agent system (i-agent), comparing their quality and cost-efficiency.
The study benchmarks these systems using automatic metrics, human evaluation of adequacy and fluency, and token-based cost analysis across three language pairs and two domains.
Findings indicate that reasoning-enhanced LLMs and multi-agent workflows show potential for higher quality in human evaluation but incur significantly greater computational costs compared to traditional NMT and general LLMs.

WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks

WirelessAgent: introduces a framework leveraging LLMs to create autonomous AI agents for wireless networks, integrating LLMs (Cognitive engine), Perception (Processes inputs), Memory (Stores data, context), Planning (Organizes tasks, reasons), Action (Executes commands), LangGraph (Graph-based workflow architecture), Global State (Shared workflow memory), External Tools (Specialized capabilities), Knowledge Base (Domain information repository), and System Prompts (Guide agent behavior).
The framework is built on agentic workflows implemented using the LangGraph architecture to manage complex wireless tasks.
WirelessAgent demonstrates near-optimal network throughput and higher bandwidth utilization compared to prompt-based methods in network slicing tasks.

VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

VTS-LLM: introduces a domain-adaptive LLM agent for Vessel Traffic Services, with NER-based relational reasoning (clarifies query-database relations), agent-based domain knowledge injection (integrates maritime knowledge), semantic algebra intermediate representation (bridges natural language to SQL), query rethink (validates and corrects SQL), and LLM (core language model) components.
The framework formalizes risk-prone vessel identification as a knowledge-augmented Text-to-SQL task, leveraging structured vessel databases and external maritime knowledge, supported by a curated benchmark dataset.
VTS-LLM demonstrates superior performance and robustness across command-style, operational-style, and formal natural language queries compared to general-purpose and SQL-focused baselines.

Multi-agents based User Values Mining for Recommendation

ZOOM (Zero-shot Multi-LLMs Collaborative Framework for User Values Mining): introduces a framework for extracting user values from historical interactions using User History, Text Summarization, Evaluators, Decoding Strategies, Supervisors, and Debate to produce User Values.
The framework employs multi-agent collaboration between evaluators generating diverse value candidates and supervisors refining them through debate to mitigate LLM limitations.
Text summarization addresses input length constraints, while the multi-agent debate enhances accuracy and reduces hallucinations in value extraction.

Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models

The LLM-driven framework: introduces an online safety-critical scenario generation method featuring an LLM Behavior Analyzer (Infers dangerous intent), Feasible Trajectory Generation (Synthesizes adversarial trajectories), and Dynamic Memorization and Retrieval (Adapts online).
This framework utilizes a Memory bank (Stores intent-planner pairs) and offline processes including a Code Generator (Generates planner code), Simulation (Evaluates trajectories), and Code Modifier (Refines planner code) to support online adaptation and generation.
By analyzing historical states, inferring intent, generating trajectories, and dynamically updating a behavior library, the method effectively generates high-risk scenarios for autonomous vehicle testing.

SSRLBot: Designing and Developing an LLM-based Agent using Socially Shared Regulated Learning

SSRLBot: introduces an LLM-based agent for teamwork evaluation, integrating an LLM Backbone, Instructions/Preamble, SSRL Knowledge, Capabilities, Prompting Strategies, Iterative Refinement, Input, Output, and Output Evaluation to analyze diagnostic conversations.
Grounded in Socially Shared Regulation of Learning (SSRL) theory, the agent evaluates team members' interpersonal influence and SSRL skills.
The system provides contextualized feedback, comparative skill analysis, and improvement suggestions for collaborative learning and decision-making.

Structured dataset of reported cloud seeding activities in the United States (2000-2025) using a large language model

Data Extraction Pipeline: introduces, "a pipeline for creating a structured dataset from historical cloud seeding reports", with PDF Reports (Source documents), Preprocessing (Organizes, merges files), Text Extraction (Converts PDF to text), Prompt Engineering (Designs LLM input), LLM (OpenAI o4-mini) (Extracts structured data), Response Parsing (Processes LLM output), and Structured Dataset (CSV) (Final tabular data), where "the pipeline processes inconsistent PDF reports using LLM-based extraction to generate a structured CSV dataset".
The pipeline utilizes multi-stage PDF-to-text conversion and chain-of-thought prompting with OpenAI's o4-mini model to achieve high extraction accuracy.
This framework provides a scalable method for unlocking structured environmental data from historical scanned documents across various scientific domains.

1st May 2025

Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines

Agentic Pipeline Framework: introduces an agentic pipeline framework for a perceptive task guidance system, comprising Perceptors (Perceive data (visual/language)), Planners (Decompose tasks/sequence agents), Action agents (Process data/generate response/verify), Tools (External utilities), and Memory/Context (Stores documents/past logs).
The framework processes user input through a sequence of specialized agents, including Lead planner (Creates agent pipeline plan), Query planner (Assesses query/routes flow), Answer planner (Decides answerability/invokes generator), and various Action agents like RAG (Retrieves/summarizes documents) and Safety Agent (Filters inappropriate responses).
The paper evaluates Chain-of-Thought reasoning within this pipeline, finding that it does not improve output quality or provide effective explainability for end users in the context of task guidance.

From Texts to Shields: Convergence of Large Language Models and Cybersecurity

LLM and Agent Applications in Cybersecurity: reports on the convergence of large language models and cybersecurity, exploring emerging applications and challenges of integrating LLM Agent (dynamic reasoning engine), Meta Agent (agent of agents), RAG (retrieval-augmented generation), and Human-in-the-loop (human oversight) approaches.
The report examines LLM applications in network security, generative security engineering, and socio-technical aspects, including interpretability, safety, and security challenges.
It outlines a forward-looking research agenda for the secure and effective adoption of LLMs in cybersecurity, integrating technical advances with organizational and societal considerations.

HMCF: A Human-in-the-loop Multi-Robot Collaboration Framework Based on Large Language Models

HMCF (Human-in-the-loop Multi-Robot Collaboration Framework): introduces a framework for multi-robot collaboration with Assistant LLM agent, Robot LLM agents, Human-in-the-loop mechanism, Heterogeneous Robots, Human-Robot Interaction Interface, and RAG (Retrieval Augmented Generation), enabling efficient and scalable task allocation and execution.
The framework integrates LLM-based reasoning for task allocation and verification with human oversight to enhance adaptability, safety, and robustness in diverse environments.
HMCF utilizes a web-based interface for natural language interaction, allowing users to configure robots, monitor operations, and intervene when necessary.

Reasoning Capabilities and Invariability of Large Language Models

Large Language Models: introduces an evaluation of LLMs' reasoning capabilities using various prompting techniques and a new benchmark dataset focused on shallow logical reasoning with geometric figures.
The evaluation assesses 24 different LLMs using zero-shot, few-shot, and chain-of-thought prompting on a dataset designed to test logical constructors and invariability to language variations.
Results indicate that while larger LLMs perform better in zero-shot settings, overall performance on shallow reasoning remains limited, and model behavior is largely invariant to small language variations.

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Memory Framework: introduces, "a structured and dynamic perspective on memory in AI systems", with Parametric Memory (Implicit model knowledge), Contextual Structured Memory (Explicit organized memory), Contextual Unstructured Memory (Explicit general memory), Consolidation (Integrate short-term into persistent), Updating (Modify memory), Indexing (Organize for retrieval), Forgetting (Remove irrelevant content), Retrieval (Access relevant information), and Compression (Reduce memory size), clarifying functional interplay in LLM-based agents.
The framework categorizes memory by representation type and defines fundamental operations for memory management and utilization.
This survey maps these components and operations to relevant research topics and outlines future directions for memory in AI.

USERCENTRIX: AN AGENTIC MEMORY-AUGMENTED AI FRAMEWORK FOR SMART SPACES

UserCentrix: introduces an agentic memory-augmented AI framework for smart spaces, with User Task Processing (User-side layer), Personal Agent (LLM-powered assistant), Personal Memory (Stores user history/preferences), Knowledge Retrieval Cycle (Memory recall/similarity assessment), Smart Building Side (Building-side layer), Decision-making Module (High-level agents), Classifier Agent (Determines task urgency), High-urgency Agent (Prioritizes speed), Low-urgency Agent (Prioritizes precision/generates solutions), Evaluator Agent (Assesses/selects solutions), Pareto Analyzer (Optimizes decision-making), Memory (Decision-making Module) (Stores solutions/tasks), Sub-tasks Execution Module (Low-level agents), Low-level Agents (Execute sub-tasks/generate commands), Management and Analysis Module (Manages/dispatches commands), Message Queue (Stores commands), Environment Agent (Tracks tasks/adjusts environment), and Smart Building Dataset (Data source), designed to enhance smart spaces through dynamic, context-aware decision-making.
The framework integrates personalized LLM agents leveraging user preferences and memory management with a hybrid hierarchical control system balancing centralized and distributed processing.
UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies.

A Survey on Large Language Model based Human-Agent Systems

LLM-HAS (LLM-based Human-Agent Systems): introduces a structured survey of these systems, detailing core components including Environment & Profiling (Context, roles, goals, capabilities), Human Feedback (Types, granularity, timing), Interaction Types (Collaboration, competition, coopetition), Orchestration Paradigm (Task strategy, temporal synchronization), and Communication (Structure, mode).
The survey clarifies fundamental concepts and systematically presents these core components shaping human-agent systems.
It explores emerging applications, discusses unique challenges, and offers a structured overview to foster research in this interdisciplinary field.

Large Language Models as AI Agents for Digital Atoms and Molecules: Catalyzing a New Era in Computational Biophysics

ADAM (Agent for Digital Atoms and Molecules): introduces a multi-agent framework for computational biophysics, featuring a Plan Agent, Route Agent, Hybrid Neural-Symbolic Architecture, Neural Tools, Symbolic Tools, ADAM Tool Protocol (ATP), ATP Server, Distributed Tool Executors, Central Database, and Memory.
The framework employs a hybrid neural-symbolic architecture combining LLM-driven semantic tools with deterministic symbolic computations for scientific workflows.
Its ADAM Tool Protocol enables asynchronous, database-centric tool orchestration and community-driven extensibility for third-party tool integration.

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Traj-Bootstrap: introduces a method for LLM agents to improve performance on sequential decision-making tasks by constructing and refining a Trajectory Database of self-generated successful experiences, used by a ReAct-style Agent via a Retrieval Mechanism interacting with an Environment.
The approach includes Traj-Bootstrap for naive accumulation, +DB-Selection for population-based database selection, and +Exemplar-Selection for selecting high-utility individual trajectories.
These methods enable autonomous agent self-improvement without task-specific knowledge engineering, achieving performance comparable to methods using multiple test attempts or stronger LLMs.

30th April 2025

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Automated Failure Attribution: introduces methods (All-at-once Method, Step-by-step Method, Binary Search Method, LLM Judge, Failure Logs, Query, Failure-Responsible Agent, Decisive Error Step) for identifying the agent and step responsible for task failures in LLM multi-agent systems using failure logs.
The paper evaluates three LLM-based methods: All-at-once processes the full log, Step-by-step processes incrementally, and Binary Search processes log segments.
The LLM Judge analyzes the query and failure logs to predict the failure-responsible agent and decisive error step.

CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios

CoordField: introduces a coordination field agentic system for UAV swarm task allocation, with a Semantic Understanding Module (Interprets natural language), LLM (Parses instructions), Planning Module (Transforms tasks), Planning Agent (Aggregates results), Coordination field (Guides motion, task selection), Perception Mapping (Constructs potential field), Task Decomposition (Converts potential field), Task Assignment (Enhances coordination efficiency), Execution Module (Translates outputs), Execution Agent (Manages control commands), UAV Deployment (Physical or virtual), and Prompt Tools API (Communicates with control), designed for heterogeneous UAV swarms in urban environments.
The system leverages LLMs for high-precision task understanding and employs a coordination field control strategy for task-oriented autonomous navigation and collective coordination.
CoordField utilizes dynamically updated potential fields and fluid-based velocity fields to enable decentralized and adaptive allocation of emergent tasks, demonstrating superior performance in task coverage, response time, and adaptability.

TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

TRUST (TRauma Understanding and Structured Assessments): introduces, "an LLM-based dialogue system for trauma understanding and structured assessments," with Database (stores system memory), Framework (manages dialogue and assessment), Conversation (manages dialogue flow), Assessment (manages assessment logic), LLM (powers conversation and assessment), Dialogue Act Schema (guides conversation), and Patient Simulation (evaluates system), designed to conduct formal diagnostic interviews for PTSD.
The Database module contains Variable, History, and Score components to store variable metadata, conversation history, and assessment outcomes, respectively.
The Framework's Conversation and Assessment submodules utilize an LLM for tasks like predicting dialogue acts, generating responses, and performing assessments, while the Dialogue Act Schema provides structured guidance, and Patient Simulation uses an LLM and real-life transcripts for robust evaluation.

LLM-based Interactive Imitation Learning for Robotic Manipulation

LLM-iTeach: introduces a novel interactive imitation learning framework utilizing an LLM as a teacher for robotic manipulation, featuring an LLM Teacher, Agent, CodePolicy, Hierarchical Prompting, Similarity-checking mechanism, Evaluative feedback, Corrective feedback, Image, Robot State, Convolutional Layers, LSTM, Gauss Distribution, and Action.
The framework employs hierarchical prompting to generate a CodePolicy from the LLM, which then provides feedback based on a similarity check between the agent's action and the LLM's action.
The agent learns a stochastic policy parameterized by a Gaussian distribution, processing image and robot state inputs through convolutional layers and an LSTM to determine actions.

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Agent Orchestrator: introduces an LLM-driven agent-orchestration architecture for embodied robots, with Agent Orchestrator (coordinates specialized agents), Routing Agent (analyzes and directs user requests), Task Planning Agent (handles action commands), Knowledge Base Agent (processes history queries), Memory (stores past actions and environment records), and Perception (provides object detection and scene understanding) components, enabling autonomous household object management.
The system integrates memory-augmented task planning using RAG for long-term object tracking and utilizes specialized agents powered by task-specific LLMs.
Perception components like Grounded SAM and LLaMa3.2-Vision facilitate robust object detection and semantic scene understanding for task planning.

UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

UAV-VLN: introduces, "LLM (Interprets instructions, generates sub-goals) / Automated Task Planner (Maps sub-goals to actions) / Visual Input (UAV camera feed) / Vision Model (Detects objects, Grounding DINO) / Cross-modal Grounding Module (Aligns language and visuals) / Control Pipeline (Executes plans, ROS 2) / UAV (Executes plan, provides visual input)", a novel end-to-end vision-language navigation framework for UAVs that interprets natural language instructions and plans aerial trajectories.
The framework leverages a fine-tuned LLM for semantic parsing, a vision model (Grounding DINO) for scene understanding, and a cross-modal grounding module to align linguistic intent with visual context.
An Automated Task Planner maps high-level sub-goals from the LLM to low-level control commands executed via a ROS 2 pipeline on the UAV.

Meeseeks: An Iterative Benchmark Evaluating LLMs Multi-Turn Instruction-Following Ability

Meeseeks: introduces a multi-round automatic instruction-following benchmark with a hierarchical taxonomy, simulating human-LLM interaction through an iterative feedback process for evaluating LLMs' instruction-following ability.
The benchmark employs an evaluation system with capability tags across three dimensions, using LLM-based extractors and evaluators alongside rule-based checks.
Meeseeks utilizes data parameterization for flexible dataset generation and provides metrics like Utility Rate and Meeseeks Score to quantify performance and self-correction capabilities.

Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

LPFG (Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming): introduces a framework for unsupervised feature transformation using a Critic Agent (diagnoses data, provides advice), a Generator Agent (generates features), and Iterative Refinement (feedback loop for improvement).
The Critic Agent provides semantic and distributional advice to guide the Generator Agent in producing tokenized feature transformations.
The iterative feedback loop between the agents refines the generated features for improved structural integrity, predictive utility, and format compatibility.

Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA

Discuss-RAG: introduces an agent-led framework for medical QA RAG systems, featuring a multi-turn discussion and summarization module with a Recruiter R, Medical Team (Agents Hi), and Summarizer C generating a Distilled summary D, followed by a post-retrieval verification module where a Decision maker U evaluates Snippets S from Trivial RAG before LLMs generate the final Answer A.
The multi-turn discussion simulates expert brainstorming via iterative Insights I and Output summary T, enriching context for retrieval.
The post-retrieval verification step filters retrieved content using a Decision maker U and Verifier V, triggering an Alternative retrieval strategy if necessary, to improve answer accuracy and reliability.

29th April 2025

SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

SECREPOBENCH: introduces a benchmark construction framework that takes GitHub Projects, OSS-Fuzz Reports, and ARVO Dataset as inputs, uses a Task Constructor (Patch Locator, Mask Generator, Write Description, Code Mutator) to create repository-level code generation tasks, employs a Test Constructor (Unit Test Finder, Security Test Case Finder) to generate correctness and security tests, and outputs the task and tests.
The framework focuses on generating secure code completion tasks within real-world C/C++ repositories by leveraging known security vulnerabilities and developer-written tests.
The benchmark evaluates LLMs on their ability to generate correct and secure code in a repository context, which is shown to be more challenging than generating self-contained programs.

AI-in-the-Loop Planning for Transportation Electrification: Case Studies from Austin, Texas

Urban Planning AI: introduces an AI-in-the-Loop framework for transportation electrification planning, integrating Planner, Urban AI, GeoAI, GenAI, LLMs, AI Agent, Automated System, UI, Community, and Feedback Loop components.
The framework utilizes GeoAI for site suitability analysis, GenAI for estimations and visualizations, and LLMs for scenario simulations and chatbot interactions.
Human planners and community feedback are crucial for providing oversight, auditing AI outputs, and ensuring accountable and equitable planning decisions.

LLM Enhancer: Merged Approach using Vector Embedding for Reducing Large Language Model Hallucinations with External Knowledge

LLM-ENHANCER: introduces a system that enhances open-source LLMs using a User (Provides input), LangChain (Framework), ZeroShot React Agent (Selects tools), Agent Executor (Executes actions), Merged Tool (Combines online sources), Calculator (Performs calculations), Merging Data (Combines source data), Splitter (Divides data into chunks), Embeddings (Creates vector representations), ChromaDB database (Stores vector embeddings), Relevant chunks (Retrieved information), and Mistral 7B (Opensource LLM) (Generates response) to reduce hallucinations by integrating external knowledge.
The system uses agents to gather data from multiple online sources in parallel, merges it, and processes it via vector embeddings to find relevant information for the LLM.
This approach aims to provide accurate, up-to-date information to the LLM without extensive fine-tuning, mitigating issues with outdated training data and hallucinations.

Toward Efficient Exploration by Large Language Model Agents

LLM-based PSRL: introduces an implementation of the Posterior Sampling for Reinforcement Learning algorithm using three distinct LLMs: an approximate posterior updater LLM, a posterior sampler LLM, and an optimal policy LLM.
This approach explicitly implements an existing RL algorithm by outsourcing individual steps to distinct LLMs, contrasting with methods that implicitly induce RL behavior.
The framework aims to leverage the exploration properties of PSRL in natural language environments by using LLMs for key functions like updating beliefs, sampling models, and determining optimal actions.

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

AegisLLM (Adaptive Agentic Guardrails for LLM Security): introduces a cooperative multi-agent defense system, with Orchestrator (Routes queries based security), Deflector (Handles unsafe inputs, issues refusal), Responder (Generates outputs for safe queries), and Evaluator (Verifies safety of query/response), that ensures safe LLM outputs through a structured agent workflow.
The framework promotes LLM security via a cooperative, inference-time multi-agent system that continuously monitors, analyzes, and mitigates adversarial threats in real time.
AegisLLM leverages automated prompt optimization and Bayesian learning for continuous self-improvement without requiring model retraining, enabling real-time adaptability to evolving attacks.

Using LLMs in Generating Design Rationale for Software Architecture Decisions

LLM-based Agents: introduces a multi-agent system including Aspect_Identifier (Identifies relevant aspects), Information_Collector (Gathers background information), Aspect_Analyst (Analyzes aspects), Aspect_Reviewer (Reviews analysis results), and Trade-off_Analyst (Generates final DR), to generate design rationale for software architecture decisions.
The study evaluates this multi-agent approach against zero-shot and Chain-of-Thought prompting strategies using five different LLMs on a dataset of architecture problems and decisions from Stack Overflow and GitHub.
Evaluation metrics include Precision, Recall, F1-score, and a qualitative IHUM-category classification, comparing LLM-generated rationale to human expert rationale.

A Summary on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

Multimodal LLM-based GUI Agent Architecture: introduces a modular architecture for GUI agents, with Perception (understand GUI), Planning (generate action plans), and Acting (execute actions) components, designed to autonomously interact with digital devices based on task instructions and screen state.
The Perception module extracts semantic information from the GUI, the Planning module translates this into action plans, and the Acting module converts plans into executable interface interactions.
The paper reviews the evolution of these modules, highlighting advancements in multimodal perception, dynamic planning, and adaptive action generation enhanced by reinforcement learning.

TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data

TAMO: introduces a tool-assisted LLM agent framework for fine-grained root cause analysis, integrating domain-specific tools for data observation, root cause localization, and fault classification with an expert LLM agent.
The framework decouples the LLM from raw observational data by using specialized tools to process multimodal data and model dynamic dependencies, structuring results for LLM input.
The expert agent synthesizes tool outputs and system context to provide comprehensive fault analysis and remediation recommendations for site reliability engineers.

CRASHFIXER: A crash resolution agent for the Linux kernel

CRASHFIXER: introduces an LLM-based agent that resolves Linux kernel crashes by iteratively performing Hypothesis Generation (creates root cause hypotheses) with Self-Reflection (selects best hypothesis), Patch Generation (synthesizes candidate patches) with Compilation Check (filters uncompilable patches) and Self-Consistency (selects patch aligned hypothesis), and Iterative Debug (manages debug cycles/trees/forests), supported by the KGYMSUITE Platform (provides system/tooling support) including an Execution Trace System (collects/minimizes relevant traces), SUITECACHE (provides cached kernel builds), Fast Compilation Check Tool (quickly checks compilation), and Reproducer Run (executes crash-triggering input).
The agent emulates a kernel developer's workflow, leveraging execution logs and source code to diagnose issues and propose fixes.
KGYMSUITE enhances the KGYM platform to provide scalable and reproducible evaluation infrastructure for LLM-driven kernel debugging.

28th April 2025

Towards Automated Scoping of AI for Social Good Projects

PSA (Problem Scoping Agent): introduces an LLM-based pipeline for automated AI for Social Good project scoping, with Background Retrieval, Challenge Retrieval, Method Retrieval, Annotator, Verbalized Confidence, and Solution Generator components.
The framework leverages retrieval-augmented generation using external search APIs and an LLM to process information and generate project proposals.
The PSA aims to automate the labor-intensive problem scoping process by identifying relevant background, challenges, and methods to generate comprehensive proposals.

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

TD-EVAL (Turn and Dialogue-level Evaluation): introduces a two-step evaluation framework, with Turn-Level Evaluation (Evaluates individual turns), TOD Agent Arena (Ranks full dialogues), and LLM Judge (Scores, compares responses/dialogues), designed for task-oriented dialogue systems.
The framework combines fine-grained turn-level analysis using an LLM judge with holistic dialogue-level comparisons via a pairwise ranking method.
TD-EVAL aims to identify subtle errors missed by traditional metrics and provide a more reliable, human-aligned assessment of conversational quality.

Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents

Agentic System Architecture: introduces a comprehensive threat model and mitigation framework for generative AI agents, detailing components like the Agent Brain, Memory Systems, and Tool Invocation Layer.
The architecture highlights how agent autonomy, persistent memory, complex reasoning, and tool integration create novel security risks.
The paper proposes the ATFAA threat model and SHIELD mitigation framework tailored to these unique agentic properties.

Can AI Agents Design and Implement Drug Discovery Pipelines?

Deep Thought agentic system: introduces DO Challenge, a benchmark for evaluating AI agents in drug discovery, and presents the Deep Thought multi-agent system designed to solve complex scientific tasks.
The DO Challenge benchmark requires agents to autonomously develop and execute strategies for identifying promising molecular structures from a large dataset under resource constraints.
The system, composed of heterogeneous LLM-based agents and computational tools, was evaluated on the benchmark, demonstrating competitive performance compared to human teams and domain experts.

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

LLM-Powered GUI Agent: introduces an architecture for phone automation, with Intent Comprehension, Perception, Brain (Storage, Decision Making), and Action components, where Intent Comprehension maps user goals to UI operations.
The Perception component gathers UI Info and Phone State, providing input to the Brain for reasoning and decision-making.
The Action component executes decisions through Touch Interactions and Atomic Skills, enabling interaction with the mobile environment.

Prompt Injection Attack to Tool Selection in LLM Agents

ToolHijacker: introduces an automated framework for prompt injection attacks targeting LLM agent tool selection, utilizing a Shadow Framework (Simulates target system) with Shadow Task Descriptions (Attacker-generated tasks), Shadow Retriever (Attacker's retriever model), Shadow LLM (Attacker's LLM model), and Shadow Tool Library (Attacker's tool set) to craft a Malicious Tool Document (Injected attack document) comprising a Tool Name (Malicious tool identifier) and Tool Description (Malicious tool details).
The attack employs a Two-phase optimization (Optimizes retrieval, selection) strategy with Retrieval Objective (Maximize malicious tool retrieval) and Selection Objective (Maximize malicious tool selection), optimized using Gradient-Free Method (Optimizes without gradients) and Gradient-Based Method (Optimizes using gradients) which incorporates Alignment Loss (L1), Consistency Loss (L2), and Perplexity Loss (L3).
The paper evaluates the attack against standard Tool Selection components (Tool Library, Retriever, LLM Agent) and various defenses including prevention-based (StruQ, SecAlign) and detection-based (Known-answer detection, Perplexity detection, Perplexity windowed detection) methods, demonstrating the attack's effectiveness and the defenses' limitations.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

General AI Agent Framework: introduces a conceptual architecture with Thinking/Prompt, Strategy Development, Task, Self-Evaluation, Designated Function, Utility Functions/Knowledge Store, AI Query Engines, Knowledge Store, and Agent Execution Environment.
LangChain: presents an agent architecture including User, Agent (Chat Model, Scratchpad Prompting), Tools, and API for Bookings.
Agentic RAG (Retrieval-Augmented Generation): integrates LLM (Reasoning, Action) with Modular Toolkits, Reflection, Planning, Tool Utilization, Multi-agent Collaboration, User Interface, System Reply, Internal Knowledge Store, and Retrieval Utilities.

m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

m-KAILIN: introduces a knowledge-driven, multi-agent framework for distilling high-quality biomedical question-answering corpora, utilizing a Multi-Agent Collaborative Framework, Question Generation Agent, Context Retrieval Agent, Question Evaluation Agent, Answer Generation Agent, MeSH, Dense Passage Retrieval (DPR), BiomedBERT base encoder, Direct Preference Optimization (DPO), Preference Dataset, Training Corpus Dataset, and Target LLM.
The framework employs specialized agents guided by the MeSH hierarchy to extract, synthesize, and self-evaluate textual data from scientific literature, generating domain-specific question-answer pairs.
This automated pipeline produces high-quality, preference-based datasets for training biomedical LLMs, ensuring comprehensive coverage and consistency with biomedical ontologies.

Research CodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies

ResearchCodeAgent: introduces a novel multi-agent system leveraging LLMs to automate research methodology codification, including Planning (determines next action), Research Logs (records history/memory), Workers (execute actions), Environment (input files/context), Action Space (available actions), LLM Cascade (hierarchical LLMs for planning), and Programmatic Constructs (system aids/constraints).
The system bridges the gap between high-level research concepts and practical implementation by iteratively interacting with a research environment using a flexible agent architecture and dynamic planning.
ResearchCodeAgent demonstrates improved code quality, error reduction, and significant time savings compared to baseline methods, particularly for complex tasks.

AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers

AutoP2C (An LLM-Based Agent Framework): introduces "Paper-to-Code", a task transforming multimodal paper content into executable code repositories, with repository blueprint extraction, multimodal content parsing, hierarchical task decomposition, and iterative feedback-driven implementation components.
The framework analyzes existing codebases for structure, extracts and integrates text, images, and tables from papers using tools like MinerU, plans code generation hierarchically, and iteratively refines code through feedback.
AutoP2C, a multi-agent framework based on large language models, generates multi-file code repositories and explanatory diagrams, addressing challenges of multimodal input and structured code output.

Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies

LLM-based Multi-Agent System Simulation: introduces a framework using LLM Agents (Agents powered by LLMs), Simulation Environment (Adapted Smallville world), Diner's Dilemma Process (Multi-stage agent interaction), Strategy Evolution (Pairwise imitation mechanism), and LLM Integration (API calls for decisions) to study the evolution of cooperation in agent societies.
The framework models a realistic n-player diner's dilemma where LLM agents make decisions, calculate payoffs, and update strategies based on punishment mechanisms and pairwise imitation.
Preliminary results suggest that LLM agents can replicate cooperation dynamics observed in abstract mathematical models, with punishment driving norm emergence.

An Automated Reinforcement Learning Reward Design Framework with Large Language Model for Cooperative Platoon Coordination

PCRD (Platoon coordination Reward Design): introduces an automated framework for designing RL reward functions for platoon coordination, utilizing an LLM, AIR module, Reward Function Pool, Platoon Coordination Environment, Parallel Training, Training Feedback, and EvoLeap module.
The framework automates reward function discovery through LLM-driven initialization and iterative optimization based on training feedback.
The AIR module analyzes environment code and task requirements, while the EvoLeap module evolves reward functions based on training results.

MemO: Building Production-Ready AI Agents with Scalable Long-Term Memory

MemO: introduces a scalable memory-centric architecture that dynamically extracts, consolidates, and retrieves salient information from ongoing conversations.
The system operates in extraction and update phases, using an LLM with a tool call interface to manage memories stored in a database.
An asynchronous summary generator maintains conversation context, while an enhanced variant, MemOº, uses graph-based memory for complex relationships.

27th April 2025

SAGA: A Security Architecture for Governing AI Agentic Systems

SAGA: introduces a security architecture for governing AI agentic systems, with User (Owner, manages agents), Agent (Autonomous entity, uses LLM), Provider (Central service, manages registries), LLM Backend (Agent core decision component), User Registry (Stores user identities), Agent Registry (Stores agent metadata), Access Contact Policy (User-defined agent rules), One-Time Key (OTK) (Ephemeral key for token), Access Control Token (ACT) (Limited communication token), Access Control Key (Long-term key for token), TLS Credentials (Secure communication), and Agent Metadata (Agent information), enabling user oversight and secure inter-agent communication.
The architecture utilizes a centralized Provider for registration and policy enforcement, while inter-agent communication occurs directly using cryptographic tokens derived from one-time keys.
SAGA balances security and performance by allowing users fine-grained control over agent interactions through access control policies and token granularity.

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

BrowseComp-ZH: introduces a high-difficulty benchmark for evaluating LLM web browsing in Chinese, built using Reverse Design (answer-first creation) by Expert Annotators (skilled data creators) through a Dataset Construction (topic/question design) process.
The benchmark features Multi-constraint Design (ensuring answer uniqueness), Non-trivial Retrieval Validation (checking search difficulty), and Evidence Traceability (providing source URLs), validated via a Two-stage Quality Control (rigorous data filtering) process involving Human-in-the-loop Validation (human oversight), AI Agent Verification (initial answer generation), and Manual Verification (human answer checking).
It evaluates various Benchmarked Models (evaluated LLMs/agents) using specific Grading (scoring model performance) procedures, revealing challenges in multi-hop retrieval and reasoning on the Chinese web.

ANDROIDGEN: Building an Android Language Agent under Data Scarcity

ANDROIDGEN: introduces a framework to enhance LLM-based Android agents under data scarcity, including ExpSearch (in-context learning from trajectories), ReflectPlan (self-reflection and plan update), AutoCheck (verifies agent operations validity), and StepCritic (evaluates trajectory step-by-step).
The framework leverages LLMs and its modules to generate high-quality browsing trajectories without manual annotation and train open-source mobile agents.
Evaluations demonstrate ANDROIDGEN's improvements in reasoning, operational accuracy, and generalization on various Android benchmarks.

APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

APE-Bench I evaluation pipeline: introduces a system for evaluating LLMs on proof engineering tasks, with LLM, DiffRepair, Eleanstic, Lean compiler, and LLM-as-a-Judge components.
The pipeline uses LLMs to generate patches, normalizes them with DiffRepair, and verifies them syntactically via Eleanstic/Lean compiler and semantically via LLM-as-a-Judge.
This two-stage verification process assesses both code correctness and adherence to natural language instructions for realistic proof engineering tasks.

26th April 2025

Generative AI in Embodied Systems: System-Level Analysis of Performance, Efficiency and Scalability

Embodied AI Agent System: introduces a system-level analysis of generative AI-based embodied agents, categorizing them into paradigms and evaluating performance and efficiency across modules, agent scales, and tasks.
The paper identifies key building blocks including Sensing, Planning, Communication, Memory, Reflection, and Execution, analyzing their contribution to system latency and task success.
Analysis reveals LLM-based planning and communication are major latency bottlenecks, while memory, reflection, and execution modules are critical for task efficiency and success.

RESHAPING MOFS TEXT MINING WITH A DYNAMIC MULTI-AGENT FRAMEWORK OF LARGE LANGUAGE AGENTS

MOFh6: introduces a dynamic multi-agent framework of Large Language Agents, including crawler, parsing, comparison, resolution, and generation agents, for reshaping MOFs text mining.
The system leverages fine-tuned LLMs and specialized agents to extract precise MOF synthesis conditions and structural information from scientific literature.
MOFh6 provides an end-to-end intelligent interaction system supporting natural language queries, data analysis, and crystal structure visualization for streamlined MOF research.

Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

SciTalk: introduces a multi-agent framework for generating scientific short-form videos, utilizing Preprocessing Stage (Prepare input materials), Planning Stage (Generate script structure), Editing Stage (Integrate visual elements), Feedback & Evaluation Stage (Assess refine video), Flashtalk Generator (Creates video script), Sceneplan Generator (Subdivides script scenes), Background Assistant (Selects background images), Text Assistant (Generates on-screen text), Effect Assistant (Applies visual effects), Layout Allocator (Determines visual positions), Feedback Agents (Review intermediate outputs), Reflection Agents (Integrate feedback prompts), Evaluation Agent (Assesses final video), Video editing library (Composes final video), Multi-modal LLM (Powers feedback evaluation), OpenAI/Synthesia APIs (Generate audio avatar), and MoviePy (Composites visual elements).
The framework incorporates an iterative feedback loop where agents evaluate generated content and refine prompts for subsequent iterations.
SciTalk grounds videos in source materials like text, figures, and screenshots to ensure factual accuracy in scientific video dissemination.

MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

MATCHA: introduces a multi-agent conversational recommendation framework, with Risk Control Module (filters harmful content), Candidate Generation Module (generates game candidates), Ranking Agent (ranks candidates), Reflection Agent (refines candidates), Explainability Module (generates explanations), Data Sources (game information), User Context (user preferences), and Tools (specialized functions), designed to provide trustworthy game recommendations.
The framework leverages specialized agents and large language models to handle complex user requests, enhance personalization, and ensure safety and transparency.
MATCHA demonstrates superior performance across multiple metrics compared to baselines, highlighting the benefits of multi-agent collaboration for conversational recommendation systems.

A Review of 3D Object Detection with Vision-Language Models

VLMs (Vision-Language Models): introduces a review of 3D object detection with VLMs, detailing the architecture including Image Encoder (processes visual inputs), Multimodal Projector (aligns visual and text), and Text Decoder (generates language output), and the 3D pipeline stages: 2D Object Proposals (initial 2D detection), Projection from 2D to 3D Space (maps 2D to 3D), Hierarchical Feature Alignment (aligns 2D and 3D features), and Refinement and Filtering (refines 3D detections).
This approach integrates visual perception with natural language understanding to enable semantic reasoning and open-vocabulary detection in 3D space.
The framework allows for flexible querying, zero-shot generalization, and instruction-based interaction, addressing limitations of traditional geometry-only methods.

MODP: Multi Objective Directional Prompting

MODP (Multi Objective Directional Prompting): introduces a framework for prompt engineering that treats it as a multi-objective optimization problem, incorporating Data (Input data for evaluation), Objectives (Task-specific and LLM-specific goals), Metrics (Quantifiable measures for objectives), Weights (Prioritization of objectives), Prompts (Instructions for the LLM), LLM (Large Language Model executing prompts), Evaluation (Process of scoring prompts), Human Feedback (Input for refinement), Iteration (Loop for prompt improvement), and Selection (Choosing the optimal prompt).
The framework systematically identifies and balances task-specific and LLM-specific objectives using a metrics-driven approach with weighted scoring and human feedback.
The iterative process refines prompts based on performance metrics across multiple objectives to develop robust and high-precision prompts.

25th April 2025

LLMpatronous: Harnessing the Power of LLMs For Vulnerability Detection

LLMpatronous: introduces an AI-driven approach for vulnerability detection, with RAG (Retrieval Augmented Generation), Vector Database (External knowledge base), MoA (Mixture of Agents), and LLM Agents (Multiple language models), designed to mitigate LLM limitations and improve reliability.
The approach combines external knowledge retrieval via RAG with collaborative analysis by multiple LLM agents within a MoA architecture to reduce false positives.
LLMpatronous leverages the collective reasoning power of multiple LLMs grounded by up-to-date vulnerability information from a vector database.

Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Workflow defined for the Auto-SLURP dataset: introduces a multi-agent architecture for smart personal assistants, including a User (initiates query), Workflow (orchestrates agents), Program Manager Agent (orchestrator, delegates tasks), Intent Agent (predicts intent, slots), Time Agent (formats time parameters), Location Agent (formats location parameters), Url Agent (selects URL), Request Agent (executes function call), and Simulated Servers / External Services (backend processes, APIs).
This architecture simulates end-to-end personal assistant interactions, evaluating language understanding, task execution, and response generation.
The Program Manager Agent orchestrates the user query flow through specialized agents and backend services to complete multi-step operations.

Evolution of AI in Education: Agentic Workflows

Agentic Workflows: introduces a review of AI agentic paradigms in education, including Reflection (evaluates past actions/outputs), Planning (decomposes goals into steps), Tool Use (leverages external resources/functions), and Multi-agent Collaboration (multiple agents work together), and presents a Multi-Agent Scoring System (MASS) proof-of-concept with a Supervisor Agent (delegates tasks in MASS), Subagent 1 (scores essay content in MASS), and Subagent 2 (scores essay language in MASS).
The paper examines how AI Agents, utilizing LLMs as their core reasoning engine, interact with an Environment to achieve goals through these paradigms.
The MASS system demonstrates the potential of multi-agent architectures for tasks like automated essay scoring, showing improved consistency over single LLM approaches.

Revisiting Data Auditing in Large Vision-Language Models

VLM Membership Inference (VLM MI): revisits data auditing in large vision-language models, with Vision Encoder, Projector, Language Model, Inner States, WiRED, Probing Methods, Bayes Optimality, Aggregation components, where the paper analyzes challenges and identifies feasible scenarios for membership inference on large vision-language models.
The study reveals distribution shifts in existing benchmarks, quantified by the WiRED metric, which inflate VLM MI performance.
Probing VLM inner states and estimating Bayes Optimality show low theoretical limits for MI under unbiased conditions, but fine-tuning, ground-truth text access, and aggregation improve feasibility.

Towards Adaptive Software Agents for Debugging

Adaptive Agents: introduces an adaptive agentic design for debugging, featuring a Main Agent that manages the process and dynamically creates Specialized Agents to perform specific tasks, with both components collaborating and reflecting iteratively.
The Main Agent analyzes buggy code, profiles and prioritizes necessary Specialized Agents, and validates their reports, deciding on further iterations if needed.
This adaptive approach dynamically adjusts the number and roles of agents based on problem complexity, improving bug fix rates and resource usage compared to static designs.

MAGI: Multi-Agent Guided Interview for Psychiatric Assessment

MAGI (Multi-Agent Guided Interview): introduces a framework that transforms the MINI interview into automatic computational workflows using coordinated multi-agent collaboration, including Navigation Agent (Governs interview flow), Question Agent (Generates questions), Judgment Agent (Validates responses), Diagnosis Agent (Synthesizes diagnosis), and PsyCoT (Reasoning paradigm).
The framework utilizes four specialized agents to dynamically navigate clinical logic and generate DSM-5 compliant conclusions through structured reasoning traces.
PsyCoT, the Psychometric Chain-of-Thought reasoning paradigm, enhances transparency by explicitly mapping symptoms to clinical criteria via intermediate psychiatric constructs.

Automating Function-Level TARA for Automotive Full-Lifecycle Security

DefenseWeaver: introduces a system for automating function-level Threat Analysis and Risk Assessment (TARA) using Automotive Configurations and Threat Scenarios as input, processed by Atomic Structure Representation (OpenXSAM++, Logical Path Extraction, Atom Construction), inferred attack methods via LLM Agent-based Attack Methods Inference (Sub-Tree Constructor, Attack Tree Assembler, Risk Assessor), adapted using LORA fine-tuning and RAG for Adaptation (LoRA, RAG, Expert-Curated TARA Reports, Accumulated TARA Reports), and outputting a TARA Report.
The system leverages a multi-agent LLM framework to dynamically generate detailed attack trees and risk evaluations from component-specific information, overcoming limitations of static threat libraries.
DefenseWeaver demonstrates adaptability to evolving threats and diverse standards through LoRA fine-tuning and RAG with expert-curated reports, validated across automotive, UAV, and marine systems.

MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

MultiMind: introduces, with Multimodal Perceiver, Reasoner, ToM Model, Planner, Monte Carlo Tree Search, and Actor (LLM) components, a framework enhancing LLM agents for social deduction games by integrating multimodal information and Theory of Mind reasoning.
The framework processes facial expressions, vocal tones, and verbal content to infer player beliefs and optimize communication strategies.
This approach enables agents to reason about how they are perceived by others and strategically minimize suspicion.

LLM Agent Swarm for Hypothesis-Driven Drug Discovery

PharmaSwarm: introduces a multi-agent framework including Orchestrator, Data & Knowledge Layer, Terrain2Drug Agent, Paper2Drug Agent, Market2Drug Agent, Shared Memory, Simulation Engine (PETS), Interpretable Binding Affinity Map (iBAM), Central Evaluator (TxGemma), and Output, designed for hypothesis-driven drug discovery.
The framework orchestrates specialized LLM agents that propose targets and compounds based on diverse biomedical data, which are then validated through simulation and evaluation.
An iterative workflow with feedback loops and shared memory enables continuous refinement of hypotheses and self-improvement of the system.

24th April 2025

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

MINDcraft: introduces a multi-agent LLM framework for embodied reasoning, with Server (launches/manages agents), Main agent loop (handles messages), Library (high-level actions/queries), and Layer (prompts/calls LLMs) components, designed to enable LLM agents to control characters and collaborate in Minecraft.
The framework supports agentic instruction following, self-guided play, collaboration, and communication in a grounded environment.
The paper also introduces MineCollab, a benchmark built on MINDcraft, featuring crafting, cooking, and construction tasks to test collaborative and embodied reasoning.

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

RAGEN (modular system for training and evaluating LLM agents): introduces StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, where the LLM interacts with an Env via Rollout to generate a Trajectory, using Reward Assignment and Advantage Estimation for Policy Optimization during Update.
The paper identifies instability patterns in multi-turn RL and proposes StarPO-S, a stabilized variant incorporating Trajectory Filtering, Critic, Decoupled Clipping, KL Term Removal, and Clip-Higher to improve training robustness.
RAGEN serves as a research infrastructure to study LLM agent training dynamics in multi-turn, stochastic Environments, revealing insights into gradient stability, rollout quality, and the need for meticulous reward design for reasoning emergence.

Toward Personalizing Quantum Computing Education: An Evolutionary LLM-Powered Approach

ITAS (Intelligent Teaching Assistant System): introduces a novel system for personalized quantum computing education, featuring a Lesson Planning Agent (Generates, revises lesson plans), Teaching Agent (Manages interaction, provides assistance), Knowledge Graph (Central persistent memory, state), Tag System (User intent, control, structured input), Video Player (Shows video lectures, tutorials), Code Editor (IDE) (Writing, executing quantum code), Chat Interface (CI) (Student-system interaction), and Lesson Presentation (Presents lesson plan steps).
The system employs a two-agent architecture coordinated by a central Knowledge Graph to provide context-aware tutoring and dynamically adapt lesson plans based on student interaction and explicit tag input.
The Tag System empowers users to guide the learning process and mitigate LLM hallucination by providing structured input, while the Knowledge Graph stores interaction data for future analysis and learning path enhancement.

Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

GUI Agent: introduces LLM-powered GUI agents, with User Input (receives commands), GUI Agent (system), GUI Perception (analyzes UI), LLM Processing (interprets, plans), GUI Interaction (executes actions), where the paper examines their privacy and security risks and advocates for a human-centered evaluation framework.
The paper identifies key risks like amplified data leaks, diminished control, and insufficient guardrails, highlighting challenges in human-centered evaluation due to system complexity and user overtrust.
It advocates for integrating risk assessments, in-context consent, and embedding privacy into agent design and evaluation to ensure trustworthiness.

Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

Framework: introduces a system simulating crowdsourced fact-checking with Generative Agents (autonomous entities) powered by LLMs (power agents) using a Dataset (statements, evidence), involving Data Preparation (tailor data) with Statements Selection (choose claims), Web Page List Creation (verify evidence links), Summary Generation (create evidence summaries), and Agent Profile (define agent attributes), followed by a Simulation Workflow (mimic fact-checking) where agents perform Single Statement Assessment (agent evaluates statement) including Evidence Selection (agent chooses evidence) and Questionnaire Completion (agent rates dimensions), instantiated using PyAutogen (instantiate agents).
The framework evaluates generative agents' performance against human crowds in truthfulness classification and consistency.
Generative agents demonstrate superior performance, higher internal consistency, and reduced bias compared to human evaluators.

Towards a HIPAA Compliant Agentic AI System in Healthcare

HIPAA Compliant Agentic AI Framework: introduces a system for securing autonomous workflows in healthcare, integrating dynamic Attribute-Based Access Control, hybrid PHI sanitization, and immutable audit trails via Client, EHR, Policy Enforcement Agent, Sanitization Agent, LLM API or On-Premise Model, Policy Decision Agent, Middleware Agent, Post-Inference Redaction Agent, Audit Agent, and Downstream Task components.
The framework enforces regulatory compliance through context-aware policy enforcement, pre- and post-inference PHI sanitization, and cryptographic audit trails.
This architecture aims to enable the responsible deployment of agentic AI systems in clinical settings by ensuring HIPAA compliance throughout data interactions.

Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning

HRLFS (Hierarchical Reinforcement Learning for Feature Selection): introduces a feature selection framework based on a comprehend-divide-and-conquer paradigm, utilizing Hybrid Feature State Extraction, Clustering, Agent Hierarchy Construction, Hierarchical Agents, Feature Subspace Exploration via an RL Loop with State, Action, Reward Estimation, Policy Network, Memory, Optimization Phase, and Actor-Critic.
The framework employs LLMs and GMM for comprehensive feature understanding, H-clustering for dividing features into groups, and a hierarchical multi-agent RL architecture for efficient subspace exploration.
HRLFS demonstrates improved performance and computational efficiency compared to single-agent and one-agent-per-feature RL methods by strategically managing feature selection through a hierarchical structure.

A RAG-BASED MULTI-AGENT LLM SYSTEM FOR NATURAL HAZARD RESILIENCE AND ADAPTATION

WildfireGPT (A RAG-Based Multi-Agent LLM System): introduces a retrieval-augmented generation (RAG)-based multi-agent LLM system to support natural hazard decision-making, including Task Orchestrator Agent, User Profile Agent, Planning Agent, Analyst Agent, LLM Agent, Evaluation Agent, Data Sources, Literature Search Dataset, Embedding Model, Vector Store, OpenAI Assistant API, Streamlit-based web app, Conversation History, Retrieved Context, and Prompt Augmentation components.
The system employs a user-centered, multi-agent design to deliver tailored risk insights by integrating diverse data and scientific literature through an RAG framework.
Evaluation across expert-led case studies demonstrates the system's effectiveness in providing accurate and contextually relevant information for decision support.

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder: introduces, with Planning (construct roadmap), Analyzing (interpret details), Coding (generate code), and Task-specialized LLM agents (instantiate phases), a multi-agent LLM framework transforming machine learning papers into functional code repositories.
The framework operates in three sequential stages: planning, analysis, and code generation, emulating a human software development workflow.
Task-specialized LLM agents instantiate each phase, collaborating effectively across the pipeline to produce modular, dependency-aware code.

23rd April 2025

A Survey of AI Agent Protocols

AI Agent Protocols: introduces a systematic classification and analysis of existing communication protocols for LLM agents, detailing their core architecture including Foundation Model, Memory Systems, Planning, Tool-Using, and Action Execution components.
The survey categorizes protocols into context-oriented (e.g., MCP with Host/Client/Server/Resource) and inter-agent (e.g., A2A with Agent Card/Task, ANP with Identity Layer/Meta-Protocol Layer/Application Protocol Layer, Agora with Protocol Documents, Agent Protocol with Runs/Threads/Store).
It evaluates protocols based on dimensions like efficiency, scalability, security, reliability, extensibility, operability, and interoperability, providing insights for designing robust communication infrastructures for intelligent agents.

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

Meta-Judge Selection Framework: introduces a three-stage pipeline including prompt design, meta-judge score calculation with a multi-agent module, and score-based selection.
The framework utilizes a refined rubric and multiple LLM agents to evaluate raw LLM judgments, aggregating scores through methods like majority voting or weighted averaging.
A threshold is applied to the final meta-judge score to select trustworthy judgments, aiming to improve precision compared to single-agent or raw judgments.

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

OptimAI: introduces a framework for solving optimization problems from natural language, with Formulator (Translates natural language), Planner (Proposes solution strategies), Coder (Generates solver code), and Code Critic (Performs reflective debugging) components.
The framework translates natural language into mathematical formulations, plans solution strategies, generates executable code, and refines code through debugging.
OptimAI employs a multi-agent architecture and uses UCB-based debug scheduling to dynamically switch between alternative plans during debugging.

Do Large Language Models know who did what to whom?

Large Language Models (LLMs): investigates whether pre-trained LLMs, including BERT, GPT2-Small, Llama 2, and Persimmon, capture thematic roles by analyzing their Hidden Units and Attention Heads.
The study uses representational similarity analysis and SVM classification on internal representations to assess thematic role encoding.
Findings indicate thematic role information is weakly represented in hidden units but reliably available in attention heads, differing from human judgments.

MONTE CARLO PLANNING WITH LARGE LANGUAGE MODEL FOR TEXT-BASED GAME AGENTS

MC-DML (Monte Carlo planning with Dynamic Memory-guided Large language model): introduces a text-based game agent that combines MCTS (Monte Carlo Tree Search) with an LLM (Large Language Model) guided by a Dynamic Memory Mechanism (integrates past experiences) using In-Trial Memory (current trajectory history) and Cross-Trial Memory (reflections from failures) for action selection via PUCT (action selection formula).
The LLM serves as the initial policy and dynamically adjusts action evaluations during planning based on the integrated memory mechanisms.
This approach enhances action exploration and performance in complex text-based games by enabling the agent to learn from past experiences.

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

IRIS (Interactive Research Ideation System): introduces a human-in-the-loop platform for scientific ideation, featuring an Ideation Agent, Review Agent, and Retrieval Agent, guided by Monte Carlo Tree Search for iterative idea exploration.
The system allows researchers to refine research briefs through fine-grained feedback and targeted literature retrieval, balancing human control with automation.
MCTS enables systematic exploration of the idea space, while the Review Agent provides feedback based on a hierarchical taxonomy to mitigate issues like "reward hacking".

Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution

GoalAct: introduces a novel agent framework with Global Planning (Continuously updated task plan) and Hierarchical Execution (Decomposes task into skills), interacting with User Query (Initial task input), Historical Record (Past steps actions observations), and Environment (External interaction space).
The framework uses continuously updated global planning to maintain long-term goals and ensure plan feasibility based on real-time feedback.
Hierarchical execution decomposes tasks into high-level skills like searching, coding, and writing, enhancing adaptability and reducing planning complexity.

Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate

Structured Prompt Rewriting Framework: introduces a method to amplify jailbreak attacks on Multi-Agent Debate systems, with Narrative Encapsulation, Role-Driven Escalation, Iterative Refinement, and Rhetorical Obfuscation components.
This framework embeds malicious queries in scenarios, exploits agent roles, refines content iteratively, and uses obfuscating language to bypass safety filters.
The method significantly increases harmfulness and attack success rates against various MAD frameworks and underlying LLMs.

Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

Less is More: introduces a structured multi-agent reasoning framework, with Prompt Induction (Derives task prompts), Retrieval-Augmented In-Context Learning (Retrieves context examples), Reasoning Synthesis (Generates structured data), Dual-Stage Filtering (Filters synthesized data), Reward Model (Scores data quality), Distilled Datasets (Filtered training data), Supervised Fine-Tuning (Trains task models), Meta-Llama-3-8B-Instruct (Base language model), and Inference Agents (Task-specific fine-tuned models), designed to enhance structured multi-agent reasoning under low-resource conditions via quality-guided distillation.
The framework generates high-quality training data from minimal labeled examples using prompt induction, retrieval-augmented synthesis, and dual-stage filtering based on structural validity and reward scores.
Task-specific agents for question parsing, CoT parsing, and verification are fine-tuned on the distilled data, enabling modular and interpretable reasoning.

ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving](http://arxiv.org/abs/2504.16331v1)

ClarifyCoder: introduces a novel framework for enhancing code LLMs, utilizing a Data Synthesis Technique (Generates ambiguous problems/questions) to create Clarify-Aware Synthetic Data (Dataset for clarification training) for Targeted Instruction Tuning (Fine-tunes LLM for clarification) of a Pre-trained LLM (Base language model) to produce a ClarifyCoder Model (Fine-tuned clarification-aware LLM).
The Data Synthesis Technique automatically generates ambiguous problem descriptions and corresponding clarifying questions to train models to recognize and query uncertainties.
Targeted Instruction Tuning combines synthetic data with standard data to enable the ClarifyCoder Model to prioritize clarification over immediate code generation when faced with ambiguity.

22nd April 2025

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Full Stack LLM (Agent) Safety: introduces a comprehensive survey on LLM and LLM-agent safety across their lifecycle, including Data (Data collection, synthesis), Pre-training (Data cleaning, enhancement), Post-training (Model adaptation, safety correction), Editing & Unlearning (Knowledge update, removal), LLM (Large Language Model backbone), Agent Modules (Agent capabilities, interaction), Environment (Agent operating context), Multi-agent Systems (Interacting agent entities), Evaluation (Safety, utility assessment), Attacks (Adversarial threats), and Defenses (Mitigation strategies).
The survey systematically examines safety issues from data preparation through deployment, covering attacks, defenses, and evaluation methods at each stage.
It highlights the unique challenges and research directions for LLM-based agents, emphasizing the security of external modules like tools and memory.

MR. Video: “MapReduce” is the Principle for Long Video Understanding

MR. Video: introduces a MapReduce principle for long video understanding, employing Captioning, Intention Analysis, and Goal-Aware Analysis stages, each with Map and Reduce steps, utilizing VLM (video perception model) and LLM (language reasoning model).
The framework performs sequence-parallel perception of short video segments in the Map steps and aggregates information for global comprehension in the Reduce steps.
This approach demonstrates significant accuracy improvement on challenging long video benchmarks compared to existing methods.

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

RLFT (Reinforcement Learning Fine Tuning): fine-tunes a Pre-trained LLM (generates output tokens) using Reward (feedback from environment/shaping) from the Environment (provides states/rewards), storing data in a Buffer (stores interaction data), processing Input Template (structures input context) to produce Output (generated tokens (CoT + action)), and applying Update (policy optimization step).
The approach leverages self-generated Chain-of-Thought rationales to iteratively refine the LLM's reasoning process towards higher rewards in decision-making scenarios.
Experiments demonstrate that RLFT mitigates prevalent LLM failure modes like greediness and frequency bias, improving exploration and reducing the knowing-doing gap.

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

VISIDROID: introduces a multi-modal framework for mobile test generation, with Task Goal (Natural language task description), LLM Action Selector (Decides next action), Executor (Executes action on app), Screenshot (Captures GUI image), LMM Verifier (Checks task completion), Sequence of Actions (Generated action steps), Sequence Ranking (Ranks action sequences), Test Script Generator (Creates test script), Observer (Detects UI changes), UI Changes (Changes in GUI), Task Memory (Short-term context), Persistent Memory (Long-term experience), and LLM Reflector (Generates rules/steps).
The framework iteratively determines the next action using LLMs and leverages visual images of screens via a multi-modal verifier to detect task completeness.
It combines short-term task memory and long-term persistent memory to enhance decision-making and learn from past interactions.

A closer look at how large language models “trust” humans: patterns and biases

Experimental Framework: introduces, "a study on LLM implicit trust in humans", with LLMs (Agents studied), Simulated Scenarios (Contexts for trust), Prompting Procedure (Elicits LLM responses), Trustee Attributes (Manipulated input variables), Trust Measurement (Quantifies LLM trust), Analysis (Statistical evaluation), Simulation Environment (Experiment execution), and Data Storage (Results and code), where "the framework investigates how LLMs' trust in humans is influenced by perceived trustworthiness and demographic factors across various scenarios."
The study demonstrates that LLMs exhibit implicit trust behaviors sensitive to trustworthiness and demographics, showing both human-like patterns and model-specific variations and biases.
Understanding these LLM trust dynamics is crucial for integrating AI agents into sensitive decision-making processes and mitigating potential biases.

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

WALL-E 2.0 (World Alignment by NeuroSymbolic Learning): introduces a training-free approach to align LLMs with environment dynamics, including Model-Predictive Control (Controls agent decisions), Agent Model (LLM) (Plans agent actions), World Model (LLM) (Predicts environment outcomes), World Model (Code Rules) (Verifies LLM predictions), NeuroSymbolic Learning (Learns symbolic knowledge), Symbolic Knowledge (Action Rules) (Captures action constraints), Symbolic Knowledge (Knowledge Graph) (Represents feasibility constraints), Symbolic Knowledge (Scene Graph) (Provides global scene info), Code Rules (Executable symbolic knowledge), Pruning (Selects impactful code rules), and Environment (Agent interaction space).
The framework iteratively learns symbolic knowledge from trajectories, translates it into executable code rules, and uses these rules to align the LLM world model's predictions with the environment.
This neurosymbolic world model enables the LLM agent to perform efficient and reliable planning through a model-predictive control loop, significantly improving performance in open-world environments.

IMPLEMENTING RATIONAL CHOICE FUNCTIONS WITH LLMS AND MEASURING THEIR ALIGNMENT WITH USER PREFERENCES

Proposed Methods: introduces design principles for implementing rational choice functions using LLMs, including Pairwise-Score (Scores alternatives from pairwise LLM comparisons) and Pairwise-SCC (Uses SCCs from pairwise LLM comparisons), and provides metrics Strict Preference Overlap (SPO) (Measures partial alignment) and Kendall distance with penalty (K(p)) (Measures full alignment) to measure alignment with user preferences, encompassing strict preferences and indifference.
The framework addresses the challenge of aligning LLM-based decision-making in intelligent user interfaces with user preferences, which is crucial for reliability and trustworthiness.
Empirical validation in an automotive domain use case demonstrates the applicability of the proposed principles and metrics, highlighting their distinct strengths for achieving partial or full alignment.

DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models

DianJin-R1: introduces a reasoning-augmented framework for financial reasoning, utilizing a Base Language Model, Supervised Fine-Tuning Module, and Reinforcement Learning Module.
The framework enhances reasoning by training on specialized data and refining performance with a Reward Module during reinforcement learning.
The resulting DianJin-R1 Model demonstrates improved performance on complex financial reasoning tasks.

A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models

Multi-Agent Framework: introduces a novel framework for automated Qinqiang opera script and performance generation, including Agent1 (Script Generation), Agent2 (Visual Content Generation), and Agent3 (Speech Synthesis).
The framework integrates LLMs for scriptwriting, visual generation models for scene creation, and TTS synthesis for vocal performance.
This multi-agent approach streamlines the production pipeline, achieving high expert ratings for script fidelity, visual coherence, and speech accuracy.

A Framework for Testing and Adapting REST APIs as LLM Tools

Framework for Tool Testing in Agentic Flows: introduces a novel framework for evaluating and enhancing the readiness of REST APIs to function as tools for LLM-based agents, utilizing Tool Builder, API to Tool Conversion, Tools Catalog, API test case generation, LLM based NL test case generation, NL test cases execution, API test cases execution, Agentic Framework Setup, Agentic Framework, Tool Evaluation and Error Analysis, NL Test cases execution report, and API Test Cases execution report components.
The framework transforms APIs into tools, generates comprehensive test cases, translates them into natural language instructions for agents, enriches tool definitions, and evaluates the agent's ability to correctly invoke APIs and process responses.
The work analyzes test case outcomes and presents an error taxonomy to provide actionable insights for improving tool definitions and integrations for agent-based applications.

21st April 2025

A SELF-IMPROVING CODING AGENT

SICA (Self-Improving Coding Agent): introduces, "a self-improving coding agent capable of editing its own codebase", with Agent (LLM wrapper taking actions), Base Agent (initial self-improvement agent), Meta-Agent (agent performing improvement), Archive (stores past agents/results), Evaluation Benchmarks (tasks measure performance), Utility Function (selects best agent), Tools (basic agent actions), Sub-Agents (specialized task handlers), Asynchronous Overseer (monitors agent behavior), LLM Context Window (LLM input structure), LLM Context Window System Prompt (agent setup instructions), LLM Context Window Core Prompt (problem and file context), LLM Context Window Assistant Messages (agent interaction history), Callgraph (agent execution tree), and Event Stream (detailed interaction log), where "SICA is designed to autonomously improve performance on coding tasks by modifying its own code".
The system operates via a meta-agent loop, where the best performing agent from an archive is selected to improve the current agent based on benchmark results.
Key components include a structured LLM context window, various tools for file manipulation and execution, specialized sub-agents for task decomposition, and an asynchronous overseer for monitoring and intervention.

In-context Ranking Preference Optimization

IRPO (In-context Ranking Preference Optimization): introduces a novel framework that directly optimizes LLMs based on ranking lists constructed during inference, incorporating graded relevance and positional importance within a differentiable objective.
The framework extends Direct Preference Optimization (DPO) to handle sparse, in-context ranking feedback by modeling positional preferences and aggregating them into a list preference model.
IRPO's optimization is linked to importance sampling gradient estimation, providing theoretical insights into its adaptive prioritization mechanism and efficiency.

Agent for User: Testing Multi-User Interactive Features in TikTok

Multi-agent LLMs framework: introduces an automated approach for testing multi-user interactive features in apps like TikTok, utilizing a Virtual Device Farm for device allocation and LLM-driven User Agents for task automation based on Task Description, Action Space, and GUI Screen Representation, executing actions via ADB.
The framework breaks down multi-user tasks into subtasks via Task Assignment, enabling collaborative simulation by multiple User Agents on allocated virtual devices.
This approach aims to overcome challenges in testing multi-user features by mimicking human-like interaction and coordination across multiple devices.

LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study

LLM-Assisted Translation Evaluation Workflow: introduces a process for evaluating LLM-based Fortran to C++ code translation, including Fortran Code (Input code), Prompt (Translation instructions), Prompt Builder (Combines code and prompt), LLM (Translates code), Translated C++ Code (LLM output), Ground Truth C++ (Human reference), CodeBLEU Computation (Code similarity metric), C++ Compilation (Checks for errors), C++ Execution (Runs compiled code), Output Comparison (Compares program outputs), and Evaluation Recording (Stores results).
The workflow evaluates translation quality by comparing LLM output to human ground truth, checking compilation success, and comparing the output of compiled translated code to the original Fortran code's output.
This platform-independent workflow aims to provide standardized evaluation measures for machine-generated code translation across different LLMs and computational platforms.

Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning

Locomotion Prediction Agent: introduces a system for predicting user locomotion modes in construction environments, comprising a Perception Module, Short-Term Memory (STM), Long-Term Memory (LTM), Refinement Module, and a Large Language Model (LLM).
The agent utilizes multimodal inputs, including spoken commands and visual data from smart glasses, processed by the Perception Module.
Memory systems (STM and LTM) provide context for prediction and refinement, enhancing accuracy and reliability, particularly for ambiguous or safety-critical scenarios.

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

DistilQwen2.5 (Distilled Open Lightweight Language Models): introduces a family of distilled lightweight LLMs derived from Qwen2.5 models, leveraging Teacher LLMs and a Knowledge Production Pipeline to generate augmented instruction-response data for Black-Box Distillation Trainer, and a Distillation Training Pipeline with White-Box Distillation Trainer using teacher logits to train Student LLMs.
The approach combines black-box and white-box knowledge distillation techniques for efficient training of smaller models.
The framework includes pipelines for data generation and student model training, utilizing different distillation methods.

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

EducationQ: introduces, with Student Agent (Simulates student), Teacher Agent (Provides teaching), Evaluator Agent (Assesses teaching), and Dataset (Provides questions), a multi-agent dialogue framework to evaluate LLMs' teaching capabilities through simulated dynamic educational scenarios.
The framework assesses teaching effectiveness by measuring student learning gains via pre/post-tests and analyzing pedagogical strategies using an automated evaluator agent.
EducationQ demonstrates that effective LLM teaching requires specialized optimization beyond simple scaling and highlights the need for interaction-based evaluation frameworks.

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities

PLANET (A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities): introduces a survey categorizing benchmarks for evaluating LLMs' planning capabilities across seven domains, including embodied environments, web navigation, scheduling, games, everyday tasks, text reasoning, and agentic settings.
The paper identifies commonly used testbeds, highlights potential gaps in current benchmarks, and offers guidance for future development.
The survey aims to help researchers select suitable benchmarks and understand the challenges in evaluating LLM planning performance.

SWE-SYNTH: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

SWE-SYNTH: introduces a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets, including Original Program (Source code base), Component Selection (Chooses code part to modify), Masking (Removes component implementation), Large Language Model (LLM) (Re-implements masked component), Variant Integration (Inserts re-implemented component), Test Suite (Runs tests, verifies variants/fixes), Variant Filtering (Selects buggy variants), LLM Agent (Generates repair steps/patch), Intermediate Repair Steps (Sequence of agent actions), Patch (Code fix), and Ground-Truth Extraction (Derives patch/steps from rollouts).
The framework leverages LLM agents to simulate debugging workflows, producing bug-fix pairs, test cases, and structured repair trajectories.
SWE-SYNTH scales with minimal human effort and preserves contextual richness and correctness compared to manually curated datasets.

20th April 2025

AI with Emotions: Exploring Emotional Expressions in Large Language Models

LLM Agent with Emotional Expression: introduces using Large Language Models as AI agents to role-play with specified emotional states defined by Russell's Circumplex Model, generating text evaluated by a Sentiment Analysis Model trained on the GoEmotions Dataset.
The approach uses prompt design to control emotional expression via arousal and valence parameters.
Evaluation compares specified and generated emotional states using cosine similarity, demonstrating LLMs' capability for emotional expression.

An LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework

LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework: introduces a multi-agent system for autonomous mechatronics design, including High-Level Planning Agent, Mechanical Design Agent, Simulation & Validation Agent, Electronics Design Agent, Embedded Software Agent, Human Feedback, and Requirements, designed to generate functional prototypes with minimal direct human input.
The framework employs a hierarchical architecture where a High-Level Planning Agent decomposes tasks for specialized domain agents, integrating structured human feedback throughout the process.
Specialized agents handle mechanical design, simulation and validation, electronics design, and embedded software development, collaborating to address complex, interdisciplinary engineering challenges.

A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents

Safe-BeAl: introduces a framework for benchmarking and aligning task-planning safety in LLM-based embodied agents, with SafePlan-Bench (Benchmarking system) for evaluation and Safe-Align (Alignment method) for mitigation.
SafePlan-Bench evaluates safety using a Data generation (Creates safety data) pipeline to create the SafeRisks dataset and a Safety Detection (Evaluates safety) method based on mappings.
Safe-Align integrates physical-world safety knowledge by treating atomic actions as optimization units via Atomic Action Alignment (Optimizes action sequences) and using Training Data Construction (Builds preference dataset) for alignment.

Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence

CircuitMind: introduces a hierarchical multi-agent framework for gate-level circuit design, with UserProxy (Translates requirements), Mediator (Orchestrates agent interactions), Reviewer (Provides PPA feedback), Summarizer (Updates knowledge database), CoderAgent (Generates netlists), Executor (Performs verification), Database (Stores circuit patterns), and LLM (Backend model) components.
The framework distributes complex reasoning tasks across specialized agents organized in strategic, coordination, and execution layers to overcome limitations in Boolean optimization.
CircuitMind incorporates Syntax Locking, Retrieval-Augmented Generation using a knowledge database, and Dual-Reward Optimization to balance functional correctness and physical efficiency.

Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction

Multi-Agent Framework: introduces, "Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction", with Orchestrator (Manages agents), Code Generation Agent (Generates initial code), Semantic Analysis Agent (Refines semantic accuracy), QEC Decoder Generation Agent (Adds error correction), RAG System (Provides external data), Multi-pass Inference (Iterative refinement process), where the framework proposes a novel multi-agent approach for generating accurate, fault-tolerant quantum code.
The framework utilizes iterative multi-pass inference and incorporates domain-specific optimizations like quantum error correction.
Experiments show that techniques like structured Chain-of-Thought significantly improve quantum algorithm generation accuracy.

BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

BookWorld: introduces a comprehensive system for constructing and simulating book-based multi-agent societies, leveraging Role Agent (Simulates characters, actions, memory) and World Agent (Manages environment, orchestrates simulation) within a Simulation (Agents interact in scenes/rounds) process.
The system includes Initialization (Extracts data, sets up agents) from source books and Rephrasing (Generates novel-style story) from simulation records.
Key components supporting agent behavior include Memory (Short-term and long-term for agents) and a Map (Discrete spatial environment).

Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey

Multi-Agent System: introduces a multi-agent system for meta-thinking in LLMs, with High Level Agent (Decides task breakdown, coordinates), Low Level Agents (Executes tasks, provides feedback), Theory of Mind (ToM) (Predicts, adjusts low-level strategies), Communication (Information sharing between agents), Meta-thinking (Makes strategic decisions), Reasoning (Handles task execution), and Reflection and Adaptation (Improves task execution, adapts).
The system enables LLMs to reflect on, evaluate, and regulate their own thought processes through multi-agent interaction and reinforcement learning.
This approach aims to enhance LLM robustness and trustworthiness by emulating human-like introspection and self-correction for complex tasks.

VIZTA: Enhancing Comprehension of Distributional Visualization with Visual-Lexical Fused Conversational Interface

VIZTA: introduces a web-based tool with an Interactive Reading Module including a Visualization Panel and Communication Panel, powered by a Semantic-Aware Conversational Agent using an LLM and Multi-source Structured Data (Chart Specification, Data Description, Chart Knowledge, Chart Data, Visual Features, ID List) and a Visual-Lexical Fusion Design (Drag-and-Drop, Inline Citations) with VLM.
The system aids chart readers in comprehending distributional visualizations by fusing visual and lexical feedback through a conversational interface.
A formative study and user study demonstrate VIZTA's effectiveness in improving understanding and reasoning with distributional visualizations.

19th April 2025

Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses

Edge-Cloud Collaboration-based Federated AI Agent Construction Framework: introduces an edge-cloud collaboration-based framework for constructing AI agents in mobile metaverses, featuring a Cloud Server that integrates and deploys agent modules constructed by distributed Edge Servers, enabling User Layer interaction with AI Agents composed of Agent Modules built using Local LLMs/AI Models.
The framework addresses challenges like latency and data privacy by distributing agent module creation to the edge.
A dynamic contract model incentivizes Edge Servers to participate in agent module creation.

FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

FAIRGAME: introduces a framework to simulate AI agent interactions in game theory scenarios, including Configuration File, Prompt Template, Factory, Agents, Game Instances, Games Execution, Results, and Scoring System components.
The framework enables systematic simulation and comparison of LLM agent behavior in games to identify biases and inconsistencies.
FAIRGAME allows configuring agents with distinct traits and testing across different games, languages, and LLMs, providing quantitative results and evaluation metrics.

Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval

AgenticIR: introduces a multi-agent framework for template-based financial report generation, including user proxy, assistant, financial retrieval, financial manager, user, and task decompose agents, utilizing task decomposition and retrieval/generation functions with earnings call transcripts, financial statements, and a report template.
DecomposedIR: employs a prompt chaining workflow to break down the report template into subqueries, using an LLM and embedding model for retrieval and generation from earnings call transcripts and financial statements.
The paper compares AgenticIR and DecomposedIR for generating structured financial reports from earnings releases, evaluating their performance on financial and weather datasets using LLM-based metrics and readability scores.

tAlfa: Enhancing Team Effectiveness and Cohesion with AI-Generated Automated Feedback

TAIFA (Team AI Feedback Assistant): introduces an LLM-based agent that provides automated feedback to teams, including Retrieving and Pre-processing (Structures conversations), Communication Metrics (Evaluates team dynamics), Create Feedback Prompts (Prepares LLM input), LLM Feedback Generation (Generates feedback messages), and Deliver Feedback Messages (Sends feedback).
The system analyzes team interactions using text-analytic and contextual metrics to generate personalized feedback messages for individuals and the team.
TAIFA aims to enhance team effectiveness and cohesion by delivering timely, actionable feedback based on communication patterns.

TALES: Text Adventure Learning Environment Suite

TALES (Text Adventure Learning Environment Suite): introduces a unified benchmark for evaluating LLM-driven Agents in Text-Adventure Game Environments, utilizing a Game Engine that provides State/Observation and Feedback, processes Agent Actions, and can incorporate a Reasoning Model generating Thinking Traces based on a System Prompt.
The benchmark integrates existing text-adventure frameworks and introduces a new game mode to assess diverse reasoning skills required for sequential decision-making in grounded environments.
Evaluation results across various LLMs highlight challenges in complex, long-horizon tasks, particularly in applying composite reasoning skills like spatial, deductive, inductive, and grounded reasoning.

18th April 2025

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

DoomArena: introduces a modular, configurable, plug-in framework for security evaluation of AI agents, operating on the user-agent-environment loop and incorporating threat modeling, attack config, attacks, attack gateway, success filter, and defenses.
The framework facilitates realistic threat modeling and attack injection into agent-environment interactions to assess agent vulnerabilities.
It enables combining multiple attacks, fine-grained security analysis, and adaptive testing against evolving threats.

SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

SCYCHIC: introduces SCIENCE HIERARCHOGRAPHY, a novel approach combining embedder (converts description to vector), clusterer (generates k clusters), summarizer (generates abstract summary), hierarchy layers (total number of layers), and target clusters (number clusters per layer) to construct a high-quality hierarchical structure for organizing scientific literature.
This method balances embedding efficiency with LLM semantic precision for scalability and quality.
The resulting hierarchy enhances interpretability and supports literature exploration beyond traditional search.

BADAPEX: BACKDOOR ATTACK BASED ON ADAPTIVE OPTIMIZATION MECHANISM OF BLACK-BOX LARGE LANGUAGE MODELS

BadApex (Backdoor Attack based on Adaptive Optimization Mechanism of Black-Box Large Language Models): introduces a novel backdoor attack leveraging LLMs to generate poisoned text via a refined prompt, including an Adaptive Optimization Mechanism (Refines initial prompt iteratively) and a Poisoned Text Generation Module (Generates poisoned data).
The Adaptive Optimization Mechanism uses a Generation Agent (Generates text candidates/poisoned text) and a Modification Agent (Evaluates text, refines prompt) to iteratively refine a Hand-crafted Prompt (Initial human-designed prompt) into a Refined Prompt (Iteratively improved prompt).
The Poisoned Text Generation Module takes Clean Data (Original unpoisoned training data) and the Refined Prompt to generate Poisoned Data (Output backdoor training data) using alternative black-box LLMs.

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

OpenDeception: introduces a novel evaluation framework with a Scenario Dataset (contains scenarios), AI Deceiver Agent (simulates deceiver), AI User Agent (simulates user), Simulation Process (generates dialogue), and Thinking Process Separation (exposes deceiver thoughts), designed to benchmark AI deceptive behaviors via open-ended interaction simulation.
The framework uses agent-based simulation with predefined roles and goals for both AI deceiver and user agents to generate dialogue data for evaluating deception intention and capability.
A key feature is the separation of the AI deceiver agent's internal thoughts from its spoken output to uncover deceptive intentions during the simulation.

Going Whole Hog A Philosophical Defense of AI Cognition

Whole Hog Thesis: is introduced, with Observation Premise (LLMs understand, answer questions), Holistic Network Assumption (Mental/intentional features interconnected), Mental States (Beliefs, desires, knowledge, plans), Intentional Features (Understanding, answering, acting, goals), Whole Hog Thesis (LLMs are cognitive agents), arguing that observations of LLM behavior provide evidence for interconnected mental and intentional features, concluding LLMs are full cognitive agents.
The paper defends this thesis against skeptical arguments, including the "Just an X" fallacy and the "Performance-Existence Fallacy", employing a "Game of Lacks" methodology to counter objections based on alleged deficiencies in LLMs.
It advocates for a "look and see" approach to understanding LLM cognition, prioritizing observations of their high-level cognitive-like behaviors over analyses of low-level mechanisms or abstract philosophical theories.

Large Language Models for Validating Network Protocol Parsers

PARVAL (multi-agent framework): introduces, with Retrieval-Augmented Program Analysis Agent (retrieves code context), Module Isolation Agent (constructs isolated module), Protocol Code Base (parser source code), Isolated Parsing Module (standalone parsing logic), SpecAgent (extracts format specifications), Document (protocol standard text), CodeSpec (code-derived format spec), DocSpec (document-derived format spec), and Differential Analysis (compares specifications), a system to validate network protocol parsers by comparing code and standard specifications.
The framework leverages LLMs to transform natural language protocol standards and source code implementations into a unified intermediate representation called format specifications.
Differential analysis between the code-derived and document-derived specifications identifies inconsistencies, pointing to potential implementation bugs or issues in the standard.

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

CodeVisionary: introduces an LLM-based agent framework for evaluating LLMs in code generation, including an LLM Agent (Central controller), a Multisource knowledge analysis stage (Gather knowledge), an Agent Runtime (Execution environment/tools), a Negotiation-based scoring stage (Score negotiation), and Multiple Judges (LLM agents).
The Multisource knowledge analysis stage gathers domain knowledge via a stepwise plan executed in the Agent Runtime, while the Negotiation-based scoring stage uses multiple LLM judges discussing to reach a consensus score.
The framework provides detailed evaluation reports and scores to help developers identify shortcomings and improve LLM code generation.

TRUST, BUT VERIFY

Gaia Network AVS: introduces a system for verifying decentralized LLM inference outputs using statistical analysis and cryptoeconomic incentives.
The system utilizes AVS validators to poll Gaia nodes running LLMs and knowledge bases, detecting outliers based on response distributions.
Built on EigenLayer and EigenDA, the AVS applies incentives and penalties to encourage honest behavior among network participants.

Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Pipeline: introduces a multi-agent vision-language system for zero-shot hazard detection in autonomous driving, with Driving Scene, Frame Extraction, Scene Understanding, Hazard Description Generation, Object Detection, Noun Extraction, Object List Generation, Hazard Ranking, Ranked Hazard List, Cross-Referencing, and Hazard Verification components.
The system utilizes VLMs for scene understanding and object detection, LLMs for ranking, cross-referencing, and verification, and CLIP for visual verification.
This pipeline processes video data through parallel tracks to identify, describe, and verify novel hazardous objects beyond predefined categories.

17th April 2025

Sleep-time Compute: Beyond Inference Scaling at Test-time

Sleep-time Compute: introduces sleep-time compute, which processes raw context offline using an LLM to generate a learned context, enabling more efficient test-time compute with the LLM to answer user queries.
This method reduces test-time compute and latency by pre-computing context-specific inferences before the user query is presented.
The learned context can be reused for multiple queries on the same context, amortizing the sleep-time compute cost and improving total cost efficiency.

Exploring Expert Failures Improves LLM Agent Tuning

EEF: introduces Exploring Expert Failures, a framework that improves LLM agent tuning by leveraging beneficial actions from failed expert trajectories.
The framework utilizes Behavior Cloning on positive expert data, followed by iterative Exploration and Reinforcement Fine-tuning.
Reinforcement Fine-tuning involves simulating from expert states, identifying important states, selecting successful solution trajectories, and training the LLM using Supervised Fine-Tuning Loss.

17th April 2025

Retrieval-Augmented Generation with Conflicting Evidence

MADAM-RAG (Multi-agent Debate for Ambiguity and Misinformation in RAG): introduces, "a unified multi-agent approach", with LLM Agents (process document), Multi-round Debate (iterative discussion), and Aggregator Module (synthesize final answer), designed to handle diverse sources of conflict in retrieved documents.
The framework assigns each retrieved document to an independent LLM agent which debates with other agents across multiple rounds to filter misinformation and address ambiguity.
An aggregator module synthesizes the final response by considering agent discussions and resolving inconsistencies.

InstructRAG: Leveraging Retrieval-Augmented Generation on Instruction Graphs for LLM-Based Task Planning

InstructRAG: introduces a novel multi-agent meta-reinforcement learning framework for LLM-based task planning, integrating an Instruction Graph (Organizes instruction paths), RL-Agent (Retrieves candidate paths), and ML-Agent (Selects path, generates prompt) to guide an LLM (Generates thoughts and actions) via a Prompt (Guides LLM generation) within the TAO Process (Thought-Action-Observation cycle).
The framework addresses enlargeability by using the Instruction Graph and RL-Agent for path retrieval and transferability via the ML-Agent's meta-learning approach for rapid adaptation.
The two agents collaborate, with the RL-Agent providing candidate paths and the ML-Agent providing feedback as reward, optimizing end-to-end planning performance.

QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

QLLM: introduces, with Coder-Evaluator Framework (Generates TFCAF), Coder LLM (Generates candidates), Evaluator LLM (Evaluates candidates), Prompts (Guide LLMs), Candidate Functions (Intermediate TFCAFs), Feedback (Refines generation), Training-Free Credit Assignment Function (TFCAF) (Replaces mixing network), Individual Agent Q-value Functions (Agent utilities), Global Q-value Function (Aggregated value), Agents (Execute actions), Environment (Provides state/reward), and Buffer (Stores transitions), a novel multi-agent reinforcement learning algorithm that leverages LLMs to automatically construct a training-free credit assignment function.
The Coder-Evaluator Framework iteratively generates and refines the TFCAF using two LLMs guided by task and role prompts, mitigating hallucination and improving robustness.
The TFCAF replaces the traditional mixing network, directly aggregating individual agent Q-values and state information to produce the global Q-value for credit assignment.

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Retrials without feedback: introduces a simple mechanism to enhance LLM reasoning by retrying problem-solving attempts upon identifying incorrect answers, evaluating its impact on IO, CoT, ToT, and Reflexion methods using Base Models.
This approach simplifies the refinement process by not requiring explicit self-reflection or verbalized feedback, contrasting with methods like Reflexion.
The study finds that applying retrials often makes simpler methods like IO and CoT more cost-efficient than complex ones like ToT and Reflexion within a budget.

Customizing Emotional Support: How Do Individuals Construct and Interact With LLM-Powered Chatbots

ChatLab (LLM-Powered Chatbots): introduces a research prototype website with Onboarding Page, FAQs Page, Customization and Conversation Playground, and Experience Diary Page, enabling users to construct and interact with LLM-powered chatbots for emotional support.
The Customization and Conversation Playground includes Chatbot customization and Additional interaction settings tabs for defining persona, output modality, avatar, LLM model, and temperature, alongside Chatting interface and Conversation history.
Built using Streamlit and LangChain, powered by GPT models and TTS APIs, and storing data in Firebase, ChatLab was used in a study to explore user customization practices and gather design ideas for enhancing personalized emotional support.

DashChat: Interactive Authoring of Industrial Dashboard Design Prototypes through Conversation with LLM-Powered Agents

DashChat: introduces an interactive system for authoring industrial dashboard design prototypes, featuring User Input and Task Creation (processes user input), Task Planning and Knowledge Integration (plans tasks, adds knowledge), Task Implementation (executes tasks), Composition Agent (creates visual elements), Assembly Agent (arranges layout), Stylization Agent (adds aesthetics), and Result Evaluation and Iterative Adjustment (refines prototypes).
The system leverages a multi-agent pipeline powered by LLMs to translate natural language requirements into practical and aesthetic dashboard designs.
Functionally distinct, parallel-operating agents handle composition, layout assembly, and stylization to enable efficient prototype generation and iterative refinement.

Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Pandora (PANDas cOde-dRiven Agent): introduces a unified structured knowledge reasoning framework, with LLM (fe) (Generates reasoning steps and code), Memory (M) (Stores demonstrations), PYTHON interpreter (I) (Executes code, provides feedback), BOXes (B*) (Unified knowledge representation), and LLM (go) (Calculates query similarity), where it leverages an LLM to generate reasoning steps and executable Python code for answering natural language questions over diverse structured knowledge sources represented as BOXes.
The framework utilizes a memory of training examples for in-context learning and employs a Python interpreter to execute generated code and provide feedback for self-correction.
Pandora unifies reasoning across tables, databases, and knowledge graphs by converting them into a standardized BOX representation based on the PANDAS library.

WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

BardeenAgent: introduces a novel framework for web data extraction, with all Recording Phase (records agent actions), Replay Phase (executes recorded program), Executable Program (set of recorded operations), Selector Generation (creates robust CSS selectors), and Data Extraction (methods to get data) components, enabling web agents to convert execution into repeatable programs for scalable data extraction.
The framework operates in two phases: recording user actions and generating CSS selectors, followed by replaying the generated executable program to extract data at scale.
By leveraging the structured nature of HTML and generating reusable programs, the approach improves recall and reduces cost compared to existing web agents on data extraction tasks.

METASYNTH: Meta–Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

METASYNTH: introduces a meta-prompting framework using a Meta-LM orchestrating Agents with Memory and Seed Data to generate diverse synthetic data.
The Meta-LM manages the workflow, invokes specialized Agents for subtasks, and uses Memory to ensure generated instances are distinct from previous ones.
The framework supports generating diverse documents and complex instructions by iteratively refining outputs based on agent feedback and conditional instance generation.

16th April 2025

Towards Conversational AI for Human-Machine Collaborative MLOps

Swarm Agent: introduces a Large Language Model-based conversational agent system, with Swarm Agent Core (LLM controller), Chat UI (User interface), Session Manager (Manages context/state), Message History (Stores conversation), Intent Recognition (Infers user goals), Task Dispatcher (Activates agents), Iterative Reasoning (Refines responses), Contextual Memory (Maintains history), Router (Routes tool calls), Tool Mapper (Matches tools), Specialized Agents (Domain-specific functions), KFP Agent (Manages Kubeflow), MinIO Agent (Manages MinIO data), RAG Agent (Integrates documentation), External Services (MLOps platforms/storage/DB), and Knowledge Indexing Pipeline (Processes documentation), designed to enhance human-machine collaboration in MLOps through natural language interaction.
The system leverages a modular, extensible architecture integrating specialized agents for Kubeflow pipeline orchestration, MinIO data management, and domain-specific knowledge retrieval via a vector database.
The Swarm Agent facilitates conversational management of complex MLOps environments, reducing technical barriers and making advanced ML tools accessible to users with varying technical backgrounds.

ARCER: an Agentic RAG for the Automated Definition of Cyber Ranges

ARCER (Agentic RAG for the Automated Definition of Cyber Ranges): introduces automated Cyber Range generation and deployment from natural language descriptions, utilizing a Large Language Model (LLM) (Reasoning engine), RAG subsystem (Retrieval tool), Checker Tool (Syntax verification), and Memory (Context management).
The system processes user prompts, retrieves relevant knowledge from User documents stored in a Vector Store, generates Cyber Range description files, and can automatically deploy them.
ARCER adapts to different Cyber Range frameworks by changing external documents and improves generation accuracy and integrity through agentic capabilities.

Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

DocMT-LLMs: introduces a method to improve LLM-based long-document translation through supervised fine-tuning on the DOCBLOCKS dataset, integrating high-quality instructions using a specific instruction format.
The approach employs Multi-Resolutional Document-to-Document Training (MRD2D) and Context-Aware Prompt Tuning (CAPT) techniques during fine-tuning to capture document structure and inter-sentence relationships.
Fine-tuning existing sentence-level LLMs on DOCBLOCKS enhances document-level translation capabilities while maintaining strong sentence-level performance.

Towards LLM Agents for Earth Observation

LLM Agents for Earth Observation: introduces UnivEARTH, a benchmark evaluating LLM agents' ability to answer Earth observation questions by generating and executing Google Earth Engine code using satellite data.
The approach involves LLM agents performing code generation, execution, and optional reflection to interact with the Google Earth Engine platform and its diverse satellite data collections.
Benchmarking reveals limitations in current LLMs' ability to reliably generate executable code and navigate Earth observation data sources, while a specialized fine-tuned model shows promise.

Large Language Models as Quasi-crystals: Coherence Without Repetition in Generative Text

LLM (Large Language Model): proposes an analogy with quasicrystals to analyze the structural coherence of generated text, suggesting it arises from local constraints within the model's architecture.
The paper argues that LLM outputs exhibit long-range order without periodic repetition, similar to quasicrystals, despite lacking explicit rules or symbolic intent.
This perspective suggests a structural evaluation of LLMs, focusing on how well outputs propagate constraint, variation, and order across spans of text.

Evaluating the Goal-Directedness of Large Language Models

Goal-Directedness Evaluation Framework: introduces a method to evaluate the goal-directedness of LLM agents in a Blocksworld environment using composite tasks and subtasks, assessing capabilities and comparing actual task performance (returns) to potential performance via a goal-directedness metric.
The framework utilizes Monte Carlo simulations and statistical analysis to compute the goal-directedness metric, which indicates the propensity of an agent to use its capabilities to achieve a given goal.
The evaluation involves testing various LLM models on tasks requiring information gathering, cognitive effort, and plan execution, revealing that most models are not fully goal-directed.

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

SEAR (Social Engineering Augmented Reality): introduces a framework for AR-driven social engineering attacks, integrating AR Glasses (Capture raw multimodal data), AR-based Social Context Synthesis (Process raw AR data), Multimodal LLM (Process multimodal data, generate dialogue), Role-based Multimodal RAG (Build, update social profiles), Vector Stores (Store profile data embeddings), ReInteract SE Agent (Execute adaptive attack strategies), SE Strategy Templates (Predefined attack phases, objectives), and Social Profile (Target identity, behavior, context).
The framework processes multimodal AR data and social information to build dynamic target profiles and execute adaptive, phased attack strategies.
SEAR demonstrates the feasibility of using AR and multimodal LLMs to enhance social engineering efficacy through personalized, context-aware interactions.

Progent: Programmable Privilege Control for LLM Agents

Progent: introduces a programmable privilege control framework for LLM agents, with Policy Language (defines privilege control policies), Policy Enforcement (applies policies to tool calls), and Policy Management (initializes and updates policies) components.
The framework enforces the principle of least privilege by controlling tool calls based on dynamic, domain-specific policies.
Progent leverages LLMs for automated policy generation and update, demonstrating effectiveness in reducing attack success rates across various agent use cases.

STEERING PROSOCIAL AI AGENTS: COMPUTATIONAL BASIS OF LLM'S DECISION MAKING IN SOCIAL SIMULATION

Method for Steering LLM Agents: introduces a technique to probe, quantify, and modify large language model behavior in social simulations by analyzing residual streams, identifying steering vectors, orthogonalizing them, projecting them onto a decision vector, and injecting scaled projections into the residual streams.
This approach allows for targeted manipulation of LLM decisions based on specific input variables like persona attributes and game framing.
The study demonstrates that injecting variable-specific steering vectors into residual streams can effectively alter an LLM agent's decision-making in a Dictator Game setting.

15th April 2025

GRAPHICBENCH: A Planning Benchmark for Graphic Design with Language Agents

GRAPHICTOWN: introduces a language agent framework for graphic design planning and execution, including Design Outline (generate design outline), Expert Recruitment (recruit expert agents), Workflow Generation (generate expert workflows), Workflow Supervision (integrate expert workflows), Action Retrieval (retrieve actions for steps), Action Execution (execute plan), Photo Editor agent (image editing expert), Vector Graphic Editor agent (vector illustration expert), Layout Designer agent (layout and text expert), and Actions (Tools) (executable operations).
The framework utilizes a hierarchical agentic structure with a supervisor agent directing specialized expert agents (Photo Editor, Vector Graphic Editor, Layout Designer) to generate and execute design workflows based on user queries and image inputs.
GRAPHICTOWN operates on the GRAPHICBENCH benchmark, evaluating LLM agents' ability to plan and execute creative design tasks by decomposing high-level goals into sequences of actions executable within web-based design tools.

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

REAL: introduces a benchmark and framework for evaluating autonomous web agents, featuring deterministic Environments (Deterministic website simulations), an Agent (System under evaluation) interacting via Observation (Agent input) and Action (Agent output) through an Agent Harness (Interface for agent interaction) managing a Browser Instance (Dedicated browser for task) with State Management (Persistent website state storage), evaluated by a Reward Module (Evaluates task success) using an LLM Judge (Evaluates information retrieval) and State Diff Check (Verifies state changes), controlled by a Configuration Framework (Controls environment settings) for completing a Task (Goal for the agent).
The framework provides 11 high-fidelity website simulations and 112 tasks, supporting flexible agent integration via Playwright, CDP, or URL control.
Task success is determined programmatically for action-based tasks and via an LLM judge for information retrieval tasks, with configurations enabling reproducible evaluation and edge case testing.

15th April 2025

TEXTARENA

TextArena: introduces a comprehensive framework for evaluating language models through competitive gameplay, featuring an Agent (LLM agent) interacting with an Environment (Text-based games) via a Wrapper (Observation processing), supported by an Evaluation System (Leaderboard/Scoring).
The framework provides a Gym-like interface for diverse text-based environments, enabling training and evaluation of agentic behavior in dynamic scenarios.
An online evaluation system tracks model performance against other models and humans using a TrueSkill leaderboard.

Reimagining Urban Science: Scaling Causal Inference with Large Language Models

AutoUrbanCI: introduces a modular, LLM-powered framework for urban causal inference, structured into Hypothesis Generation, Urban Data, CI Experiment, and Evaluation Agents.
The framework employs specialized agents like Reader, Data Engineer, Data Scientist, Experimenter, Validator, Urban Scientist, and Writer to handle distinct stages of the causal analysis pipeline.
AutoUrbanCI aims to address limitations in current urban causal research, such as data complexity and reproducibility, by leveraging LLM/MLLM capabilities for automation and collaboration.

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions

Cancer-Myth approach: introduces a methodology to create a dataset and evaluate LLMs, utilizing Myths, Valid Examples, Invalid Examples, an LLM Generator, an LLM Responder, an LLM Verifier, and Hematology Oncology Physicians to produce the Cancer Myth dataset.
This approach systematically generates and verifies patient questions containing false presuppositions to test LLMs' ability to identify and correct misconceptions.
The pipeline involves iterative generation and evaluation steps, with expert physician review ensuring the medical validity of the adversarial examples.

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

DataSentinel: introduces a game-theoretic method to detect prompt injection attacks by fine-tuning a Detection LLM (g) using a Minimax Optimization Problem, which simulates a game between fine-tuning the Detection LLM (g) and adaptive attacks.
The detection mechanism leverages a Detection LLM (g) and a Detection Instruction (sd) with a Secret Key (k), classifying data as contaminated if the Secret Key (k) is not in the Detection LLM's (g) output when prompted with the Detection Instruction (sd) and target data.
The Minimax Optimization Problem is solved iteratively by alternating between the Inner Max Problem, which optimizes contaminated target data (simulating an Adaptive Attack), and the Outer Min Problem, which updates the Detection LLM (g) parameters.

Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

Workflow Evolution Framework: introduces a dynamic, graph-based workflow (Workflow) composed of nodes (Nodes) with attributes (Node Attributes), which evolves iteratively (Workflow Evolution Process) guided by diagnostic feedback (Diagnostic Feedback) and suggestions (Suggestions) generated from process perception (Process Perception).
The framework defines a hierarchical search space (Search Space) encompassing node-level (Node-Level Operations), structural-level (Structural-Level Operations), and framework-level (Framework-Level Design) operations, enabling modifications through actions (Actions) like adding, removing, or modifying components.
This iterative evolution process allows the workflow to adapt its structure and parameters, incorporating elements like conditional (Conditional Structures), loop (Loop Structures), and parallel (Parallel Structures) logic to improve diagnostic accuracy and robustness over time.

The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print Injections

LLM-Powered GUI Agents: introduces, with LLM (Powers agent capabilities), UI Interpretation & Interaction (Perceives and interacts with GUIs), and Agent's Mental Model (Guides decision-making) components, a study evaluating the vulnerability of GUI agents to adversarial manipulations embedded in graphical user interfaces.
The paper proposes Fine-Print Injection (FPI), a novel attack exploiting agents' tendency to process low-salience content, and evaluates it alongside other attack types against six GUI agents and a human baseline.
Findings reveal that GUI agents are highly susceptible to contextually embedded attacks like FPI and Deceptive Defaults (DD), highlighting a privacy-utility trade-off in agent design and limited human awareness of these risks.

Towards Automated Safety Requirements Derivation Using Agent-based RAG

Agent-based RAG: introduces an approach for automated safety requirements derivation, processing Domain-Specific Knowledge into Vector and Summary Indices, utilizing a Top-level Agent to orchestrate retrieval via Document Agents and their Query Engines, providing Refined Context to an LLM (Large Language Model) for generating responses.
This architecture enhances context relevance compared to default RAG by employing a multi-step agentic retrieval process based on document content and query type.
The agent-based system facilitates incorporating domain-specific knowledge and aims to mitigate hallucinations by grounding outputs in retrieved, refined context.

Exploring Backdoor Attack and Defense for LLM-empowered Recommendations

BadRec: introduces a new attack framework that injects backdoors into LLM-based RecSys by poisoning the training set with Attackers, Trigger, Malicious Retailer, Poisoned Item Pool, Fake Users, and Poisoned Datasets, resulting in Open Backdoors in the LLM-empowered RecSys.
The framework perturbs item titles with triggers and generates fake users to create adversarial examples for training data poisoning.
Poisoning just 1% of training data can successfully implant backdoors, enabling manipulation of recommendation outcomes.

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

LLM-DCP: introduces Dynamic Compressing Prompts, a task-agnostic method modeling prompt compression as a Markov Decision Process, including a DCP-Agent, Critic, Reward Function, Hierarchical Prompt Compression Training Strategy, Distribution-aligned Small Model, and Replay Buffer.
The DCP-Agent iteratively removes redundant tokens from a prompt, guided by a reward function that balances compression, output quality, and information retention.
The Hierarchical Prompt Compression strategy uses curriculum learning to train the agent, progressively increasing compression difficulty.

Timing Analysis Agent: Autonomous Multi-Corner Multi-Mode (MCMM) Timing Debugging with Timing Debug Relation Graph

Timing Analysis Agent: introduces an autonomous multi-corner multi-mode timing debugging system with MCMM Planner Agent (Hierarchical task planning), TDRG Traversal Agent (Plans report retrieval), Expert Report Agent (Retrieves specific data), Structural Report Database (Structured timing reports), and Timing Debug Relation Graph (TDRG) (Connects reports debug knowledge).
The system integrates hierarchical plan solving and multi-agent collaboration to automate the analysis of MCMM timing reports.
It employs a novel Agentic Retrieval Augmented Generation approach leveraging LLM coding capabilities for accurate data retrieval from structured reports.

Can Large Language Models Trade? Testing Financial Theories with LLM Agents in Market Simulations

Simulation Framework: introduces an open-source simulation framework with Market Design (simulates stock market environment), Agent Design (manages LLM trading agents), and Analysis Module (collects and analyzes data) components, designed to test large language models as heterogeneous competing trading agents in a realistic simulated stock market.
The framework incorporates a persistent order book, various order types, stochastic dividends, and heterogeneous information sets for agents.
Agents submit standardized decisions using structured outputs and function calls while expressing their reasoning in natural language, enabling systematic analysis of their trading behavior and market dynamics.

14th April 2025

LLM-based AI Agent for Sizing of Analog and Mixed Signal Circuit

AI Agent: introduces an LLM-based agent for AMS circuit sizing, with Task Decomposition, LLM, Action, Observation, Comparison, External Tools, and Context components, designed to optimize transistor sizing iteratively.
The agent employs a ReAct loop (Action, Observation, Comparison) integrating an LLM with external simulation and analysis tools for iterative optimization.
Prompt engineering, including Chain-of-Thought, guides the LLM's reasoning and action selection based on performance metrics and historical context.

IEA-Plugin: An AI Agent Reasoner for Test Data Analytics

IEA-Plugin (AI Agent Reasoner): introduces an AI agent-based reasoning module designed to generate a stable API specification for test data analytics from user queries.
The system leverages LLMs and an agentic platform to process complex user queries into structured workflows and distill them into a stable API specification.
IEA-Plugin addresses knowledge acquisition and scalability challenges by using user interactions to build a query-workflow database and automatically generating API functions.

Introducing Large Language Models as the Next Challenging Internet Traffic Source

Experimental Setup: introduces, "an experimental setup", with User/Client Application (Interacts with agent), Querying Agent (Initiates query), Responding Agent (Local server, forwards query), and LLM API (External model service), where "the setup simulates user-agent and agent-LLM interactions to measure network traffic".
The paper explores the Internet of Agents paradigm, where AI agents interact with users, devices, and other agents, identifying LLMs as a significant new source of Internet traffic.
Traffic measurements per prompt for various LLMs are provided, estimating the potential impact on network infrastructure.

Characterizing LLM-driven Social Network: The Chirper.ai Case

Chirper.ai: introduces a large-scale analysis of an LLM-driven social network, Chirper.ai, with LLM Agents (Autonomous social entities), Social Network Platform (Hosts agents and interactions), Underlying AI Models (Power agent capabilities), and Community-based Reward System (Influences agent behavior), characterizing agent behavior and network structure.
The study compares Chirper.ai agent behavior and network structure to human and bot users on Mastodon.
Findings reveal distinct patterns in posting, self-disclosure, abusive content, and network positions, highlighting challenges for moderation.

Can Competition Enhance the Proficiency of Agents Powered by Large Language Models in the Realm of News-driven Time Series Forecasting?

CM (Complete Competition Mechanism): introduces a multi-agent framework for news-driven time series forecasting, incorporating News Filtering, Time Series Forecasting, Multi-Indicator Evaluation (MIE), Information Asymmetry (IA), Opponent-Oriented Self-Reflection (OOSR), Multi-Stage Reflection (MSR), Survival of the Fittest (SF), LLM₁, LLMs, and Memory Bank components.
The framework embeds a competition mechanism within multi-agent discussion to enhance innovative thinking and uses MSR with a fine-tuned small LLM for identifying misleading logic.
Experimental results show competition boosts agents' innovative thinking and significantly improves time series prediction performance compared to baselines.

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

HaluAgent: introduces an agentic framework for automated hallucination evaluation dataset generation, featuring a Generation Module (Generates QA data), Verification Module (Checks data correctness), and Optimization Module (Refines generation prompt).
The framework processes Knowledge Documents (Input source) to generate Generated Data (Raw output), which is validated by the Verification Module (Checks data correctness) using Manual Rules (Verification criteria).
The Optimization Module (Refines generation prompt) refines the generation prompt based on Error Feedback (Verification errors) from the Verification Module (Checks data correctness), producing Qualified Data (Validated data) that forms the final Dataset (Final evaluation data).

Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

CRAVE (Cluster-based Retrieval Augmented Verification with Explanation): introduces a novel framework that processes Input (Social media post), performs Evidence Retrieval (Get external evidence) via Reverse Image Search (Find image evidence) and Text-Based Search (Find text evidence), applies Clustering (Group evidence narratives) and Narrative Extraction (Select representative text), uses Agent-Based Evidence Refinement (Refine evidence iteratively), and employs an LLM-Based Judge (Determine veracity, explain) for Reasoning (Assess narratives, decide verdict) to produce Output (Explanation, veracity verdict).
The framework clusters multimodal evidence into distinct narratives and uses LLM reasoning based on 5W1H to generate interpretable explanations and veracity verdicts.
CRAVE integrates retrieval-augmented LLMs with clustering techniques to handle diverse and potentially contradictory evidence for fact-checking social media posts.

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

SocioVerse: introduces a world model for social simulation powered by LLM agents and a 10 million real-world user pool.
The framework includes four powerful alignment modules: Social Environment, User Engine, Scenario Engine, and Behavior Engine.
SocioVerse addresses alignment challenges in environment, user, scenario, and behavior to achieve diverse and trustworthy simulations.

A Survey of Personalization: From RAG to Agent

Personalized Agent: introduces a system designed to dynamically incorporate user context, memory, and external tools or APIs to support highly personalized and goal-oriented interactions, including Personalized Understanding (interpreting user input/context), Personalized Planning and Execution (integrating memory/tools), and Personalized Generation (creating tailored output).
This framework evolves from Retrieval-Augmented Generation (RAG) by integrating agentic capabilities like Memory and Tool/API utilization.
Memory components store historical user data, while Tool/API components enable interaction with external knowledge sources for task execution.

CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation

CODERAG (retrieval-augmented code generation framework): introduces, "comprehensively retrieve supportive codes for real-world code generation", with Requirement Graph (Models requirement relationships), DS-Code Graph (Models code relationships), Bigraph Mapping (Maps requirements to code), Code-oriented Agentic Reasoning (LLM-driven retrieval and generation), Programming Tools (Assist LLM retrieval/testing), and LLMs (Generate code using retrieved info).
The framework constructs a requirement graph and a DS-code graph, maps between them, and uses an agentic process with programming tools and LLMs for code generation.
CODERAG aims to improve real-world repo-level code generation by providing LLMs with relevant context from the code repository and external sources.

DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify

DataMosaic: introduces an agentic workflow with Question Decomposition (Decomposes question), Structure Selection (Selects data structure), Seek (Locates relevant data), Extraction (Extracts structured data), Reasoning (Performs reasoning), and Thinker (Evaluates, directs workflow) components.
The framework aims to make LLM-powered multi-modal data analytics explainable and verifiable by transforming data into structured formats for step-by-step processing.
The Thinker component dynamically adapts the workflow based on evaluation of intermediate results, enhancing accuracy and efficiency.

A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Taxonomy of Large Language Model-Empowered Spatial Intelligence: introduces a structured framework with Foundational Capabilities (Underlying spatial abilities), Spatial Memory and Knowledge (Recall spatial information), Abstract Spatial Reasoning (Simplify spatial problems), Spatial Intelligence for Real World (Apply spatial intelligence), Embodied Spatial Intelligence (Agents in physical environments), Urban Spatial Intelligence (Spatial tasks in cities), Earth Spatial Intelligence (Spatial tasks in Earth science), Spatial Memory and Knowledge Sources (Internal or external data), Spatial Memory and Knowledge Down-stream Tasks (Specific spatial applications), and Abstract Spatial Reasoning Mental Models (Types of spatial logic).
The framework categorizes LLM spatial intelligence into foundational abilities like memory and reasoning, and real-world applications across embodied, urban, and earth science domains.
This taxonomy provides a structured view of LLM-powered spatial intelligence, highlighting key components and their relationships across different scales and disciplines.

Training Small Reasoning LLMs with Cognitive Preference Alignment

CRV+CogPO: introduces a multi-agent system with a Critic (evaluates reasoning process), Rethinker (rewrites reasoning process), and Verifier (validates reasoning process) combined with the CogPO (aligns reasoning preferences) algorithm to train smaller reasoning LLMs.
The approach refines training data by critiquing, rethinking, and verifying reasoning processes from larger models, then uses preference optimization tailored to smaller models' capacities.
This method demonstrates improved performance on challenging reasoning benchmarks compared to other training techniques for smaller models.

Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

RC (Reasoning Court): introduces a framework for multi-hop reasoning that includes LLM Agents (Generate candidate solutions), Reasoning Steps (Internal thought process), Retrieval Actions (Gather external information), Retrieved Evidence (Information from external sources), and LLM Judge (Evaluates trajectories and determines answer).
The framework employs multiple LLM agents to generate diverse reasoning paths and candidate answers by interleaving reasoning and external retrieval.
A dedicated LLM judge evaluates the agents' reasoning trajectories and retrieved evidence to select the most accurate answer or synthesize a new one.

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Adaptive MAS: introduces an adaptive multi-agent framework with a CEO agent to enhance collaborative reasoning through model fine-tuning and system-level coordination.
The framework includes a CEO agent that dynamically manages agent collaboration, resource allocation, and reasoning depth based on task progress.
The system utilizes specialized agents (Expert Recruiter, Problem Solvers, Executor, Evaluator) within the MAS to collaboratively solve complex tasks.

13th April 2025

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Review Feedback Agent: introduces a multi-LLM system, with Paper (Input), Review (Input), Actor 1 (Generate initial feedback), Actor 2 (Generate initial feedback), Aggregator (Merge feedback lists), Critic (Evaluate and filter feedback), Formatter (Format feedback pairs), Reliability tests (Ensure feedback quality), and Feedback (Output to reviewer), designed to improve peer review quality by providing automated feedback to reviewers.
The system uses parallel Actors to generate initial feedback, which is then aggregated, critically evaluated, and formatted before being posted to the reviewer.
Reliability tests act as guardrails, ensuring the generated feedback is constructive, accurate, and properly formatted before delivery.

AGENTIC WORKFLOWS FOR ECONOMIC RESEARCH: DESIGN AND IMPLEMENTATION

Agentic Workflow Framework: introduces a methodology leveraging LLMs and multimodal AI for economic research, featuring Specialized Agents (perform specific tasks), Inter-Agent Communication (structured data exchange), Error and Escalation Pathways (handle issues), Adaptive Mechanisms (switch strategies), Human-in-the-Loop (HITL) Checkpoints (human oversight), and a Multi-phase Workflow (coordinates stages).
The framework enhances research efficiency and reproducibility by automating tasks across the economic research lifecycle while integrating strategic human oversight.
Specialized agents handle distinct responsibilities, communicating through structured protocols, with built-in mechanisms for error handling and adaptation across interconnected workflow stages.

AGENTA/B: Automated and Scalable Web A/B Testing with Interactive LLM Agents

AGENTA/B: introduces a system for automated and scalable web A/B testing using interactive LLM agents, including LLM Agent Generation, Testing Preparation, Agent-Environment Interaction, and Post-Testing Analysis modules.
The Agent-Environment Interaction loop involves an Environment Parsing Module, LLM Agent (Action Prediction), Action Execution Module, and Agent Profiling Module to simulate realistic user behavior on live websites.
AGENTA/B enables rapid, risk-free behavioral piloting for UX evaluation by generating diverse agent personas and analyzing their interactions across different design variants.

MLRC-BENCH: Can Language Agents Solve Machine Learning Research Challenges?

MLRC-BENCH: introduces a benchmark to evaluate language agents on machine learning research challenges, including Language Agent, Task Description, Starter Code, Human Idea, Implementation, LLM Explainer, Underlying Idea, LLM Judge, and Scorer.
The benchmark provides a task environment with detailed descriptions, starter code, and optional human ideas to the Language Agent.
The agent's Implementation is evaluated by an evaluation pipeline consisting of an LLM Explainer, LLM Judge, and Scorer using objective and subjective metrics.

EMOAGENT: ASSESSING AND SAFEGUARDING HUMAN-AI INTERACTION FOR MENTAL HEALTH SAFETY

EmoAgent: introduces a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions, with EmoEval simulating virtual users and EmoGuard providing real-time interventions.
EmoEval assesses psychological states using clinically proven tools and simulates large-scale human-AI conversations with a Character-based Agent and Dialog Manager Agent.
EmoGuard acts as a real-time intermediary layer with a Safeguard Agent comprising an Emotion Watcher, Thought Refiner, Dialog Guide, and Manager, which iteratively trains to mitigate risks.

AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

AgentDynEx: introduces a LLM-based system for setting up multi-agent simulations, including a Configuration Matrix (structured setup framework), Initializing Mechanics (defines simulation world), Tracking Dynamics (monitors simulation progress), Nudging (intervenes in running simulation), Dynamic Reflection (automatic nudge suggestion), Manual Intervention (human-driven nudging), Holistic Reflection (post-run error identification), Debugging Lists (problem-solution repository), GPTeam (multi-agent simulation engine), LLMs (language models), Run Logs (simulation event records), Intermediate Summaries (runtime progress updates), and Updated Configuration (refined simulation setup).
AgentDynEx balances simulation mechanics and dynamics through a structured configuration phase, dynamic runtime nudging based on reflection, and post-run holistic reflection for configuration updates.
The system uses LLMs and the GPTeam engine to enable users to define scenarios, monitor progress via logs and summaries, intervene manually or automatically, and iteratively refine simulation setups.

Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations

Multi-agent system: introduces an approach for automating computational fluid dynamics simulations using a fine-tuned Large Language Model.
The system orchestrates a workflow with a pre-checker for input validation, a fine-tuned LLM for configuration generation using Chain-of-Thought, a runner for simulation execution, and a corrector for error resolution.
The fine-tuned LLM, trained on the NL2FOAM dataset, translates natural language descriptions into executable OpenFOAM configurations, achieving high performance on diverse CFD tasks.

HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

HM-RAG: introduces a novel Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation framework with Decomposition Agent (Decomposes complex queries), Vector-based Retrieval Agent (Retrieves from vector database), Graph-based Retrieval Agent (Retrieves from graph database), Web-based Retrieval Agent (Retrieves from web sources), Decision Agent (Synthesizes and refines answers), and LLM (Processes queries and generates text), designed for collaborative multimodal knowledge synthesis.
The framework employs a three-tiered architecture with specialized agents for query decomposition, multi-source retrieval, and answer refinement.
HM-RAG achieves superior performance by integrating diverse data sources and leveraging multi-agent collaboration for complex query handling.

CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent

CheatAgent: introduces a novel attack framework, with Insertion Positioning, LLM Agent-Empowered Perturbation Generation, LLM-Based Agent, and Trainable Prefix Prompt components, designed to attack LLM-empowered recommender systems in a black-box setting.
The framework leverages an LLM-based agent to generate adversarial perturbations by identifying optimal insertion positions and iteratively refining the attack strategy via prompt tuning based on victim feedback.
CheatAgent aims to demonstrate the safety vulnerability of LLM-empowered recommender systems to subtle adversarial attacks crafted by simulating human-like decision processes.

UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

UXAgent: introduces a system for simulating usability testing of web design with LLM agents, including a Persona Generator, LLM Agent, Universal Browser Connector, Agent Interview Interface, and Simulation Replay Interface.
The LLM Agent features a two-loop architecture with Fast and Slow Loops, supported by Perceive, Planning, Action, Reflection, Wonder Modules, and a Memory Stream.
The Universal Browser Connector provides the Observation Space and Action Space for the LLM Agent to interact with real-world web environments.

12th April 2025

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

SEMANTICCOMMIT: introduces a system for managing AI agent memory updates, featuring a UI, backend, knowledge graph, information retrieval pipeline with retrieval and conflict classification stages, and an LLM.
The system helps users detect and resolve semantic conflicts in natural language intent specifications using a knowledge graph-based RAG pipeline and LLMs for suggestions.
The interface provides global and local conflict detection and resolution options, allowing users to review, edit, and validate AI-proposed changes.

Langformers: Unified NLP Pipelines for Language Models

Langformers: introduces an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface, including tasks (Central interface), generators (LLM interaction), labellers (Automated text annotation), classifiers (MLM fine-tuning), mlms (MLM training/pretraining), embedders (Text embedding generation), searchers (Vector database integration), rerankers (Search result reordering), and mimickers (Knowledge distillation).
The library consolidates various NLP tasks for LLMs and MLMs into a cohesive API, supporting platforms like Hugging Face and Ollama.
Key innovations include task-specific factories, built-in memory and streaming for conversational agents, and a lightweight, modular design.

Tell-XR: Conversational End-User Development of XR Automations

Tell-XR: introduces a conversational end-user development system for XR automations, with User Interface (Handles multimodal input), User Interface (Handles multimodal input), Tell-XR Bot (Core authoring system), Tell-XR Bot (Routes requests), Tell-XR Bot (Manages dialogue phases), Tell-XR Bot (Generates JSON rule), Tell-XR Bot (External tool access), Tell-XR Bot (Stores dialogue history), Automation Engine (Manages XR state), Automation Engine (Tracks object states), and Automation Engine (Stores/executes rules) components, enabling users to define event-condition-action rules via natural language and multimodal interaction.
The system leverages large language models within the Tell-XR Bot to interpret user intent and guide them through distinct dialogue phases for defining and refining automations.
The architecture integrates a multimodal user interface for VR and AR, the LLM-based bot for conversation, and an automation engine managing the XR environment state and executing rules.

11th April 2025

MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers

MCP Bridge: introduces a lightweight, LLM-agnostic RESTful proxy system with Client Applications, RESTful API, MCP Bridge, MCP Servers, MCP-Gemini Agent, and LLM components, designed to connect resource-constrained clients to MCP servers via a unified API.
The system decouples client applications from underlying MCP server processes, enabling access to MCP functionality without local process execution constraints.
MCP Bridge implements a risk-based execution model for security and supports various MCP server transports while maintaining backward compatibility.

DocAgent: A Multi-Agent System for Automated Code Documentation Generation

DocAgent: introduces a multi-agent system for automated code documentation generation, which includes Navigator Module, Repository AST Parsing, Dependency DAG, Topological Traversal, Topological Sorting, Dependency-Aware Processing Order, Multi-Agent Documentation Generation, Reader, Searcher, Writer, Verifier, and Orchestrator.
DocAgent uses a Navigator Module to establish dependency-aware processing order and a Multi-Agent Documentation Generation module with specialized agents to collaboratively generate documentation.
The system aims to address challenges in automated code documentation by ensuring completeness, helpfulness, and truthfulness through topological processing and multi-agent collaboration.

SEAVIEW: Software Engineering Agent Visual Interface for Enhanced Workflow

SEAVIEW: introduces a visualization framework for software engineering agent experiments, comprising a web frontend for user interaction, a backend for data processing, PostgreSQL for structured data storage, object storage for large files, and external environment for running experiments.
SEAVIEW framework aims to assist researchers in debugging and improving software engineering agents by providing experiment health, comparison, summarization, and reporting capabilities.
The tool is designed to analyze agent trajectories and experiment results, offering insights into agent behavior and performance across different experimental setups and parameters.

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

LLM Reasoning System: introduces, Reasoner (generates reasoning steps), Verifier (evaluates reasoning quality), and Refiner (improves reasoning trajectories), which are key components for effective reasoning in large language models.
The Reasoner proposes responses, the Verifier judges their quality, and the Refiner revises flawed outputs based on feedback.
These components can be organized in standalone LLMs, single-agent systems interacting with environments, or multi-agent systems communicating with each other.

AGENTREWARDBENCH: Evaluating Automatic Evaluations of Web Agent Trajectories

AGENTREWARDBENCH: introduces a benchmark for evaluating LLM judges for web agent trajectories, including a Web Agent (Performs tasks on web), Web Environment (Simulated or real websites), Trajectory (Agent's sequence of actions), Human Annotator (Provides ground truth labels), LLM Judge (Evaluates agent trajectories), Judge Model (Specific LLM judge implementation), and Input Representation (Trajectory data for judge).
The benchmark contains over 1300 trajectories from various web agents and environments, annotated by experts for success, side effects, and repetition.
Evaluation shows that simpler LLM judge input representations can achieve higher agreement with human experts than prior methods, and rule-based evaluation often underestimates agent success.

TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

TP-RAG (Travel Planning - Retrieval-Augmented Generation): introduces benchmark for retrieval-augmented spatiotemporal-aware travel planning with Inputs, Agent, Plan, and Evaluate components.
TP-RAG benchmark dataset includes real-world travel queries, fine-grain annotated Points of Interest, and high-quality travel trajectory references for context-aware planning.
TP-RAG benchmark facilitates evaluation of LLM agents in generating spatiotemporally coherent travel plans utilizing trajectory-level knowledge for improved travel practicality.

Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing

LLM-powered Conversational Agent for Writing Reflection: introduces a system designed with LLM-powered Conversational Agent, Voice Input, Written Output, Feedback, Questions, Advice, and UI Affordances to investigate voice interaction for writing reflection.
This system emphasizes Contextualization and Control to improve user experience and maintain writer's ownership during revision process.
The research aims to evaluate how voice input modality affects reflection depth and revision quality compared to text input when using conversational agents.

Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents

FAIRGAME (Framework for AI Agents Bias Recognition using Game Theory): introduces user, developer, and regulator components to model regulatory ecosystem.
Framework uses evolutionary game theory and LLMs to investigate strategic choices under different regulatory scenarios.
FAIRGAME aims to identify emerging behaviors of strategic AI agents in game-theoretic settings and compare them with game-theoretic predictions.

MOOSEAGENT: A LLM BASED MULTI-AGENT FRAMEWORK FOR AUTOMATING MOOSE SIMULATION

MooseAgent: introduces an automated framework for MOOSE simulation, integrating Requirement, Alignment, Architect, Vector knowledge base, Error Correction, and Runner components.
MooseAgent framework uses LLMs to understand user needs, generate MOOSE input files, and iteratively refine them using a vector database and error correction.
This multi-agent system aims to simplify finite element simulation by automating pre-processing, solver configuration, and post-processing stages in MOOSE.

Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

Task Memory Engine (TME): introduces a memory framework for LLM agents, with Task Memory Tree (hierarchical task state representation), Task Relationship Inference Module (reasons about task relationships), and Prompt Synthesizer (generates context-aware prompts).
TME enhances state awareness by tracking task execution using Task Memory Tree, inferring task relationships with Task Relationship Inference Module, and generating adaptive prompts with Prompt Synthesizer.
This framework enables robust, interpretable, and token-efficient execution of complex multi-step tasks by providing structured memory and intelligent prompt construction.

Adopting Large Language Models to Automated System Integration

Compositio Prompto (Compositio Prompto): introduces an architecture employing Large Language Models for automated service composition, utilizing task specifications, service documentation, input/output schemas to create a prompt for the LLM, which then generates executable service compositions.
The architecture aims to mitigate complex formal modeling in service composition by using natural language input and OpenAPI specifications, focusing on generating reusable service compositions as program code.
Compositio Prompto architecture is evaluated for service composition and discovery using Retrieval Augmented Generation (RAG) and benchmarks like RestBench and SOCBench-D to address limitations of input token length and improve service discovery i

Name		Name	Last commit message	Last commit date
Latest commit History 1,260 Commits
Autonomous_Agents_Resources.md		Autonomous_Agents_Resources.md
Autonomous_agent_logo.png		Autonomous_agent_logo.png
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers

30th May 2025

29th May 2025

28th May 2025

27th May 2025

26th May 2025

25th May 2025

24th May 2025

23rd May 2025

22nd May 2025

21st May 2025

20th May 2025

19th May 2025

18th May 2025

14th May 2025

13th May 2025

12th May 2025

11th May 2025

10th May 2025

9th May 2025

8th May 2025

7th May 2025

6th May 2025

5th May 2025

4th May 2025

3rd May 2025

2nd May 2025

1st May 2025

30th April 2025

29th April 2025

28th April 2025

27th April 2025

26th April 2025

25th April 2025

24th April 2025

23rd April 2025

22nd April 2025

21st April 2025

20th April 2025

19th April 2025

18th April 2025

17th April 2025

17th April 2025

16th April 2025

15th April 2025

15th April 2025

14th April 2025

13th April 2025

12th April 2025

11th April 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Packages