Skip to content

Conversation

TerminallyLazy
Copy link
Contributor

Fix A2A Subordinate Stability and Resource Management

Problem

The A2A subordinate system had several critical issues preventing reliable multi-agent coordination:

  1. Duplicate Spawning: Multiple subordinates with same role due to poor status checking
  2. Port Conflicts: All subordinates trying to use port 8100, causing connection failures
  3. Memory Exhaustion: Subordinates killed by OOM without graceful handling
  4. UI Pollution: Crashed subordinates leaving duplicate entries in agent management UI
  5. Poor Error Reporting: Unclear error messages making debugging difficult

Solution

Subordinate Manager (a2a_subordinate_manager.py)

  • Unique Port Allocation: Incremental port assignment (8100, 8101, 8102...)
  • Duplicate Prevention: Role normalization + proper existing subordinate detection
  • Resource Cleanup: Immediate port/context release on subordinate death
  • Better Status Handling: Wait for "starting" subordinates, cleanup failed ones
  • OOM Detection: Detect exit codes 9/137 and provide clear OOM error messages

Subordinate Runner (a2a_subordinate_runner.py)

  • Memory Limits: Configurable RLIMIT_AS (default 8GB, via SUBORDINATE_RAM_GB)
  • Graceful OOM: MemoryError handling instead of kernel SIGKILL
  • Better Logging: Detailed error reporting for agent creation failures

Task Handler (a2a_handler.py)

  • Memory Monitoring: Track memory usage during task execution
  • OOM Recovery: Catch MemoryError during monologue and return error response
  • Progress Tracking: Better timeout and progress monitoring

UI Context Management

  • Duplicate Prevention: Check/remove existing proxy contexts before registration
  • Crash Cleanup: Unregister contexts when subordinates crash or become unreachable
  • Status Priority: Keep best-status context when duplicates detected

Testing

  • ✅ Simple 2-subordinate coordination works reliably
  • ✅ Retry mechanism functions when subordinates crash
  • ✅ No UI duplicates after subordinate respawn
  • ✅ Clear error messages for OOM/connection issues
  • ✅ Port allocation scales properly

Impact

  • Reliability: Subordinates now survive memory pressure and connection issues
  • Debuggability: Clear error messages for common failures
  • UI Cleanliness: No more duplicate/orphaned agent entries
  • Scalability: Multiple subordinates can run simultaneously without port conflicts
  • Resource Management: Proper cleanup prevents resource leaks

Breaking Changes

None - all changes are backwards compatible.

Configuration

Set SUBORDINATE_RAM_GB=16 (or higher) for memory-intensive workflows.

Add comprehensive Agent-to-Agent (A2A) Protocol support enabling Agent Zero
to communicate with other A2A-compliant agents while maintaining full
backward compatibility. Supports peer discovery, task delegation, and
multiple interaction patterns (polling, SSE, webhook).

Key features:
- A2A Protocol v1.1.0 compliance with JSON-RPC 2.0
- AgentCard discovery via /.well-known/agent.json
- TaskState management (SUBMITTED, WORKING, COMPLETED, FAILED)
- Multiple interaction patterns: polling, SSE, webhook
- Enterprise authentication (Bearer, API Key, OAuth2)
- Dynamic tool discovery and wrapping
- Async/await architecture with proper event coordination
- Full backward compatibility with existing Agent Zero functionality

Implementation includes:
- Core A2A communication tools and handlers
- Starlette/FastAPI ASGI server with all required endpoints
- Robust client with retry logic and authentication
- Extended AgentContext and AgentConfig for A2A capabilities
- Peer-to-peer communication layer
- Tool registry for dynamic capability discovery
- Comprehensive test suite (50+ test cases)
- Multi-agent workflow examples
- A2A-specific prompt templates

Files added:
- python/tools/a2a_communication.py - A2A communication tool
- python/helpers/a2a_handler.py - Core A2A protocol handler
- python/helpers/a2a_server.py - A2A server implementation
- python/helpers/a2a_client.py - A2A client with proper async handling
- python/helpers/a2a_agent.py - Peer-to-peer communication layer
- python/helpers/a2a_tool_wrapper.py - Dynamic tool discovery
- examples/a2a_multi_agent_workflow.py - Multi-agent demo
- tests/test_a2a_integration.py - Comprehensive test suite
- prompts/default/agent.system.a2a.*.md - A2A prompt templates

Resolves enterprise requirements for agent collaboration and scalability.
…boration

  Replace traditional hierarchical subordinates with A2A-based peer-to-peer
  communication enabling parallel processing, direct user interaction, and
  scalable multi-agent workflows while maintaining full backward compatibility.

  Key Features:
  - True parallel processing with independent subordinate processes
  - Direct user communication with any subordinate agent
  - Auto port allocation and process lifecycle management
  - Agent hierarchy visualization and management
  - Fault tolerance through process isolation
  - Scalable architecture supporting distributed agent networks

  Implementation:
  - A2ASubordinateManager: Complete subordinate lifecycle management
  - A2ASubordinate Tool: Enhanced tool replacing call_subordinate
  - A2ASubordinateRunner: Independent process runner for subordinates
  - Enhanced AgentContext: Multi-agent registry and message routing
  - Extended AgentConfig: Subordinate-specific configuration options

  Benefits over traditional subordinates:
  - Parallel execution instead of sequential processing
  - Direct user access to subordinates via A2A protocol
  - Process isolation prevents cascading failures
  - Horizontal scalability across multiple machines
  - Rich interaction patterns between all participants

  Files added:
  - python/helpers/a2a_subordinate_manager.py - Subordinate lifecycle management
  - python/helpers/a2a_subordinate_runner.py - Independent subordinate processes
  - python/tools/a2a_subordinate.py - Enhanced subordinate communication tool
  - examples/a2a_enhanced_subordinates_demo.py - Complex workflow demonstration
  - tests/test_a2a_subordinates.py - Comprehensive test suite
  - docs/a2a_subordinates.md - Complete documentation and migration guide
  - prompts/default/agent.system.tool.a2a_subordinate.md - Tool documentation

  Enables sophisticated multi-agent workflows while preserving Agent Zero's
  tool-based simplicity and maintaining backward compatibility.
- Fix duplicate subordinate spawning via role normalization and better status checking
- Implement unique port allocation (8100, 8101, 8102...) instead of port reuse conflicts
- Add configurable memory limits (SUBORDINATE_RAM_GB) with graceful OOM handling
- Enhance memory monitoring during task execution using psutil
- Prevent UI duplicate contexts with proper cleanup on subordinate crash/failure
- Improve error reporting and logging for better debugging
- Add proper resource cleanup (ports, contexts) when subordinates exit prematurely

Fixes issues where subordinates would:
- Spawn duplicates due to poor existing subordinate detection
- Fail to connect due to port 8100 conflicts
- Crash with SIGKILL due to memory exhaustion
- Leave orphaned UI contexts after crashes
- Fix duplicate subordinate spawning via role normalization and better status checking
- Implement unique port allocation (8100, 8101, 8102...) instead of port reuse conflicts
- Add configurable memory limits (SUBORDINATE_RAM_GB) with graceful OOM handling
- Enhance memory monitoring during task execution using psutil
- Prevent UI duplicate contexts with proper cleanup on subordinate crash/failure
- Improve error reporting and logging for better debugging
- Add proper resource cleanup (ports, contexts) when subordinates exit prematurely

Fixes issues where subordinates would:
- Spawn duplicates due to poor existing subordinate detection
- Fail to connect due to port 8100 conflicts
- Crash with SIGKILL due to memory exhaustion
- Leave orphaned UI contexts after crashes
@TerminallyLazy
Copy link
Contributor Author

I think I might need some help with this one... If anyone feels up to it.

@Omni-NexusAI
Copy link

Would this seemingly fix the ballooning memory usage that the agents like to use? I noticed that the container's memory usage keeps increasing steadily until it reaches the limit, and presumably wouldn't eventually lead to a crash. I have a lot of RAM, so I hadn't managed to reach the max yet. but it is obvious a big problem for long-term usage.

Also, I was wondering if the agents can truly run in parallel, or if they are sequential. Appears that they are sequential as of now, but would be useful if they could somehow run in parallel within the same instance, or the agent can dynamically create a new envs for it's agents to perform operations in parallel with each other.

Either way, I've been looking into how these two things could be improved or implemented, and see what I can do.

    - Added a2a_server_token to Settings TypedDict
    - Updated default settings to auto-generate A2A tokens
    - Added token clearing in sensitive settings removal
    - Implemented token update logic with deferred tasks
  2. DynamicA2AProxy Class (a2a_server.py:574-678):
    - Similar to DynamicMcpProxy but for A2A protocol
    - Supports token-based URL routing: /t-{token}/endpoint
    - Thread-safe reconfiguration of routes
    - ASGI-compatible proxy implementation
  3. A2A Client Updates (a2a_client.py):
    - Added url_token parameter to constructor
    - _build_token_url() method for token-based URL construction
    - Updated all client methods to use token-based URLs when available
  4. Subordinate Integration:
    - Updated subordinate manager to pass A2A tokens to clients
    - Modified subordinate runner to use DynamicA2AProxy
    - Added _start_uvicorn_with_proxy() method for token-based server startup
  5. UI Settings Panel:
    - Added complete A2A settings section with 6 configuration fields
    - Token field is hidden (like MCP) but managed automatically
    - Includes settings for port, subordinate management, and protocol options

  Token-Based URL Structure:

  Standard A2A URLs:
  - /.well-known/agent.json
  - /tasks/submit
  - /message/stream

  Token-Based URLs:
  - /t-{token}/.well-known/agent.json
  - /t-{token}/tasks/submit
  - /t-{token}/message/stream

  This implementation provides the same level of security and URL-based authentication that MCP uses, ensuring that A2A communications are properly
  authenticated via URL tokens while maintaining backward compatibility with standard A2A protocol endpoints.
  - Subordinates inherit complete tool system from parent agent
  - All tools (code execution, web search, file operations, etc.) are available
  - MCP tools are properly inherited and configured
  - Tool discovery and execution works identically to main agent

  ✅ Prompt Configuration Verified:

  - Profile-based prompt system works correctly for subordinates
  - Agent-specific prompts override defaults when available
  - System prompts (behavior, communication, tools) are automatically loaded
  - Template variables and placeholders are properly processed

  ✅ Architecture Integration:

  - Token-based authentication system implemented
  - Subordinate-to-subordinate communication via peer discovery
  - Agent Management UI integration with activity drawer
  - Complete flowchart workflow implemented in a2a_task_delegator.py
  - Independent operation with separate context windows
  - Results integration back to main conversation
…ystem as explicitly required by the user for Agent Zero framework development.

  2. Identified Root Cause: The issue was that the Agent Zero Docker container was missing required Python dependencies and modules for the file handling APIs.
  3. Implemented Comprehensive Solution:
    - Installed Dependencies: Added essential Python packages (attrs, Flask, nest-asyncio, aiohttp) to the container
    - Created Minimal Modules: Deployed lightweight file_info.py and get_work_dir_files.py modules at /a0/ in the container
    - Enhanced RFC Handler: Modified /a0/python/helpers/rfc.py with fallback logic that:
        - First tries to import full module paths (python.api.file_info)
      - Falls back to simplified module names (file_info) when dependencies are missing
      - Ensures /a0 is in the Python path for module discovery
  4. Verified Functionality:
    - Both minimal modules work correctly in the container
    - RFC fallback mechanism successfully routes calls to the appropriate modules
    - File operations (browsing directories, getting file info) now function properly
@Kironkeys
Copy link

this looks fire bro...im going to test it right now. I added a couple of your pulls too, they are badass appreciate it brotha fireee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants