I’m curious how people here are thinking about managing agentic LLM systems once they’re running in production. #10379

bryanadenhq · 2026-01-13T20:44:48Z

bryanadenhq
Jan 13, 2026

Beyond basic observability, things like instrumentation, runtime control, and cost management seem to get complicated quickly as soon as you have multiple agents, tools, and models involved. In particular, it feels hard to reason about cost and token usage at the agent level, apply guardrails or budgets at runtime, or debug and compare agent runs in a structured way rather than just reading logs after the fact.
I’m interested in hearing how others are approaching this today. What parts are you building yourselves, what’s working, and where are you still feeling friction? This is just for discussion and learning, not pitching anything.

KeepALifeUS · 2026-02-12T20:36:49Z

KeepALifeUS
Feb 12, 2026

Great topic — this is exactly what I've been working on. Here's what's worked for me:

1. Shared State as the Observability Layer

Instead of parsing logs, make the coordination mechanism be the audit trail:

state = {
    "run_id": "abc123",
    "agent_decisions": {
        "retriever": {"action": "query", "tokens": 150, "latency_ms": 230},
        "analyzer": {"action": "summarize", "tokens": 890, "latency_ms": 1200}
    },
    "total_cost": 0.0042,
    "errors": []
}

Every agent reads/writes to this state. You get:

Token tracking per agent — no log parsing needed
Cost attribution — which agent is expensive?
Deterministic replay — feed the same state, get the same behavior

2. Token Budgets with Circuit Breakers

if state["token_usage"]["total"] > BUDGET_LIMIT:
    state["status"] = "paused"
    state["pause_reason"] = "Budget exceeded"
    # Alert or escalate

3. Structured Diffs for Debugging

Instead of comparing logs, compare state snapshots:

diff = compare_states(run_v1_state, run_v2_state)
# Shows exactly what changed between runs

Key Insight

The coordination layer (shared state) becomes the observability layer. You don't add monitoring on top — it's built into how agents coordinate.

This pattern reduced our debugging time dramatically. Working implementation: https://github.com/KeepALifeUS/autonomous-agents

Would love to hear what others are building!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I’m curious how people here are thinking about managing agentic LLM systems once they’re running in production. #10379

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

I’m curious how people here are thinking about managing agentic LLM systems once they’re running in production. #10379

Uh oh!

bryanadenhq Jan 13, 2026

Replies: 1 comment

Uh oh!

KeepALifeUS Feb 12, 2026

1. Shared State as the Observability Layer

2. Token Budgets with Circuit Breakers

3. Structured Diffs for Debugging

Key Insight

bryanadenhq
Jan 13, 2026

KeepALifeUS
Feb 12, 2026