This document explains the current WebSocket behavior in Forge from the point of view of a client or UI.
The goal is to make the runtime behavior understandable without requiring the client to read Go and Python internals. It focuses on:
- which WebSocket endpoints exist
- which messages flow over each socket
- how to understand status, progress, and failure
- how infrastructure lifecycle events differ from guild/business messages
- what a client should render or ignore
Forge exposes two different WebSocket channels per guild:
-
usercommsUsed for user-to-guild conversational and business messages. -
syscommsUsed for system-facing traffic relevant to the current user session, guild status traffic, and guild infrastructure lifecycle events.
These channels are intentionally different.
usercomms is where a browser or CLI sends user requests and receives agent responses.
syscomms is where a browser or CLI should look for:
- system notifications targeted at the current user
- guild health/status traffic
- launch/progress/failure events emitted by Forge runtime infrastructure
These are the public routes registered by the main API server:
GET /ws/guilds/:id/usercomms/:user_id/:user_nameGET /ws/guilds/:id/syscomms/:user_id
These endpoints use the canonical Forge protocol.Message wire shape.
The local Rustic UI exposes proxy-compatible routes:
GET /rustic/ws/:ws_id/usercommsGET /rustic/ws/:ws_id/syscomms
These use a compatibility shaping layer that accepts and emits a more UI-oriented JSON shape. Internally they still map onto the same backend topics and message flow.
If you are writing a new external client, prefer the canonical public endpoints unless you explicitly need parity with the local Rustic UI transport.
Forge messaging is guild-scoped. A WebSocket connected to guild g1 only sees traffic within namespace g1.
The important topics for WebSocket clients are:
-
user:<user_id>Inbound user requests fromusercomms -
user_notifications:<user_id>Outbound user-facing responses delivered tousercomms -
user_system:<user_id>Inbound system requests fromsyscomms -
user_system_notification:<user_id>Outbound user-targeted system notifications delivered tosyscomms -
guild_status_topicGuild-manager health/status traffic delivered tosyscomms -
infra_events_topicForge runtime lifecycle events delivered tosyscomms
The split matters:
usercommsis about guild interactionsyscommsis about runtime/system state
The public WebSocket routes send and receive the canonical protocol.Message JSON shape.
At a minimum, clients should expect fields like:
{
"id": 9650997620256485376,
"sender": {
"id": "echo-agent",
"name": "Echo Agent"
},
"topics": ["default_topic"],
"payload": {
"message": "hello"
},
"format": "some.qualified.Format",
"thread": [9650997620256485376],
"message_history": [],
"traceparent": "00-...",
"topic_published_to": "user_notifications:u1"
}The client should treat the following fields as the primary routing metadata:
-
formatThe semantic type of the message. This is the most important field for rendering. -
payloadThe actual message body. -
senderWho produced the message. -
topic_published_toWhich topic the message was actually delivered on. Useful for debugging and telemetry. -
threadConversation lineage. -
message_historyProcess lineage added by Forge/guild execution.
The local UI compatibility layer accepts alternate field names such as:
datainstead ofpayloadtopicinstead oftopicsthreadsinstead ofthreadmessageHistoryinstead ofmessage_historyrecipientListinstead ofrecipient_listconversationIdinstead ofconversation_idinReplyToinstead ofin_response_to
It also aliases some short format names like:
healthcheckquestionResponseformResponseparticipantsRequestchatCompletionRequeststopGuildRequest
For new clients, this compatibility mode should be treated as legacy/UI-specific rather than the primary contract.
usercomms subscribes to:
user_notifications:<user_id>
That means the socket only receives outbound user-visible responses and notifications for that user.
A client sends an application message. Forge wraps it into a canonical protocol.Message and publishes it to:
user:<user_id>
The wrapped message uses:
sender.id = user_socket:<user_id>sender.name = <user_name>format = rustic_ai.core.messaging.core.message.Messagepayload = normalized user envelope
The normalized inner payload preserves user-supplied fields like:
topicspayloadformatrecipient_listthreadmessage_historyin_response_toconversation_id
usercomms is not the right place to look for launch progress or runtime failures. Even if an agent eventually surfaces an error as a business message, infrastructure state belongs on syscomms.
syscomms subscribes to three topic families:
user_system_notification:<user_id>guild_status_topicinfra_events_topic
This means a single syscomms socket carries three categories of outbound messages:
- direct system notifications for the user
- guild manager health/status messages
- Forge runtime lifecycle messages
The client can send system-oriented messages with:
formatpayload
Forge wraps them and publishes them to:
user_system:<user_id>
Important behavior:
- inbound syscomms messages missing either
formatorpayloadare dropped - the server resets the thread to
[current_message_id] - the server injects its own trace context
- the sender becomes
sys_comms_socket:<user_id>
When a syscomms socket connects, Forge immediately publishes a HealthCheckRequest to guild_status_topic.
This is an internal kick to prompt guild-manager health/status reporting. A client should not interpret the connection itself as proof that the guild is healthy. Wait for actual outbound messages on guild_status_topic and infra_events_topic.
A client should not treat HTTP 201 Created from guild creation as "guild is running".
Guild launch is asynchronous.
The correct model is:
- HTTP create/relaunch says the launch request was accepted.
syscommsshows runtime progress over time.infra_events_topicexplains what Forge runtime is doing.guild_status_topicreflects guild-manager health/status once the manager is alive enough to emit it.
This gives you two complementary views:
- infrastructure lifecycle view:
infra_events_topic - guild health/application view:
guild_status_topic
Forge now emits structured runtime events with:
- topic:
infra_events_topic - format:
rustic_ai.forge.runtime.InfraEvent
The payload of a canonical protocol.Message on infra_events_topic looks like:
{
"schema_version": 1,
"event_id": "a1b2c3d4",
"kind": "agent.process.started",
"severity": "info",
"timestamp": "2026-03-25T23:31:08.754Z",
"guild_id": "guild-dist-docker",
"agent_id": "echo-agent",
"organization_id": "e2e-org",
"request_id": "66936059-f857-47bd-b95f-bee0896796d9",
"node_id": "local-node",
"source": {
"component": "forge-go.supervisor.process"
},
"attempt": 1,
"message": "agent process started",
"detail": {
"pid": 12345
}
}Current severities are:
infowarningerror
Guild launch events:
guild.launch.requestedguild.launch.persistedguild.launch.enqueue_requestedguild.launch.enqueuedguild.launch.enqueue_failed
Spawn handling events:
agent.spawn.receivedagent.spawn.rejectedagent.spawn.skipped_existing_remote
Process lifecycle events:
agent.process.startingagent.process.startedagent.process.start_failedagent.process.exitedagent.process.restartingagent.process.failedagent.process.stopped
Treat infra events as the primary source of progress and failure for launch/runtime operations.
A good UI model is:
- show a timeline from
kind,timestamp, andmessage - show current phase derived from the most recent event
- highlight
severity = error - show retry count from
attemptwhen present - show process details like
pid, exit code, or error text fromdetail
Healthy launch:
guild.launch.requestedguild.launch.persistedguild.launch.enqueue_requestedguild.launch.enqueuedagent.spawn.receivedagent.process.startingagent.process.started
Pre-launch rejection:
guild.launch.enqueuedagent.spawn.receivedagent.spawn.rejected
Crash with retries:
agent.process.startedagent.process.exitedagent.process.restartingagent.process.started- repeat
agent.process.failed
Explicit stop:
agent.process.startedagent.process.stopped
guild_status_topic is not the same thing as infrastructure lifecycle.
It is the guild-manager health/status lane.
Typical formats seen here include:
rustic_ai.core.guild.agent_ext.mixins.health.HealthCheckRequestrustic_ai.core.guild.agent_ext.mixins.health.AgentsHealthReport
Use guild_status_topic to answer questions like:
- Is the guild manager alive enough to respond?
- What does the manager say about agent health?
- Has the application-level guild reached a healthy state?
Do not use it as the only source of launch truth, because infrastructure failures can happen before the manager is alive enough to publish anything useful.
That is exactly why infra_events_topic exists.
These are user-targeted system messages delivered over syscomms.
They are not necessarily lifecycle events and should not be interpreted as such unless their format or payload explicitly says so.
A client should treat them as a separate rendering lane from infra events.
A robust client should keep separate derived state for:
- conversation state from
usercomms - runtime lifecycle state from
infra_events_topic - health/state summary from
guild_status_topic - user-targeted system notifications from
user_system_notification:<user_id>
One practical model is:
-
conversationTimelineMessages fromusercomms -
runtimeTimelineInfraEventpayloads fromsyscomms -
guildHealthLatest health/status payload fromguild_status_topic -
systemNotificationsEverything fromuser_system_notification:<user_id>
Drive progress indicators from the latest infra event:
guild.launch.*means launch is being orchestratedagent.process.startingmeans the process launch has begunagent.process.restartingmeans the process is in retry/backoffagent.process.startedmeans process startup succeeded
Show launch/runtime failure when you see:
guild.launch.enqueue_failedagent.spawn.rejectedagent.process.start_failedagent.process.failed
Use:
severitymessagedetail.errordetail.exit_codeattempt
to build a human-readable error summary.
Use AgentsHealthReport or other guild status messages to represent the current guild/application health, not launch orchestration.
WebSockets are live subscriptions, not durable progress replay.
After reconnect, a client should:
- reconnect
usercommsand/orsyscomms - wait for fresh messages
- refresh guild metadata over HTTP if it needs a current summary
- treat new infra events as authoritative going forward
If a client needs durable historical timelines, that requires a separate history or persistence layer. The WebSocket by itself should be treated as a live stream.
Client sends:
{
"format": "my.app.UserPrompt",
"topics": ["default_topic"],
"payload": {
"text": "hello"
}
}Forge wraps and publishes it internally to user:<user_id>.
Client sends:
{
"format": "my.app.ControlAction",
"payload": {
"action": "refresh"
}
}Forge wraps and publishes it internally to user_system:<user_id>.
Client receives canonical protocol.Message:
{
"id": 9650997620256485376,
"sender": {
"id": "forge-go.supervisor.process"
},
"topics": ["infra_events_topic"],
"format": "rustic_ai.forge.runtime.InfraEvent",
"payload": {
"schema_version": 1,
"event_id": "abc123",
"kind": "agent.process.failed",
"severity": "error",
"timestamp": "2026-03-25T23:31:55.000Z",
"guild_id": "test-guild-bwrap",
"agent_id": "echo-agent-bwrap",
"message": "agent process failed after retry exhaustion",
"detail": {
"error": "Read-only file system"
}
},
"topic_published_to": "infra_events_topic"
}If you are building a CLI or browser UI, the recommended baseline is:
- Open
usercommsfor conversational traffic. - Open
syscommsfor runtime/system traffic. - Route by
format. - Treat
rustic_ai.forge.runtime.InfraEventas the source of progress/failure. - Treat
HealthCheckRequestandAgentsHealthReportas guild health traffic, not launch orchestration. - Do not equate HTTP create/relaunch success with runtime success.
Some important constraints in the current design:
syscommsmultiplexes three different streams on one socket.- The client must inspect
formatand sometimestopic_published_toto distinguish them. - WebSocket delivery is live-stream oriented, not durable replay.
- Launch success/failure is asynchronous; the socket view is more accurate than the immediate HTTP response.
The short version is:
- use
usercommsfor normal guild interaction - use
syscommsfor status, progress, and failures - use
infra_events_topicfor runtime lifecycle - use
guild_status_topicfor guild-manager health/status - treat the WebSocket as the live truth for launch progress after HTTP accept/queueing