fix: ensure data published to MQTT is not dropped#305
fix: ensure data published to MQTT is not dropped#305paulstuart wants to merge 40 commits intodevelopfrom
Conversation
…il rotation is required
Vulnerability Scan: PassedImage:
Commit: 683914e |
|
Go test coverage
Total coverage: 59.6% |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent telemetry/OTLP data from being dropped during MQTT disconnects caused by frequent credential rotation, by queueing publishes and refreshing JWT credentials automatically on MQTT auto-reconnect. It also adds a debug-only signal trigger for forcing rotation/inspecting token status and wires build tags through local builds and Docker builds.
Changes:
- Switch OTLP bridge publishing to
PublishViaQueueto buffer messages across MQTT disconnects. - Add an autopaho
ConnectPacketBuilderthat refreshes the JWT on auto-reconnect, and wire it fromFleetConfigManager. - Add a debug-only SIGUSR1/SIGUSR2 trigger for credential rotation/status logging, plus build-tag plumbing (
BUILD_TAGS) for make/Docker and a small golangci-lint config tweak.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Adds BUILD_TAGS plumbing into go build/go test and passes build args into Docker builds. |
| agent/otlpbridge/publisher_adapter.go | Uses PublishViaQueue to buffer OTLP publishes during disconnects. |
| agent/otlpbridge/mqtt.go | Uses PublishViaQueue for the OTLP bridge MQTT publisher implementation. |
| agent/docker/Dockerfile | Accepts/passes BUILD_TAGS into the make agent_bin build step. |
| agent/configmgr/fleet/token_refresh_test.go | Adds unit tests for the reconnect-time JWT refresh ConnectPacketBuilder behavior. |
| agent/configmgr/fleet/debug.go | Introduces a small interface used by debug-only trigger code (avoids package dependency). |
| agent/configmgr/fleet/debug_trigger.go | Adds debug-tag-only OS signal trigger to rotate/log credentials. |
| agent/configmgr/fleet/debug_trigger_test.go | Adds debug-tag-only tests for signal-trigger behavior. |
| agent/configmgr/fleet/debug_trigger_off.go | Provides a no-op StartDebugTrigger when not built with -tags debug. |
| agent/configmgr/fleet/connection.go | Adds token refresher plumbing, ConnectPacketBuilder for auto-reconnect JWT refresh, and dispatch shutdown race handling changes. |
| agent/configmgr/fleet.go | Wires token refresher into the MQTT connection; starts debug triggers; implements debug credential methods. |
| .github/golangci.yaml | Enables relative-path-mode: gomod for lint output paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Somewhat moot as the debug code will go away once code is guaranteed solid Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…d require refresh. An 'expected' error that is recoverable and part of the process should not be treated as a real error. The goal is that the agent should never log an issue as an error unless it is a real problem and we don't get blind to 'normal errors'
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0762225091
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
agent/otlpbridge/handlers_test.go:45
newBridgeWithTopics(enc Encoder)no longer uses theencargument and hard-codesEncoding: "protobuf", which makes the parameter misleading and weakens the test helper’s intent. Either remove the unused parameter or use it to select the encoder/BridgeConfig so tests accurately reflect the requested encoding.
func newBridgeWithTopics(enc Encoder) (*BridgeServer, *fakePublisher) {
fp := &fakePublisher{}
bridge, _ := NewBridgeServer(BridgeConfig{Encoding: "protobuf"}, nil, nil)
bridge.SetPublisher(fp)
bridge.SetIngestTopic("ingest")
bridge.SetTelemetryTopic("telemetry")
return bridge, fp
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 58fbb0d17a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0652cc50c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 11f5c861af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 170b5f3edf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if resp.StatusCode == http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden { | ||
| return nil, &AuthError{StatusCode: resp.StatusCode, Body: string(body)} | ||
| } |
There was a problem hiding this comment.
Classify non-401/403 auth client errors as non-retriable
Startup now retries indefinitely for any error that is not AuthError (maxStartupRetries = 0 in fleet.go), but this code only marks 401/403 as AuthError. Permanent client failures from the token endpoint such as 400/404 therefore get treated as transient, causing the agent to loop forever in startup instead of failing fast on invalid auth configuration.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| s.draining = false | ||
| s.ready = true | ||
| s.retryBackoff = 0 | ||
| s.pendingMu.Unlock() |
There was a problem hiding this comment.
drainPending() unconditionally sets s.ready = true at the end of a successful drain. If ClearPublisher() races during the drain (or publisher/topics are swapped), this can mark the bridge ready even though publisher is now nil or no longer matches the drained connection. That can prevent future drains (drainPending returns early when ready==true) and strand messages in s.pending. Before setting ready=true, re-check under locks that the publisher/topics are still the ones you drained against (or that publisher != nil), otherwise keep ready=false and leave messages queued for the next readiness cycle.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Too many changes for one PR and refactored to 3 for better comprehension |
As the MQTT credentials rotate frequently, it encounters situations where the connection is invalid and the data is unable to be sent.
This leverages the queuing capability in the handler and ensures that normal credentials rotation does not drop data, nor pollute the logs with errors that are "part of doing business"