NETOBSERV-2376 improve FlowCollector status#2610
NETOBSERV-2376 improve FlowCollector status#2610jpinsonneau wants to merge 11 commits intonetobserv:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd per-component and integration status types and CRD fields, pod-health aggregation, exporter tracking and event events in the status manager, RBAC for events, health-based readiness checks, pointer-based Loki status handling, reconciler updates, and related tests/docs/CRD changes. Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler as Controller Reconciler
participant StatusMgr as Status Manager
participant K8s as Kubernetes API
participant Recorder as Event Recorder
participant FlowCR as FlowCollector CR
Reconciler->>StatusMgr: set component status (Ready/InProgress/Failure/Degraded/Unused)
StatusMgr->>K8s: list Pods (label selector per component)
K8s-->>StatusMgr: pod list
StatusMgr->>StatusMgr: classify pod issues, build PodHealthSummary
alt pod issues detected
StatusMgr->>StatusMgr: downgrade to Degraded, attach PodIssues/UnhealthyCount
end
StatusMgr->>Recorder: emit transition events (create/patch)
Recorder->>K8s: create Event (core/v1 Events)
StatusMgr->>FlowCR: populate fc.status.components & fc.status.integrations (exporters)
Reconciler->>K8s: patch FlowCollector CR status
K8s-->>FlowCR: status updated
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
New images: quay.io/netobserv/network-observability-operator:7b36896
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-7b36896
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-7b36896They will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:7b36896 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-7b36896Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-7b36896
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (6)
internal/controller/monitoring/monitoring_controller.go (1)
165-166: Wrap health-dashboard build failures with a specific status reason.This is the only helper error path that bypasses
r.status.Error(...), so it degrades to a genericMonitoringErrorand loses phase context.As per coding guidelines, "Wrap errors with context using proper error handling patterns in the controller reconciliation logic and error handling paths".Suggested change
- if desiredHealthDashboardCM, del, err := buildHealthDashboard(ns, nsFlowsMetric); err != nil { - return err + if desiredHealthDashboardCM, del, err := buildHealthDashboard(ns, nsFlowsMetric); err != nil { + return r.status.Error("CantBuildHealthDashboard", err) } else if !del { cms = append(cms, desiredHealthDashboardCM) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/monitoring/monitoring_controller.go` around lines 165 - 166, The call to buildHealthDashboard currently returns errors that bypass r.status.Error and become generic MonitoringError; modify the error handling around buildHealthDashboard(ns, nsFlowsMetric) so that when err != nil you wrap the error with a specific status reason and call r.status.Error(...) before returning—use the same phase/context used elsewhere (e.g., include MonitoringError or a new reason like HealthDashboardBuildFailed) and attach the original err as the cause; ensure this occurs in the branch handling desiredHealthDashboardCM, del, err to preserve phase context and include the build failure details in the returned error.internal/controller/consoleplugin/consoleplugin_test.go (1)
559-560: Test intent says nil, but non-nil pointer is passed.Line 559 and Line 620 claim nil LokiStack-status behavior, but
&lokiStatusUnusedskips the actual nil branch. Usenilto validate the pointer guard path.As per coding guidelines `**/*_test.go`: include tests for valid cases, invalid input, and edge cases (empty, nil values).Proposed test fix
- cm, _, err = builder.configMap(context.Background(), nil, &lokiStatusUnused) + cm, _, err = builder.configMap(context.Background(), nil, nil)- cm, digest, err := builder.configMap(context.Background(), nil, &lokiStatusUnused) + cm, digest, err := builder.configMap(context.Background(), nil, nil)Also applies to: 620-621
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/consoleplugin/consoleplugin_test.go` around lines 559 - 560, The test is claiming to exercise the nil LokiStack-status branch but passes a non-nil pointer (&lokiStatusUnused) to builder.configMap; update the failing assertions to pass nil for the Loki status argument (i.e., call builder.configMap(context.Background(), nil, nil) or builder.configMap(..., nil, nil) where the third parameter is meant to be nil) so the pointer-guard branch in configMap is executed, and add/adjust the paired test at the other location (around lines referencing lokiStatusUnused) to explicitly cover the non-nil case as well; locate tests that call builder.configMap and replace &lokiStatusUnused with nil for the nil-case test and keep a separate test using &lokiStatusUnused to cover the non-nil path.internal/pkg/manager/status/statuses.go (1)
81-86:ptr.To(*s.DesiredReplicas)is redundant.You're dereferencing and re-wrapping the same value. Direct assignment suffices since both fields are
*int32.Simplify
if s.DesiredReplicas != nil { - cs.DesiredReplicas = ptr.To(*s.DesiredReplicas) + cs.DesiredReplicas = s.DesiredReplicas } if s.ReadyReplicas != nil { - cs.ReadyReplicas = ptr.To(*s.ReadyReplicas) + cs.ReadyReplicas = s.ReadyReplicas }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/statuses.go` around lines 81 - 86, The code redundantly dereferences and re-wraps pointer fields; instead of using ptr.To(*s.DesiredReplicas) and ptr.To(*s.ReadyReplicas), assign the pointers directly: set cs.DesiredReplicas = s.DesiredReplicas and cs.ReadyReplicas = s.ReadyReplicas so you avoid unnecessary dereference/re-wrap via ptr.To while preserving the *int32 types referenced in s.DesiredReplicas, s.ReadyReplicas and cs.DesiredReplicas, cs.ReadyReplicas.internal/pkg/manager/status/status_manager.go (3)
104-115: Minor duplication insetReadyvspreserveReplicas.
setReadymanually preserves replicas with the same pattern aspreserveReplicas. Consider usingpreserveReplicasfor consistency.Optional refactor
func (s *Manager) setReady(cpnt ComponentName) { - existing := s.getStatus(cpnt) - cs := ComponentStatus{ - Name: cpnt, - Status: StatusReady, - } - if existing != nil { - cs.DesiredReplicas = existing.DesiredReplicas - cs.ReadyReplicas = existing.ReadyReplicas - } + cs := s.preserveReplicas(cpnt) + cs.Status = StatusReady s.statuses.Store(cpnt, cs) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 104 - 115, The setReady method duplicates logic that copies DesiredReplicas and ReadyReplicas; replace the manual preservation with a call to preserveReplicas to keep behavior consistent. Inside Manager.setReady, remove the existing != nil block that copies replicas and after building cs := ComponentStatus{...StatusReady...} call s.preserveReplicas(cpnt, &cs) (or the appropriate signature of preserveReplicas) before storing with s.statuses.Store so replica fields are preserved consistently with other setters; keep getStatus and statuses.Store usage unchanged.
190-196: Exporter slice order is non-deterministic.
sync.Map.Rangeiteration order is undefined, sofc.Status.Integrations.Exportersmay have different ordering between calls. This could cause unnecessary status updates or confuse users inspecting the CRD.Sort exporters by name for stable output
+ import "sort" + var exporters []flowslatest.FlowCollectorExporterStatus s.exporters.Range(func(_, v any) bool { exp := v.(flowslatest.FlowCollectorExporterStatus) exporters = append(exporters, exp) return true }) + sort.Slice(exporters, func(i, j int) bool { + return exporters[i].Name < exporters[j].Name + }) fc.Status.Integrations.Exporters = exporters🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 190 - 196, The exporter slice is gathered via s.exporters.Range which yields undefined order; after collecting into the local exporters slice (from s.exporters.Range and type flowslatest.FlowCollectorExporterStatus) sort the slice by exporter name before assigning to fc.Status.Integrations.Exporters (e.g., use sort.Slice on exporters comparing exporters[i].Name < exporters[j].Name) and add the sort import so fc.Status.Integrations.Exporters has a stable, deterministic order.
463-487: RedundantsetDeploymentReplicascalls.
setDeploymentReplicasis called at line 468 (before condition check), then again at lines 473, 476, or 486 depending on the branch. The first call at line 468 is unnecessary since every branch calls it afterward.Remove redundant call
func (i *Instance) CheckDeploymentProgress(d *appsv1.Deployment) { if d == nil { i.s.setInProgress(i.cpnt, "DeploymentNotCreated", "Deployment not created") return } - i.setDeploymentReplicas(d) for _, c := range d.Status.Conditions { if c.Type == appsv1.DeploymentAvailable {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 463 - 487, In CheckDeploymentProgress remove the redundant early call to i.setDeploymentReplicas(d) (the one immediately after the nil check) because every control path below already calls i.setDeploymentReplicas(d); keep the calls inside the DeploymentAvailable condition branches, the final ready/not-ready branches, and the nil check behavior (i.s.setInProgress when d == nil) unchanged so replicas are still updated in all actual-deployment paths; look for the CheckDeploymentProgress function and the setDeploymentReplicas references to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bundle/manifests/flows.netobserv.io_flowcollectors.yaml`:
- Around line 6807-6850: The exporter status cannot be reliably matched to
spec.exporters because spec.exporters has no stable identifier; add a unique
stable key (e.g., add a required string field "id" or make "name"
required/unique) to spec.exporters entries and update the status schema so
status.integrations.exporters is keyed (or includes that same id) rather than
relying on array position; specifically modify the spec.exporters definition to
include a required unique identifier and update the FlowCollectorExporterStatus
/ integrations.exporters mapping (and any consumers) to reference that
identifier instead of array index or ambiguous type/name alone.
In `@config/crd/bases/flows.netobserv.io_flowcollectors.yaml`:
- Around line 6265-6299: status.integrations.exporters[].name is declared
required but spec.exporters[] has no stable identifier to join against; either
add an optional stable identifier field (e.g., spec.exporters[].exporterID or
spec.exporters[].name) with the kubebuilder '+optional' marker and string type
so clients can reliably map status entries to spec entries, or remove/clarify
the comment that status.name matches spec entries. Update the CRD schema for
spec.exporters (the items.properties block) to include the new optional field
and ensure any generated CRD tags use '+optional' to maintain backward
compatibility.
- Around line 18-28: The Sampling column is shown generically but points only at
.spec.agent.ebpf.sampling, which is misleading for non-eBPF agents; restore mode
specificity by re-adding an Agent column that shows .spec.agent.type (jsonPath:
.spec.agent.type) so users can see the agent mode, and make the Sampling column
explicitly eBPF-specific (rename to "eBPF Sampling" and keep jsonPath:
.spec.agent.ebpf.sampling). This ensures kubectl output reflects agent type
(Agent) and that Sampling is clearly scoped to eBPF.
In `@docs/FlowCollector.md`:
- Around line 13481-13485: The docs incorrectly state that the exporter `name`
matches `spec.exporters` entries even though items in `spec.exporters` have no
`name` field; update the documentation to either explain exactly how the status
`name` is derived (e.g., generated identifier, combination of exporter `type` +
index, or other deterministic key) or, if the implementation supports it,
add/mention the true cross-reference key (e.g., `id`, `key`, or a user-provided
`name`) that links status entries to spec entries; reference the `name` field in
status and the `spec.exporters` array in the docs so users can reliably
correlate entries.
In `@helm/crds/flows.netobserv.io_flowcollectors.yaml`:
- Around line 6266-6303: The status field FlowCollectorExporterStatus currently
uses name in status.integrations.exporters[] to "match spec.exporters entries"
but spec.exporters only contains type, kafka, ipfix, openTelemetry so multiple
spec entries can share the same type; update the CRD by adding a stable
identifier (e.g., "id" or "name") as an optional/required field to
spec.exporters[] and mirror that identifier in
status.integrations.exporters[].name (or change status to reference
spec.exporters[].<newId>) so clients can deterministically map a status entry
back to its spec entry; alternatively, enforce uniqueness on
spec.exporters[].type if you prefer to keep no identifier (update spec.exporters
schema accordingly and reflect that constraint in status validation).
In `@internal/controller/ebpf/agent_controller.go`:
- Around line 146-150: Do not unconditionally map all failures to
"AgentKafkaError" based solely on target.Spec.UseKafka(); instead keep the
default reason "AgentControllerError" and only set "AgentKafkaError" when the
error is actually Kafka-related. Update the block around reason,
target.Spec.UseKafka(), and c.Status.SetFailure in agent_controller.go: keep
reason := "AgentControllerError" by default, and replace the current boolean
check with a check that recognizes Kafka-specific failures (e.g., inspect the
operation context or use errors.Is/errors.As against known Kafka error types or
an errIsKafkaRelated(err) helper) so c.Status.SetFailure receives
"AgentKafkaError" only for genuine Kafka errors. Ensure references to reason and
c.Status.SetFailure remain intact.
In `@internal/controller/flp/flp_controller.go`:
- Around line 196-204: The updateExporterStatuses method is marking exporters as
Ready even when the flow is paused; change it to check fc.Spec.Execution.Mode
(e.g., compare to the ExecutionModeOnHold constant or "OnHold") inside
Reconciler.updateExporterStatuses and, when on-hold, either skip marking
exporters as Ready or set them to a non-ready status with a clear reason (e.g.,
"ExecutionOnHold" or "On hold") using r.mgr.Status.SetExporterStatus(name,
string(exp.Type), <non-ready status>, "<reason>", ""); updateExporterStatuses
and calls to r.mgr.Status.SetExporterStatus are the targets to modify.
- Line 141: The subnet auto-detect failure calls
r.status.SetDegraded("SubnetDetectionError", ...) but upstream Reconcile then
calls r.status.SetReady(), wiping the degraded state; modify reconcile() so that
when the subnet auto-detection step (the call that currently logs SetDegraded on
error) fails it returns an error to abort the success path, or alternatively
propagate a failure flag from that detection and skip the unconditional
r.status.SetReady() call; specifically update the function that performs subnet
auto-detect (and the reconcile/Reconcile control flow) to either return the
error from that detection to the caller or check a detected-failure boolean
before invoking r.status.SetReady(), ensuring SetDegraded's Reason/Message are
preserved when detection fails.
In `@internal/controller/monitoring/monitoring_controller.go`:
- Around line 130-134: Currently dashboard cleanup runs only inside
reconcileDashboards() when r.mgr.ClusterInfo.IsOpenShift() &&
r.mgr.ClusterInfo.HasSvcMonitor() is true, so when HasSvcMonitor() flips to
false previously created dashboard ConfigMaps remain; add an else branch after
that if which calls a new cleanup function (e.g., cleanupStaleDashboards(ctx,
clh, ns) or reuse reconcileDashboardsWithCleanup(ctx, clh, ns, removeOnly bool))
that enumerates and deletes the managed dashboard ConfigMaps (the same
identifiers/labels used by reconcileDashboards()) and returns any error;
implement the cleanup helper near reconcileDashboards() and invoke it when
HasSvcMonitor() is false so stale dashboard ConfigMaps are removed.
- Around line 126-129: The reconcile currently calls
r.status.SetDegraded("ServiceMonitorMissing", ...) when ServiceMonitor is absent
but then always calls r.status.SetReady(), which overwrites the degraded state
via setReady(); change the flow so SetReady() is not called when a degraded
condition exists — e.g., use the status manager's predicate
(r.status.IsDegraded() or equivalent) to skip calling r.status.SetReady() after
reconcile, or modify SetReady()/setReady() to preserve existing degraded
conditions instead of replacing the ComponentStatus; update the handling around
ServiceMonitor detection (the block using SetDegraded and the final SetReady
call) to ensure a degraded ServiceMonitorMissing condition remains set and the
operator is not marked ready.
In `@internal/pkg/manager/status/pod_health.go`:
- Around line 91-122: classifyPodIssue currently misses pods stuck in Pending
due to scheduling; update classifyPodIssue to detect pod.Status.Phase ==
corev1.PodPending and inspect pod.Status.Conditions for a condition with Type ==
corev1.PodScheduled and Status == corev1.ConditionFalse (or Reason ==
"Unschedulable") and return a new reason like "PendingScheduling" with the
condition.Message (or Reason + Message) as the human-readable msg; ensure you
reference pod.Status.Conditions and return this before the final healthy "" ,""
return so Pending scheduling failures appear in podIssues.
- Around line 43-87: The Issues string is non-deterministic because the code
iterates the groups map directly; to fix, collect the map keys (the issue
reasons) from groups, sort them (e.g., using sort.Strings), and then iterate the
sorted keys when building parts so the order of parts is deterministic; update
the loop that currently ranges over "for _, g := range groups" to instead range
over sorted reason keys and lookup groups[reason] (referencing issueGroup,
groups, classifyPodIssue, maxPodNamesInSummary, truncateMessage, and
PodHealthSummary to locate the relevant code).
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 10-11: The test imports "github.com/stretchr/testify/assert" and
"github.com/stretchr/testify/require" which must be migrated to Ginkgo/Gomega;
remove those testify imports and add "github.com/onsi/ginkgo/v2" and
"github.com/onsi/gomega" (or the project’s preferred ginkgo/gomega aliases),
replace the top-level TestX functions with Describe/It blocks (or a Ginkgo
entrypoint) and convert all assert.*/require.* calls to gomega Expect
expressions (e.g., Expect(err).To(HaveOccurred()) /
Expect(val).To(Equal(expected))), and if any setup/teardown currently in test
helpers uses testify suites, move them into BeforeEach/AfterEach within the
Describe so the status_manager_test.go aligns with the project Ginkgo/Gomega
testing style.
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 362-371: The check for the transformer status should also skip
StatusUnused so we don't treat unused components as active; update the condition
in the block that calls s.getStatus(FLPTransformer) (the same block handling
StatusFailure/StatusDegraded and PodHealth.UnhealthyCount) to require ts.Status
is neither StatusUnknown nor StatusUnused before proceeding, ensuring stale pod
health for unused transformers won't set hasKafkaIssue or append transformer pod
messages.
In `@internal/pkg/manager/status/statuses.go`:
- Around line 36-38: The switch currently groups StatusUnknown and StatusUnused
and sets c.Reason = "Unused", which conflates two distinct states; split the
combined case in the switch so StatusUnknown sets c.Status =
metav1.ConditionUnknown and c.Reason = "Unknown" (or "StatusUnknown") while
StatusUnused retains c.Status = metav1.ConditionUnknown and c.Reason = "Unused";
locate the change in the same switch handling c.Status/c.Reason (symbols:
StatusUnknown, StatusUnused, c.Status, c.Reason, metav1.ConditionUnknown) and
update any related tests or callers that assert the reason string.
---
Nitpick comments:
In `@internal/controller/consoleplugin/consoleplugin_test.go`:
- Around line 559-560: The test is claiming to exercise the nil LokiStack-status
branch but passes a non-nil pointer (&lokiStatusUnused) to builder.configMap;
update the failing assertions to pass nil for the Loki status argument (i.e.,
call builder.configMap(context.Background(), nil, nil) or builder.configMap(...,
nil, nil) where the third parameter is meant to be nil) so the pointer-guard
branch in configMap is executed, and add/adjust the paired test at the other
location (around lines referencing lokiStatusUnused) to explicitly cover the
non-nil case as well; locate tests that call builder.configMap and replace
&lokiStatusUnused with nil for the nil-case test and keep a separate test using
&lokiStatusUnused to cover the non-nil path.
In `@internal/controller/monitoring/monitoring_controller.go`:
- Around line 165-166: The call to buildHealthDashboard currently returns errors
that bypass r.status.Error and become generic MonitoringError; modify the error
handling around buildHealthDashboard(ns, nsFlowsMetric) so that when err != nil
you wrap the error with a specific status reason and call r.status.Error(...)
before returning—use the same phase/context used elsewhere (e.g., include
MonitoringError or a new reason like HealthDashboardBuildFailed) and attach the
original err as the cause; ensure this occurs in the branch handling
desiredHealthDashboardCM, del, err to preserve phase context and include the
build failure details in the returned error.
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 104-115: The setReady method duplicates logic that copies
DesiredReplicas and ReadyReplicas; replace the manual preservation with a call
to preserveReplicas to keep behavior consistent. Inside Manager.setReady, remove
the existing != nil block that copies replicas and after building cs :=
ComponentStatus{...StatusReady...} call s.preserveReplicas(cpnt, &cs) (or the
appropriate signature of preserveReplicas) before storing with s.statuses.Store
so replica fields are preserved consistently with other setters; keep getStatus
and statuses.Store usage unchanged.
- Around line 190-196: The exporter slice is gathered via s.exporters.Range
which yields undefined order; after collecting into the local exporters slice
(from s.exporters.Range and type flowslatest.FlowCollectorExporterStatus) sort
the slice by exporter name before assigning to fc.Status.Integrations.Exporters
(e.g., use sort.Slice on exporters comparing exporters[i].Name <
exporters[j].Name) and add the sort import so fc.Status.Integrations.Exporters
has a stable, deterministic order.
- Around line 463-487: In CheckDeploymentProgress remove the redundant early
call to i.setDeploymentReplicas(d) (the one immediately after the nil check)
because every control path below already calls i.setDeploymentReplicas(d); keep
the calls inside the DeploymentAvailable condition branches, the final
ready/not-ready branches, and the nil check behavior (i.s.setInProgress when d
== nil) unchanged so replicas are still updated in all actual-deployment paths;
look for the CheckDeploymentProgress function and the setDeploymentReplicas
references to make this change.
In `@internal/pkg/manager/status/statuses.go`:
- Around line 81-86: The code redundantly dereferences and re-wraps pointer
fields; instead of using ptr.To(*s.DesiredReplicas) and
ptr.To(*s.ReadyReplicas), assign the pointers directly: set cs.DesiredReplicas =
s.DesiredReplicas and cs.ReadyReplicas = s.ReadyReplicas so you avoid
unnecessary dereference/re-wrap via ptr.To while preserving the *int32 types
referenced in s.DesiredReplicas, s.ReadyReplicas and cs.DesiredReplicas,
cs.ReadyReplicas.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dd34bc19-ad83-4575-8f2a-37c97fcf99e5
📒 Files selected for processing (23)
api/flowcollector/v1beta2/flowcollector_types.goapi/flowcollector/v1beta2/zz_generated.deepcopy.gobundle/manifests/flows.netobserv.io_flowcollectors.yamlbundle/manifests/netobserv-operator.clusterserviceversion.yamlconfig/crd/bases/flows.netobserv.io_flowcollectors.yamlconfig/rbac/role.yamldocs/FlowCollector.mdhelm/crds/flows.netobserv.io_flowcollectors.yamlhelm/templates/clusterrole.yamlinternal/controller/consoleplugin/consoleplugin_objects.gointernal/controller/consoleplugin/consoleplugin_reconciler.gointernal/controller/consoleplugin/consoleplugin_test.gointernal/controller/ebpf/agent_controller.gointernal/controller/flowcollector_controller.gointernal/controller/flp/flp_controller.gointernal/controller/monitoring/monitoring_controller.gointernal/controller/reconcilers/reconcilers.gointernal/pkg/manager/manager.gointernal/pkg/manager/status/pod_health.gointernal/pkg/manager/status/pod_health_test.gointernal/pkg/manager/status/status_manager.gointernal/pkg/manager/status/status_manager_test.gointernal/pkg/manager/status/statuses.go
|
New images: quay.io/netobserv/network-observability-operator:4d98c0f
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-4d98c0f
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-4d98c0fThey will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:4d98c0f make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-4d98c0fOr as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-4d98c0f
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
bundle/manifests/flows.netobserv.io_flowcollectors.yaml (1)
6811-6851:⚠️ Potential issue | 🟠 MajorUse a stable exporter key instead of array position.
Line 6821 makes exporter identity depend on its position in
spec.exporters. Reordering or inserting entries changes that identity, so consumers can no longer reliably match status back to the same exporter. Add a stable user-defined key onspec.exporters[]and mirror that instatus.integrations.exporters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bundle/manifests/flows.netobserv.io_flowcollectors.yaml` around lines 6811 - 6851, The exporter identity must not be position-based; add a stable user-provided identifier field (e.g., spec.exporters[].key or spec.exporters[].id) to the spec and require it, then change the status schema FlowCollectorExporterStatus under status.integrations.exporters to include and require that same key field (instead of relying on the generated name like "kafka-export-0"), and update any code that writes status to populate this key from spec.exporters to ensure consumers can reliably match status to the correct exporter after reorders/insertions.helm/crds/flows.netobserv.io_flowcollectors.yaml (1)
6277-6303:⚠️ Potential issue | 🟠 MajorExporter status identity is still order-dependent.
On Line 6277,
nameis generated from exporter type + index. Reorderingspec.exporters[]changes identity and makes client-side mapping/history unstable. Add a stable identifier inspec.exporters[](e.g.,name/id) and mirror that value here.Suggested CRD adjustment
# In spec.exporters[].properties + name: + description: Stable exporter identifier used to correlate status entries. + type: string # In status.integrations.exporters[].properties.name description - description: '`name` is a generated identifier for this exporter (e.g., "kafka-export-0"), derived from its type and position in spec.exporters.' + description: '`name` matches spec.exporters[].name and is used as a stable identifier for this exporter.'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@helm/crds/flows.netobserv.io_flowcollectors.yaml` around lines 6277 - 6303, The exporter identity in the CRD is currently generated from type+index and is order-dependent; update the CRD to require a stable identifier in spec.exporters (e.g., add a required string field "name" or "id" under spec.exporters) and change the status exporter schema (the object under status.exporters with fields name, state, type, reason) to mirror that stable identifier instead of relying on generated index-based names; ensure the new spec.exporters.{name|id} is marked required in the openapi schema and referenced in status.exporters so reordering spec.exporters[] does not change exporter identity.internal/pkg/manager/status/status_manager_test.go (1)
7-11:⚠️ Potential issue | 🟡 MinorMigrate this test file to Ginkgo/Gomega.
Line 7 and Lines 10-11 still anchor this file on
testing+testify; project rules for**/*_test.gorequire Ginkgo/Gomega patterns. Please convertTest*/t.Runblocks toDescribe/Itandassert/requirecalls toExpect.#!/bin/bash set -euo pipefail target="internal/pkg/manager/status/status_manager_test.go" echo "== Framework imports ==" rg -n 'github.com/stretchr/testify|github.com/onsi/ginkgo|github.com/onsi/gomega|^\s*"testing"' "$target" echo echo "== Test structure patterns ==" rg -n '^\s*func\s+Test[A-Za-z0-9_]+\s*\(|\bt\.Run\s*\(' "$target"As per coding guidelines,
**/*_test.go: Use Ginkgo/Gomega patterns for unit tests - include tests for valid cases, invalid input, and edge cases (empty, nil values).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager_test.go` around lines 7 - 11, Replace the table-driven Test* and t.Run style in status_manager_test.go with Ginkgo/Gomega: remove the "testing" import and testify imports ("github.com/stretchr/testify/assert" and "github.com/stretchr/testify/require") and add "github.com/onsi/ginkgo/v2" and "github.com/onsi/gomega"; convert the top-level func TestX(t *testing.T) into a Ginkgo entrypoint that calls gomega.RegisterFailHandler(ginkgo.Fail) and ginkgo.RunSpecs (or use Ginkgo V2 suite pattern), then convert each t.Run subtest into ginkgo.Describe/Context and ginkgo.It blocks, replacing assert/require calls with gomega.Expect(...) assertions (e.g. assert.Equal -> Expect(...).To(Equal(...)), require.NoError -> Expect(err).ToNot(HaveOccurred())), adjust any setup to BeforeEach/AfterEach, and keep references to the same helpers/types like flowslatest and any functions under test so tests exercise the same cases (valid, invalid, edge) after migration.config/crd/bases/flows.netobserv.io_flowcollectors.yaml (1)
18-28:⚠️ Potential issue | 🟡 MinorKeep the sampling column eBPF-specific.
The header is now generic, but the JSONPath still targets
.spec.agent.ebpf.sampling. Rename it back toeBPF SamplingorSampling (eBPF)sokubectl getstays precise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/crd/bases/flows.netobserv.io_flowcollectors.yaml` around lines 18 - 28, The CRD list column for sampling is eBPF-specific but the column header was made generic; update the `name` for the column whose `jsonPath` is `.spec.agent.ebpf.sampling` (the entry currently named `Sampling`) to a clear eBPF-specific label such as `eBPF Sampling` or `Sampling (eBPF)` so `kubectl get` output remains precise; locate the block with `jsonPath: .spec.agent.ebpf.sampling` and change its `name` value accordingly.
🧹 Nitpick comments (1)
internal/pkg/manager/status/status_manager.go (1)
458-484: RedundantsetDeploymentReplicascalls can be consolidated.The call at line 465 sets replica counts, then
setInProgress/setReadypreserve them viapreserveReplicas. The subsequent calls at lines 470, 473, and 483 reapply the same values.Consider calling
setDeploymentReplicasonce at the end, after all status-setting paths converge:♻️ Suggested consolidation
func (i *Instance) CheckDeploymentProgress(d *appsv1.Deployment) { if d == nil { i.s.setInProgress(i.cpnt, "DeploymentNotCreated", "Deployment not created") return } - i.setDeploymentReplicas(d) for _, c := range d.Status.Conditions { if c.Type == appsv1.DeploymentAvailable { if c.Status != v1.ConditionTrue { i.s.setInProgress(i.cpnt, "DeploymentNotReady", fmt.Sprintf("Deployment %s not ready: %d/%d (%s)", d.Name, d.Status.ReadyReplicas, d.Status.Replicas, c.Message)) - i.setDeploymentReplicas(d) } else { i.s.setReady(i.cpnt) - i.setDeploymentReplicas(d) } + i.setDeploymentReplicas(d) return } } if d.Status.ReadyReplicas == d.Status.Replicas && d.Status.Replicas > 0 { i.s.setReady(i.cpnt) } else { i.s.setInProgress(i.cpnt, "DeploymentNotReady", fmt.Sprintf("Deployment %s not ready: %d/%d (missing condition)", d.Name, d.Status.ReadyReplicas, d.Status.Replicas)) } i.setDeploymentReplicas(d) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 458 - 484, The CheckDeploymentProgress method repeatedly calls setDeploymentReplicas before and after setting status, causing redundant work; refactor CheckDeploymentProgress to call setDeploymentReplicas once after all status branches converge (i.e., remove the early calls at the start and inside the loop/branches), keep the logic that uses d.Status and conditions to decide between i.s.setInProgress(i.cpnt, ...) and i.s.setReady(i.cpnt), and ensure preserveReplicas behavior (if any) still occurs by calling setDeploymentReplicas(d) exactly once at the end of the function; key symbols: CheckDeploymentProgress, setDeploymentReplicas, i.s.setInProgress, i.s.setReady.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/pkg/manager/status/statuses.go`:
- Around line 35-48: The switch over s.Status in the status conversion can leave
c.Status empty for unknown/empty inputs, producing invalid Kubernetes
Conditions; add a default case in the switch that sets c.Status =
metav1.ConditionUnknown and c.Reason = "Unknown" (or another safe reason) so any
unrecognized s.Status value is handled; update the switch that references
s.Status and c.Status (and existing cases like StatusUnknown, StatusUnused,
StatusFailure, StatusInProgress, StatusDegraded, StatusReady) to include this
default branch to ensure conditions are always valid.
---
Duplicate comments:
In `@bundle/manifests/flows.netobserv.io_flowcollectors.yaml`:
- Around line 6811-6851: The exporter identity must not be position-based; add a
stable user-provided identifier field (e.g., spec.exporters[].key or
spec.exporters[].id) to the spec and require it, then change the status schema
FlowCollectorExporterStatus under status.integrations.exporters to include and
require that same key field (instead of relying on the generated name like
"kafka-export-0"), and update any code that writes status to populate this key
from spec.exporters to ensure consumers can reliably match status to the correct
exporter after reorders/insertions.
In `@config/crd/bases/flows.netobserv.io_flowcollectors.yaml`:
- Around line 18-28: The CRD list column for sampling is eBPF-specific but the
column header was made generic; update the `name` for the column whose
`jsonPath` is `.spec.agent.ebpf.sampling` (the entry currently named `Sampling`)
to a clear eBPF-specific label such as `eBPF Sampling` or `Sampling (eBPF)` so
`kubectl get` output remains precise; locate the block with `jsonPath:
.spec.agent.ebpf.sampling` and change its `name` value accordingly.
In `@helm/crds/flows.netobserv.io_flowcollectors.yaml`:
- Around line 6277-6303: The exporter identity in the CRD is currently generated
from type+index and is order-dependent; update the CRD to require a stable
identifier in spec.exporters (e.g., add a required string field "name" or "id"
under spec.exporters) and change the status exporter schema (the object under
status.exporters with fields name, state, type, reason) to mirror that stable
identifier instead of relying on generated index-based names; ensure the new
spec.exporters.{name|id} is marked required in the openapi schema and referenced
in status.exporters so reordering spec.exporters[] does not change exporter
identity.
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 7-11: Replace the table-driven Test* and t.Run style in
status_manager_test.go with Ginkgo/Gomega: remove the "testing" import and
testify imports ("github.com/stretchr/testify/assert" and
"github.com/stretchr/testify/require") and add "github.com/onsi/ginkgo/v2" and
"github.com/onsi/gomega"; convert the top-level func TestX(t *testing.T) into a
Ginkgo entrypoint that calls gomega.RegisterFailHandler(ginkgo.Fail) and
ginkgo.RunSpecs (or use Ginkgo V2 suite pattern), then convert each t.Run
subtest into ginkgo.Describe/Context and ginkgo.It blocks, replacing
assert/require calls with gomega.Expect(...) assertions (e.g. assert.Equal ->
Expect(...).To(Equal(...)), require.NoError ->
Expect(err).ToNot(HaveOccurred())), adjust any setup to BeforeEach/AfterEach,
and keep references to the same helpers/types like flowslatest and any functions
under test so tests exercise the same cases (valid, invalid, edge) after
migration.
---
Nitpick comments:
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 458-484: The CheckDeploymentProgress method repeatedly calls
setDeploymentReplicas before and after setting status, causing redundant work;
refactor CheckDeploymentProgress to call setDeploymentReplicas once after all
status branches converge (i.e., remove the early calls at the start and inside
the loop/branches), keep the logic that uses d.Status and conditions to decide
between i.s.setInProgress(i.cpnt, ...) and i.s.setReady(i.cpnt), and ensure
preserveReplicas behavior (if any) still occurs by calling
setDeploymentReplicas(d) exactly once at the end of the function; key symbols:
CheckDeploymentProgress, setDeploymentReplicas, i.s.setInProgress, i.s.setReady.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c3d92d76-e665-4334-85d1-7dc35bed3a83
📒 Files selected for processing (10)
api/flowcollector/v1beta2/flowcollector_types.gobundle/manifests/flows.netobserv.io_flowcollectors.yamlconfig/crd/bases/flows.netobserv.io_flowcollectors.yamldocs/FlowCollector.mdhelm/crds/flows.netobserv.io_flowcollectors.yamlinternal/pkg/manager/status/pod_health.gointernal/pkg/manager/status/pod_health_test.gointernal/pkg/manager/status/status_manager.gointernal/pkg/manager/status/status_manager_test.gointernal/pkg/manager/status/statuses.go
✅ Files skipped from review due to trivial changes (2)
- internal/pkg/manager/status/pod_health.go
- docs/FlowCollector.md
🚧 Files skipped from review as they are similar to previous changes (2)
- api/flowcollector/v1beta2/flowcollector_types.go
- internal/pkg/manager/status/pod_health_test.go
|
New images: quay.io/netobserv/network-observability-operator:5de85bb
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-5de85bb
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-5de85bbThey will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:5de85bb make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-5de85bbOr as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-5de85bb
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
internal/controller/flp/flp_controller.go (2)
200-208:⚠️ Potential issue | 🟠 MajorDon't mark exporters ready while execution is on hold.
This still publishes every exporter as
Ready/Configuredeven when the processor is intentionally paused, so the integration status is a false positive.Possible fix
func (r *Reconciler) updateExporterStatuses(fc *flowslatest.FlowCollector) { r.mgr.Status.ClearExporters() + if fc.Spec.OnHold() { + return + } for i, exp := range fc.Spec.Exporters { if exp == nil { continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/flp/flp_controller.go` around lines 200 - 208, The updateExporterStatuses function is marking exporters as Ready even when the FlowCollector is paused; modify updateExporterStatuses (which iterates fc.Spec.Exporters and calls r.mgr.Status.SetExporterStatus) to first check the FlowCollector pause/hold flag on fc.Spec (e.g., fc.Spec.Paused or fc.Spec.OnHold) and if the collector is paused, either skip setting Ready or set an appropriate non-ready status/message (e.g., "OnHold"/"Paused" with a non-ready status) instead of "Configured"/StatusReady so the exporter status reflects the paused state.
145-146:⚠️ Potential issue | 🟠 MajorPreserve the degraded status from subnet auto-detect failures.
This still calls
SetDegraded(...)and keeps going. The outer success path then sets the parent back toReady, so the degradation is lost before commit.Possible fix
- r.status.SetReady() + if current := r.status.Get(); current.Status == status.StatusUnknown { + r.status.SetReady() + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/flp/flp_controller.go` around lines 145 - 146, The subnet auto-detect error calls r.status.SetDegraded(...) but the outer success path later clears that by setting the parent back to Ready; change the control flow in the function that performs subnet auto-detect so that after calling r.status.SetDegraded("SubnetDetectionError", ...) you either return early (propagating the error) or set a flag and skip the later code that resets the parent status to Ready, ensuring the degraded state from r.status.SetDegraded is not overwritten before commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 289-291: The code currently skips appending a Kafka condition when
s.GetKafkaCondition() returns nil, causing KafkaReady to disappear during
rollouts and remain when Kafka is actually disabled; update the logic so
GetKafkaCondition (or the call sites around StatusInProgress) never returns/uses
nil but instead returns an explicit Condition for each state (e.g.,
KafkaPending/KafkaDisabled) with the proper Type "KafkaReady", Status, Reason
and Message, and make both places referenced (the call at GetKafkaCondition()
and the similar branch at lines 397-406) append that explicit condition instead
of skipping when nil so KafkaReady is always updated correctly.
- Around line 177-180: The current branch for LokiStack/DemoLoki only lets
Failure/Degraded replace fc.Status.Integrations.Loki; update the conditional in
status_manager.go (the switch case handling LokiStack, DemoLoki) so an active
Ready backend can override an existing inactive/unknown entry: when
fc.Status.Integrations.Loki is nil or cs.Status is StatusFailure or
StatusDegraded or cs.Status is StatusReady (or otherwise considered active),
assign fc.Status.Integrations.Loki = cs.toCRDStatus() so Ready wins over
Unknown/Unused.
- Around line 521-527: The readiness check should gate on ds.Status.NumberReady
instead of ds.Status.UpdatedNumberScheduled: in the conditional that currently
compares ds.Status.UpdatedNumberScheduled < ds.Status.DesiredNumberScheduled,
change it to compare ds.Status.NumberReady < ds.Status.DesiredNumberScheduled,
update the setInProgress message to reflect NumberReady (e.g. "DaemonSet %s not
ready: %d/%d" with ds.Status.NumberReady), and keep calling
i.setDaemonSetReplicas(ds) in both branches; leave i.s.setReady(i.cpnt) when
NumberReady meets DesiredNumberScheduled.
---
Duplicate comments:
In `@internal/controller/flp/flp_controller.go`:
- Around line 200-208: The updateExporterStatuses function is marking exporters
as Ready even when the FlowCollector is paused; modify updateExporterStatuses
(which iterates fc.Spec.Exporters and calls r.mgr.Status.SetExporterStatus) to
first check the FlowCollector pause/hold flag on fc.Spec (e.g., fc.Spec.Paused
or fc.Spec.OnHold) and if the collector is paused, either skip setting Ready or
set an appropriate non-ready status/message (e.g., "OnHold"/"Paused" with a
non-ready status) instead of "Configured"/StatusReady so the exporter status
reflects the paused state.
- Around line 145-146: The subnet auto-detect error calls
r.status.SetDegraded(...) but the outer success path later clears that by
setting the parent back to Ready; change the control flow in the function that
performs subnet auto-detect so that after calling
r.status.SetDegraded("SubnetDetectionError", ...) you either return early
(propagating the error) or set a flag and skip the later code that resets the
parent status to Ready, ensuring the degraded state from r.status.SetDegraded is
not overwritten before commit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2a6e2a1d-c89f-45de-af75-2e987e8f4a08
📒 Files selected for processing (4)
internal/controller/flowcollector_controller.gointernal/controller/flp/flp_controller.gointernal/pkg/manager/status/status_manager.gointernal/pkg/manager/status/status_manager_test.go
|
New images: quay.io/netobserv/network-observability-operator:fdc1d3b
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-fdc1d3b
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-fdc1d3bThey will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:fdc1d3b make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-fdc1d3bOr as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-fdc1d3b
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
internal/pkg/manager/status/status_manager.go (3)
289-291:⚠️ Potential issue | 🟠 MajorStale KafkaReady condition persists when Kafka is disabled or transformer is rolling out.
When
GetKafkaCondition()returnsnil(Kafka disabled or transformer inStatusInProgress), no condition is appended and the previousKafkaReadycondition stays on the CRD.Fix
-if kafkaCond := s.GetKafkaCondition(); kafkaCond != nil { - conditions = append(conditions, *kafkaCond) -} +if kafkaCond := s.GetKafkaCondition(); kafkaCond != nil { + conditions = append(conditions, *kafkaCond) +} else { + meta.RemoveStatusCondition(&fc.Status.Conditions, "KafkaReady") +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 289 - 291, GetKafkaCondition() can return nil for Kafka-disabled or in-progress transformer states, but the current logic only appends when non-nil so an old KafkaReady condition remains; update the condition assembly to explicitly remove any existing KafkaReady condition from the conditions slice when kafkaCond == nil (i.e., after calling GetKafkaCondition()), so that KafkaReady is cleared when Kafka is disabled or transformer is rolling out; locate the code that builds the conditions slice (reference variables/functions: GetKafkaCondition(), the local variable conditions, and the KafkaReady condition type/name) and ensure you either filter out KafkaReady when kafkaCond is nil or replace the previous condition with a nil/removed entry before persisting the status.
521-527:⚠️ Potential issue | 🟠 MajorDaemonSet readiness gated on scheduled pods, not ready pods.
UpdatedNumberScheduledcan reach the desired count before pods pass readiness probes. Component markedReadywhileNumberReady < DesiredNumberScheduled.Fix
-} else if ds.Status.UpdatedNumberScheduled < ds.Status.DesiredNumberScheduled { - i.s.setInProgress(i.cpnt, "DaemonSetNotReady", fmt.Sprintf("DaemonSet %s not ready: %d/%d", ds.Name, ds.Status.UpdatedNumberScheduled, ds.Status.DesiredNumberScheduled)) +} else if ds.Status.NumberReady < ds.Status.DesiredNumberScheduled || ds.Status.NumberUnavailable > 0 { + i.s.setInProgress(i.cpnt, "DaemonSetNotReady", fmt.Sprintf("DaemonSet %s not ready: %d/%d", ds.Name, ds.Status.NumberReady, ds.Status.DesiredNumberScheduled)) i.setDaemonSetReplicas(ds)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 521 - 527, The readiness check is using ds.Status.UpdatedNumberScheduled which can be met before pods pass readiness probes; change the gating to use ds.Status.NumberReady instead: in the block referencing i.s.setInProgress(i.cpnt, "DaemonSetNotReady", ...), replace the condition ds.Status.UpdatedNumberScheduled < ds.Status.DesiredNumberScheduled with ds.Status.NumberReady < ds.Status.DesiredNumberScheduled so the component only becomes Ready via i.s.setReady(i.cpnt) when NumberReady >= DesiredNumberScheduled; keep the calls to i.setDaemonSetReplicas(ds) and preserve the existing log message but include NumberReady vs DesiredNumberScheduled in the message.
177-180:⚠️ Potential issue | 🟠 MajorReady Loki backend cannot override Unknown/Unused.
This branch only lets
FailureorDegradedreplace an existing Loki status. IfDemoLokisetsUnknownorUnusedfirst, a later activeLokiStackwithReadystate never wins.Fix
case LokiStack, DemoLoki: - if fc.Status.Integrations.Loki == nil || cs.Status == StatusFailure || cs.Status == StatusDegraded { + if fc.Status.Integrations.Loki == nil || + fc.Status.Integrations.Loki.State == string(StatusUnknown) || + fc.Status.Integrations.Loki.State == string(StatusUnused) || + cs.Status == StatusFailure || cs.Status == StatusDegraded || + cs.Status == StatusReady { fc.Status.Integrations.Loki = cs.toCRDStatus() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager.go` around lines 177 - 180, The branch for LokiStack/DemoLoki currently only replaces fc.Status.Integrations.Loki when cs.Status is Failure or Degraded, so a later Ready cannot override an existing Unknown/Unused; update the condition in the LokiStack/DemoLoki case to also allow cs.Status == StatusReady to replace an existing integration status when fc.Status.Integrations.Loki is nil OR when the existing fc.Status.Integrations.Loki.Status is StatusUnknown or StatusUnused (in addition to the existing checks for StatusFailure/StatusDegraded), then assign fc.Status.Integrations.Loki = cs.toCRDStatus() as before.
🧹 Nitpick comments (2)
internal/pkg/manager/status/status_manager_test.go (2)
547-646: Missing test for transformer in StatusInProgress.Tests cover Ready, Failure, and pod health scenarios but not the case where transformer is
StatusInProgress(e.g., during rollout). This is whereGetKafkaCondition()returnsnil, potentially leaving a staleKafkaReadycondition.Suggested test case
t.Run("transformer in progress returns pending condition", func(t *testing.T) { s := NewManager() transformer := s.ForComponent(FLPTransformer) transformer.SetNotReady("Deploying", "rolling out") cond := s.GetKafkaCondition() // Currently returns nil - verify expected behavior // If nil is intentional, document why; if not, fix GetKafkaCondition })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager_test.go` around lines 547 - 646, Add a test case in TestKafkaCondition for the transformer being in progress: create a manager, call s.ForComponent(FLPTransformer) and transformer.SetNotReady("Deploying", "rolling out"), then call s.GetKafkaCondition() and assert the expected behavior (either expect a non-nil pending/unknown or false KafkaReady condition with a Reason/Message containing "Deploying"/"rolling out", or if nil is intended, add a test assertion and an inline comment documenting that nil is expected). Ensure the test references FLPTransformer, SetNotReady, and GetKafkaCondition so it covers the StatusInProgress code path.
455-495: Add test case for Ready Loki overriding Unknown/Unused.The Loki tests cover failure overriding ready and unused state, but miss the scenario where a Ready backend should override an Unknown/Unused one. This gap aligns with the existing implementation issue at
populateComponentStatuseswhere Ready cannot override Unknown/Unused.Add a test like:
t.Run("lokistack ready overrides demo loki unused", func(t *testing.T) { s := NewManager() demo := s.ForComponent(DemoLoki) ls := s.ForComponent(LokiStack) demo.SetUnused("disabled") ls.SetReady() fc := &flowslatest.FlowCollector{} s.populateComponentStatuses(fc) require.NotNil(t, fc.Status.Integrations.Loki) assert.Equal(t, "Ready", fc.Status.Integrations.Loki.State) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/pkg/manager/status/status_manager_test.go` around lines 455 - 495, Add a new subtest in TestPopulateLokiStatus to verify Ready overrides Unknown/Unused: create a manager via NewManager(), get components with ForComponent(DemoLoki) and ForComponent(LokiStack), set demo to unused with demo.SetUnused(...) and set the stack to ready with ls.SetReady(), call s.populateComponentStatuses(fc) on a flowslatest.FlowCollector{}, then assert fc.Status.Integrations and fc.Status.Integrations.Loki are non-nil and that fc.Status.Integrations.Loki.State == "Ready"; this ensures populateComponentStatuses correctly lets LokiStack's Ready win over DemoLoki's Unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 283-284: The event emission currently happens before the status
write completes (s.emitStateTransitionEvents(ctx, &fc) runs inside the retry
loop), causing events to be emitted even if the status update ultimately fails
or is retried; move the call to emitStateTransitionEvents so it runs only after
a successful status write (i.e., after the retry loop returns success), or add
tracking/state (e.g., a boolean or annotation on the FlowCluster object) to
ensure emitStateTransitionEvents is only executed once the persisted status
reflects the transitioned state; adjust cleanupStaleConditions(&fc) placement
accordingly if it depends on the same successful write.
---
Duplicate comments:
In `@internal/pkg/manager/status/status_manager.go`:
- Around line 289-291: GetKafkaCondition() can return nil for Kafka-disabled or
in-progress transformer states, but the current logic only appends when non-nil
so an old KafkaReady condition remains; update the condition assembly to
explicitly remove any existing KafkaReady condition from the conditions slice
when kafkaCond == nil (i.e., after calling GetKafkaCondition()), so that
KafkaReady is cleared when Kafka is disabled or transformer is rolling out;
locate the code that builds the conditions slice (reference variables/functions:
GetKafkaCondition(), the local variable conditions, and the KafkaReady condition
type/name) and ensure you either filter out KafkaReady when kafkaCond is nil or
replace the previous condition with a nil/removed entry before persisting the
status.
- Around line 521-527: The readiness check is using
ds.Status.UpdatedNumberScheduled which can be met before pods pass readiness
probes; change the gating to use ds.Status.NumberReady instead: in the block
referencing i.s.setInProgress(i.cpnt, "DaemonSetNotReady", ...), replace the
condition ds.Status.UpdatedNumberScheduled < ds.Status.DesiredNumberScheduled
with ds.Status.NumberReady < ds.Status.DesiredNumberScheduled so the component
only becomes Ready via i.s.setReady(i.cpnt) when NumberReady >=
DesiredNumberScheduled; keep the calls to i.setDaemonSetReplicas(ds) and
preserve the existing log message but include NumberReady vs
DesiredNumberScheduled in the message.
- Around line 177-180: The branch for LokiStack/DemoLoki currently only replaces
fc.Status.Integrations.Loki when cs.Status is Failure or Degraded, so a later
Ready cannot override an existing Unknown/Unused; update the condition in the
LokiStack/DemoLoki case to also allow cs.Status == StatusReady to replace an
existing integration status when fc.Status.Integrations.Loki is nil OR when the
existing fc.Status.Integrations.Loki.Status is StatusUnknown or StatusUnused (in
addition to the existing checks for StatusFailure/StatusDegraded), then assign
fc.Status.Integrations.Loki = cs.toCRDStatus() as before.
---
Nitpick comments:
In `@internal/pkg/manager/status/status_manager_test.go`:
- Around line 547-646: Add a test case in TestKafkaCondition for the transformer
being in progress: create a manager, call s.ForComponent(FLPTransformer) and
transformer.SetNotReady("Deploying", "rolling out"), then call
s.GetKafkaCondition() and assert the expected behavior (either expect a non-nil
pending/unknown or false KafkaReady condition with a Reason/Message containing
"Deploying"/"rolling out", or if nil is intended, add a test assertion and an
inline comment documenting that nil is expected). Ensure the test references
FLPTransformer, SetNotReady, and GetKafkaCondition so it covers the
StatusInProgress code path.
- Around line 455-495: Add a new subtest in TestPopulateLokiStatus to verify
Ready overrides Unknown/Unused: create a manager via NewManager(), get
components with ForComponent(DemoLoki) and ForComponent(LokiStack), set demo to
unused with demo.SetUnused(...) and set the stack to ready with ls.SetReady(),
call s.populateComponentStatuses(fc) on a flowslatest.FlowCollector{}, then
assert fc.Status.Integrations and fc.Status.Integrations.Loki are non-nil and
that fc.Status.Integrations.Loki.State == "Ready"; this ensures
populateComponentStatuses correctly lets LokiStack's Ready win over DemoLoki's
Unused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1b710142-8861-46d9-ad2f-6161ce579b39
📒 Files selected for processing (2)
internal/pkg/manager/status/status_manager.gointernal/pkg/manager/status/status_manager_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2610 +/- ##
==========================================
+ Coverage 72.23% 72.33% +0.10%
==========================================
Files 105 107 +2
Lines 10903 11352 +449
==========================================
+ Hits 7876 8212 +336
- Misses 2544 2655 +111
- Partials 483 485 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Rebased and addressed feedback Thanks @OlivierCazade ! |
|
New images: quay.io/netobserv/network-observability-operator:cfef3f9
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-cfef3f9
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-cfef3f9They will expire in two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:cfef3f9 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-cfef3f9Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-cfef3f9
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
|
|
||
| // conditionType maps a ComponentName to a Kubernetes condition type. | ||
| // Uses "<Component>Ready" naming convention (positive polarity). | ||
| var conditionTypeMap = map[ComponentName]string{ |
There was a problem hiding this comment.
actually I'm not sure to see the whole point here for this map, is it useful to decouple the name from the condition type? (name is only used for that) - it seems adding verbosity for no value ?
There was a problem hiding this comment.
One more thing, iirc there was a reason to have "WaitingXYZ" instead of "XYZReady": it's because, apart from the main "Ready" condition, the openshift console was considering all conditions as being negative when true, which drives the displayed status. So that worked when "WaitingXYZ=true" => issue => displayed as an issue by he console
versus
"XYZReady=true" => no issue => but displayed as an issue by the console
Not sure if something has changed in the console, and it's now nicely displayed?
There was a problem hiding this comment.
Indeed I forgot about that point since we override status logic in the console plugin.
The OCP Console approach remains and forces us to keep it that way.
I'm rolling back to the original behavior and removing the map
| i.s.setInProgress(i.cpnt, "DeploymentNotCreated", "Deployment not created") | ||
| return | ||
| } | ||
| i.setDeploymentReplicas(d) |
There was a problem hiding this comment.
i was wondering why i.setDeploymentReplicas(d) is called plenty of times in this function, but I guess that's because every time the status is modified, it's overriden and must be set again ?
Could be cleaner by just introducing a new function that does the status update, and only once after that, replicas count is updated?
or just using a defer ?
There was a problem hiding this comment.
defer seems to be the right approach here
| i.s.setInProgress(i.cpnt, "DaemonSetNotCreated", "DaemonSet not created") | ||
| } else if ds.Status.UpdatedNumberScheduled < ds.Status.DesiredNumberScheduled { | ||
| i.s.setInProgress(i.cpnt, "DaemonSetNotReady", fmt.Sprintf("DaemonSet %s not ready: %d/%d", ds.Name, ds.Status.UpdatedNumberScheduled, ds.Status.DesiredNumberScheduled)) | ||
| } else if ds.Status.NumberReady < ds.Status.DesiredNumberScheduled { |
There was a problem hiding this comment.
I will blindly trust that regarding the status fields to check :-)
I thought we finally found something stable and consistent for checking the progress, but looks like it wasn't the case, after all? I'll just cross my fingers here.
There was a problem hiding this comment.
Those numbers makes me crazy as it's hard to understand all the scenarios:
- UpdatedNumberScheduled: counts nodes running the latest pod spec (tracks rollout progress)
- NumberReady: counts nodes with a pod that passes readiness probes (tracks actual health)
| rlog := log.FromContext(ctx) | ||
|
|
||
| podList := corev1.PodList{} | ||
| if err := c.List(ctx, &podList, &client.ListOptions{ |
There was a problem hiding this comment.
we need to monitor if that increases the memory usage of the operator. Pods are not currently covered in the narrowcache package so the risk is to end up caching all cluster pods when listing.
There was a problem hiding this comment.
Indeed. However, the call only happens when a workload reports unhealthy replicas so that should not happen all the time.
That's said, we can improve the cache to only list what's owned 👌
jotak
left a comment
There was a problem hiding this comment.
I've got some comments but it's a really nice improvement!
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Addressed feedback. Thanks @jotak |
|
/ok-to-test |
Description
Improve FlowCollector status
Assisted-by:
claude-4.6-opus-highDependencies
n/a
Checklist
Summary by CodeRabbit
New Features
Documentation
Chores