Skip to content

Fix TestKubeletConfigRemediation unstable issue#1080

Open
Anna-Koudelkova wants to merge 1 commit intoComplianceAsCode:masterfrom
Anna-Koudelkova:ci_cleanup
Open

Fix TestKubeletConfigRemediation unstable issue#1080
Anna-Koudelkova wants to merge 1 commit intoComplianceAsCode:masterfrom
Anna-Koudelkova:ci_cleanup

Conversation

@Anna-Koudelkova
Copy link
Collaborator

@Anna-Koudelkova Anna-Koudelkova commented Feb 6, 2026

Note:
As disscussed on Friday with the dev team and today with Xiaojie, we do not need the specific image the test case was using because we now ship these rules as part of the ocp4 profile bundle. Also, based on Xiaojie's suggestion I have replaced the overly complicated defer logic for unapplying the remediation with an implementation of already existing functions. Thanks for the feedback!

Changes in this PR:

  1. Add cleanup to test case TestKubeletConfigRemediation - the test case itself is not failing much in our serial job run, but the subsequent test, TestSuspendScanSetting, is failing quite often. I think the root cause lies in TestKubeletConfigRemediation, that applied remediation and did not unapply it as well as it did not wait for the scan and suite to be deleted.

To fix that, I have added func WaitForComplianceSuiteDeletion() to common.go and make minor modification in the test itself.

I hope this works, TestKubeletConfigRemediation is problematic, Xiaojie is trying a different approach in PR #1055 and in some of the PRs any few of my local tries, the test case fails on the result of the suite being NOT-APPLICABLE, which is a whole different story.

  1. When running test case in my local cluster, often after PASSing runs my environment ended up with one worker node degraded. This is apparently caused by removing the e2e label from the worker node and immediately deleting the e2e machineconfigpool before the node fully reboots as part of worker node pool. I have added a func WaitForMachineConfigPoolUpdated() into common.go and call for this function into the main_entry.go file to the teardown phase.
    This is not important for our CI runs, but it greatly improves quality of my experience when I run these tests locally and saves my sanity.

At least I did not harm to the TestKubeletConfigRemediation:

2026/02/06 15:55:25 All scans in ComplianceSuite have finished (kubelet-remediation-test-suite-node)
2026/02/06 15:55:25 scan re-run has finished
--- PASS: TestKubeletConfigRemediation (373.79s)
PASS
...
ok  	github.com/ComplianceAsCode/compliance-operator/tests/e2e/serial	658.249s

@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Anna-Koudelkova

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 6, 2026
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-00de17c89de971cd81bec1ccef51093501797691

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-3beaa2c3dc1b014fd6da7e9f74db1c28d829a6e7

Copy link
Collaborator

@rhmdnd rhmdnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about using t.Fatal versus t.Log (primarily noting an inconsistency and wanting to make sure I understand it).

Otherwise, this looks like a much more stable approach to the serial tests.

}
}

log.Printf("Machine Config Pool %s has not finished updating yet... retrying\n", poolName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit: could add the c.Status here to see that in the log as it rolls through the update (kinda like we do when waiting for scan statuses).

return false, nil
})
if err != nil {
return fmt.Errorf("ComplianceSuite %s may not have been fully cleaned up: %w", name, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this output the length of time we waited for the suite to cleanup? That might be good information to log if we're not already through the err.

// Finally clean up by removing the remediation and waiting for the nodes to reboot one more time
err = f.UnApplyRemediationAndCheck(f.OperatorNamespace, remName, "worker")
if err != nil {
t.Fatal(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above - we're using t.Log instead of t.Fatal. Do we want those to be consistent? Or was there a reason why you wanted to use t.Fatal here and not above?

For context - I think t.Fatal is the correct usage, since it will fail the test if it cannot cleanup after itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason for this - it is caused by me touching an already existing testcase while being assisted by Cursor. I have fixed it to t.Fatal so it is consistent

}()

// Ensure that all the scans in the suite have finished and are marked as Done
err = f.WaitForSuiteScansStatus(f.OperatorNamespace, suiteName, compv1alpha1.PhaseDone, compv1alpha1.ResultNonCompliant)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually came back as Compliant which is what's failing the serial CI. We might need to either use the variadic version of this wait function, or find some other rules that will fail (which might be necessary if we're testing remediations).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not found any kubelet rules that are not passing by default now (it is possible I have not tried all the options). The only way how can I get the test case to pass now in my local cluster is by reverting back to using the brokenContentImagePath and creating a profilebundle. That is why I have uploaded this version here, so I can see what is going to happen in the CI now..
I am frankly quite confused by this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-9ad1b1d737a8cbcb055877a2db32e094832b5fb9

@Anna-Koudelkova Anna-Koudelkova force-pushed the ci_cleanup branch 2 times, most recently from dfc0fb2 to 521e127 Compare March 10, 2026 18:32
@github-actions
Copy link

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-dfc0fb2c78500c9773bcf2a2ce801d134d18a584

@github-actions
Copy link

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-521e12770fc51e293ab24dd1c599c436ca543b3d

@github-actions
Copy link

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-88210258ec7fe770c60de3bed358d3c650ce2b4e

@github-actions
Copy link

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1080-c4d183458a50ef93c2a5cd43aad56fee4a27ac22

@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

@Anna-Koudelkova: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-rosa c4d1834 link true /test e2e-rosa
ci/prow/e2e-aws-serial c4d1834 link true /test e2e-aws-serial
ci/prow/e2e-aws-serial-arm c4d1834 link true /test e2e-aws-serial-arm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants