Fix TestKubeletConfigRemediation unstable issue#1080
Fix TestKubeletConfigRemediation unstable issue#1080Anna-Koudelkova wants to merge 1 commit intoComplianceAsCode:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Anna-Koudelkova The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
🤖 To deploy this PR, run the following command: |
00de17c to
3beaa2c
Compare
|
🤖 To deploy this PR, run the following command: |
rhmdnd
left a comment
There was a problem hiding this comment.
One comment about using t.Fatal versus t.Log (primarily noting an inconsistency and wanting to make sure I understand it).
Otherwise, this looks like a much more stable approach to the serial tests.
tests/e2e/framework/common.go
Outdated
| } | ||
| } | ||
|
|
||
| log.Printf("Machine Config Pool %s has not finished updating yet... retrying\n", poolName) |
There was a problem hiding this comment.
Super minor nit: could add the c.Status here to see that in the log as it rolls through the update (kinda like we do when waiting for scan statuses).
tests/e2e/framework/common.go
Outdated
| return false, nil | ||
| }) | ||
| if err != nil { | ||
| return fmt.Errorf("ComplianceSuite %s may not have been fully cleaned up: %w", name, err) |
There was a problem hiding this comment.
Does this output the length of time we waited for the suite to cleanup? That might be good information to log if we're not already through the err.
| // Finally clean up by removing the remediation and waiting for the nodes to reboot one more time | ||
| err = f.UnApplyRemediationAndCheck(f.OperatorNamespace, remName, "worker") | ||
| if err != nil { | ||
| t.Fatal(err) |
There was a problem hiding this comment.
Above - we're using t.Log instead of t.Fatal. Do we want those to be consistent? Or was there a reason why you wanted to use t.Fatal here and not above?
For context - I think t.Fatal is the correct usage, since it will fail the test if it cannot cleanup after itself.
There was a problem hiding this comment.
There is no reason for this - it is caused by me touching an already existing testcase while being assisted by Cursor. I have fixed it to t.Fatal so it is consistent
| }() | ||
|
|
||
| // Ensure that all the scans in the suite have finished and are marked as Done | ||
| err = f.WaitForSuiteScansStatus(f.OperatorNamespace, suiteName, compv1alpha1.PhaseDone, compv1alpha1.ResultNonCompliant) |
There was a problem hiding this comment.
This actually came back as Compliant which is what's failing the serial CI. We might need to either use the variadic version of this wait function, or find some other rules that will fail (which might be necessary if we're testing remediations).
There was a problem hiding this comment.
I have not found any kubelet rules that are not passing by default now (it is possible I have not tried all the options). The only way how can I get the test case to pass now in my local cluster is by reverting back to using the brokenContentImagePath and creating a profilebundle. That is why I have uploaded this version here, so I can see what is going to happen in the CI now..
I am frankly quite confused by this.
There was a problem hiding this comment.
We might need to adjust this by changing a rule like https://github.com/ComplianceAsCode/content/blob/master/applications/openshift/kubelet/kubelet_eviction_thresholds_set_hard_imagefs_inodesfree/rule.yml#L77 and its remediation value variable default https://github.com/ComplianceAsCode/content/blob/master/applications/openshift/kubelet/var_kubelet_evictionhard_imagefs_inodesfree.var to a fixed value, making it fail by default and making the test work in that way.
3beaa2c to
9ad1b1d
Compare
|
🤖 To deploy this PR, run the following command: |
dfc0fb2 to
521e127
Compare
|
🤖 To deploy this PR, run the following command: |
|
🤖 To deploy this PR, run the following command: |
521e127 to
8821025
Compare
|
🤖 To deploy this PR, run the following command: |
8821025 to
c4d1834
Compare
|
🤖 To deploy this PR, run the following command: |
|
@Anna-Koudelkova: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Note:
As disscussed on Friday with the dev team and today with Xiaojie, we do not need the specific image the test case was using because we now ship these rules as part of the ocp4 profile bundle. Also, based on Xiaojie's suggestion I have replaced the overly complicated defer logic for unapplying the remediation with an implementation of already existing functions. Thanks for the feedback!
Changes in this PR:
To fix that, I have added func WaitForComplianceSuiteDeletion() to common.go and make minor modification in the test itself.
I hope this works, TestKubeletConfigRemediation is problematic, Xiaojie is trying a different approach in PR #1055 and in some of the PRs any few of my local tries, the test case fails on the result of the suite being NOT-APPLICABLE, which is a whole different story.
e2elabel from the worker node and immediately deleting thee2emachineconfigpool before the node fully reboots as part of worker node pool. I have added a func WaitForMachineConfigPoolUpdated() into common.go and call for this function into the main_entry.go file to the teardown phase.This is not important for our CI runs, but it greatly improves quality of my experience when I run these tests locally and saves my sanity.
At least I did not harm to the TestKubeletConfigRemediation: