-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Problem
If I have a CR in a chart and remove its definition from the cluster, it may result in a broken operator state:
- If I have a single revision, the operator constantly prints the error:
rollback failed: release: not found: original upgrade error: unable to build kubernetes objects from current release manifest: [resource mapping not found for name: "stackrox-central" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "stackrox-scanner" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first]
- If I have more than 1 revision and the previous revision does NOT contain a CR, I get the endless reconcile loop as described in Failed upgrade may lead to an endless loop of rollbacks #224
- If I have more than 1 revision and the previous revision also contains a CR, then the rollback fails and the release gets stuck in
pending-rollback
state. The change Allow marking releases stuck in a pending state as failed #116 recovers the release.
We've discovered this issue with clusters that have been upgraded to 1.25 and have had the PSPs removed. However, this applies to any CRD.
Root cause
getReleaseState
calls actionClient.Upgrade
with the DryRun
flag. This function tries to infer whether release was changed in storage based on return value of Upgrade.Run
. From the comment it seems to me that it is expected that the returned release should not be nil with DryRun
but apparently that is not the case (at least with Helm v3.12.1):
helm-operator-plugins/pkg/client/actionclient.go
Lines 237 to 241 in a775742
// As of Helm 2.13, if Upgrade returns a non-nil release, that | |
// means the release was also recorded in the release store. | |
// Therefore, we should perform the rollback when we have a non-nil | |
// release. Any rollback error here would be unexpected, so always | |
// log both the update and rollback errors. |
Thus, when the dry-run upgrade fails, action client performs a non-dry-run rollback to the previous revision.
From the Helm upgrade source code:
https://github.com/helm/helm/blob/main/pkg/action/upgrade.go#L293-L298