Skip to content

getReleaseState may sometimes cause an unwanted rollback #227

@kovayur

Description

@kovayur

Problem

If I have a CR in a chart and remove its definition from the cluster, it may result in a broken operator state:

  1. If I have a single revision, the operator constantly prints the error:
rollback failed: release: not found: original upgrade error: unable to build kubernetes objects from current release manifest: [resource mapping not found for name: "stackrox-central" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "stackrox-scanner" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1" ensure CRDs are installed first]
  1. If I have more than 1 revision and the previous revision does NOT contain a CR, I get the endless reconcile loop as described in Failed upgrade may lead to an endless loop of rollbacks #224
  2. If I have more than 1 revision and the previous revision also contains a CR, then the rollback fails and the release gets stuck in pending-rollback state. The change Allow marking releases stuck in a pending state as failed #116 recovers the release.

We've discovered this issue with clusters that have been upgraded to 1.25 and have had the PSPs removed. However, this applies to any CRD.

Root cause

getReleaseState calls actionClient.Upgradewith the DryRun flag. This function tries to infer whether release was changed in storage based on return value of Upgrade.Run. From the comment it seems to me that it is expected that the returned release should not be nil with DryRun but apparently that is not the case (at least with Helm v3.12.1):

// As of Helm 2.13, if Upgrade returns a non-nil release, that
// means the release was also recorded in the release store.
// Therefore, we should perform the rollback when we have a non-nil
// release. Any rollback error here would be unexpected, so always
// log both the update and rollback errors.

Thus, when the dry-run upgrade fails, action client performs a non-dry-run rollback to the previous revision.

From the Helm upgrade source code:
https://github.com/helm/helm/blob/main/pkg/action/upgrade.go#L293-L298

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions