Skip to content

Group snapshot not taken on conflict #1280

@jsafrane

Description

@jsafrane

What happened:

I do not know how exactly it happened, but I got into a situation when the snapshot controller was processing a VolumeGroupSnapshot and corresponding VolumeSnapshot already existed. The controller reacts to 409 Already Exists here:

createdVolumeSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(volumeSnapshotNamespace).Create(ctx, volumeSnapshot, metav1.CreateOptions{})
if err != nil && !apierrs.IsAlreadyExists(err) {
return groupSnapshotContent, fmt.Errorf(
"createSnapshotsForGroupSnapshotContent: creating volumesnapshot %w", err)
}

You can see the code continues processing when the snapshot already exists. However, the createdVolumeSnapshot variable has undefined content (it points to an object with empty namespace / name in my case) and later on, when the controller tries to update it, Patch() fails here:

_, err = utils.PatchVolumeSnapshot(createdVolumeSnapshot, []utils.PatchOp{

Error:
E0310 21:18:31.250917 1 groupsnapshot_controller_helper.go:257] could not sync group snapshot "e2e-volumegroupsnapshottable-8755/group-snapshot-z6w49": createSnapshotsForGroupSnapshotContent: binding volumesnapshot to volumesnapshotcontent resource name may not be empty

What you expected to happen:
The controller should perhaps Get() the VolumeSnapshot from its informer before creating it. And if it already exists, then fail + expect the VolumeSnapshot appears in the informer in the next retry.

All IsAlreadyExists should be then handled in a similar way, not just VolumeSnapshot.

How to reproduce it:
I don't know. It happened only once in OpenShift e2e tests. I am stress-testing the test, 1024 runs so far, 0 failures.

Logs from OpenShift e2e tests:

As you can see, it would be helpful to log names of created VolumeSnapshots, VolumeSnapshotContents, and VolumeGroupContents - it's quite hard to map which VolumeSnapshot failed to be patched and what's the corresponding VGSC.

Anything else we need to know?:

Environment:

  • Driver version: csi-driver-hostpath + e2e tests as in Kubernetes 1.32
  • Kubernetes version (use kubectl version): 1.32-ish

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions