Skip to content

resourcegraphdefinition: wait for CRD deletion to complete#1124

Open
a-hilaly wants to merge 1 commit intokubernetes-sigs:mainfrom
a-hilaly:crd-deletion
Open

resourcegraphdefinition: wait for CRD deletion to complete#1124
a-hilaly wants to merge 1 commit intokubernetes-sigs:mainfrom
a-hilaly:crd-deletion

Conversation

@a-hilaly
Copy link
Copy Markdown
Member

@a-hilaly a-hilaly commented Mar 9, 2026

This patch makes deletion follow the same behavioral contract that creation
already relies on. On create, the RGD controller blocks until the generated
CRD is actually ready before marking the kind ready and continuing
reconciliation. On delete, cleanup was weaker: it treated a successful
DELETE request as if the CRD were already gone.

This is intentionally a narrow behavioral correction, not a larger
lifecycle redesign. Longer term, the options we can consider include moving
this kind of waiting out of reconcile and into a requeue / event-driven
model, or revisiting whether more of this lifecycle should be expressed via
ownerReferences. If we want to go further than that, and especially if we
want explicit transitional states such as Activating and Deactivating, that
should go through KREP-level design work rather than being folded into this
fix.

That mismatch showed up in the integration suite. The test
"CRD CRD Creation [It] should delete CRD when ResourceGraphDefinition is
deleted" removes an RGD and then polls the generated CRD until it returns
NotFound. In CI we observed the controller log "Deleting CRD", proceed with
cleanup, and remove the finalizer, while the API server continued to serve
the CRD. The test eventually timed out with "Expected an error, got nil".
This change makes deletion honor the stronger contract by polling for
NotFound after CRD deletion is requested and only reporting cleanup success
once the CRD has actually disappeared. The change is covered by an RGD
controller unit test, while the existing integration spec remains the
end-to-end regression.

This patch makes deletion follow the same behavioral contract that creation
already relies on. On create, the RGD controller blocks until the generated
CRD is actually ready before marking the kind ready and continuing
reconciliation. On delete, cleanup was weaker: it treated a successful
DELETE request as if the CRD were already gone.

This is intentionally a narrow behavioral correction, not a larger
lifecycle redesign. Longer term, the options we can consider include moving
this kind of waiting out of reconcile and into a requeue / event-driven
model, or revisiting whether more of this lifecycle should be expressed via
ownerReferences. If we want to go further than that, and especially if we
want explicit transitional states such as Activating and Deactivating, that
should go through KREP-level design work rather than being folded into this
fix.

That mismatch showed up in the integration suite. The test
"CRD CRD Creation [It] should delete CRD when ResourceGraphDefinition is
deleted" removes an RGD and then polls the generated CRD until it returns
NotFound. In CI we observed the controller log "Deleting CRD", proceed with
cleanup, and remove the finalizer, while the API server continued to serve
the CRD. The test eventually timed out with "Expected an error, got nil".
This change makes deletion honor the stronger contract by polling for
NotFound after CRD deletion is requested and only reporting cleanup success
once the CRD has actually disappeared. The change is covered by an RGD
controller unit test, while the existing integration spec remains the
end-to-end regression.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 9, 2026
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 9, 2026
}

// waitForDelete waits for a CRD to be fully removed from the API server.
func (w *CRDWrapper) waitForDelete(ctx context.Context, name string) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this block a worker from reconciling other RGDs until the CRD is deleted?
should we consider requeuing instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm i see it's addressed in PR description

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's just temporary hack, and i'm the first to dislike it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we ensure we delete the CRD before shutting down the controller here? https://github.com/kubernetes-sigs/kro/blob/main/pkg/controller/resourcegraphdefinition/controller_cleanup.go#L34-L50

If not, if there are any instances that have not been cleaned up, they will no longer be reconciled

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. We should, perhaps in a separate PR with proper integration tests etc...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picking it up after this PR is merged

Copy link
Copy Markdown
Contributor

@michaelhtm michaelhtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@michaelhtm: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly, michaelhtm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Member

@jakobmoellerdev jakobmoellerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

please lets follow up on this with a proper fix as soon as possible.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2026
@a-hilaly
Copy link
Copy Markdown
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2026
@a-hilaly
Copy link
Copy Markdown
Member Author

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@a-hilaly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmits-integration-tests 3fa0274 link true /test presubmits-integration-tests
presubmits-verify-codegen 3fa0274 link true /test presubmits-verify-codegen
presubmits-e2e-upgrade-tests 3fa0274 link true /test presubmits-e2e-upgrade-tests

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants