resourcegraphdefinition: wait for CRD deletion to complete#1124
resourcegraphdefinition: wait for CRD deletion to complete#1124a-hilaly wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
This patch makes deletion follow the same behavioral contract that creation already relies on. On create, the RGD controller blocks until the generated CRD is actually ready before marking the kind ready and continuing reconciliation. On delete, cleanup was weaker: it treated a successful DELETE request as if the CRD were already gone. This is intentionally a narrow behavioral correction, not a larger lifecycle redesign. Longer term, the options we can consider include moving this kind of waiting out of reconcile and into a requeue / event-driven model, or revisiting whether more of this lifecycle should be expressed via ownerReferences. If we want to go further than that, and especially if we want explicit transitional states such as Activating and Deactivating, that should go through KREP-level design work rather than being folded into this fix. That mismatch showed up in the integration suite. The test "CRD CRD Creation [It] should delete CRD when ResourceGraphDefinition is deleted" removes an RGD and then polls the generated CRD until it returns NotFound. In CI we observed the controller log "Deleting CRD", proceed with cleanup, and remove the finalizer, while the API server continued to serve the CRD. The test eventually timed out with "Expected an error, got nil". This change makes deletion honor the stronger contract by polling for NotFound after CRD deletion is requested and only reporting cleanup success once the CRD has actually disappeared. The change is covered by an RGD controller unit test, while the existing integration spec remains the end-to-end regression.
| } | ||
|
|
||
| // waitForDelete waits for a CRD to be fully removed from the API server. | ||
| func (w *CRDWrapper) waitForDelete(ctx context.Context, name string) error { |
There was a problem hiding this comment.
would this block a worker from reconciling other RGDs until the CRD is deleted?
should we consider requeuing instead?
There was a problem hiding this comment.
nvm i see it's addressed in PR description
There was a problem hiding this comment.
Yeah it's just temporary hack, and i'm the first to dislike it.
There was a problem hiding this comment.
can we ensure we delete the CRD before shutting down the controller here? https://github.com/kubernetes-sigs/kro/blob/main/pkg/controller/resourcegraphdefinition/controller_cleanup.go#L34-L50
If not, if there are any instances that have not been cleaned up, they will no longer be reconciled
There was a problem hiding this comment.
Good call. We should, perhaps in a separate PR with proper integration tests etc...
There was a problem hiding this comment.
picking it up after this PR is merged
|
@michaelhtm: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a-hilaly, michaelhtm The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
jakobmoellerdev
left a comment
There was a problem hiding this comment.
/lgtm
please lets follow up on this with a proper fix as soon as possible.
|
/hold |
|
/retest |
|
@a-hilaly: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This patch makes deletion follow the same behavioral contract that creation
already relies on. On create, the RGD controller blocks until the generated
CRD is actually ready before marking the kind ready and continuing
reconciliation. On delete, cleanup was weaker: it treated a successful
DELETE request as if the CRD were already gone.
This is intentionally a narrow behavioral correction, not a larger
lifecycle redesign. Longer term, the options we can consider include moving
this kind of waiting out of reconcile and into a requeue / event-driven
model, or revisiting whether more of this lifecycle should be expressed via
ownerReferences. If we want to go further than that, and especially if we
want explicit transitional states such as Activating and Deactivating, that
should go through KREP-level design work rather than being folded into this
fix.
That mismatch showed up in the integration suite. The test
"CRD CRD Creation [It] should delete CRD when ResourceGraphDefinition is
deleted" removes an RGD and then polls the generated CRD until it returns
NotFound. In CI we observed the controller log "Deleting CRD", proceed with
cleanup, and remove the finalizer, while the API server continued to serve
the CRD. The test eventually timed out with "Expected an error, got nil".
This change makes deletion honor the stronger contract by polling for
NotFound after CRD deletion is requested and only reporting cleanup success
once the CRD has actually disappeared. The change is covered by an RGD
controller unit test, while the existing integration spec remains the
end-to-end regression.