Skip to content

Conversation

jakobmoellerdev
Copy link
Contributor

@jakobmoellerdev jakobmoellerdev commented Jul 31, 2025

previously the informer queue of the dynamic controller and the manager shut-down immediately (got PID killed).

This is problematic because it means that a reconcile loop can potentially be interrupted in the middle of a run even though we have a normally graceful shutdown.

Fortunately controller runtime provides a nice way to fix this: by adding the dynamic controller into the manager runnables instead of leaking a goroutine as well as fixing how the manager is actually stopped (previously the manager was run in a goroutine).

Additionally we now properly shutdown the queue of the dynamic controller and wait for the workers to finish processing.

Note that if this process still takes longer than the graceful shutdown period, we forcefully kill the manager and the binary.

Because the shutdown timeout is now respected globally, I have replaced

  • --dynamic-controller-default-shutdown-timeout with --graceful-shutdown-timeout, and
  • KRO_DYNAMIC_CONTROLLER_DEFAULT_SHUTDOWN_TIMEOUT with KRO_GRACEFUL_SHUTDOWN_TIMEOUT

This is because the shutdown timeout is now handled globally within the manager and not individually just for the dynamic controller. A nice side-effect is that this shutdown timeout now applies globally to other controllers as well and not just the dynamic controller

An example log now when the process receives a shutdown signal:

2025-07-31T15:34:25.977+0200	INFO	Stopping and waiting for non leader election runnables
2025-07-31T15:34:25.977+0200	INFO	Stopping and waiting for leader election runnables
2025-07-31T15:34:25.977+0200	INFO	rgd-controller	Shutdown signal received, waiting for all workers to finish	{"controller": "ResourceGraphDefinition", "controllerGroup": "kro.run", "controllerKind": "ResourceGraphDefinition"}
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	Received shutdown signal, shutting down dynamic controller queue
2025-07-31T15:34:25.977+0200	INFO	rgd-controller	All workers finished	{"controller": "ResourceGraphDefinition", "controllerGroup": "kro.run", "controllerKind": "ResourceGraphDefinition"}
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	Dynamic controller worker received shutdown signal, stopping
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	All workers have stopped
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	Starting graceful shutdown
2025-07-31T15:34:25.977+0200	DEBUG	dynamic-controller	Shutting down informer	{"gvr": "kro.run/v1alpha1, Resource=deploymentservices"}
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	All informers shut down successfully
2025-07-31T15:34:25.977+0200	INFO	dynamic-controller	Shutting down dynamic controller
2025-07-31T15:34:25.977+0200	INFO	Stopping and waiting for caches
2025-07-31T15:34:25.977+0200	INFO	Stopping and waiting for webhooks
2025-07-31T15:34:25.977+0200	INFO	Stopping and waiting for HTTP servers
2025-07-31T15:34:25.977+0200	INFO	shutting down server	{"name": "health probe", "addr": "[::]:8079"}
2025-07-31T15:34:25.977+0200	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2025-07-31T15:34:25.977+0200	INFO	Wait completed, proceeding to shutdown the manager

…hutdown correctly

previously the informer queue of the dynamic controller and the manager shut-down immediately (killed). This is problematic because it means that a reconcile loop can potentially be interrupted in the middle of a run even though we have a normally graceful shutdown.

This fixes this by adding the dynamic controller into the manager runnables instead of leaking a goroutine as well as fixing how the manager is actually stopped (previously the manager was run in a goroutine).

Additionally we now properly shutdown the queue of the dynamic controller and wait for the workers to finish processing.

Note that if this process still takes longer than the graceful shutdown period, we forcefully kill the manager and the binary.

Because the shutdown timeout is now respected globally, I have replaced `--dynamic-controller-default-shutdown-timeout` and `KRO_DYNAMIC_CONTROLLER_DEFAULT_SHUTDOWN_TIMEOUT` with `--graceful-shutdown-timeout` and `KRO_GRACEFUL_SHUTDOWN_TIMEOUT` because the shutdown timeout is now handled globally within the manager and not individually just for the dynamic controller.

Signed-off-by: Jakob Möller <jakob.moeller@sap.com>
@barney-s barney-s requested review from a-hilaly, barney-s and matthchr and removed request for a-hilaly August 8, 2025 18:07
"interval at which the controller will re list resources even with no changes, in seconds")
flag.IntVar(&queueMaxRetries, "dynamic-controller-default-queue-max-retries", 20,
"maximum number of retries for an item in the queue will be retried before being dropped")
flag.IntVar(&shutdownTimeout, "dynamic-controller-default-shutdown-timeout", 60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably call this out in the release notes since it's breaking, especially given we've also changed the name of the parameter in the Helm chart.

@a-hilaly do we have a good process to flag something that should be highlighted in the release notes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we would use something like labels and conventional commits, we could make it so that something like fix!: myfix would mark the breaking change with a ! (see https://www.conventionalcommits.org/en/v1.0.0/#summary). but idk if thats too restrictive. apart from that it would be good to mark it somehow manually with a label and then be able to generate release notes from that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-hilaly do we have a good process to flag something that should be highlighted in the release notes?

Not really, so far we manually flagged things. We do use https://github.com/softprops/action-gh-release for github releases and at first glance looks like they parse conventional commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked it right now and will pay attention on the next release, if the notes get generated correctly, all good, then it should just become part of the contributor guide

@jakobmoellerdev jakobmoellerdev changed the title fix: properly shutdown controller manager and dynamic controller fix!: properly shutdown controller manager and dynamic controller Aug 13, 2025
@jakobmoellerdev
Copy link
Contributor Author

@a-hilaly @matthchr anything left to do for me here before merge?

@jakobmoellerdev
Copy link
Contributor Author

@a-hilaly @matthchr friendly ping as this has been open for a few weeks now

@matthchr
Copy link
Contributor

matthchr commented Sep 5, 2025

As far as I am concerned this is good to merge. @a-hilaly if you want to take another look that's fine by me, but otherwise my vote is merge?

I'm out of office all next week but when I'm back if no more comments I'll merge this. Otherwise feel free to merge while I am out as well.

…fecycle

# Conflicts:
#	cmd/controller/main.go
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jakobmoellerdev, matthchr
Once this PR has been reviewed and has the lgtm label, please assign barney-s for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants