🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation by atharrva01 · Pull Request #3440 · kubernetes-sigs/controller-runtime

atharrva01 · 2026-01-27T11:03:46Z

Summary

This PR fixes a control-flow bug in the manager startup lifecycle where runnableGroup.Start() could enter an infinite loop when the context is cancelled before all runnables signal readiness.
The issue could cause manager startup or shutdown to hang indefinitely while consuming CPU.

Impact

Before this change, the manager could:

Spin indefinitely at 100% CPU after context cancellation
Hang forever during startup if any runnable fails its readiness check
Hang during SIGTERM / shutdown while waiting for runnables
Leak goroutines stuck inside Start()

These scenarios are common in real deployments (RBAC misconfigurations, cache sync failures, API server unavailability), making the issue both user-visible and hard to diagnose.

Steps to Reproduce

Scenario 1: Startup failure

Start a manager with a runnable whose cache sync fails (e.g. missing RBAC)
The runnable never signals readiness
Start() waits forever and the manager never becomes ready

Scenario 2: Shutdown during startup

Start the manager
Send SIGTERM before all runnables become ready
Context is cancelled, but Start() keeps looping and burns CPU

Root Cause

The startup loop correctly listens for ctx.Done(), but does not exit the loop when the context is cancelled.

case <-ctx.Done():
    if err := ctx.Err(); !errors.Is(err, context.Canceled) {
        retErr = err
    }
    // loop continues → ctx.Done() fires forever

Since ctx.Done() is a closed channel after cancellation, the select immediately re-selects it on every iteration, causing a tight busy loop.

Fix

Exit the startup loop immediately when the context is cancelled:

case <-ctx.Done():
    if err := ctx.Err(); !errors.Is(err, context.Canceled) {
        retErr = err
    }
    return

This ensures:

The startup loop terminates correctly
Errors are propagated to the caller
Shutdown and startup behave predictably

Behavior After Fix

Manager startup fails fast instead of hanging
Shutdown completes cleanly without CPU spin
No goroutine leaks from stuck Start() calls
Already-started runnables continue running on their own context and are cleaned up via StopAndWait()

Verification

✅ Code compiles successfully
✅ go vet passes
✅ Existing unit tests pass
⚠️ Integration tests require envtest binaries (not available in this environment)

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

linux-foundation-easycla · 2026-01-27T11:03:53Z

The committers listed above are authorized under a signed CLA.

✅ login: atharrva01 / name: Atharva Borade (4e1d027, 8b18f6f, 8deb807, ab2c563, b0f36fd, c00cf65)

k8s-ci-robot · 2026-01-27T11:03:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atharrva01
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-01-27T11:03:55Z

Welcome @atharrva01!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-01-27T11:03:56Z

Hi @atharrva01. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

atharrva01 · 2026-01-27T11:05:56Z

hi @inteon @vincepri This fixes a startup/shutdown hang where runnableGroup.Start() could spin indefinitely after context cancellation by returning immediately on ctx.Done().

pkg/manager/runnable_group.go

sbueringer · 2026-01-27T12:26:05Z

Nice catch, thx!

PTAL at the CLA: #3440 (comment)

/ok-to-test

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

atharrva01 · 2026-01-27T13:42:36Z

hi @sbueringer I’ve added a focused regression test covering the context-cancellation path in runnableGroup.Start(). I ran the unit test locally; integration tests require envtest binaries, so CI should cover those. Thanks for the suggestion!

sbueringer · 2026-01-27T15:22:01Z

@atharrva01 The 🐛 prefix in the PR title I added was intentional. We use this to generate release notes and without it the "PR title verifier" action will block merging this PR

atharrva01 · 2026-01-27T16:50:56Z

hi @sbueringer Thanks for the clarification! 👍
I’ve kept that manager: prefix as suggested.

sbueringer · 2026-01-28T17:30:26Z

Feel free to ping me for review once the CLA is done

atharrva01 · 2026-01-29T05:03:03Z

Hi @sbueringer, the Easy CLA is now completed ✅
Whenever you have time, I’ve pinged you for review as suggested. Thanks!

sbueringer

Thx, just some minor findings

sbueringer · 2026-02-02T12:35:53Z

pkg/manager/runnable_group_test.go

+		},
+	)).To(Succeed())
+
+	ctx, cancel := context.WithCancel(context.Background())


Let's use t.Context() instead of context.Background(), same below (see linter findings)

sbueringer · 2026-02-02T12:37:18Z

pkg/manager/runnable_group_test.go

+	select {
+	case <-startReturned:
+		// Success: Start() returned after context cancellation
+	case <-time.After(100 * time.Millisecond):


Maybe let's use 5s or so just to make sure we don't get a flake in CI. 100ms should be enough, but not sure how slow CI might be

hi @sbueringer I’ve updated the test to use t.Context() and relaxed the timeout to avoid potential CI flakes, as suggested. Changes are pushed, retriggering CI now.

/retest

@atharrva01 One more linter finding (please see the lint/test results below :))

hi @sbueringer Thanks for catching that , there was one remaining context.Background() in the test.
I’ve replaced it with t.Context() to satisfy lint. Changes are pushed.

sbueringer · 2026-02-03T08:59:23Z

pkg/manager/runnable_group_test.go

 }

+// TestStartReturnsWhenContextCancelledWithPendingReadinessCheck verifies that
+// runnableGroup.Start() returns promptly when the context is cancelled, even if


Looks like there's an issue, not sure if in prod or test code

WARNING: DATA RACE Write at 0x00c0006184e0 by goroutine 191: runtime.closechan() /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:414 +0x0 sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start.func1.deferwrap1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:170 +0x33 runtime.deferreturn() /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/panic.go:589 +0x5d sync.(*Once).doSlow() /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:78 +0xd1 sync.(*Once).Do() /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:69 +0x44 sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:169 +0xb5 sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3.1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:181 +0xca Previous read at 0x00c0006184e0 by goroutine 238: runtime.chansend() /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:176 +0x0 sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc4 Goroutine 191 (running) created at: sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:178 +0xf9 github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3() /home/prow/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.27.2/internal/suite.go:942 +0x6ed Goroutine 238 (running) created at: sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfc sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.gowrap1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:284 +0x41 ================== panic: send on closed channel goroutine 281 [running]: sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1() /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc5 created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 in goroutine 279 /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfd FAIL sigs.k8s.io/controller-runtime/pkg/manager 5.433s

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520 (see https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520/build-log.txt)

hi @sbueringer Thanks for the clarification request, this turned out to be a real concurrency issue in runnableGroup, not just a test artifact.

Start() was closing startReadyCh while readiness goroutines could still send on it, which caused a data race and send on closed channel panic depending on timing. Tests just made this easier to reproduce.

Removing the channel close exposed a second shutdown issue: Start() could block waiting for readiness when StopAndWait() cancelled the group context and all senders had already exited.

The fix now handles both sides explicitly: senders stop on r.ctx.Done(), and Start() exits readiness waiting on either ctx.Done() or r.ctx.Done(). This avoids races and deadlocks while preserving normal startup behavior.

Unit tests and go test -race pass locally; envtest-based integration tests should be covered by CI.

sbueringer · 2026-02-09T10:30:41Z

pkg/manager/runnable_group.go

 					}
 				}
 				// We're done waiting if the queue is empty, return.
 				if len(r.startQueue) == 0 {


I got why we can't close the channel in the other cases above, but I think we can still close it here

sbueringer · 2026-02-09T10:36:35Z

pkg/manager/runnable_group.go

+						// either block forever or panic if the channel were closed.
+						select {
+						case r.startReadyCh <- rn:
+						case <-r.ctx.Done():


I think this has to be the first case otherwise this will still send always on startReadyCh

sbueringer · 2026-02-09T10:37:24Z

pkg/manager/runnable_group_test.go

+	// Cancel the context
+	cancel()
+
+	// Start() should return promptly (within 100ms), not hang or spin


I think the 100ms are outdated

Fix infinite CPU spin in runnableGroup.Start on context cancel

4e1d027

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

k8s-ci-robot requested review from inteon and vincepri January 27, 2026 11:03

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 27, 2026

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 27, 2026

atharrva01 changed the title ~~Fix infinite CPU spin in runnableGroup.Start on context cancel~~ manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026

sbueringer reviewed Jan 27, 2026

View reviewed changes

pkg/manager/runnable_group.go Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2026

sbueringer changed the title ~~manager: fix infinite CPU spin in runnableGroup.Start on context cancellation~~ 🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026

Fix runnableGroup.Start context cancel spin and add regression test

c00cf65

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 27, 2026

atharrva01 changed the title ~~🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation~~ manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026

atharrva01 changed the title ~~manager: fix infinite CPU spin in runnableGroup.Start on context cancellation~~ 🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 29, 2026

sbueringer reviewed Feb 2, 2026

View reviewed changes

atharrva01 added 3 commits February 2, 2026 19:14

test: use t.Context and relax timeout to avoid flakes

ab2c563

test: remove time.Sleep and rely on context cancellation

8b18f6f

test: replace remaining context.Background with t.Context

b0f36fd

sbueringer reviewed Feb 3, 2026

View reviewed changes

fix: unblock readiness wait on group shutdown

8deb807

sbueringer added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 9, 2026

sbueringer reviewed Feb 9, 2026

View reviewed changes

Conversation

atharrva01 commented Jan 27, 2026

Summary

Impact

Steps to Reproduce

Scenario 1: Startup failure

Scenario 2: Shutdown during startup

Root Cause

Fix

Behavior After Fix

Verification

Uh oh!

linux-foundation-easycla bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 27, 2026

Uh oh!

k8s-ci-robot commented Jan 27, 2026

Uh oh!

k8s-ci-robot commented Jan 27, 2026

Uh oh!

atharrva01 commented Jan 27, 2026

Uh oh!

Uh oh!

sbueringer commented Jan 27, 2026

Uh oh!

atharrva01 commented Jan 27, 2026

Uh oh!

sbueringer commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atharrva01 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Jan 28, 2026

Uh oh!

atharrva01 commented Jan 29, 2026

Uh oh!

sbueringer left a comment

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

atharrva01 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atharrva01 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

atharrva01 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

sbueringer Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

linux-foundation-easycla bot commented Jan 27, 2026 •

edited

Loading

sbueringer commented Jan 27, 2026 •

edited

Loading

atharrva01 commented Jan 27, 2026 •

edited

Loading

atharrva01 Feb 2, 2026 •

edited

Loading

sbueringer Feb 2, 2026 •

edited

Loading

sbueringer Feb 9, 2026 •

edited

Loading