🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation#3440
🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation#3440atharrva01 wants to merge 6 commits intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atharrva01 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @atharrva01! |
|
Hi @atharrva01. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Nice catch, thx! PTAL at the CLA: #3440 (comment) /ok-to-test |
Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
|
hi @sbueringer I’ve added a focused regression test covering the context-cancellation path in runnableGroup.Start(). I ran the unit test locally; integration tests require envtest binaries, so CI should cover those. Thanks for the suggestion!
|
|
@atharrva01 The 🐛 prefix in the PR title I added was intentional. We use this to generate release notes and without it the "PR title verifier" action will block merging this PR |
|
hi @sbueringer Thanks for the clarification! 👍 |
|
Feel free to ping me for review once the CLA is done |
|
Hi @sbueringer, the Easy CLA is now completed ✅ |
sbueringer
left a comment
There was a problem hiding this comment.
Thx, just some minor findings
pkg/manager/runnable_group_test.go
Outdated
| }, | ||
| )).To(Succeed()) | ||
|
|
||
| ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Let's use t.Context() instead of context.Background(), same below (see linter findings)
pkg/manager/runnable_group_test.go
Outdated
| select { | ||
| case <-startReturned: | ||
| // Success: Start() returned after context cancellation | ||
| case <-time.After(100 * time.Millisecond): |
There was a problem hiding this comment.
Maybe let's use 5s or so just to make sure we don't get a flake in CI. 100ms should be enough, but not sure how slow CI might be
There was a problem hiding this comment.
hi @sbueringer I’ve updated the test to use t.Context() and relaxed the timeout to avoid potential CI flakes, as suggested. Changes are pushed, retriggering CI now.
/retest
There was a problem hiding this comment.
@atharrva01 One more linter finding (please see the lint/test results below :))
There was a problem hiding this comment.
hi @sbueringer Thanks for catching that , there was one remaining context.Background() in the test.
I’ve replaced it with t.Context() to satisfy lint. Changes are pushed.
| } | ||
|
|
||
| // TestStartReturnsWhenContextCancelledWithPendingReadinessCheck verifies that | ||
| // runnableGroup.Start() returns promptly when the context is cancelled, even if |
There was a problem hiding this comment.
Looks like there's an issue, not sure if in prod or test code
WARNING: DATA RACE
Write at 0x00c0006184e0 by goroutine 191:
runtime.closechan()
/home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:414 +0x0
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start.func1.deferwrap1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:170 +0x33
runtime.deferreturn()
/home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/panic.go:589 +0x5d
sync.(*Once).doSlow()
/home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:78 +0xd1
sync.(*Once).Do()
/home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:69 +0x44
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:169 +0xb5
sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3.1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:181 +0xca
Previous read at 0x00c0006184e0 by goroutine 238:
runtime.chansend()
/home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:176 +0x0
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc4
Goroutine 191 (running) created at:
sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:178 +0xf9
github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
/home/prow/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.27.2/internal/suite.go:942 +0x6ed
Goroutine 238 (running) created at:
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfc
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.gowrap1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:284 +0x41
==================
panic: send on closed channel
goroutine 281 [running]:
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1()
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc5
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 in goroutine 279
/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfd
FAIL sigs.k8s.io/controller-runtime/pkg/manager 5.433s
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520 (see https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520/build-log.txt)
There was a problem hiding this comment.
hi @sbueringer Thanks for the clarification request, this turned out to be a real concurrency issue in runnableGroup, not just a test artifact.
Start() was closing startReadyCh while readiness goroutines could still send on it, which caused a data race and send on closed channel panic depending on timing. Tests just made this easier to reproduce.
Removing the channel close exposed a second shutdown issue: Start() could block waiting for readiness when StopAndWait() cancelled the group context and all senders had already exited.
The fix now handles both sides explicitly: senders stop on r.ctx.Done(), and Start() exits readiness waiting on either ctx.Done() or r.ctx.Done(). This avoids races and deadlocks while preserving normal startup behavior.
Unit tests and go test -race pass locally; envtest-based integration tests should be covered by CI.
| } | ||
| } | ||
| // We're done waiting if the queue is empty, return. | ||
| if len(r.startQueue) == 0 { |
There was a problem hiding this comment.
I got why we can't close the channel in the other cases above, but I think we can still close it here
| // either block forever or panic if the channel were closed. | ||
| select { | ||
| case r.startReadyCh <- rn: | ||
| case <-r.ctx.Done(): |
There was a problem hiding this comment.
I think this has to be the first case otherwise this will still send always on startReadyCh
| // Cancel the context | ||
| cancel() | ||
|
|
||
| // Start() should return promptly (within 100ms), not hang or spin |
There was a problem hiding this comment.
I think the 100ms are outdated

Summary
This PR fixes a control-flow bug in the manager startup lifecycle where
runnableGroup.Start()could enter an infinite loop when the context is cancelled before all runnables signal readiness.The issue could cause manager startup or shutdown to hang indefinitely while consuming CPU.
Impact
Before this change, the manager could:
Start()These scenarios are common in real deployments (RBAC misconfigurations, cache sync failures, API server unavailability), making the issue both user-visible and hard to diagnose.
Steps to Reproduce
Scenario 1: Startup failure
Start()waits forever and the manager never becomes readyScenario 2: Shutdown during startup
Start()keeps looping and burns CPURoot Cause
The startup loop correctly listens for
ctx.Done(), but does not exit the loop when the context is cancelled.Since
ctx.Done()is a closed channel after cancellation, theselectimmediately re-selects it on every iteration, causing a tight busy loop.Fix
Exit the startup loop immediately when the context is cancelled:
This ensures:
Behavior After Fix
Start()callsStopAndWait()Verification
go vetpassesenvtestbinaries (not available in this environment)