Skip to content

🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation#3440

Open
atharrva01 wants to merge 6 commits intokubernetes-sigs:mainfrom
atharrva01:fix/runnable-group-context-cancel-spin
Open

🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation#3440
atharrva01 wants to merge 6 commits intokubernetes-sigs:mainfrom
atharrva01:fix/runnable-group-context-cancel-spin

Conversation

@atharrva01
Copy link

Summary

This PR fixes a control-flow bug in the manager startup lifecycle where runnableGroup.Start() could enter an infinite loop when the context is cancelled before all runnables signal readiness.
The issue could cause manager startup or shutdown to hang indefinitely while consuming CPU.


Impact

Before this change, the manager could:

  • Spin indefinitely at 100% CPU after context cancellation
  • Hang forever during startup if any runnable fails its readiness check
  • Hang during SIGTERM / shutdown while waiting for runnables
  • Leak goroutines stuck inside Start()

These scenarios are common in real deployments (RBAC misconfigurations, cache sync failures, API server unavailability), making the issue both user-visible and hard to diagnose.


Steps to Reproduce

Scenario 1: Startup failure

  1. Start a manager with a runnable whose cache sync fails (e.g. missing RBAC)
  2. The runnable never signals readiness
  3. Start() waits forever and the manager never becomes ready

Scenario 2: Shutdown during startup

  1. Start the manager
  2. Send SIGTERM before all runnables become ready
  3. Context is cancelled, but Start() keeps looping and burns CPU

Root Cause

The startup loop correctly listens for ctx.Done(), but does not exit the loop when the context is cancelled.

case <-ctx.Done():
    if err := ctx.Err(); !errors.Is(err, context.Canceled) {
        retErr = err
    }
    // loop continues → ctx.Done() fires forever

Since ctx.Done() is a closed channel after cancellation, the select immediately re-selects it on every iteration, causing a tight busy loop.


Fix

Exit the startup loop immediately when the context is cancelled:

case <-ctx.Done():
    if err := ctx.Err(); !errors.Is(err, context.Canceled) {
        retErr = err
    }
    return

This ensures:

  • The startup loop terminates correctly
  • Errors are propagated to the caller
  • Shutdown and startup behave predictably

Behavior After Fix

  • Manager startup fails fast instead of hanging
  • Shutdown completes cleanly without CPU spin
  • No goroutine leaks from stuck Start() calls
  • Already-started runnables continue running on their own context and are cleaned up via StopAndWait()

Verification

  • ✅ Code compiles successfully
  • go vet passes
  • ✅ Existing unit tests pass
  • ⚠️ Integration tests require envtest binaries (not available in this environment)

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 27, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atharrva01
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @atharrva01!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 27, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @atharrva01. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 27, 2026
@atharrva01
Copy link
Author

hi @inteon @vincepri This fixes a startup/shutdown hang where runnableGroup.Start() could spin indefinitely after context cancellation by returning immediately on ctx.Done().

@atharrva01 atharrva01 changed the title Fix infinite CPU spin in runnableGroup.Start on context cancel manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026
@sbueringer
Copy link
Member

Nice catch, thx!

PTAL at the CLA: #3440 (comment)

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2026
@sbueringer sbueringer changed the title manager: fix infinite CPU spin in runnableGroup.Start on context cancellation 🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026
Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 27, 2026
@atharrva01
Copy link
Author

hi @sbueringer I’ve added a focused regression test covering the context-cancellation path in runnableGroup.Start(). I ran the unit test locally; integration tests require envtest binaries, so CI should cover those. Thanks for the suggestion!

image

@atharrva01 atharrva01 changed the title 🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026
@sbueringer
Copy link
Member

sbueringer commented Jan 27, 2026

@atharrva01 The 🐛 prefix in the PR title I added was intentional. We use this to generate release notes and without it the "PR title verifier" action will block merging this PR

@atharrva01 atharrva01 changed the title manager: fix infinite CPU spin in runnableGroup.Start on context cancellation 🐛 manager: fix infinite CPU spin in runnableGroup.Start on context cancellation Jan 27, 2026
@atharrva01
Copy link
Author

atharrva01 commented Jan 27, 2026

hi @sbueringer Thanks for the clarification! 👍
I’ve kept that manager: prefix as suggested.

@sbueringer
Copy link
Member

Feel free to ping me for review once the CLA is done

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 29, 2026
@atharrva01
Copy link
Author

Hi @sbueringer, the Easy CLA is now completed ✅
Whenever you have time, I’ve pinged you for review as suggested. Thanks!

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, just some minor findings

},
)).To(Succeed())

ctx, cancel := context.WithCancel(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use t.Context() instead of context.Background(), same below (see linter findings)

select {
case <-startReturned:
// Success: Start() returned after context cancellation
case <-time.After(100 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's use 5s or so just to make sure we don't get a flake in CI. 100ms should be enough, but not sure how slow CI might be

Copy link
Author

@atharrva01 atharrva01 Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @sbueringer I’ve updated the test to use t.Context() and relaxed the timeout to avoid potential CI flakes, as suggested. Changes are pushed, retriggering CI now.

/retest

Copy link
Member

@sbueringer sbueringer Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atharrva01 One more linter finding (please see the lint/test results below :))

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @sbueringer Thanks for catching that , there was one remaining context.Background() in the test.
I’ve replaced it with t.Context() to satisfy lint. Changes are pushed.

}

// TestStartReturnsWhenContextCancelledWithPendingReadinessCheck verifies that
// runnableGroup.Start() returns promptly when the context is cancelled, even if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's an issue, not sure if in prod or test code

WARNING: DATA RACE
Write at 0x00c0006184e0 by goroutine 191:
  runtime.closechan()
      /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:414 +0x0
  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start.func1.deferwrap1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:170 +0x33
  runtime.deferreturn()
      /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/panic.go:589 +0x5d
  sync.(*Once).doSlow()
      /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:78 +0xd1
  sync.(*Once).Do()
      /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/sync/once.go:69 +0x44
  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).Start()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:169 +0xb5
  sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3.1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:181 +0xca

Previous read at 0x00c0006184e0 by goroutine 238:
  runtime.chansend()
      /home/prow/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/chan.go:176 +0x0
  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc4

Goroutine 191 (running) created at:
  sigs.k8s.io/controller-runtime/pkg/manager.init.func5.3()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group_test.go:178 +0xf9
  github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
      /home/prow/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.27.2/internal/suite.go:942 +0x6ed

Goroutine 238 (running) created at:
  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfc
  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.gowrap1()
      /home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:284 +0x41
==================
panic: send on closed channel

goroutine 281 [running]:
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1.1()
	/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:249 +0xc5
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 in goroutine 279
	/home/prow/go/src/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:246 +0xfd
FAIL	sigs.k8s.io/controller-runtime/pkg/manager	5.433s

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520 (see https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_controller-runtime/3440/pull-controller-runtime-test/2018359601396715520/build-log.txt)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @sbueringer Thanks for the clarification request, this turned out to be a real concurrency issue in runnableGroup, not just a test artifact.

Start() was closing startReadyCh while readiness goroutines could still send on it, which caused a data race and send on closed channel panic depending on timing. Tests just made this easier to reproduce.

Removing the channel close exposed a second shutdown issue: Start() could block waiting for readiness when StopAndWait() cancelled the group context and all senders had already exited.

The fix now handles both sides explicitly: senders stop on r.ctx.Done(), and Start() exits readiness waiting on either ctx.Done() or r.ctx.Done(). This avoids races and deadlocks while preserving normal startup behavior.

Unit tests and go test -race pass locally; envtest-based integration tests should be covered by CI.

@sbueringer sbueringer added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 9, 2026
}
}
// We're done waiting if the queue is empty, return.
if len(r.startQueue) == 0 {
Copy link
Member

@sbueringer sbueringer Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got why we can't close the channel in the other cases above, but I think we can still close it here

// either block forever or panic if the channel were closed.
select {
case r.startReadyCh <- rn:
case <-r.ctx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has to be the first case otherwise this will still send always on startReadyCh

// Cancel the context
cancel()

// Start() should return promptly (within 100ms), not hang or spin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 100ms are outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants