feat: Add last chance check before orphan termination for JIT instances #4595

stuartp44 · 2025-05-22T08:25:02Z

This pull request introduces enhancements to the AWS Lambda functions responsible for scaling GitHub Actions runners. Key changes include the addition of functionality to untag runners, support for Just-In-Time (JIT) runner configurations, and improvements to orphan runner handling. The updates also include modifications to tests to validate these new features.

Enhancements to Runner Management:

Added a new untag function in lambdas/functions/control-plane/src/aws/runners.ts to remove tags from EC2 instances. This is used to handle orphan runners that are online and busy.
Updated terminateOrphan logic in lambdas/functions/control-plane/src/scale-runners/scale-down.ts to differentiate between JIT runners and regular runners. Orphan runners with a valid runnerId are now checked for their state before untagging or terminating.

Support for JIT Runner Configurations:

Added runnerId to the RunnerInfo structure to support JIT runner identification. This enables tracking and handling of ephemeral runners created via JIT configurations. [1] [2]
Updated test mocks and assertions to include JIT runner scenarios, ensuring proper handling of JIT-configured runners in scaling operations. [1] [2]

Test Suite Updates:

Added new test cases in scale-down.test.ts to verify the behavior of orphan runners under JIT and non-JIT configurations. These tests ensure that online and busy runners are untagged, while offline runners are terminated.
Introduced mock data and functions in scale-down.test.ts and scale-up.test.ts to simulate JIT runner IDs and validate their integration with scaling logic. [1] [2]

Code Refactoring and Imports:

Updated imports across multiple files to include DeleteTagsCommand and untag functionality. This ensures consistent usage of the new untagging feature. [1] [2] [3]
Refactored getGitHubRunnerBusyState in scale-down.ts to reuse the new getGitHubSelfHostedRunnerState function, improving code clarity and reducing redundancy.

These changes collectively improve the scalability, reliability, and maintainability of the control plane for GitHub Actions runners.

…nality for incorrectly tagged runners

…tion

…minateOrphan function

npalm · 2025-05-22T15:04:28Z

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

stuartp44 · 2025-05-23T06:54:53Z

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

The PR is actually ready to be review, the build failures where unrelated to my changes but I have fixed them nevertheless. @npalm

npalm

Thx for the PR. I only was able to a partial review so far. I have checked the Lambda code (excluding th tests). Also not tested a deployment.

This solution is solving the problem only for JIT enabled runners, but for non JIT the problem remains. Assuming the chances is less since typically less runners will be created. Do you have thoughts about alternatives? I think we should find a way / place to document this limitation.

I have made some comments, but need more time to go over the PR. Will ping you once ready, but laready sharing the feedback so far.

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

Co-authored-by: Niek Palm <npalm@users.noreply.github.com>

…function

…mports in scale-down.ts

…on logic

…eadability

… and clarity

…bSelfHostedRunnerState for consistency

Copilot

Pull Request Overview

This PR adds JIT runner support by tagging runner IDs on EC2 instances, then performing a final orphan check to untag busy runners or terminate offline ones. It also refactors runner state retrieval and updates tests to cover the new behavior.

Introduce tag/untag for JIT runner lifecycle management
Add addGhRunnerIdToEC2InstanceTag and lastChanceCheckOrphanRunner in scale-up/scale-down
Update and extend tests for JIT and non-JIT orphan scenarios

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
scale-runners/scale-up.ts	Imported `tag`, added `addGhRunnerIdToEC2InstanceTag` call
scale-runners/scale-up.test.ts	Mocked `tag`, fixed typo, extended JIT mocks
scale-runners/scale-down.ts	Imported `untag`, added `lastChanceCheckOrphanRunner`, refactored runner state calls
scale-runners/scale-down.test.ts	Mocked `untag`, added JIT orphan handling tests
aws/runners.ts	Added `DeleteTagsCommand` and `untag`, updated `runnerId` extraction
aws/runners.test.ts	Added JIT listing test, fixed tag/untag runner tests

Comments suppressed due to low confidence (1)

lambdas/functions/control-plane/src/scale-runners/scale-up.ts:419

[nitpick] The function name 'addGhRunnerIdToEC2InstanceTag' is verbose and mixes naming styles. Consider renaming it to something like 'tagRunnerId' for brevity and consistency.

async function addGhRunnerIdToEC2InstanceTag(instanceId: string, runnerId: string): Promise<void> {

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/aws/runners.ts

lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts

lambdas/functions/control-plane/src/aws/runners.test.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…est.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

npalm · 2025-06-11T19:37:24Z

lambdas/functions/control-plane/src/aws/runners.ts

+export async function untag(instanceId: string, tags: Tag[]): Promise<void> {
+  logger.debug(`Untagging '${instanceId}'`, { tags });
+  const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
+  await ec2.send(new DeleteTagsCommand({ Resources: [instanceId], Tags: tags }));


Did you test untagging via an actual deployment. I do not see any update to the permisison of the lmabda role. So assuming the call wil fail.

npalm · 2025-06-11T19:40:43Z

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

+      // do we have a valid runnerId? then we are in a Jit Runner scenario else, use old method
+      if (runner.runnerId) {
+        logger.debug(`Runner '${runner.instanceId}' is orphan, but has a runnerId.`);
+        await lastChanceCheckOrphanRunner(runner);


untagging can lead to an exception, exception looks like not handled at all. Would suggest to handle a potential exception inside the untag function and return a boolean.

npalm · 2025-06-11T19:43:42Z

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

  metricGitHubAppRateLimit(state.headers);

-  return state.data.busy;
+  return {
+    id: state.data.id,


would it not simpler to re-use either the github state object or simpley explode return the data object instead of this 1-1 mapping.

npalm · 2025-06-11T19:47:44Z

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

+async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<void> {
+  const client = await getOrCreateOctokit(runner as RunnerInfo);
+  const runnerId = parseInt(runner.runnerId || '0');
+  const ec2Instance = runner as RunnerInfo;


why casting here? Does the input object not contain the right information?

npalm · 2025-06-11T19:54:13Z

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

@@ -200,18 +221,42 @@ async function markOrphan(instanceId: string): Promise<void> {
  }
 }

+async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<void> {


For me the method is confusing, it looks terminationg is now down on two palces. Makes the code flow hard to follow. Also lastChanceCheckOrphanRunner sounds like a last check, not terinating.

npalm · 2025-06-11T20:04:37Z

lambdas/functions/control-plane/src/scale-runners/scale-down.ts

-        logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
-      });
+      // do we have a valid runnerId? then we are in a Jit Runner scenario else, use old method
+      if (runner.runnerId) {


I think the code remains simpler if you update fhe code as follow.

pseudeo code

if (runner.id) { checkRunnerStillOrphan ? : terminate() : untag() else terminate }

or similar this wil keep the main control flow in one view. And avoiding termination is invoked in two differnet functions.

npalm

@stuartp44 thx for the work. did a partial review. Made several comments. I think it would be better to keep the main control flow in a single fucntion, especiaaly the termination calls.

Besides this the PR only solve the problem for JIT enabled runners, not targetting general runners. This also required to ensure the limitation get document.

I still need to check a deployment and walk through the test code. do you know any way to reproduce the problem?

stuartp44 added 2 commits May 21, 2025 18:39

feat(runners): add runnerId to RunnerList and implement untag functio…

3fa09d2

…nality for incorrectly tagged runners

fix(tests): improve clarity of orphaned runner untagging test descrip…

14df9fe

…tion

stuartp44 requested a review from a team as a code owner May 22, 2025 08:25

stuartp44 linked an issue May 22, 2025 that may be closed by this pull request

Pagination Data Slippage Issue causing EC2 instance to be scaled down #4584

Open

stuartp44 requested a review from Copilot May 22, 2025 08:29

This comment was marked as outdated.

Sign in to view

fmt

9d7c89a

stuartp44 changed the title ~~Add last chance check before termination for JIT instances~~ feat: Add last chance check before termination for JIT instances May 22, 2025

stuartp44 changed the title ~~feat: Add last chance check before termination for JIT instances~~ feat: Add last chance check before orphan termination for JIT instances May 22, 2025

stuartp44 added 2 commits May 22, 2025 10:42

fix(scale-down): remove unnecessary logging of runner variable in ter…

e09337f

…minateOrphan function

fix(scale-up): remove unused import of run function

9064512

fix(scale-down): remove unused import of metricGitHubAppRateLimit

716e079

Merge branch 'main' into stu/add_tag_plus_check

e826fbe

npalm self-requested a review May 23, 2025 13:56

npalm reviewed May 24, 2025

View reviewed changes

stuartp44 and others added 9 commits May 26, 2025 11:34

Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts

a5fcc88

Co-authored-by: Niek Palm <npalm@users.noreply.github.com>

Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts

bb8ba2b

Co-authored-by: Niek Palm <npalm@users.noreply.github.com>

Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts

97de234

Co-authored-by: Niek Palm <npalm@users.noreply.github.com>

Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts

8b61d39

Co-authored-by: Niek Palm <npalm@users.noreply.github.com>

Remove warning log for orphan runners without runnerId in scale-down …

9883b0b

…function

Remove logging of runner ID marking in addGhRunnerIdToEC2InstanceTag …

43468a7

…function

readded metricGitHubAppRateLimit

bc995ef

Merge branch 'main' into stu/add_tag_plus_check

1182c8b

Refactor runner interfaces: remove RunnerState interface and update i…

6e1c72c

…mports in scale-down.ts

stuartp44 marked this pull request as draft June 3, 2025 15:49

stuartp44 added 3 commits June 4, 2025 10:13

Add headers to runner state return and update logging for busy state

097c14d

Remove redundant comment describing RunnerState type

971ec2d

Implement last chance check for orphan runners and refactor terminati…

5a275e5

…on logic

stuartp44 added 3 commits June 4, 2025 10:32

Format return object in getGitHubSelfHostedRunnerState for improved r…

9f59abe

…eadability

Refactor runner state types to use Endpoints for improved type safety…

65c0b0e

… and clarity

Fix formatting of type definitions and adjust indentation in getGitHu…

8ead598

…bSelfHostedRunnerState for consistency

stuartp44 marked this pull request as ready for review June 4, 2025 10:46

stuartp44 requested a review from Copilot June 4, 2025 10:46

Copilot AI reviewed Jun 4, 2025

View reviewed changes

stuartp44 and others added 5 commits June 4, 2025 11:49

Update lambdas/functions/control-plane/src/aws/runners.ts

ab5b6b0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update lambdas/functions/control-plane/src/scale-runners/scale-down.t…

0b462bb

…est.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix typo in key for GitHub runner ID in mock running instances

102edf0

Merge branch 'main' into stu/add_tag_plus_check

e6d2d88

Merge branch 'main' into stu/add_tag_plus_check

83610eb

npalm reviewed Jun 11, 2025

View reviewed changes

feat: Add last chance check before orphan termination for JIT instances #4595

Are you sure you want to change the base?

feat: Add last chance check before orphan termination for JIT instances #4595

Uh oh!

Conversation

stuartp44 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancements to Runner Management:

Support for JIT Runner Configurations:

Test Suite Updates:

Code Refactoring and Imports:

Uh oh!

This comment was marked as outdated.

Uh oh!

npalm commented May 22, 2025

Uh oh!

stuartp44 commented May 23, 2025

Uh oh!

npalm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

npalm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stuartp44 commented May 22, 2025 •

edited

Loading