Skip to content

feat: Add last chance check before orphan termination for JIT instances #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

stuartp44
Copy link
Contributor

@stuartp44 stuartp44 commented May 22, 2025

This pull request introduces enhancements to the AWS Lambda functions responsible for scaling GitHub Actions runners. Key changes include the addition of functionality to untag runners, support for Just-In-Time (JIT) runner configurations, and improvements to orphan runner handling. The updates also include modifications to tests to validate these new features.

Enhancements to Runner Management:

  • Added a new untag function in lambdas/functions/control-plane/src/aws/runners.ts to remove tags from EC2 instances. This is used to handle orphan runners that are online and busy.
  • Updated terminateOrphan logic in lambdas/functions/control-plane/src/scale-runners/scale-down.ts to differentiate between JIT runners and regular runners. Orphan runners with a valid runnerId are now checked for their state before untagging or terminating.

Support for JIT Runner Configurations:

  • Added runnerId to the RunnerInfo structure to support JIT runner identification. This enables tracking and handling of ephemeral runners created via JIT configurations. [1] [2]
  • Updated test mocks and assertions to include JIT runner scenarios, ensuring proper handling of JIT-configured runners in scaling operations. [1] [2]

Test Suite Updates:

  • Added new test cases in scale-down.test.ts to verify the behavior of orphan runners under JIT and non-JIT configurations. These tests ensure that online and busy runners are untagged, while offline runners are terminated.
  • Introduced mock data and functions in scale-down.test.ts and scale-up.test.ts to simulate JIT runner IDs and validate their integration with scaling logic. [1] [2]

Code Refactoring and Imports:

  • Updated imports across multiple files to include DeleteTagsCommand and untag functionality. This ensures consistent usage of the new untagging feature. [1] [2] [3]
  • Refactored getGitHubRunnerBusyState in scale-down.ts to reuse the new getGitHubSelfHostedRunnerState function, improving code clarity and reducing redundancy.

These changes collectively improve the scalability, reliability, and maintainability of the control plane for GitHub Actions runners.

@stuartp44 stuartp44 requested a review from a team as a code owner May 22, 2025 08:25
@stuartp44 stuartp44 requested a review from Copilot May 22, 2025 08:29
Copilot

This comment was marked as outdated.

@stuartp44 stuartp44 changed the title Add last chance check before termination for JIT instances feat: Add last chance check before termination for JIT instances May 22, 2025
@stuartp44 stuartp44 changed the title feat: Add last chance check before termination for JIT instances feat: Add last chance check before orphan termination for JIT instances May 22, 2025
@npalm
Copy link
Member

npalm commented May 22, 2025

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

@stuartp44
Copy link
Contributor Author

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

The PR is actually ready to be review, the build failures where unrelated to my changes but I have fixed them nevertheless. @npalm

@npalm npalm self-requested a review May 23, 2025 13:56
Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR. I only was able to a partial review so far. I have checked the Lambda code (excluding th tests). Also not tested a deployment.

This solution is solving the problem only for JIT enabled runners, but for non JIT the problem remains. Assuming the chances is less since typically less runners will be created. Do you have thoughts about alternatives? I think we should find a way / place to document this limitation.

I have made some comments, but need more time to go over the PR. Will ping you once ready, but laready sharing the feedback so far.

@stuartp44 stuartp44 marked this pull request as draft June 3, 2025 15:49
@stuartp44 stuartp44 marked this pull request as ready for review June 4, 2025 10:46
@stuartp44 stuartp44 requested a review from Copilot June 4, 2025 10:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds JIT runner support by tagging runner IDs on EC2 instances, then performing a final orphan check to untag busy runners or terminate offline ones. It also refactors runner state retrieval and updates tests to cover the new behavior.

  • Introduce tag/untag for JIT runner lifecycle management
  • Add addGhRunnerIdToEC2InstanceTag and lastChanceCheckOrphanRunner in scale-up/scale-down
  • Update and extend tests for JIT and non-JIT orphan scenarios

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scale-runners/scale-up.ts Imported tag, added addGhRunnerIdToEC2InstanceTag call
scale-runners/scale-up.test.ts Mocked tag, fixed typo, extended JIT mocks
scale-runners/scale-down.ts Imported untag, added lastChanceCheckOrphanRunner, refactored runner state calls
scale-runners/scale-down.test.ts Mocked untag, added JIT orphan handling tests
aws/runners.ts Added DeleteTagsCommand and untag, updated runnerId extraction
aws/runners.test.ts Added JIT listing test, fixed tag/untag runner tests
Comments suppressed due to low confidence (1)

lambdas/functions/control-plane/src/scale-runners/scale-up.ts:419

  • [nitpick] The function name 'addGhRunnerIdToEC2InstanceTag' is verbose and mixes naming styles. Consider renaming it to something like 'tagRunnerId' for brevity and consistency.
async function addGhRunnerIdToEC2InstanceTag(instanceId: string, runnerId: string): Promise<void> {

export async function untag(instanceId: string, tags: Tag[]): Promise<void> {
logger.debug(`Untagging '${instanceId}'`, { tags });
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
await ec2.send(new DeleteTagsCommand({ Resources: [instanceId], Tags: tags }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test untagging via an actual deployment. I do not see any update to the permisison of the lmabda role. So assuming the call wil fail.

// do we have a valid runnerId? then we are in a Jit Runner scenario else, use old method
if (runner.runnerId) {
logger.debug(`Runner '${runner.instanceId}' is orphan, but has a runnerId.`);
await lastChanceCheckOrphanRunner(runner);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untagging can lead to an exception, exception looks like not handled at all. Would suggest to handle a potential exception inside the untag function and return a boolean.

metricGitHubAppRateLimit(state.headers);

return state.data.busy;
return {
id: state.data.id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it not simpler to re-use either the github state object or simpley explode return the data object instead of this 1-1 mapping.

async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<void> {
const client = await getOrCreateOctokit(runner as RunnerInfo);
const runnerId = parseInt(runner.runnerId || '0');
const ec2Instance = runner as RunnerInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why casting here? Does the input object not contain the right information?

@@ -200,18 +221,42 @@ async function markOrphan(instanceId: string): Promise<void> {
}
}

async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the method is confusing, it looks terminationg is now down on two palces. Makes the code flow hard to follow. Also lastChanceCheckOrphanRunner sounds like a last check, not terinating.

logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
// do we have a valid runnerId? then we are in a Jit Runner scenario else, use old method
if (runner.runnerId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code remains simpler if you update fhe code as follow.

pseudeo code

if (runner.id) {
   checkRunnerStillOrphan ? : terminate() : untag()
else 
   terminate
}

or similar this wil keep the main control flow in one view. And avoiding termination is invoked in two differnet functions.

Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartp44 thx for the work. did a partial review. Made several comments. I think it would be better to keep the main control flow in a single fucntion, especiaaly the termination calls.

Besides this the PR only solve the problem for JIT enabled runners, not targetting general runners. This also required to ensure the limitation get document.

I still need to check a deployment and walk through the test code. do you know any way to reproduce the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pagination Data Slippage Issue causing EC2 instance to be scaled down
2 participants