Skip to content

Upgrade token verification to AWS SDK Go V2 #875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gargipanatula
Copy link
Contributor

What this PR does / why we need it:
Updates the way that the server discovers valid partitions and regions during token verification. In essence, the host of a token is valid if it 1) comes from the same partition as the server and 2) contains a valid region.

Functionality changes

  • There is no functionality change - it maintains parity with V1. However, if a new partition is added, this code will need to be updated to verify tokens from that partition.

Implementation changes

  • Listing partitions through the github.com/aws/aws-sdk-go/aws/endpoints package → using a hardcoded list of partitions
  • Discovering regions through the github.com/aws/aws-sdk-go/aws/endpoints package → discovering regions through the ec2.DescribeRegions API.
    • The call yields all available regions to an account, regardless of opt-in status.

Testing

  • Tested that the verify command still a) correctly verifies a token from the same partition and b) correctly denies tokens from different partitions

Created a token using credentials from an account in the aws partition (US_TOKEN) and an account in the aws-cn partition (CHINA_TOKEN)

Ran verify on each token using accounts from within their partition, one with the same and one with a different region than the account that created the token. I was able to see that the token was correctly successfully verified when an account is in the same partition as the one that generated the token, regardless of region.

Output of this test for CHINA_TOKEN (similar for US_TOKEN):

&{ARN:arn:aws-cn:sts::<account id>:assumed-role/Admin/<user> CanonicalARN:arn:aws-cn:iam::<account id>:role/<role> AccountID:<account id> UserID:<user-id> SessionName:<user> AccessKeyID:<access-key-id> STSEndpoint:sts.cn-northwest-1.amazonaws.com.cn}

Then, I ran verify on each token with both the correct and incorrect partition (aws for CHINA_TOKEN and aws-cn for US_TOKEN) to replicate conditions where the client is in a different partition from the server. I saw that the command correctly rejected the tokens when coming from a different partition.

Output for US_TOKEN (similar for CHINA_TOKEN):

# same partition (aws)
(25-06-24 23:47:21) <0> [/aws-iam-authenticator]
dev % /aws-iam-authenticator/_output/bin/aws-iam-authenticator verify —cluster-id=<id> —token=$US_TOKEN —partition=aws
[Warn] Region not found in instance metadata, err: operation error ec2imds: GetRegion, access disabled to EC2 IMDS via client option, or "AWS_EC2_METADATA_DISABLED" environment variable
&{ARN:arn:aws:sts::<account-id>:assumed-role/admin/<role> CanonicalARN:arn:aws:iam::<account-id>:role/<role> AccountID:<account-id> UserID:<user-id> SessionName:<session-nmae> AccessKeyID:<key-id> STSEndpoint:sts.us-west-2.amazonaws.com}

# different partition (aws-cn)
(25-06-24 23:47:33) <0> [/aws-iam-authenticator]
dev % /aws-iam-authenticator/_output/bin/aws-iam-authenticator verify —cluster-id=<id> —token=$US_TOKEN —partition=aws-cn
[Warn] Region not found in instance metadata, err: operation error ec2imds: GetRegion, access disabled to EC2 IMDS via client option, or "AWS_EC2_METADATA_DISABLED" environment variable
could not verify token: input token was not properly formatted: unexpected hostname "sts.us-west-2.amazonaws.com" in pre-signed URL

Note that the warning about ec2imds comes from the fact that the AWS_EC2_METADATA_DISABLED env var is set to true, if set to false the output is the same but without the warning.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gargipanatula
Once this PR has been reviewed and has the lgtm label, please assign micahhausler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 25, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @gargipanatula. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 25, 2025
@gargipanatula gargipanatula marked this pull request as ready for review June 25, 2025 00:34
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2025
@bryantbiggs
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2025
@gargipanatula
Copy link
Contributor Author

/retest

1 similar comment
@gargipanatula
Copy link
Contributor Author

/retest

@@ -521,73 +572,6 @@ func response(account, userID, arn string) getCallerIdentityWrapper {
return wrapper
}

func Test_getDefaultHostNameForRegion(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context: this function doesn't exist anymore, so no need to test

@@ -171,8 +194,32 @@ func TestSTSEndpoints(t *testing.T) {
{"aws-not-a-partition", "sts.amazonaws.com", false, ""},
}

regions := []ec2types.Region{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context: tests that simply verified valid sts endpoints, not necessarily region/partition matching, need a stub list of regions that have each region used in the test.

DescribeDelay = 100
)

func newMockedEC2ProviderImpl() *ec2ProviderImpl {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context: slightly changed to mocking to support extended mocking. essentially moved the functions around and directly implemented functions for MockEc2Client rather than relaying through the EC2API.

partitionKeys = append(partitionKeys, p.ID())
}
if _, ok := partitionMap[cfg.PartitionID]; !ok {
if slices.Contains(PartitionKeys, cfg.PartitionID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't contain right?

@@ -215,9 +216,10 @@ func (c *Server) getHandler(ctx context.Context, backendMapper BackendMapper, ec
} else {
instanceRegion = instanceRegionOutput.Region
}
ec2Client := ec2.NewFromConfig(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we creating new client instead of using ec2provider because we need to make call against the account where the authenticator is running instead of ServerEC2DescribeInstancesRoleARN ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured configuring a client directly would be more straightforward than creating ec2provider with the host account's ARN.

func getInstanceRegion(ctx context.Context) string {
cfg, err := config.LoadDefaultConfig(ctx)
if err != nil {
fmt.Printf("[Warn] Unable to create metadata client, err: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the error message is not correct right as we just generating config here and not creating client?

@@ -54,14 +56,17 @@ var verifyCmd = &cobra.Command{
os.Exit(1)
}

sess := session.Must(session.NewSession())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are expecting to have the session object right? But in getInstanceRegion function its best effort , any specific reason?

Copy link
Contributor Author

@gargipanatula gargipanatula Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, this requirement shouldn't have been relaxed. I've added a panic to getInstanceRegion that occurs when config.LoadDefaultConfig returns an error, to replicate the panic that session.Must() would've sent if session.New() had an error. (session code).

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 26, 2025
@gargipanatula gargipanatula force-pushed the sdk-upgrade branch 3 times, most recently from bd9b879 to e802f44 Compare June 26, 2025 17:19
@gargipanatula
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@gargipanatula: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-aws-iam-authenticator-unit f80e105 link true /test pull-aws-iam-authenticator-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

switch partitionID {
case "aws":
tlds = []string{"amazonaws.com", "api.aws"}
validSTShostnames["sts.amazonaws.com"] = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: support global for all partitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: it's only supported for the commercial partition, so this is actually not necessary

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants