Skip to content

Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025

Conversation

mszacillo
Copy link
Contributor

@mszacillo mszacillo commented Mar 3, 2025

What type of PR is this?
/kind design

What this PR does / why we need it:

This proposal enhances the FederatedResourceQuota so that it can impose namespaced resource limits directly on the Karmada control-plane level.

Which issue(s) this PR fixes:
Fixes #5179

Special notes for your reviewer:
Following our discussion during the community meeting, there were some questions regarding API server request latency as a result of adding an admission webhook to enforce resource limits. I ran some tests by measuring the time taken to apply some FlinkDeployments to the Karmada control-plane:

Without webhook: Average e2e request latency over 100 retries: 370 ms
With webhook: Average e2e request latency over 100 retries: 390 ms

The request latency increases slightly, as expected. But can run some more comprehensive performance tests for this feature.

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/design Categorizes issue or PR as related to design. label Mar 3, 2025
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.27%. Comparing base (3b6c0e0) to head (4cbca9c).
Report is 147 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6181      +/-   ##
==========================================
+ Coverage   47.95%   48.27%   +0.31%     
==========================================
  Files         674      677       +3     
  Lines       55841    56067     +226     
==========================================
+ Hits        26781    27068     +287     
+ Misses      27311    27229      -82     
- Partials     1749     1770      +21     
Flag Coverage Δ
unittests 48.27% <ø> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
Start working on this.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mszacillo so much for bring this up. I like this feature~

It would be great to add a user story section to this proposal to elaborate the use case. Because it's hard to connect failover feature with the quota management.

@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2025
@mszacillo mszacillo changed the title Proposal for federated resource quota enhancement Proposal to support ResourceQuotaEnforcer API Mar 17, 2025
@mszacillo mszacillo force-pushed the frq-proposal branch 2 times, most recently from 0a81ad7 to 1a2f8e3 Compare March 28, 2025 03:37
@mszacillo mszacillo changed the title Proposal to support ResourceQuotaEnforcer API Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level Mar 28, 2025
@mszacillo
Copy link
Contributor Author

Hi @RainbowMango,

I have gone ahead and improved the diagrams to be more clear with the motivation. They now include FlinkDeployments, with their resource usage, and how failover does not work with the existing static assigned ResourceQuotas.

Please let me know if there are any other concerns with regards to this proposal. Thanks!

@RainbowMango
Copy link
Member

@whitewindmills Would you like to take a look at this proposal as well? It is highly likely that the scheduler should be involved to enforce the FederatedResourceQuota.

@whitewindmills
Copy link
Member

get, I'll take a look ASAP.
/assign

@mszacillo
Copy link
Contributor Author

Hi @RainbowMango, I've updated the proposal following the discussions we had during Kubecon earlier last week!

I've also included another user story for the Apache Spark Platform use-case, which was written by @CaesarTY. Hopefully we can begin working on this feature soon, before the next Karmada release.

@zhzhuang-zju
Copy link
Contributor

/cc

Copy link
Contributor

@zhzhuang-zju zhzhuang-zju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mszacillo, Thank you for your proposal. It is very interesting and meaningful. I will keep an eye on this proposal and hope that I can be of some help with it.


Certain types of applications, especially big data and analytics jobs (e.g., Apache Spark on Kubernetes), do not always request all required resources up front. Instead, they may dynamically request additional resources (executors) at runtime, depending on current workload size, task parallelism, and job progress.

In a multi-cluster environment managed by Karmada, we currently lack a way to enforce resource usage limits across clusters for such elastic workloads. Traditional Kubernetes ResourceQuota objects operate at the namespace level within a single cluster and are not designed to account for cross-cluster resource consumption or dynamic scaling behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a multi-cluster environment managed by Karmada, we currently lack a way to enforce resource usage limits across clusters for such elastic workloads.

Could this point be expanded in more detail? For example, could you give a specific example using a Spark application?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. A typical example is Spark application running in dynamic allocation mode, which adjusts the number of executor pods at runtime based on the job's needs. Here is a simple spark submit command:

spark-submit \
  --class org.apache.spark.examples.JavaWordCount \
  --master k8s://https://<api-server> \
  --deploy-mode cluster \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.minExecutors=1 \
  --conf spark.dynamicAllocation.initialExecutors=5 \
  --conf spark.dynamicAllocation.maxExecutors=50 \
  --conf spark.kubernetes.namespace=analytics \
  --conf spark.kubernetes.container.image=myrepo/spark:latest \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar \
  s3a://my-bucket/text-data/

In this setup:

  1. The job starts with 5 executors.
  2. Executors are dynamically scaled up to 50 as needed.
  3. Each executor is a pod that may land on any eligible node.

It will be great if we can have a way to enforce resource usage limits across clusters for such elastic workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the max # of executors is something that can be found in the CRD spec, you can probably configure your custom interpreter to account for this in your replica requirements, which in turn would be reflected in the FederatedResourceQuota.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job starts with 5 executors.
Executors are dynamically scaled up to 50 as needed.

@CaesarTY Thanks to your detailed description. For this elastic scenario, do you want to calculate the usage based on 5, 50, or the number of replicas in the member cluster?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the usage should be calculated by the real-time number of replicas (driver + executors pods resources) and enforced by FederatedResourceQuota across the tier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that once the Spark application is deployed to a member cluster, there is no way for Karmada to suspend the executor scale-up. (Sure, Karmada can calculate the real-time resource consumption, but it can only be used to check the following Spark applications.)

I think @zhzhuang-zju's question is whether we should enforce quota interception based on the initial or maximum usage.

In my opinion, we should always use the maximum usage that can ensure the total amount of resource usage does not exceed the quota limits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resource name is limits.cpu, then we should use maximum usage, take maxExecutors into account.

apiVersion: policy.karmada.io/v1alpha1
kind: FederatedResourceQuota
metadata:
  name: foo
spec:
  overall:
    limits.cpu: 100

If the resource name is requests.cpu, then should use the requested usage, take initialExecutors into account.

apiVersion: policy.karmada.io/v1alpha1
kind: FederatedResourceQuota
metadata:
  name: foo
spec:
  overall:
    requests.cpu: 100

@CaesarTY I don't know if that meets your case. It would be great to share a sample of your ResourceQuota configuration.


1. FederatedResourceQuota API should enforce namespaced overall resource limits.
- FederatedResourceQuota Status will be updated whenever resources are applied against the relevant namespace
- We consider including a resource selector scope for the quota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the approximate functions and APIs of the resource selector scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can flesh this out and update the proposal. But the general idea was to support a PriorityClass scope in the same way the Kubernetes ResourceQuota does: https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-scopes

However, for a first pass of this feature this may not be necessary. If a FederatedResourceQuota is defined in the namespace, then it can monitor all resources in that namespace rather than filtering by scope. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support a PriorityClass scope in the same way the Kubernetes ResourceQuota does

Get it. This is also one of the features I expect.

However, for a first pass of this feature this may not be necessary. If a FederatedResourceQuota is defined in the namespace, then it can monitor all resources in that namespace rather than filtering by scope. Let me know what you think.

In terms of API design, we can try to make it as complete as possible or ensure sufficient scalability. In terms of implementation, we can achieve it step by step according to a certain rhythm.


// FederatedResourceQuota represents the name of the quota that will be used for this resource
// +optional
FederatedResourceQuota string `json:"federatedResourceQuota,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that only one FederatedResourceQuota can be applied within the same namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just means that the ResourceBinding will be tied to at most 1 FederatedResourceQuota. If we support scopes, then we can include multiple quotas in the same namespace, and the ResourceBinding would pick up the quota that matches the scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just means that the ResourceBinding will be tied to at most 1 FederatedResourceQuota.

Could you explain the reason for this constraint? It seems that the native ResourceQuota doesn't limit a pod to using only one ResourceQuota.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question.

Currently, this proposal explains how and when the .spec.FederatedResourceQuota in ResourceBinding should be set, The question is why ResourceBinding needs to keep the reference to the FederatedResourceQuota?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo from the proposal:

Reconciliation: Controller will reconcile whenever a resource binding is created, updated, or deleted. Controller will only reconcile if the resource in question has a pointer to the FederatedResourceQuota.

Is this the reason of introducing the reference? For the controller to figure out which ResourceBinding should be cared about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the reason for this constraint? It seems that the native ResourceQuota doesn't limit a pod to using only one ResourceQuota.

This isn't a hard constraint, but this decision was meant to decrease the scope of the proposal. I would also want to know more use-cases for this, since both of the user stories in the proposal only require one FederatedResourceQuota per ResourceBinding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is why ResourceBinding needs to keep the reference to the FederatedResourceQuota?

Your reasoning below was correct, this was to make the controller logic simpler. Since the controller is updating the FederatedResourceQuota status, it needs to know which ResourceBindings are tied to the specific FederatedResourceQuota - hence the reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/staging/src/k8s.io/apiserver/pkg/quota/v1/interfaces.go#L45-L65
Hi @mszacillo, I guess the role of FederatedResourceQuota is similar to the method Evaluator.Matches of the native resourceQuota. Both are used to determine whether the resource can be applied to this quota. It's just that one maintains a method, while the other maintains a field in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of extensibility, I prefer the latter. Moreover, maintaining an API field can also be quite burdensome. Of course, if I'm wrong in my understanding, please correct me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhzhuang-zju I'll take a look at the Evaluator.Matches implementation a bit more, as I'm not familiar with the method. But I agree that maintaining API fields can be burdensome. Let me catch up on how Evaluator.Matches works and I can update the proposal with this information.


Note: Since the controller is listening to RBs, the FederatedResourceQuota will be calculated after the resource binding has been created or updated.

If a user bulk-applies a bunch of resources at once, it could be possible for the user to go above the quota’s limits. In this case, we should also check that the quota is honored before deciding to schedule the resource to a member cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can draw on the implementation of the native resourcequota to avoid the quota restrictions becoming ineffective due to user bulk applications. I have an idea here that we can discuss. Maybe during the next meeting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!! Yes this is definitely a concern, which is why we were discussing a change to the scheduler may be necessary to additionally enforce quota limits there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge apologies for not being able to make the meeting this week. Perhaps we can discuss your thoughts here? @zhzhuang-zju

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. When dealing with bulk applications, the admission control of the native resource quota will aggregate the validation requests according to the namespace and process all the requests under one namespace serially at one time.
fccf5cd49e01ca3cee6b66080b6aa13
For the convenience of understanding, I will use this diagram to explain. As you can see from this diagram, the verification request will first enter the namespace queue. At the same time, a corresponding map will be maintained, where the key is the namespace and the values are the verification requests that have already entered into the queue. Then the work routine will retrieve the namespace from the namespace queue and obtain all the verification requests under that namespace from the map. In this way, the verification requests under one namespace will always be processed within one work routine, and they are processed serially, which ensures that in the scenario of bulk applications, it will not exceed the quota limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, karmada-webhook supports multiple instances, which may make it difficult to directly draw on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! This makes a lot of sense, thank you for the thorough explanation. Thinking about this more, it think its clear that the webhook approach will not cover our edge cases (such as bulk applies for instance) - I'll update the proposal with the admission controller approach instead, borrowing from the ResourceQuota implemention in Kubernetes.


![failover-example-2](resources/failover-example-2.png)

Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mszacillo For the scenario of failure migration, the work in the failed cluster may not be cleared immediately. Do you want the resource usage of the failed cluster to be included in the total resource usage, or do you prefer to ignore this part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be ignored, because it should be expected for Karmada to clean up the resource. From the perspective of the user, they have applied their resource once to the Karmada control-plane, so it should be reflected in the resource quota accordingly.


As part of this change, we will introduce a new validating webhook:

1. The new validating webhook will watch all types of resources, at least all native workloads (Deployments) and supported CRDs (FlinkDeployments). The webhook will use Karmada's `ResourceInterpreter#GetReplicas` to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement admission control on the control plane, is it still necessary to propagate the resource quota to the member clusters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we won't need to propagate resource quotas.

@zhzhuang-zju
Copy link
Contributor

Hi @mszacillo, can you add me to the reviewers of your proposal?Thanks~

@mszacillo
Copy link
Contributor Author

Hi @mszacillo, can you add me to the reviewers of your proposal?Thanks~

Yes, absolutely :)

Signed-off-by: mszacillo <mszacillo@bloomberg.net>
@mszacillo
Copy link
Contributor Author

Hello - apologies for delay, had Easter holidays last weekend and I was quite sick. I've gone ahead and updated the proposal to address latest feedback. @zhzhuang-zju I've also included your diagram in the proposal for the admission controller section.

@zhzhuang-zju
Copy link
Contributor

Hello - apologies for delay, had Easter holidays last weekend and I was quite sick.

It’s all good! Hope you’re fully recovered now.

I've also included your diagram in the proposal for the admission controller section.

Thanks

@RainbowMango
Copy link
Member

Hello - apologies for delay, had Easter holidays last weekend and I was quite sick. I've gone ahead and updated the proposal to address latest feedback. @zhzhuang-zju I've also included your diagram in the proposal for the admission controller section.

Sorry to hear that, take care. Hope you doing fine now. No worries!
@zhzhuang-zju talked a lot with me about this proposal, he worked out several potential ways to implement this feature, and hope to talk with you at the next meeting. Hopefully, we could make an agreement there.

@mszacillo
Copy link
Contributor Author

@RainbowMango @zhzhuang-zju Sounds great! See you both at the community meeting.

@RainbowMango
Copy link
Member

@mszacillo For this proposal, I think we already have a clear user case and definition of the feature. Could we merge this version first and then fill in the implementation details later by following PRs? This would allow us to focus on the updated parts incrementally, which might help move this feature faster.

@RainbowMango
Copy link
Member

/lgtm
/approve
/hold
Just add all necessary tags for merging this PR.

If you agree to merge this first, please unhold this by leaving a comment /hold cancel in a separate line. After that, the PR will be merged, and then we can start a new PR for further updates.

@karmada-bot karmada-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Apr 29, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2025
@mszacillo
Copy link
Contributor Author

@RainbowMango Sure, that sounds good to me! Happy to add updates to this proposal as we implement.

@mszacillo
Copy link
Contributor Author

/hold cancel

@karmada-bot karmada-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2025
@karmada-bot karmada-bot merged commit 8cbefd8 into karmada-io:master Apr 29, 2025
24 checks passed
@mszacillo
Copy link
Contributor Author

mszacillo commented May 2, 2025

Hi @RainbowMango, @zhzhuang-zju,

I created a tentative list of iteration tasks for the proposal. Feel free to change these around if necessary:

Proposal:

Iteration Tasks -- Part-1: Enhance FederatedResourceQuota status controller

  • Update federated-resource-quota-status-controller to skip reconcile if staticAssignment is not set
  • FederatedResourceQuota validating webhook should block updates to enable staticAssignment on Update requests
  • Create new overall federated resource quota status controller:
    • Watch create/update events for FRQ, and update federatedresourcequota.status.overall
    • Watch deletion of RB -> find related FRQ -> recalculate quota usage
    • Periodically recalculate FRQ status

---- Part 1 Improvements beyond MVP ----

  • [API Change] Add Scope to the FederatedResourceQuotaSpec
  • Consider improvements to reconcile strategy for overall FRQ status controller (use cache instead of listing all RB each reconcile)

Iteration Tasks -- Part-2: Validation of Quota before Scheduling

  • Add ResourceBinding validation webhook to monitor updates to resourcebinding.spec and resourcebinding.spec.replicaRequirements after scheduling result
    • Check delta in quota usage and deny request if above limits
    • Otherwise update federatedresourcequota.status.overallUsed and then approve update request
  • Determine if changes necessary to scheduler when reacting to denied RB updates
  • FederatedResourceQuota controller should requeue unschedulable RBs to be scheduled in the case of a change to overall quota usage

I'm happy to divide up this work however we'd like. @liwang0513 and I are both available to contribute. Some of these are already taken care of by our internal implementation, and would need some minimal changes.

@RainbowMango
Copy link
Member

Great thanks to @mszacillo. Could you please create a new issue for the umbrella tasks? So we can track the progress there.

In addition, for the MVP versions, I prefer to have both the webhook(for validation) and controller(for calculation), because they are mandatory parts for this feature. So, shall we organize all tasks that should be in MVP into part-1, and leave others to part-2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/design Categorizes issue or PR as related to design. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FederatedResourceQuota should be failover friendly
7 participants