-
Notifications
You must be signed in to change notification settings - Fork 938
Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #6181 +/- ##
==========================================
+ Coverage 47.95% 48.27% +0.31%
==========================================
Files 674 677 +3
Lines 55841 56067 +226
==========================================
+ Hits 26781 27068 +287
+ Misses 27311 27229 -82
- Partials 1749 1770 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Start working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mszacillo so much for bring this up. I like this feature~
It would be great to add a user story section to this proposal to elaborate the use case. Because it's hard to connect failover feature with the quota management.
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
docs/proposals/resource-quota-enforcer/resource-quota-enforcer.md
Outdated
Show resolved
Hide resolved
docs/proposals/resource-quota-enforcer/resource-quota-enforcer.md
Outdated
Show resolved
Hide resolved
0a81ad7
to
1a2f8e3
Compare
Hi @RainbowMango, I have gone ahead and improved the diagrams to be more clear with the motivation. They now include FlinkDeployments, with their resource usage, and how failover does not work with the existing static assigned ResourceQuotas. Please let me know if there are any other concerns with regards to this proposal. Thanks! |
@whitewindmills Would you like to take a look at this proposal as well? It is highly likely that the scheduler should be involved to enforce the FederatedResourceQuota. |
get, I'll take a look ASAP. |
Hi @RainbowMango, I've updated the proposal following the discussions we had during Kubecon earlier last week! I've also included another user story for the Apache Spark Platform use-case, which was written by @CaesarTY. Hopefully we can begin working on this feature soon, before the next Karmada release. |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mszacillo, Thank you for your proposal. It is very interesting and meaningful. I will keep an eye on this proposal and hope that I can be of some help with it.
|
||
Certain types of applications, especially big data and analytics jobs (e.g., Apache Spark on Kubernetes), do not always request all required resources up front. Instead, they may dynamically request additional resources (executors) at runtime, depending on current workload size, task parallelism, and job progress. | ||
|
||
In a multi-cluster environment managed by Karmada, we currently lack a way to enforce resource usage limits across clusters for such elastic workloads. Traditional Kubernetes ResourceQuota objects operate at the namespace level within a single cluster and are not designed to account for cross-cluster resource consumption or dynamic scaling behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a multi-cluster environment managed by Karmada, we currently lack a way to enforce resource usage limits across clusters for such elastic workloads.
Could this point be expanded in more detail? For example, could you give a specific example using a Spark application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. A typical example is Spark application running in dynamic allocation mode, which adjusts the number of executor pods at runtime based on the job's needs. Here is a simple spark submit command:
spark-submit \
--class org.apache.spark.examples.JavaWordCount \
--master k8s://https://<api-server> \
--deploy-mode cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.initialExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=50 \
--conf spark.kubernetes.namespace=analytics \
--conf spark.kubernetes.container.image=myrepo/spark:latest \
local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar \
s3a://my-bucket/text-data/
In this setup:
- The job starts with 5 executors.
- Executors are dynamically scaled up to 50 as needed.
- Each executor is a pod that may land on any eligible node.
It will be great if we can have a way to enforce resource usage limits across clusters for such elastic workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the max # of executors is something that can be found in the CRD spec, you can probably configure your custom interpreter to account for this in your replica requirements, which in turn would be reflected in the FederatedResourceQuota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job starts with 5 executors.
Executors are dynamically scaled up to 50 as needed.
@CaesarTY Thanks to your detailed description. For this elastic scenario, do you want to calculate the usage based on 5, 50, or the number of replicas in the member cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the usage should be calculated by the real-time number of replicas (driver + executors pods resources) and enforced by FederatedResourceQuota across the tier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that once the Spark application is deployed to a member cluster, there is no way for Karmada to suspend the executor scale-up. (Sure, Karmada can calculate the real-time resource consumption, but it can only be used to check the following Spark applications.)
I think @zhzhuang-zju's question is whether we should enforce quota interception based on the initial or maximum usage.
In my opinion, we should always use the maximum usage that can ensure the total amount of resource usage does not exceed the quota limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the resource name is limits.cpu
, then we should use maximum usage, take maxExecutors
into account.
apiVersion: policy.karmada.io/v1alpha1
kind: FederatedResourceQuota
metadata:
name: foo
spec:
overall:
limits.cpu: 100
If the resource name is requests.cpu
, then should use the requested usage, take initialExecutors
into account.
apiVersion: policy.karmada.io/v1alpha1
kind: FederatedResourceQuota
metadata:
name: foo
spec:
overall:
requests.cpu: 100
@CaesarTY I don't know if that meets your case. It would be great to share a sample of your ResourceQuota configuration.
|
||
1. FederatedResourceQuota API should enforce namespaced overall resource limits. | ||
- FederatedResourceQuota Status will be updated whenever resources are applied against the relevant namespace | ||
- We consider including a resource selector scope for the quota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the approximate functions and APIs of the resource selector scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can flesh this out and update the proposal. But the general idea was to support a PriorityClass scope in the same way the Kubernetes ResourceQuota does: https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-scopes
However, for a first pass of this feature this may not be necessary. If a FederatedResourceQuota is defined in the namespace, then it can monitor all resources in that namespace rather than filtering by scope. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support a PriorityClass scope in the same way the Kubernetes ResourceQuota does
Get it. This is also one of the features I expect.
However, for a first pass of this feature this may not be necessary. If a FederatedResourceQuota is defined in the namespace, then it can monitor all resources in that namespace rather than filtering by scope. Let me know what you think.
In terms of API design, we can try to make it as complete as possible or ensure sufficient scalability. In terms of implementation, we can achieve it step by step according to a certain rhythm.
|
||
// FederatedResourceQuota represents the name of the quota that will be used for this resource | ||
// +optional | ||
FederatedResourceQuota string `json:"federatedResourceQuota,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that only one FederatedResourceQuota can be applied within the same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just means that the ResourceBinding will be tied to at most 1 FederatedResourceQuota. If we support scopes, then we can include multiple quotas in the same namespace, and the ResourceBinding would pick up the quota that matches the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just means that the ResourceBinding will be tied to at most 1 FederatedResourceQuota.
Could you explain the reason for this constraint? It seems that the native ResourceQuota doesn't limit a pod to using only one ResourceQuota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question.
Currently, this proposal explains how and when the .spec.FederatedResourceQuota
in ResourceBinding should be set, The question is why ResourceBinding needs to keep the reference to the FederatedResourceQuota?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echo from the proposal:
Reconciliation: Controller will reconcile whenever a resource binding is created, updated, or deleted. Controller will only reconcile if the resource in question has a pointer to the FederatedResourceQuota.
Is this the reason of introducing the reference? For the controller to figure out which ResourceBinding should be cared about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the reason for this constraint? It seems that the native ResourceQuota doesn't limit a pod to using only one ResourceQuota.
This isn't a hard constraint, but this decision was meant to decrease the scope of the proposal. I would also want to know more use-cases for this, since both of the user stories in the proposal only require one FederatedResourceQuota per ResourceBinding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is why ResourceBinding needs to keep the reference to the FederatedResourceQuota?
Your reasoning below was correct, this was to make the controller logic simpler. Since the controller is updating the FederatedResourceQuota status, it needs to know which ResourceBindings are tied to the specific FederatedResourceQuota - hence the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/staging/src/k8s.io/apiserver/pkg/quota/v1/interfaces.go#L45-L65
Hi @mszacillo, I guess the role of FederatedResourceQuota
is similar to the method Evaluator.Matches
of the native resourceQuota. Both are used to determine whether the resource can be applied to this quota. It's just that one maintains a method, while the other maintains a field in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of extensibility, I prefer the latter. Moreover, maintaining an API field can also be quite burdensome. Of course, if I'm wrong in my understanding, please correct me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhzhuang-zju I'll take a look at the Evaluator.Matches
implementation a bit more, as I'm not familiar with the method. But I agree that maintaining API fields can be burdensome. Let me catch up on how Evaluator.Matches
works and I can update the proposal with this information.
|
||
Note: Since the controller is listening to RBs, the FederatedResourceQuota will be calculated after the resource binding has been created or updated. | ||
|
||
If a user bulk-applies a bunch of resources at once, it could be possible for the user to go above the quota’s limits. In this case, we should also check that the quota is honored before deciding to schedule the resource to a member cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can draw on the implementation of the native resourcequota to avoid the quota restrictions becoming ineffective due to user bulk applications. I have an idea here that we can discuss. Maybe during the next meeting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great!! Yes this is definitely a concern, which is why we were discussing a change to the scheduler may be necessary to additionally enforce quota limits there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge apologies for not being able to make the meeting this week. Perhaps we can discuss your thoughts here? @zhzhuang-zju
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. When dealing with bulk applications, the admission control of the native resource quota will aggregate the validation requests according to the namespace and process all the requests under one namespace serially at one time.
For the convenience of understanding, I will use this diagram to explain. As you can see from this diagram, the verification request will first enter the namespace queue. At the same time, a corresponding map will be maintained, where the key is the namespace and the values are the verification requests that have already entered into the queue. Then the work routine will retrieve the namespace from the namespace queue and obtain all the verification requests under that namespace from the map. In this way, the verification requests under one namespace will always be processed within one work routine, and they are processed serially, which ensures that in the scenario of bulk applications, it will not exceed the quota limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, karmada-webhook supports multiple instances, which may make it difficult to directly draw on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! This makes a lot of sense, thank you for the thorough explanation. Thinking about this more, it think its clear that the webhook approach will not cover our edge cases (such as bulk applies for instance) - I'll update the proposal with the admission controller approach instead, borrowing from the ResourceQuota implemention in Kubernetes.
|
||
 | ||
|
||
Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mszacillo For the scenario of failure migration, the work in the failed cluster may not be cleared immediately. Do you want the resource usage of the failed cluster to be included in the total resource usage, or do you prefer to ignore this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be ignored, because it should be expected for Karmada to clean up the resource. From the perspective of the user, they have applied their resource once to the Karmada control-plane, so it should be reflected in the resource quota accordingly.
|
||
As part of this change, we will introduce a new validating webhook: | ||
|
||
1. The new validating webhook will watch all types of resources, at least all native workloads (Deployments) and supported CRDs (FlinkDeployments). The webhook will use Karmada's `ResourceInterpreter#GetReplicas` to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we implement admission control on the control plane, is it still necessary to propagate the resource quota to the member clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we won't need to propagate resource quotas.
Hi @mszacillo, can you add me to the reviewers of your proposal?Thanks~ |
Yes, absolutely :) |
Signed-off-by: mszacillo <mszacillo@bloomberg.net>
Hello - apologies for delay, had Easter holidays last weekend and I was quite sick. I've gone ahead and updated the proposal to address latest feedback. @zhzhuang-zju I've also included your diagram in the proposal for the admission controller section. |
It’s all good! Hope you’re fully recovered now.
Thanks |
Sorry to hear that, take care. Hope you doing fine now. No worries! |
@RainbowMango @zhzhuang-zju Sounds great! See you both at the community meeting. |
@mszacillo For this proposal, I think we already have a clear user case and definition of the feature. Could we merge this version first and then fill in the implementation details later by following PRs? This would allow us to focus on the updated parts incrementally, which might help move this feature faster. |
/lgtm If you agree to merge this first, please unhold this by leaving a comment |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@RainbowMango Sure, that sounds good to me! Happy to add updates to this proposal as we implement. |
/hold cancel |
Hi @RainbowMango, @zhzhuang-zju, I created a tentative list of iteration tasks for the proposal. Feel free to change these around if necessary: Proposal:
Iteration Tasks -- Part-1: Enhance FederatedResourceQuota status controller
---- Part 1 Improvements beyond MVP ----
Iteration Tasks -- Part-2: Validation of Quota before Scheduling
I'm happy to divide up this work however we'd like. @liwang0513 and I are both available to contribute. Some of these are already taken care of by our internal implementation, and would need some minimal changes. |
Great thanks to @mszacillo. Could you please create a new issue for the umbrella tasks? So we can track the progress there. In addition, for the MVP versions, I prefer to have both the webhook(for validation) and controller(for calculation), because they are mandatory parts for this feature. So, shall we organize all tasks that should be in MVP into part-1, and leave others to part-2? |
What type of PR is this?
/kind design
What this PR does / why we need it:
This proposal enhances the FederatedResourceQuota so that it can impose namespaced resource limits directly on the Karmada control-plane level.
Which issue(s) this PR fixes:
Fixes #5179
Special notes for your reviewer:
Following our discussion during the community meeting, there were some questions regarding API server request latency as a result of adding an admission webhook to enforce resource limits. I ran some tests by measuring the time taken to apply some FlinkDeployments to the Karmada control-plane:
Without webhook: Average e2e request latency over 100 retries: 370 ms
With webhook: Average e2e request latency over 100 retries: 390 ms
The request latency increases slightly, as expected. But can run some more comprehensive performance tests for this feature.
Does this PR introduce a user-facing change?: