-
Notifications
You must be signed in to change notification settings - Fork 940
Proposal for multiple pod template support #5085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #5085 +/- ##
==========================================
+ Coverage 28.21% 28.29% +0.08%
==========================================
Files 632 632
Lines 43568 43635 +67
==========================================
+ Hits 12291 12345 +54
- Misses 30381 30388 +7
- Partials 896 902 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment. |
I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback. |
docs/proposals/scheduling/crd-scheduling-improvements/crd-scheduling-improvements.md
Outdated
Show resolved
Hide resolved
@RainbowMango Thanks so much for reviewing this with me during the community meeting! Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator. Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy. |
295c2ae
to
8c59ff6
Compare
/assign |
646c98e
to
3f2e4c2
Compare
…uling Signed-off-by: mszacillo <mszacillo@bloomberg.net>
3f2e4c2
to
c026200
Compare
. . . | ||
|
||
// The total number of replicas scheduled by this resource. Each replica will represented by exactly one component of the resource. | ||
TotalReplicas int32 `json:"totalReplicas,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this field as a replacement for the existing Replicas
field, which is used very frequently within the Karmada codebase. Even though we are introducing the concept of components, Karmada will still ultimately be scheduling replicas - so I believe this slight refactor will make the implementation of this change less complex. This is again making an assumption that resources with more than 1 component will be scheduled on the same cluster.
|
||
For the proposed implementation, please refer to the next section. | ||
|
||
### Accurate Estimator Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if the explanation of the implementation proposal is a little bit dense. I can format this to be a little more clear, but otherwise please let me know if you have any questions that I can clarify.
Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :) Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend. |
During maxReplica estimation, we will take the sum of all resource requirement for the CRD. | ||
|
||
Total_CPU = component_1.replicas * (component_1.cpu) + component_2.replicas * (component_2.cpu) = (1 * 1) + (2 * 1) = 3 CPU. | ||
Total_Memory = component_1.replicas * (component_1.memory) + component_2.replicas * (component_2.memory) = (1 * 2GB) + (2 * 1GB) = 8GB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 4G?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes :)
So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right? |
Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple. In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it. |
Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time. I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time. By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this? |
Hi @RainbowMango, thanks for the heads up and enjoy your time off!
Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:
After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support. |
|
||
// Defines the requirements of an individual component of the resource. | ||
// +optional | ||
Components []Components `json:"components,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Components []ComponentRequirements
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Hi, did you make any progress or is there any available prototype to try? |
Will you support new replicaSchedulingType like Gang ? |
`Assumption 1`: Resources with more than one replica will always be scheduled to the same cluster. | ||
- This simplifies the scope of the problem, and accounts for the fact that it is non-trivial to schedule components of the same CRD across multiple clusters. | ||
|
||
`Assumption 2`: MaxReplica estimation will use a sum of all the resource requirement for every component's replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So please correct me if I was wrong,you mean that we consider all components(podTemplate) together to compute how many CRDs a cluster can deploy?The assign result is a replica number of a whole crd,not one pod,right?What I am concerned about is what the targetCluster field of RB
returns, whether it remains the same or returns different replicas for different podtemplates.
And if that was true, is it not considered that multiple pods under one crd are deployed on different nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to answer these questions, let me know if anything else needs clarification:
we consider all components(podTemplate) together to compute how many CRDs a cluster can deploy
Correct, but just for the estimation portion. If we sum all requirements together we can estimate how many multiples of the entire CRD can be scheduled on the cluster's available resources (even if that ends up being an overestimate due to resource-partitioning).
assign result is a replica number of a whole crd,not one pod,right
The assign result for the resource will still be in terms of pods, since that's the atomic scheduling unit in Karmada. In terms of the targetCluster field I believe it would be the same for both podTemplates, since this design assumes that CRDs with multiple pod templates want to be scheduled to the same cluster.
Providing an example for FlinkDeployment (for simplicity lets just estimate CPU capacity):
JobManagerPodTemplate
replicas: 1
replicaRequirements:
cpu: 1
TaskManagedPodTemplate
replicas: 2
replicaRequirements:
cpu: 1
Let's assume Cluster A has 7 available CPU. For estimation purposes we sum up the total CPU = (1 + 2) = 3. Total # of CRDs that can fit will be 7 / 3 = 2.
Once we verify that all individual components can be scheduled on available nodes (to account for resource fragmentation), we return the estimation in terms of replicas = 2 * (3) = 6 replicas.
The Karmada scheduler will then assign the 3 total replicas of the CRD to the 6 available replicas in Cluster A.
And if that was true, is it not considered that multiple pods under one crd are deployed on different nodes?
Well it is possible that different pods of the same CRD are deployed on different nodes. Unless you require something like gang-scheduling, where all pods must be scheduled on the same node. Is that something you require?
Not yet. I like this proposal very much, and glad to help move this forward. |
Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal. |
I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster. |
Out of curiosity, which CRD are you working with? Until this pod template proposal is implemented, you could customize the way your replica's are interpreted so that the resource requirement takes the max(cpu, memory) of all your templates. Replica estimation would then be able to determine if you could schedule all your replicas on a single cluster. |
Hi, we also have some user case of multi pod template of volcano job(see https://github.com/volcano-sh/volcano), and we have designed a complete API of multi pod template here https://github.com/karmada-io/karmada/blob/79e374dd3840254e40d8abb33479fc81175a356b/docs/proposals/multi-template-partitionting-and-scheduler-extensibility/README.md#update-resourcebinding-api-definition, which we think is more robust, could you take a look? |
Here is a volcano job CRD example: https://github.com/volcano-sh/volcano/blob/master/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml |
What type of PR is this?
/kind design
What this PR does / why we need it:
Described in document.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.
Does this PR introduce a user-facing change?: