Proposal for multiple pod template support #5085

mszacillo · 2024-06-22T23:56:07Z

What type of PR is this?

/kind design

What this PR does / why we need it:
Described in document.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2024-06-22T23:56:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign garrybest for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-06-23T00:07:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.29%. Comparing base (c8acebc) to head (c026200).
Report is 34 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5085      +/-   ##
==========================================
+ Coverage   28.21%   28.29%   +0.08%     
==========================================
  Files         632      632              
  Lines       43568    43635      +67     
==========================================
+ Hits        12291    12345      +54     
- Misses      30381    30388       +7     
- Partials      896      902       +6

Flag	Coverage Δ
unittests	`28.29% <ø> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mszacillo · 2024-06-24T19:38:46Z

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

RainbowMango · 2024-06-25T03:45:44Z

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback.

docs/proposals/scheduling/crd-scheduling-improvements/crd-scheduling-improvements.md

mszacillo · 2024-06-25T17:51:28Z

@RainbowMango Thanks so much for reviewing this with me during the community meeting!

Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator.

Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy.

whitewindmills · 2024-07-02T03:34:32Z

/assign

…uling Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo · 2024-07-12T22:31:36Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+    . . .
+
+    // The total number of replicas scheduled by this resource. Each replica will represented by exactly one component of the resource.
+    TotalReplicas int32 `json:"totalReplicas,omitempty"`


I've included this field as a replacement for the existing Replicas field, which is used very frequently within the Karmada codebase. Even though we are introducing the concept of components, Karmada will still ultimately be scheduling replicas - so I believe this slight refactor will make the implementation of this change less complex. This is again making an assumption that resources with more than 1 component will be scheduled on the same cluster.

mszacillo · 2024-07-12T22:32:17Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+
+For the proposed implementation, please refer to the next section.
+
+### Accurate Estimator Changes


Apologies if the explanation of the implementation proposal is a little bit dense. I can format this to be a little more clear, but otherwise please let me know if you have any questions that I can clarify.

mszacillo · 2024-07-12T22:34:12Z

Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :)

Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend.

Monokaix · 2024-07-17T09:33:23Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+During maxReplica estimation, we will take the sum of all resource requirement for the CRD.
+
+Total_CPU = component_1.replicas * (component_1.cpu) + component_2.replicas * (component_2.cpu) = (1 * 1) + (2 * 1) = 3 CPU.
+Total_Memory = component_1.replicas * (component_1.memory) + component_2.replicas * (component_2.memory) = (1 * 2GB) + (2 * 1GB) = 8GB.


Oops, yes :)

Monokaix · 2024-07-17T09:36:06Z

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

mszacillo · 2024-07-17T14:30:58Z

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple.

In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it.

RainbowMango · 2024-07-20T11:20:30Z

Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time.

I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time.

By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

mszacillo · 2024-07-20T17:27:09Z

Hi @RainbowMango, thanks for the heads up and enjoy your time off!

this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:

Karmada should attach a label / annotation to the workload if it fails over to another cluster: Add a label/annotation to the resource being rescheduled during failover #4969. This is currently implemented and we're simply doing some testing and some internal code review before opening up a PR upstream. This change will also address an existing bug that was detailed here: fix: do not reschedules the workload to the same unhealthy cluster when application failover enabled #4215, but guaranteeing Karmada will not reschedule workloads to the cluster it has failed-over from.
Address the limitation in the FederatedResourceQuota discussed in FederatedResourceQuota should be failover friendly #5179, to which I am currently implementing a workaround.

After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support.

KunWuLuan · 2024-10-23T11:50:12Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+
+    // Defines the requirements of an individual component of the resource.
+	// +optional
+    Components []Components `json:"components,omitempty"`


Components []ComponentRequirements
Right?

KunWuLuan · 2024-10-23T11:56:09Z

Hi, did you make any progress or is there any available prototype to try?

KunWuLuan · 2024-10-23T12:06:35Z

Will you support new replicaSchedulingType like Gang ?

Monokaix · 2024-10-24T06:34:01Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+`Assumption 1`: Resources with more than one replica will always be scheduled to the same cluster.
+	- This simplifies the scope of the problem, and accounts for the fact that it is  non-trivial to schedule components of the same CRD across multiple clusters.
+
+`Assumption 2`: MaxReplica estimation will use a sum of all the resource requirement for every component's replica.


So please correct me if I was wrong，you mean that we consider all components(podTemplate) together to compute how many CRDs a cluster can deploy?The assign result is a replica number of a whole crd，not one pod，right？What I am concerned about is what the targetCluster field of RB returns, whether it remains the same or returns different replicas for different podtemplates.
And if that was true, is it not considered that multiple pods under one crd are deployed on different nodes?

Will try to answer these questions, let me know if anything else needs clarification:

we consider all components(podTemplate) together to compute how many CRDs a cluster can deploy

Correct, but just for the estimation portion. If we sum all requirements together we can estimate how many multiples of the entire CRD can be scheduled on the cluster's available resources (even if that ends up being an overestimate due to resource-partitioning).

assign result is a replica number of a whole crd，not one pod，right

The assign result for the resource will still be in terms of pods, since that's the atomic scheduling unit in Karmada. In terms of the targetCluster field I believe it would be the same for both podTemplates, since this design assumes that CRDs with multiple pod templates want to be scheduled to the same cluster.

Providing an example for FlinkDeployment (for simplicity lets just estimate CPU capacity):

JobManagerPodTemplate replicas: 1 replicaRequirements: cpu: 1 TaskManagedPodTemplate replicas: 2 replicaRequirements: cpu: 1

Let's assume Cluster A has 7 available CPU. For estimation purposes we sum up the total CPU = (1 + 2) = 3. Total # of CRDs that can fit will be 7 / 3 = 2.

Once we verify that all individual components can be scheduled on available nodes (to account for resource fragmentation), we return the estimation in terms of replicas = 2 * (3) = 6 replicas.

The Karmada scheduler will then assign the 3 total replicas of the CRD to the 6 available replicas in Cluster A.

And if that was true, is it not considered that multiple pods under one crd are deployed on different nodes?

Well it is possible that different pods of the same CRD are deployed on different nodes. Unless you require something like gang-scheduling, where all pods must be scheduled on the same node. Is that something you require?

RainbowMango · 2024-10-24T09:14:32Z

Hi, did you make any progress or is there any available prototype to try?

Not yet. I like this proposal very much, and glad to help move this forward.

RainbowMango · 2024-10-24T09:16:04Z

Will you support new replicaSchedulingType like Gang ?

Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.

KunWuLuan · 2024-10-24T09:58:39Z

Will you support new replicaSchedulingType like Gang ?

Maybe not as Gang scheduling is already supported I guess. Please let us know your use case if you are interested in this proposal.

I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.

mszacillo · 2024-10-24T12:44:45Z

I found the proposal in #5218. I think this is what I need. We need to ensure the minimum replicas of job can be run in one cluster.

Out of curiosity, which CRD are you working with? Until this pod template proposal is implemented, you could customize the way your replica's are interpreted so that the resource requirement takes the max(cpu, memory) of all your templates. Replica estimation would then be able to determine if you could schedule all your replicas on a single cluster.

Monokaix · 2024-11-28T03:02:59Z

Hi, we also have some user case of multi pod template of volcano job(see https://github.com/volcano-sh/volcano), and we have designed a complete API of multi pod template here https://github.com/karmada-io/karmada/blob/79e374dd3840254e40d8abb33479fc81175a356b/docs/proposals/multi-template-partitionting-and-scheduler-extensibility/README.md#update-resourcebinding-api-definition, which we think is more robust, could you take a look?

Monokaix · 2024-11-28T03:04:29Z

Here is a volcano job CRD example: https://github.com/volcano-sh/volcano/blob/master/example/MindSpore-example/mindspore_gpu/mindspore-gpu.yaml

karmada-bot added the kind/design Categorizes issue or PR as related to design. label Jun 22, 2024

karmada-bot requested review from Poor12 and Tingtal June 22, 2024 23:56

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2024

RainbowMango reviewed Jun 25, 2024

View reviewed changes

docs/proposals/scheduling/crd-scheduling-improvements/crd-scheduling-improvements.md Outdated Show resolved Hide resolved

This was referenced Jun 26, 2024

Adding FlinkDeployment CRD to supported third party resource customizations #5023

Merged

[Feature] Multi-components workload scheduling #5115

Open

mszacillo force-pushed the crd-scheduling-proposal branch from 295c2ae to 8c59ff6 Compare July 1, 2024 21:36

karmada-bot assigned whitewindmills Jul 2, 2024

mszacillo force-pushed the crd-scheduling-proposal branch 3 times, most recently from 646c98e to 3f2e4c2 Compare July 9, 2024 15:03

Proposal document for improvement to accurate estimator for CRD sched…

c026200

…uling Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo force-pushed the crd-scheduling-proposal branch from 3f2e4c2 to c026200 Compare July 12, 2024 22:28

mszacillo changed the title ~~Proposal for accurate estimator improvement for CRD scheduling~~ Proposal for multiple pod template support Jul 12, 2024

mszacillo commented Jul 12, 2024

View reviewed changes

RainbowMango mentioned this pull request Jul 17, 2024

FederatedResourceQuota should be failover friendly #5179

Closed

Monokaix reviewed Jul 17, 2024

View reviewed changes

KunWuLuan reviewed Oct 23, 2024

View reviewed changes

Monokaix reviewed Oct 24, 2024

View reviewed changes


		For the proposed implementation, please refer to the next section.

		### Accurate Estimator Changes

Proposal for multiple pod template support #5085

Are you sure you want to change the base?

Proposal for multiple pod template support #5085

Uh oh!

Conversation

mszacillo commented Jun 22, 2024

Uh oh!

karmada-bot commented Jun 22, 2024

Uh oh!

codecov-commenter commented Jun 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mszacillo commented Jun 24, 2024

Uh oh!

RainbowMango commented Jun 25, 2024

Uh oh!

Uh oh!

mszacillo commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whitewindmills commented Jul 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mszacillo commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Monokaix commented Jul 17, 2024

Uh oh!

mszacillo commented Jul 17, 2024

Uh oh!

RainbowMango commented Jul 20, 2024

Uh oh!

mszacillo commented Jul 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KunWuLuan commented Oct 23, 2024

Uh oh!

KunWuLuan commented Oct 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RainbowMango commented Oct 24, 2024

Uh oh!

RainbowMango commented Oct 24, 2024

Uh oh!

KunWuLuan commented Oct 24, 2024

Uh oh!

mszacillo commented Oct 24, 2024

Uh oh!

Monokaix commented Nov 28, 2024

Uh oh!

Monokaix commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 23, 2024 •

edited

Loading

mszacillo commented Jun 25, 2024 •

edited

Loading

mszacillo commented Jul 12, 2024 •

edited

Loading

Monokaix commented Nov 28, 2024 •

edited

Loading