|
| 1 | +--- |
| 2 | +title: CRD Component Scheduler Estimation |
| 3 | +authors: |
| 4 | + - "@mszacillo" |
| 5 | + - "@Dyex719" |
| 6 | +reviewers: |
| 7 | +- "@RainbowMango" |
| 8 | +- "@XiShanYongYe-Chang" |
| 9 | +- "@zhzhuang-zju" |
| 10 | +approvers: |
| 11 | +- "@RainbowMango" |
| 12 | + |
| 13 | +create-date: 2024-06-17 |
| 14 | +--- |
| 15 | +# CRD Component Scheduler Estimation |
| 16 | + |
| 17 | +## Summary |
| 18 | + |
| 19 | +Users may want to use Karmada for resource-aware scheduling of Custom Resources (CRDs). This can be done |
| 20 | +if the CRD is comprised of a single podTemplate, which Karmada can already parse if the user defines |
| 21 | +the ReplicaRequirements with this in mind. Resource-aware scheduling becomes more difficult however, |
| 22 | +if the CRD is comprised of multiple podTemplates or pods of differing resource requirements. |
| 23 | + |
| 24 | +In the case of FlinkDeployments, there is only one podTemplate per CRD. However this podTemplate contains information |
| 25 | +related to the resourceRequirements of both the JobManager and TaskManager pods. Karmada cannot currently distinguish between |
| 26 | +these different components with the existing ReplicaRequirements API definition. |
| 27 | + |
| 28 | +We could technically add up all the individual component requirements and input those into the replicaRequirements, but Karmada would |
| 29 | +treat this like a "super replica", and try to find a node in the destination namespace that could fit the entire replica. In many cases, |
| 30 | +this is simply not possible. |
| 31 | + |
| 32 | +For this proposal, we would like to enhance the accurate scheduler to account of complex CRDs with multiple podTemplates or components. |
| 33 | + |
| 34 | +## Background on our Use-Case |
| 35 | + |
| 36 | +Karmada will be used as an intelligent scheduler for FlinkDeployments. We aim to use the accurate estimator (with the |
| 37 | +ResourceQuota plugin enabled), to estimate whether a FlinkDeployment can be fully scheduled on the potential destination namespace. |
| 38 | +In order to make this estimation, we need to take into account all of the resource requirements of the components that will be |
| 39 | +scheduled by the Flink Operator. Once the CRD is scheduled by Karmada, the Flink Operator will take over the rest of the component |
| 40 | +scheduling as seen below. |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +In the case of Flink, these components are the JobManager(s) as well as the TaskManager(s). Both of these components can be comprised of |
| 45 | +multiple pods, and the JM and TM frequently do not have the same resource requirements. |
| 46 | + |
| 47 | +## Motivation |
| 48 | + |
| 49 | +Karmada currently provides 2 methods of scheduling estimation through: |
| 50 | +1. The general estimator (which analyzes total cluster resources to determine scheduling) |
| 51 | +2. The accurate estimator (which can inspect namespaced resource quotas and determine |
| 52 | + number of potential replicas via the ResourceQuota plugin) |
| 53 | + |
| 54 | +This proposal aims to improve the 2nd method by allowing users to define components for their replica |
| 55 | +and provide precise resourceRequirements. |
| 56 | + |
| 57 | +## Goals |
| 58 | + |
| 59 | +- Provide a declarative pattern for defining the resourceRequests for individual replica components |
| 60 | +- Allow more accurate scheduling estimates for CRDs |
| 61 | + |
| 62 | +## Design Details |
| 63 | + |
| 64 | +### API change |
| 65 | + |
| 66 | +The main changes of this proposal are to the API definition of the ReplicaRequirements struct. The proposed change will |
| 67 | +add an optional `Components` field, which can be defined by the user via a ResourceInterpreterCustomization. |
| 68 | + |
| 69 | +Each `Component` will have a `Name`, an optional `PodCount` (this may not be required depending on the accurate estimator implementation), and |
| 70 | +a `resourceRequest`. These basic fields are necessary to allow the accurate estimator to determine whether all components of the CRD replica |
| 71 | +will be able to fit on the destination namespace. |
| 72 | + |
| 73 | +```go |
| 74 | +// ReplicaRequirements represents the requirements required by each replica. |
| 75 | +type ReplicaRequirements struct { |
| 76 | + // NodeClaim represents the node claim HardNodeAffinity, NodeSelector and Tolerations required by each replica. |
| 77 | + // +optional |
| 78 | + NodeClaim *NodeClaim `json:"nodeClaim,omitempty"` |
| 79 | + |
| 80 | + // ResourceRequest represents the resources required by each replica. |
| 81 | + // +optional |
| 82 | + ResourceRequest corev1.ResourceList `json:"resourceRequest,omitempty"` |
| 83 | + |
| 84 | + // A replica's total resource request may be subdivided into multiple components. |
| 85 | + // These components can be optionally defined in order to make scheduling estimation more precise. |
| 86 | + // +optional |
| 87 | + Components []Components `json:"components,omitempty"` |
| 88 | + |
| 89 | + // Namespace represents the resources namespaces |
| 90 | + // +optional |
| 91 | + Namespace string `json:"namespace,omitempty"` |
| 92 | + |
| 93 | + // PriorityClassName represents the resources priorityClassName |
| 94 | + // +optional |
| 95 | + PriorityClassName string `json:"priorityClassName,omitempty"` |
| 96 | +} |
| 97 | + |
| 98 | +type Components struct { |
| 99 | + |
| 100 | + // Name of the component |
| 101 | + Name string `json:"name"` |
| 102 | + |
| 103 | + // Number of total pods needed by the replica's component |
| 104 | + // +optional |
| 105 | + PodCount int32 `json:"podCount,omitempty"` |
| 106 | + |
| 107 | + // ResourceRequest of an individual pod of the component |
| 108 | + // +optional |
| 109 | + ResourceRequest corev1.ResourceList `json:"resourceRequest,omitempty"` |
| 110 | +} |
| 111 | +``` |
| 112 | + |
| 113 | +### Accurate Estimator Changes |
| 114 | + |
| 115 | +Besides the change to the ReplicaRequirements API, we will need to make a code change to the accurate estimator's implementation, |
| 116 | +which can be found here: https://github.com/karmada-io/karmada/blob/5e354971c78952e4f992cc5e21ad3eddd8d6716e/pkg/estimator/server/estimate.go#L59. |
| 117 | + |
| 118 | +Currently the accurate estimator will calculate the maxReplica count by: |
| 119 | +1. Running the maxReplica calculation for each plugin enabled by the accurate estimator. |
| 120 | +2. The accurate estimator will then loop through all nodes and determine if the replica can fit in any of them. This is to account for the resource fragmentation issue. |
| 121 | + |
| 122 | +For step 2, we should change this calculation if there are subcomponents set for the replica. However, with the introduction of subcomponents to the |
| 123 | +replica, we begin to run into an interesting bin-packing problem. |
| 124 | + |
| 125 | + |
| 126 | + |
| 127 | +Here we have a couple of options we can think over: |
| 128 | + |
| 129 | +1. Calculate precise amount of ways we can pack all components into existing nodes *(not recommended)* |
| 130 | +- For this option we would have to loop through each subcomponent and through all nodes to calculate the total number of ways we can pack all subcomponents into the namespace. |
| 131 | +- This would become very expensive, and I don't see the benefit of being that precise when all we care about is that the CRD can be scheduled at all. |
| 132 | + |
| 133 | +2. Confirm that all components can be scheduled into one combination of nodes |
| 134 | +- We would instead confirm that each component could fit into one of the possible nodes contrained by our destination namespace. |
| 135 | +- If we confirm that each component can fit in the available nodes, we would simply return the maxReplica estimation made by the plugin since we know that the CRD can be fully scheduled on the namespace. |
| 136 | +- If we notice that one or many of the components can not fit in any available node, we would ignore the maxReplica estimation made by the plugin and return 0. |
| 137 | + |
| 138 | + |
0 commit comments