FederatedResourceQuota should be failover friendly #5179

mszacillo · 2024-07-11T18:57:18Z

What would you like to be added:
A way for the FederatedResourceQuota to monitoring existing ResourceQuotas (without managing those ResourceQuotas) and impose resource limits on the user based off the sum of all currently used Quota.

Why is this needed:
The existing FederatedResourceQuota mirrors the behavior of a typical Kubernetes ResourceQuota by imposing total resource limits in a multi-cluster setup. This is done by creating statically distributed ResourceQuotas across the specified member clusters, who's limits will total the limits defined in the FederatedResourceQuota. This works if the user does not need to worry about DR events which require back-up resources dedicated for failover in the event of a disaster.

In our case, since we are using Karmada for it's failover feature, we would like clusters to have additionally available quota for each namespace so that in the event of a DR event, all applicatons are able to be rescheduled:

In the diagram above, we can see that the total limits of the FederatedResourceQuota is 40/40 CPU and 50Gb / 50 Gb Memory. Individual clusters will have the same limit, so that in the case of a DR event, all workloads can be scheduled on one cluster.

Above, we see that during a failover all workloads from Cluster A will be migrated to Cluster B, where there will be enough available resources to schedule all required pods. With the existing statically defined ResourceQuotas, we cannot support this type of failover.

We've created this ticket to start a discussion on how best to address this limitation, and if this use-case is valid.

RainbowMango · 2024-07-16T12:22:09Z

First of all, thanks for bringing this up. I'm glad to enhance it with real-world use cases.

Generally, the FederatedResourceQuota is designed to enforce quota restrictions on Karmada control plane. But, currently, it just provides a capacity for administrators to manage the ResourceQuota across clusters, by the StaticAssignments. I guess, as lack of feedbacks, it saves the efforts to propagate Kubernetes ResourceQuota by a PropagationPolicy.
The general ideas and possible approaches are listed in the comments..

RainbowMango · 2024-07-16T12:40:02Z

I guess your idea is to let the user declare a total quota by FedratedResourceQuota for a specific namespace, and the quota can be shared across clusters.
In your first diagram, the total quota is 40 CPU, member1 and member2 used 20 each. At this point, no more applications(require CPU) can be scheduled to both clusters in that namespace as the total quota out. Since Karmada handles the process of failover, it knows after the failover, the quota will be released from member1, so it still can schedule application to member2 temporarily. Please correct me if I'm wrong.

mszacillo · 2024-07-16T18:55:39Z

Thanks for taking a look!

In your first diagram, the total quota is 40 CPU, member1 and member2 used 20 each. At this point, no more applications(require CPU) can be scheduled to both clusters in that namespace as the total quota out. Since Karmada handles the process of failover, it knows after the failover, the quota will be released from member1, so it still can schedule application to member2 temporarily. Please correct me if I'm wrong.

Pretty much yes. In our use-case, we have a controller that syncs a tenant's resourcequota on each member cluster to be equal to the tenant's limits (lets say 40CPU and 50GB). Each cluster will an identical static resourcequota (so that 1 cluster can accommodate all of the tenant's workloads if necessary in the case of DR).

But we want the FederatedResourceQuota to monitor the existing quota usage across all clusters and set a limit on the amount of resources that can be applied to the Karmada control plane. In the comments you linked these two were most relevant:

	//   - The rule about how to prevent workload from scheduling to cluster without quota.
	//   - The rule about how to prevent workload from creating to Karmada control plane.

Perhaps this would require some sort of admission webhook that would prevent resources from being applied if their total resource usage would go above the limits defined in the FederatedResourceQuota. This would mirror the way that ResourceQuotas are defined in K8s. The more difficult part would be determining when to replenish the quota (perhaps when a work is deleted?).

RainbowMango · 2024-07-17T02:14:53Z

Perhaps this would require some sort of admission webhook that would prevent resources from being applied if their total resource usage would go above the limits defined in the FederatedResourceQuota.

Yes, exactly. In addition, the scheduler also should take the resource quota into account and prevent scheduling workloads from clusters that exceed the limitation.

By the way, I might be slow to respond on this topic and I wish to pay more attention to #5116 and #5085 and the others we planned in the current release. But I'm interested and glad to have this discussion, and hoping keep this open and welcome other people to join this.

mszacillo · 2024-07-17T14:26:28Z

By the way, I might be slow to respond on this topic and I wish to pay more attention to #5116 and #5085 and the others we planned in the current release. But I'm interested and glad to have this discussion, and hoping keep this open and welcome other people to join this.

That's alright! Apologies for all the issues that have been filed recently - one at a time. :)

mszacillo added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 11, 2024

github-project-automation bot added this to Karmada Overall Backlog Jul 11, 2024

RainbowMango moved this to Accepted in Karmada Overall Backlog Jul 17, 2024

mszacillo mentioned this issue Jul 20, 2024

Proposal for multiple pod template support #5085

Open

mszacillo linked a pull request Jul 29, 2024 that will close this issue

Enhancing default FederatedResourceQuota, extending pp + rb api, adding admission webhook #5271

Open

RainbowMango moved this from Accepted to Planning in Karmada Overall Backlog Dec 23, 2024

mszacillo linked a pull request Mar 3, 2025 that will close this issue

Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FederatedResourceQuota should be failover friendly #5179

FederatedResourceQuota should be failover friendly #5179

mszacillo commented Jul 11, 2024

RainbowMango commented Jul 16, 2024

RainbowMango commented Jul 16, 2024

mszacillo commented Jul 16, 2024

RainbowMango commented Jul 17, 2024

mszacillo commented Jul 17, 2024

FederatedResourceQuota should be failover friendly #5179

FederatedResourceQuota should be failover friendly #5179

Comments

mszacillo commented Jul 11, 2024

RainbowMango commented Jul 16, 2024

RainbowMango commented Jul 16, 2024

mszacillo commented Jul 16, 2024

RainbowMango commented Jul 17, 2024

mszacillo commented Jul 17, 2024