-
Notifications
You must be signed in to change notification settings - Fork 146
feat: pod overlay for kubernetes scheduler (#1067,#1068) #1148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Please take a look @kiukchung and @d4l3k and let me know if this something that looks good. I'll then include more testing evidence if that's the case. |
5d53679 to
1eaaa28
Compare
1eaaa28 to
ce37a51
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1148 +/- ##
==========================================
- Coverage 91.65% 91.59% -0.06%
==========================================
Files 83 83
Lines 6483 6618 +135
==========================================
+ Hits 5942 6062 +120
- Misses 541 556 +15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ce37a51 to
661bfa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--
thanks for the update! LGTM. One small comment:
|
|
That's a good question, @kiukchung ! Technically we override keys for dicts so by default it would perhaps be least surprising if we also override the list but I think we can add a support for append also.
role.metadata = {
"kubernetes": lambda pod_dict: {
**pod_dict,
"spec": {
**pod_dict["spec"],
"tolerations": pod_dict["spec"].get("tolerations", []) + [{"key": "new"}]
}
}
}and we can load yaml or whatever using code: import yaml
import fsspec
with fsspec.open("file:///path/to/overlay.yaml", "r") as f:
overlay_dict = yaml.safe_load(f)
role.metadata={
"kubernetes": lambda pod: {
**pod,
"spec": {
**pod["spec"],
**overlay_dict.get("spec", {}),
"tolerations": pod["spec"].get("tolerations", []) + overlay_dict.get("spec", {}).get("tolerations", [])
}
}
}
{
"kubernetes": [
{
"op": "add",
"path": "/spec/nodeSelector/gpu",
"value": "true"
},
{
"op": "add",
"path": "/spec/tolerations/-",
"value": {
"key": "nvidia.com/gpu",
"operator": "Exists"
}
}
]
}I don't want to invent anything new here @kiukchung , it should be ideally some standard tools only, so perhaps a lambda would be ideal, especially since TorchX is investing in programmatic approaches. |
661bfa3 to
483f7af
Compare
|
@clumsy thanks for the detailed proposals! I like RFC 6902 if we were implementing a more flexible and complete overlay mechanism. I'll definitely take this into consideration for the "official" overlay API. In the meanwhile, considering what you have is scoped to the kubernetes scheduler and to unblock you, I think what you have is fine (e.g. upsert on maps, and no append on lists). Once the tests pass I'll merge. |
|
Does it mean you're in favor of the 1st revision with a new runopt vs. 2nd revision metadata lambda? @kiukchung |
Ah this is an area where TorchX lacks clarity for devs extending TorchX today. In my POV there are two somewhat tangential decisions here:
In regards to #1:
Therefore, both from a philosophical and technical POV it makes sense for us to carry overlays via the In regards to #2 Today we don't have an "official" specification on how the overlay is applied. Each scheduler implementing overlays decides how the overlay is applied. For instance, our internal scheduler implementation uses the following logic:
The logic you've implemented for k8s is #1 and #3 but not #2. If we adopt RFC-6902, then the overlay specified in That said, I think RFC-6902 should be adopted when we officially implement an overlay API, that could potentially look something like: |
98cf09a to
483f7af
Compare
|
Thanks for a very detailed context, @kiukchung This is super useful! I think with that I have a proposal to get us 1-3 using the approach you already approved (1st revision): we can use Please let me know what you think, @kiukchung |
483f7af to
3dd0f1e
Compare
|
Done @kiukchung |
Adding support for custom kubernetes pod overlay.
This fixes #1067 and #1068.
This way we don't need to add a runopts for each spec parameter.
Test plan:
[x] added unit tests