-
Notifications
You must be signed in to change notification settings - Fork 36
feat: make status and cancel command configurable #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
@ebirn thank you for your contribution! I am open to configurable status and cancel commands. However, the following criteria apply:
edit: the |
Hi @cmeesters, In the current form, Snakemake does not work with Slurm unless the accounting database is configured (i.e. slurmdbd + mariaDB). We have some smaller Slurm installations, where we don't setup the accouting database. The assumption to be able to use sacct without accouting db as stated here: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/d8d605c5f31ff0148ad799efd34e87ce245f4461/snakemake_executor_plugin_slurm/__init__.py#L434C39-L434C52 is not valid in current versions of Slurm (see my PR description, sacct will not give any job information without accouting db).
I've left the default setting for the configurable status command to what it is now ( The scancel This way, the previous default behaviour does not change, the PR is only adding the option to override these commands. As for feature completeness: It is either a choice to support Slurm setups without accounting database. The efficiency report would also require sacct (and the accounting database) to retrieve the resource usage statistics, but this option is by default also set to |
I consider a cluster without accounting db as not completely set up. Hence, there is no assumption, that
We would, however, need a test suite to do that. Are you familiar with the test framework we use? Else we might have to schedule a meeting, no necessarily with you (but welcoming you, if you like) for this is not an easy task. I'm afraid, your case is not yet covered by our framework. |
Large clusters will have accounting enabled, but we provide Slurm also on smaller application specific appliances, where accounting is not an issue, but it is used for ordering access to resources to i.e. GPUs. For the switch, you imagine something like a I'm not familiar with the test framework, sorry. |
Something along the line - yes. Have to think about it. I am open for suggestions.
My argument was always, that without, there is information missing for admins and users alike. If the missing functionality is indeed now present for I will attempt a PR with test cases. We would need to update your PR accordingly. I understand the urgency in this case. (Although setting up SLURM with an accounting DB is rather straightforward.) Please understand that I am under teaching obligations and currently without a cluster to test on (it is being moved). So, no promises as to timely solutions. As for the I am however reluctant to introduce more and more switches: the UI should not be too obfuscated. So, whenever you have an ideain that regard, let me know about it, please. |
I can confirm that the functionality is available through squeue (see the specific example in the PR description). 2 conditions must be met for
Then, this command will produce the identical output as the
I have the code of this PR running successfully on our development Slurm setup. As for the UI of CLI options, this will be your decision. However, I found it as a clean setup, to keep the existing defaults (which will work on most Slurm setups), but allow an expert switch to override the specific commands. Maybe you can mark those flags as an "expert" option? Otherwise you're running the risk of introducing more and more "special" case switches for different scenarios, just to accommodate the next more specific scenario. (this is just my take on how to deal with CLI options) |
I have been working on a test. However, without a cluster (should be up tomorrow 🤞), I cannot test a thing, and it is a mess right now, so no commit. As to point 2: I think we need to query this parameterization to make the choice available. After all, this is volatile memory and I assume it will not be configured at all to some substantial time on many clusters. What is your take on a minimum time which is sensible? My guess for long-running workflows it ought to be 12 h minimum. This is because, a workflow can easily run for days in extreme scenarios and jobs need to be queried. On the other side, a bigger cluster will not cache such information as there may be too much information to store (considering the number of jobs - and a user might cause OOM issues for the ctld if a permutation test goes wrong). So, my bet would be to query this parameterization and make one or both options available, based on the presence of commands (leaning precedence to Please understand that we need to deploy this as stable as possible. I am glad, you found a working solution and appreciate your PR! |
Slurms behavior with sacct and scancel (with specific options) changed when no accounting database is configured:
My Slurm version in use:
Slurm job status
For checking the job status an aequivalent commant to the current implementation can be run with squeue (given the Slurm config has sufficiently long, i.e.
MinJobAge=86400
).Example an alternative status command:
slurm job cancel
For the cancel command, I'd rather remove the
--clusters=all
flag, and make the default just to issue the cancel on the "local" cluster (with the option to override it). But I'm not an active Snakemake user, so the discussion about default settings is better left to someone else ;)