Skip to content

Conversation

ebirn
Copy link

@ebirn ebirn commented Aug 6, 2025

Slurms behavior with sacct and scancel (with specific options) changed when no accounting database is configured:

[supercomputer ~]# sacct 
Slurm accounting storage is disabled

[supercomputer ~]# scancel 123 --clusters=all
scancel: error: Problem talking to database
scancel: error: No clusters can be reached now. Contact your admin to resolve the problem.
scancel: fatal: Could not get cluster information

My Slurm version in use:

# scancel --version
slurm 24.05.4

Slurm job status

For checking the job status an aequivalent commant to the current implementation can be run with squeue (given the Slurm config has sufficiently long, i.e. MinJobAge=86400).

Example an alternative status command:

$ squeue --format=%i\|%T --states=all --noheader --name b722e167-0559-4add-8c32-2fcc7aa80a62
3266|CANCELLED
3267|CANCELLED
[...]

slurm job cancel

For the cancel command, I'd rather remove the --clusters=all flag, and make the default just to issue the cancel on the "local" cluster (with the option to override it). But I'm not an active Snakemake user, so the discussion about default settings is better left to someone else ;)

Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ebirn ebirn changed the title make Slurm status and cancel command configurable feat: make status and cancel command configurable Aug 6, 2025
@cmeesters
Copy link
Member

cmeesters commented Aug 6, 2025

@ebirn thank you for your contribution!

I am open to configurable status and cancel commands. However, the following criteria apply:

  • there must be a viable alternative (which I fail not see, perhaps you can elaborate)
  • feature completeness must be preserved (that is particularly regarding the sacct command, which enables a number of checks. Also: sacct enables status checking for non-running and non-pending jobs, in contrast to squeue. sacct allows for checking of all jobs of a workflow with a single query, too.)

edit: the --all option has been introduced specifically because people required it, and it does no harm for non-multicluster setups. Yet, people have to have a fully functional cluster, including an accounting database.

@ebirn
Copy link
Author

ebirn commented Aug 6, 2025

Hi @cmeesters,

In the current form, Snakemake does not work with Slurm unless the accounting database is configured (i.e. slurmdbd + mariaDB). We have some smaller Slurm installations, where we don't setup the accouting database.

The assumption to be able to use sacct without accouting db as stated here: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/d8d605c5f31ff0148ad799efd34e87ce245f4461/snakemake_executor_plugin_slurm/__init__.py#L434C39-L434C52 is not valid in current versions of Slurm (see my PR description, sacct will not give any job information without accouting db).

squeue does allow to check all job status pending, runnning, completed, etc. see: https://slurm.schedmd.com/squeue.html#OPT_states So, for the purpose of checking job status, this can be done alternatively with squeue.

I've left the default setting for the configurable status command to what it is now (sacct) but just added the option for the user to provide their own status check command. It is not mandatory to provide these command, and the default behavior does not change

The scancel --clusters=all option I've put back in, as to not change default behaviour.

This way, the previous default behaviour does not change, the PR is only adding the option to override these commands.

As for feature completeness: It is either a choice to support Slurm setups without accounting database. The efficiency report would also require sacct (and the accounting database) to retrieve the resource usage statistics, but this option is by default also set to false.

@cmeesters
Copy link
Member

I consider a cluster without accounting db as not completely set up. Hence, there is no assumption, that sacct will work without an accounting db, but rather that some sort of accounting, albeit volatile, can easily be set up and is a rather sensible thing to do - for admins and users alike.

  • that squeue reliably displays completed jobs is news to me, I must admit. Will have to test (we are a year behind your SLURM version and the plugin has to work for a range of versions).
  • if squeue worked reliably in this regard, I would welcome an easy, (perhaps) preconfigured, switch. Wouldn't you agree to have a range of preconfigured options better in terms of abstracting complexity and ease of use?

We would, however, need a test suite to do that. Are you familiar with the test framework we use? Else we might have to schedule a meeting, no necessarily with you (but welcoming you, if you like) for this is not an easy task. I'm afraid, your case is not yet covered by our framework.

@ebirn
Copy link
Author

ebirn commented Aug 6, 2025

Large clusters will have accounting enabled, but we provide Slurm also on smaller application specific appliances, where accounting is not an issue, but it is used for ordering access to resources to i.e. GPUs.

For the switch, you imagine something like a --no-sacct switch, that switches to squeue for status and calls scancel without the --clusters=all option?

I'm not familiar with the test framework, sorry.

@cmeesters
Copy link
Member

For the switch, you imagine something like a

Something along the line - yes. Have to think about it. I am open for suggestions.

Large clusters will have accounting enabled ...

My argument was always, that without, there is information missing for admins and users alike. If the missing functionality is indeed now present for squeue, we can have that switch. I have to admit: I do wonder how retrieving the job status for completed jobs (longer than a minute or two in the past) can be accomplished without a DB. But I cannot test this myself - it does not work on our clusters. Can you confirm, it does for your cluster?

I will attempt a PR with test cases. We would need to update your PR accordingly.

I understand the urgency in this case. (Although setting up SLURM with an accounting DB is rather straightforward.) Please understand that I am under teaching obligations and currently without a cluster to test on (it is being moved). So, no promises as to timely solutions.

As for the scancel issue: This is related but different. How about a separate PR (I can do this), introducing a switch for multiclusters? In this case, the default would be without --all and no DB query can fail.

I am however reluctant to introduce more and more switches: the UI should not be too obfuscated. So, whenever you have an ideain that regard, let me know about it, please.

@ebirn
Copy link
Author

ebirn commented Aug 11, 2025

I can confirm that the functionality is available through squeue (see the specific example in the PR description). 2 conditions must be met for squeue to show this information:

  1. use the flag --states=all, to report all states of all jobs in slurmctld's memory. See https://slurm.schedmd.com/squeue.html#OPT_states
  2. The slurmd.conf contains a stanza to MinJobAge=<seconds> to keep completed jobs (and their state) in memory for a longer time - i.e. I use MinJobAge=86400, see https://slurm.schedmd.com/slurm.conf.html#OPT_MinJobAge

Then, this command will produce the identical output as the sacct:

squeue --format=%i\|%T --states=all --noheader --name <snakemake_uuid>
3266|CANCELLED
3267|CANCELLED
3268|COMPLETED
3269|RUNNING
3270|PENDING

I have the code of this PR running successfully on our development Slurm setup.

As for the UI of CLI options, this will be your decision. However, I found it as a clean setup, to keep the existing defaults (which will work on most Slurm setups), but allow an expert switch to override the specific commands. Maybe you can mark those flags as an "expert" option?

Otherwise you're running the risk of introducing more and more "special" case switches for different scenarios, just to accommodate the next more specific scenario. (this is just my take on how to deal with CLI options)

@cmeesters
Copy link
Member

I have been working on a test. However, without a cluster (should be up tomorrow 🤞), I cannot test a thing, and it is a mess right now, so no commit.

As to point 2: I think we need to query this parameterization to make the choice available. After all, this is volatile memory and I assume it will not be configured at all to some substantial time on many clusters. What is your take on a minimum time which is sensible? My guess for long-running workflows it ought to be 12 h minimum. This is because, a workflow can easily run for days in extreme scenarios and jobs need to be queried. On the other side, a bigger cluster will not cache such information as there may be too much information to store (considering the number of jobs - and a user might cause OOM issues for the ctld if a permutation test goes wrong). So, my bet would be to query this parameterization and make one or both options available, based on the presence of commands (leaning precedence to sacct) and offer the choice only, if both options are available and reasonably save. What do you think?

Please understand that we need to deploy this as stable as possible. I am glad, you found a working solution and appreciate your PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants