-
Notifications
You must be signed in to change notification settings - Fork 0
Update for Slurm appliance v2.3 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* fix fatimage build without secrets * bump CI image
Fix typos in docs
* Test if upload as raw from volume is quicker * test raw fatimage on top of raw fat image * bump raw volume-backed images * convert image to qcow2 during sync workflow * test large runner * try qemu compression * free up space on runner * test raw 9.5 base image * Update rocky 8 to raw --------- Co-authored-by: bertiethorpe <bertie443@gmail.com>
This enables defining extra packages and users based on some conditions,
for example:
appliances_extra_packages_other:
- "{{ 'cuda-toolkit' if 'cuda' in group_names }}"
Allow empty items in extra package and user lists
The hpctests role creates a directory that is user-owned and group-owned by hpctests_user, which defaults to ansible_user. However, this group is not guaranteed to exist, especially in LDAP environments where groups dedicated to a single user are often not used. Support customising the value via the hpctests_group variable, while still defaulting to hpctests_user.
Fix nightly-cleanup workflow
Fix creation of hpctests directory
* bump zenith client to 0.14 * try removing zenith_proxy_client_auth_params * try host networking for zenith * Revert "try host networking for zenith" This reverts commit aa11f439ffa6b594b8cc23af8d01a30f4e045e08. * bind grafana to all interfaces for caas * revert zenith pods to slirp4netns to allow rootless pods to reach host * Revert "bind grafana to all interfaces for caas" This reverts commit 8d264278ed1b2847c85472ef4788d2a612fbf5eb. * remove commented-out zenith config * get caas hpctests working with root-squashed nfs * combine caas groups files (now not symlinked from everything anyway) * caas: mount homedirs on control node too for manila for consistency and fix homedir creation
* allow setting packer cleanup option on fatimage builds * use grafana packages from ark/pulp * bump CI image * disable grafana repo when templating so existing logic works * bump grafana from v9 to v10 * bump CI image * fix site failing to template out dnf password for grafana repofile * bump CI image * avoid leaking Ark creds from local build into site.yml * bump CI image
* add cluster_nodename_template and compute/login.nodename_template * refactor nodename templating for DRY * add environment_name to nodename template * bump opentofu version in CI
* enable caas to turn on dnf_repos via extravars * don't disable repos at end of play when dnf_repos_enabled
…1 (#668) * bump OpenHPC snapshots to v3.1.1 (slurm 24.11.5) and v2.9.1 (slurm 23.11.11) for CVE-2025-43904 * bump CI image * extend timeout for slurmdbd startup to cope with major version upgrade on startup
* bump OpenHPC snapshots to v3.1.1 (slurm 24.11.5) and v2.9.1 (slurm 23.11.11) for CVE-2025-43904 * bump CI image * extend timeout for slurmdbd startup to cope with major version upgrade on startup * configure openhpc for slurmdbd backup/update * support mysql tasks in openhpc role * remove slurmdbd startup timeout increase - got borked during merge from main * mysql package now installed separately in role from openhpc_packages * bump CI image to get mysql client installed * delete snapshot when cleaning up in CI * bump openhpc role to commit * bump openhpc role to release
* bump OpenHPC snapshots to v3.1.1 (slurm 24.11.5) and v2.9.1 (slurm 23.11.11) for CVE-2025-43904 * bump CI image * extend timeout for slurmdbd startup to cope with major version upgrade on startup * configure openhpc for slurmdbd backup/update * support mysql tasks in openhpc role * remove slurmdbd startup timeout increase - got borked during merge from main * mysql package now installed separately in role from openhpc_packages * bump CI image to get mysql client installed * delete snapshot when cleaning up in CI * automate image release
Fix typo in comment
* PoC of automating partition/nodegroup config * update rebuild docs * fixup ondemand partitions for openhpc_partitions * fixup rebuild role docs for openhpc_partitions * fix caas for openhpc_partitions/openhpc_nodegroups * make caas provisioning less confusing * fix openhpc_partition config for stackhpc.openhpc groups->nodegroups change * run stackhpc_openhpc validation * fix caas nodegroups typo * fix partitions for caas and non-rebuilt-enabled clusters * bump openhpc role to release
The bandwidthTest utility was removed from CUDA Samples v12.9 [1]. Remove it from the samples playbook for now: it can be replaced by nvbandwidth [2] later. [1] https://github.com/NVIDIA/cuda-samples/releases/tag/v12.9 [2] https://github.com/NVIDIA/nvbandwidth
Running `cookiecutter skeleton` fails with:
Unable to create file 'inventory/group_vars/all/alertmanager.yml'
Error message: 'vault_alertmanager_slack_integration_app_creds' is undefined
This is caused by cookiecutter templating which needs to be escaped.
Fix a typo as well.
* optional home vol using vol size * use home_volume_provisioning * automatically modify nfs configuration depending on home volume * remove dead tf code * fix prod docs * make state volume provisioning optional * address review comments
* allow specifying root volume type
* Update environments/skeleton/{{cookiecutter.environment}}/tofu/variables.tf
Co-authored-by: Matt Anson <matta@stackhpc.com>
---------
Co-authored-by: Matt Anson <matta@stackhpc.com>
Move HPL source download to fatimage build
* bump openhpc: encoded munge key, always configless * bump openhpc role to release * bump CI image * Revert "bump CI image" This reverts commit ee5acb5f8a0596e98e3743025fdcbfa979214fa2. * bump CI image again to try to trigger CI
* Moved cookiecutter tofu to site environment * updated CI environment * Updated docs for new environment structure * review comments Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> * docs updates * typo Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> * removed topology from default groups + added docs --------- Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com>
* Image sync CI to use leafcloud S3 * Use leafcloud s3 config * Update release image workflows and upgrades.md bucket url
Allow specifying security groups for individual login groups
* configure rstudio on compute node * configure rstudio app on compute node * fix rstudio_compute.yml * improve ood app install logic * Add MATLAB ood app configuration * add and configure VSCode OOD app * WIP: debug rstudio app errors * reconfigure ood code-server app * Document lack of out-the-box MATLAB functionality * remove groups * re-add inventory * bump CI images * Use lustre-release mirror * bump ood app releases, update matlab submit script, optimise portal.yml * fix widget selection order * openondemand.md improvements * bump CI images * undo bastion edit * del duplicate sentence in openondemand.md * Update docs/openondemand.md Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> --------- Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com>
* allow setting volume type for extra_volumes * Added config drive option to tofu * docs suggestions Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> * config drive defaults to null + added to additional nodegroups --------- Co-authored-by: Steve Brasier <steveb@stackhpc.com> Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com>
* allow setting volume type for extra_volumes * Added config drive option to tofu * add option for additional user data * docs suggestions Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> * made user_data node variables non-nullable --------- Co-authored-by: Steve Brasier <steveb@stackhpc.com> Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com>
slurm.conf common env updated
When the line displaying the group of nodes used tokenises as four items, it results in a parse error when python attemmpts to cast the line as integers and floats.
sjpb
approved these changes
Sep 5, 2025
Collaborator
sjpb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked so lets call it good. Thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.