Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions ansible/adhoc/cudatests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,8 @@
- ansible.builtin.import_role:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we don't even run devicequery, I think we should just remove this task entirely TBH. But leave the role pending thinking more!

name: cuda
tasks_from: samples.yml

- name: Run CUDA bandwidth tasks
ansible.builtin.import_role:
name: cuda
tasks_from: bandwidth.yml
3 changes: 3 additions & 0 deletions ansible/roles/cuda/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,6 @@ cuda_samples_programs:
- bandwidthTest
# cuda_devices: # discovered from deviceQuery run
cuda_persistenced_state: started
# variables for nvbandwidth (for bandwidth.yml tasks run in cudatests.yml)
cuda_bandwidth_path: "/var/lib/{{ ansible_user }}/cuda_bandwidth"
cuda_bandwidth_release_url: "https://github.com/NVIDIA/nvbandwidth/archive/refs/tags/v0.8.tar.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a separate _version var please and use that in the above? Makes grepping/overriding easier.

56 changes: 56 additions & 0 deletions ansible/roles/cuda/tasks/bandwidth.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
- name: Ensure cuda_bandwidth_path exists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Ensure cuda_bandwidth_path exists
- name: Ensure cuda_bandwidth_path directory exists

ansible.builtin.file:
state: directory
path: "{{ cuda_bandwidth_path }}"
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: "0755"

- name: Download CUDA bandwith test release
ansible.builtin.unarchive:
remote_src: true
src: "{{ cuda_bandwidth_release_url }}"
dest: "{{ cuda_bandwidth_path }}"
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8"

- name: Creates CUDA bandwidth test build directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the name: consistent with the name: on the first task please?

ansible.builtin.file:
state: directory
path: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build"
mode: "0755"

- name: Ensure cudatests directory exists
ansible.builtin.file:
path: "{{ appliances_environment_root }}/cudatests"
state: directory
mode: '0755'
delegate_to: localhost

- name: Build CUDA bandwidth test
ansible.builtin.shell:
cmd: |
source /cvmfs/software.eessi.io/versions/2023.06/init/bash &&
module load Boost/1.82.0-GCC-12.3.0 &&
. /etc/profile.d/sh.local && cmake .. &&
make -j {{ ansible_processor_vcpus }}
chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build"
creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/nvbandwidth"

- name: Run CUDA bandwidth test

Check failure on line 42 in ansible/roles/cuda/tasks/bandwidth.yml

View workflow job for this annotation

GitHub Actions / Lint / Lint

no-changed-when

Commands should not change things if nothing needs doing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs changed_when: true to subdue the check error

ansible.builtin.shell: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  1. Rather than using export you can use the environment keyword - docs
  2. Why do we have to mess with LD_LIBRARY_PATH?
  3. If we really do, this approach won't work b/c it is e.g. hardcoding the microarch (zen4), which will definitely break (e.g. when using an Intel processor), and versions, which doesn't seem robust.

Is it not sufficent to just activate eessi again? And maybe load some eeesi modules?

export LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64:\
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost/1.82.0-GCC-12.3.0/lib
./nvbandwidth
args:
chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/"
register: cuda_bandwidth_output

- name: Save CUDA bandwidth output to bandwidth_results.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no useful summary we can do here?

ansible.builtin.copy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is fetching a file, why does this not use ansible.builtin.fetch?

content: "{{ cuda_bandwidth_output.stdout }}"
dest: "{{ appliances_environment_root }}/cudatests/bandwidth_results.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cuda group contains multiple nodes they will all write to the same file.

mode: '0644'
delegate_to: localhost
Loading