Skip to content

Conversation

danpetreamd
Copy link

@danpetreamd danpetreamd commented Sep 22, 2025

Motivation

Fix AAC specific issues that prevented the project from running.

Technical Details

  • get the rocminfo path at runtime
  • get the container runtime dynamically (docker vs podman)

Test Plan

  • ran a few models successfully on AAC

Steps to test:

  1. Get access to AAC
  2. SSH into AAC login node.

ssh -i <path_to_private_key> <username>@aac10.amd.com

  1. Allocate an AAC node to run the test on.

salloc --reservation=gpu-38_reservation --exclusive --mem=0

  1. Load ROCm. By default, there's no ROCm installation on AAC nodes.
    Modules are used to load any ROCm version.

module load rocm

  1. Create setup.sh with the following contents:
#!/usr/bin/env bash

# ------------------------------------------------------------------------------
# Function to clone or update a repository.
# Arguments:
#   1. repo_path: URL of the repository to clone.
#   2. repo_branch: Branch of the repository to checkout.
# Usage:
#   clone_repo <repo_path> <repo_branch>
# ------------------------------------------------------------------------------
clone_repo() {
    local repo_path="$1"
    local repo_branch="$2"
    local repo_basename repo_username repo_name
    local repo_info="${repo_path#*:}"

    repo_username="$(basename "$(dirname "${repo_info}")")"
    repo_basename=$(basename "${repo_info}" .git)
    repo_name="${repo_basename}_${repo_username}"

    echo "Cloning ${repo_name}..." >&2
    if [ ! -d "${repo_name}" ]; then
        git clone --recursive "${repo_path}" "${repo_name}" >&2
    elif [ -d "${repo_name}/.git" ]; then
        git -C "${repo_name}" fetch --all >&2
    else
        echo "... Warning: ${repo_name} is not a git repository." >&2
        echo "... clone failed." >&2
        echo "... git pull failed." >&2
        echo "... Please remove ${repo_name} and try again." >&2
        return 1
    fi

    git -C "${repo_name}" checkout "${repo_branch}" >&2
    git -C "${repo_name}" show --oneline -s >&2

    echo "${repo_name}"
}


repo_name=$(clone_repo "https://github.com/danpetreamd/MAD.git" "aac_tweaks")

pushd "${repo_name}"

echo
echo "installing prerequisites..."
pip install -r requirements.txt
echo

echo
echo "verifying installation..."
madengine run --tags pyt_huggingface_bert
echo

popd
  1. Make setup.sh executable:

chmod u+x setup.sh

  1. Execute setup.sh.

Test Result

  • all tests passed
    • pull the MAD repo
    • install MAD prerequisites, including madengine
    • run pyt_huggingface_bert model

Notes:

The setup.sh script will use the code from my MAD and madengine forks.
Before merging PR#94 and PR#41 I need to update setup.sh and MAD requirements.txt to point to the ROCm MAD and madengine.

Submission Checklist

@danpetreamd danpetreamd marked this pull request as draft September 22, 2025 14:18
@danpetreamd danpetreamd marked this pull request as ready for review September 22, 2025 14:20
@danpetreamd danpetreamd mentioned this pull request Sep 23, 2025
1 task
@danpetreamd danpetreamd marked this pull request as draft September 24, 2025 16:24
used get_cmd() to refactor some functions where the command was not a standard command likely to be installed on all systems in a known location.
@danpetreamd danpetreamd marked this pull request as ready for review September 25, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant