Skip to content

Commit 8e08909

Browse files
authored
Adds infrastructure for various Sapphire Rapids instances on AWS (#6)
* Adds infrastructure for various Saphire Rapids instances on AWS * Changes scripts to relative for c7i.24xlarge * Converts all scripts over to relative paths * Adjusts per-user memory requests to work based on testing * Adds the plotting extra to Thicket * Removes all absolute paths from all configs and scripts * Fixes extras syntax for Thicket install from Git * Pins the versions of Hatchet and Thicket to be used for the tutorial * Fixes bug where version identifier was written in Spack syntax instead of pip syntax * Pins the version of the AWS EBS to 1.45.0 since master was breaking things
1 parent 76af8c3 commit 8e08909

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+3538
-1
lines changed

2025-HPDC/docker/Dockerfile.thicket

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ FROM ghcr.io/llnl/caliper:hpdc-2025
1111
USER root
1212

1313
RUN . /opt/global_py_venv/bin/activate && \
14-
python3 -m pip install git+https://github.com/LLNL/thicket.git@develop
14+
python3 -m pip install llnl-hatchet==2024.1.3 && \
15+
python3 -m pip install "llnl-thicket[plotting] @ git+https://github.com/LLNL/thicket.git@develop-2024-11-02"
1516
# python3 -m pip install llnl-thicket[extrap,plotting]==2024.2.1
1617

1718
USER ${NB_USER}
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Deploy hpdc-2025-c7i-24xlarge to AWS Elastic Kubernetes Service (EKS)
2+
3+
These config files and scripts can be used to deploy the hpdc-2025-c7i-24xlarge tutorial to EKS.
4+
5+
The sections below walk you through the steps to deploying your cluster. All commands in these
6+
sections should be run from the same directory as this README.
7+
8+
## Step 1: Create EKS cluster
9+
10+
To create an EKS cluster with your configured settings, run the following:
11+
12+
```bash
13+
$ ./create_cluster.sh
14+
```
15+
16+
Be aware that this step can take upwards of 15-30 minutes to complete.
17+
18+
## Step 2: Configure Kubernetes within the EKS cluster
19+
20+
After creating the cluster, we need to configure Kubernetes and its addons. In particular,
21+
we need to setup the Kubernetes autoscaler, which will allow our tutorial to scale to as
22+
many users as our cluster's resources can possibly handle.
23+
24+
To configure Kubernetes and the autoscaler, run the following:
25+
26+
```bash
27+
$ ./configure_kubernetes.sh
28+
```
29+
30+
## Step 3: Deploy JupyterHub to the EKS cluster
31+
32+
With the cluster properly created and configured, we now can deploy JupyterHub to the cluster
33+
to manage everything else about our tutorial.
34+
35+
To deploy JupyterHub, run the following:
36+
37+
```bash
38+
$ ./deploy_jupyterhub.sh
39+
```
40+
41+
## Step 4: Verify that everything is working
42+
43+
After deploying JupyterHub, we need to make sure that all the necessary components
44+
are working properly.
45+
46+
To check this, run the following:
47+
48+
```bash
49+
$ ./check_jupyterhub_status.sh
50+
```
51+
52+
If everything worked properly, you should see an output like this:
53+
54+
```
55+
NAME READY STATUS RESTARTS AGE
56+
continuous-image-puller-2gqrw 1/1 Running 0 30s
57+
continuous-image-puller-gb7mj 1/1 Running 0 30s
58+
hub-8446c9d589-vgjlw 1/1 Running 0 30s
59+
proxy-7d98df9f7-s5gft 1/1 Running 0 30s
60+
user-scheduler-668ff95ccf-fw6wv 1/1 Running 0 30s
61+
user-scheduler-668ff95ccf-wq5xp 1/1 Running 0 30s
62+
```
63+
64+
Be aware that the hub pod (i.e., hub-8446c9d589-vgjlw above) may take a minute or so to start.
65+
66+
If something went wrong, you will have to edit the config YAML files to get things working. Before
67+
trying to work things out yourself, check the FAQ to see if your issue has already been addressed.
68+
69+
Depending on what file you edit, you may have to run different commands to update the EKS cluster and
70+
deployment of JupyterHub. Follow the steps below to update:
71+
1. If you only edited `helm-config.yaml`, try to just update the deployment of Jupyterhub by running `./update_jupyterhub_deployment.sh`
72+
2. If step 1 failed, fully tear down the JupyterHub deployment with `./tear_down_jupyterhub.sh` and then re-deploy it with `./deploy_jupyterhub.sh`
73+
3. If you edited `cluster-autoscaler.yaml` or `storage-class.yaml`, tear down the JupyterHub deployment with `./tear_down_jupyterhub.sh`. Then, reconfigure Kubernetes with `./configure_kubernetes.sh`, and re-deploy JupyterHub with `./deploy_jupyterhub.sh`
74+
4. If you edited `eksctl-config.yaml`, fully tear down the cluster with `cleanup.sh`, and then restart from the top of this README
75+
76+
## Step 5: Get the public cluster URL
77+
78+
Now that everything's ready to go, we need to get the public URL to the cluster.
79+
80+
To do this, run the following:
81+
82+
```bash
83+
$ ./get_jupyterhub_url.sh
84+
```
85+
86+
Note that it can take several minutes after the URL is available for it to actually redirect
87+
to JupyterHub.
88+
89+
## Step 6: Distribute URL and password to attendees
90+
91+
Now that we have our pulbic URL, we can give the attendees everything they need to join the tutorial.
92+
93+
For attendees to access JupyterHub, they simply need to enter the public URL (from step 5) in their browser of choice.
94+
This will take them to a login page. The login credentials are as follows:
95+
* Username: anything the attendee wants (note: this should be unique for every user. Otherwise, users will share pods.)
96+
* Password: the password specified towards the top of `helm-config.yaml`
97+
98+
Once the attendees log in with these credentials, the Kubernetes autoscaler will spin up a pod for them (and grab new
99+
resources, if needed). This pod will contain a JupyterLab instace with the tutorial materials and environment already
100+
prepared for them.
101+
102+
At this point, you can start presenting your interactive tutorial!
103+
104+
## Step 7: Cleanup everything
105+
106+
Once you are done with your tutorial, you should cleanup everything so that there are not continuing, unneccesary expenses
107+
to your AWS account. To do this, simply run the following:
108+
109+
```bash
110+
$ ./cleanup.sh
111+
```
112+
113+
After cleaning everything up, you can verify that everything has been cleaned up by going to the AWS web consle
114+
and ensuring nothing from your tutorial still exists in CloudFormation and EKS.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
hub_pod_id=$(kubectl get pods -n default --no-headers=true | awk '/hub/{print $1}')
13+
kubectl logs $hub_pod_id
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
if [ $# -ne 1 ]; then
13+
echo "Usage: ./check_init_container_log.sh <pod_name>"
14+
exit 1
15+
fi
16+
17+
kubectl logs $1 -c init-tutorial-service
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
kubectl --namespace=default get pods
13+
14+
echo "If there are issues with any pods, you can get more details with:"
15+
echo " $ kubectl --namespace=default describe pod <pod-name>"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
if ! command -v eksctl >/dev/null 2>&1; then
13+
echo "ERROR: 'eksctl' is required to create a Kubernetes cluster on AWS with this script!"
14+
echo " Installation instructions can be found here:"
15+
echo " https://eksctl.io/installation/"
16+
exit 1
17+
fi
18+
19+
if ! command -v helm >/dev/null 2>&1; then
20+
echo "ERROR: 'helm' is required to configure and launch JupyterHub on AWS with this script!"
21+
echo " Installation instructions can be found here:"
22+
echo " https://helm.sh/docs/intro/install/"
23+
exit 1
24+
fi
25+
26+
# Temporarily allow errors in the script so that the script won't fail
27+
# if the JupyterHub deployment failed or was previously torn down
28+
set +e
29+
echo "Tearing down JupyterHub and uninstalling everything related to Helm:"
30+
helm uninstall hpdc-2025-c7i-24xlarge-jupyter
31+
set -e
32+
33+
echo ""
34+
echo "Deleting all pods from the EKS cluster:"
35+
kubectl delete pod --all-namespaces --all --force
36+
37+
echo ""
38+
echo "Deleting the EKS cluster:"
39+
eksctl delete cluster --config-file ./eksctl-config.yaml --wait
40+
41+
echo ""
42+
echo "Everything is now cleaned up!"

0 commit comments

Comments
 (0)