Skip to content

Commit 66dc7cf

Browse files
Add workaround for slurm bug with sbcast.
Slurm version 24.11.0 introduced a bug with sbcast utility where files are copied to the compute nodes' local file system under /tmp instead of the tmpfs. The bug broke the spark-start script. This workaround uses srun and cp to copy out the spark worker script to the tmpfs of the compute nodes' instead of sbcast.
1 parent 0fd4c1b commit 66dc7cf

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

spark-start

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ echo "SPARK_MASTER_URL: ${SPARK_MASTER_URL}"
241241
echo "SPARK_MASTER_WEBUI: ${SPARK_MASTER_WEBUI}"
242242

243243
# Create a worker starter script for non-daemonized spark workers.
244-
cat > ${SCRATCH}/tmp/sparkworker.sh <<EOF
244+
cat > ${SPARK_CONF_DIR}/sparkworker.sh <<EOF
245245
#!/bin/bash
246246
ulimit -u 16384 -n 16384
247247
export SPARK_CONF_DIR=${SPARK_CONF_DIR}
@@ -252,10 +252,10 @@ exec spark-class org.apache.spark.deploy.worker.Worker "${SPARK_MASTER_URL}" &>
252252
EOF
253253

254254
# Broadcast the worker script to all slurm nodes.
255-
chmod +x ${SCRATCH}/tmp/sparkworker.sh
256-
sbcast ${SCRATCH}/tmp/sparkworker.sh "${SCRATCH}/sparkworker.sh" \
255+
chmod +x ${SPARK_CONF_DIR}/sparkworker.sh
256+
srun cp ${SPARK_CONF_DIR}/sparkworker.sh "${SCRATCH}/sparkworker.sh" \
257257
|| fail "Could not broadcast worker start script to nodes"
258-
rm -f ${SCRATCH}/tmp/sparkworker.sh
258+
rm -f ${SPARK_CONF_DIR}/sparkworker.sh
259259

260260
# Modify the worker script on the node that will run the spark driver. Reduce
261261
# the resources requested by the worker to leave resources for the driver.

0 commit comments

Comments
 (0)