This repository provides a fully functional Docker-based environment for running Apache Spark cluster. on Hadoop YARN or in standalone mode. .
✅ Supports executing MapReduce and Spark applications on YARN or standalone mode. Additionally, it sets up an HDFS cluster for reliable data storage and access.
✅ Integrated Jupyter notebook for interactive data analysis and visualization using PySpark and Scala.
✅ Fully compatible with our modernized HiBench version, enabling performance benchmarking of Spark workloads.
- Base OS: Ubuntu 20.04
- Spark: v3.3.2 Scala v2.12 and Python (pyspark) v3.8
- Hadoop: v3.3.2
Download and extract the contents into a folder of your choice (be sure there is enough space for the images and share data).
Run the following to build the required Docker images:
On Windows (PowerShell):
.\scripts\build_images.ps1To unblock it:
right-click the file build_images.ps1, select Properties and at the bottom of the General tab check Unblock.
On Linux/macOS
cd scripts
make build_base_image
make build_master_image
make build_slave_image
make build_jupyter_imageTo start a cluster with 1 master and 3 slave nodes:
docker-compose -p spark-cluster up -ddocker-compose.yml accordingly.
The docker-compose creates a Nginx internal reverse proxy to access the log files of the slaves.
To ensure the slaves can resolve from the host, add the following line to each slave's /etc/hosts file:
127.0.0.1 spark-cluster-master
127.0.0.1 spark-cluster-slave-1
127.0.0.1 spark-cluster-slave-2
127.0.0.1 spark-cluster-slave-3
or using this command as administrator in Windows PowerShell:
..\scripts\add_hosts.ps1To stop and remove all associated containers, volumes, and networks:
docker-compose downTo restart existing containers:
docker-compose startTo stop running containers:
docker-compose stop| Role | Master | Slaves |
|---|---|---|
| HDFS NameNode | ✅ | ❌ |
| HDFS SecondaryNameNode | ✅ | ❌ |
| HDFS DataNode | ❌ | ✅ |
| YARN ResourceManager | ✅ | ❌ |
| YARN NodeManager | ❌ | ✅ |
| Spark History Server | ✅ | ❌ |
| Spark Master/Worker | ✅/❌ (standalone) | ❌/✅ (standalone) |
- HDFS NameNode: Central service managing file system metadata.
- HDFS SecondaryNameNode: Periodically merges fsimage and edit logs.
- HDFS DataNode: Stores actual data blocks; distributed across slaves.
- YARN ResourceManager: Manages cluster resources and job scheduling.
- YARN NodeManager: Runs containers and reports usage; one per slave.
- Spark History Server: Displays completed Spark jobs (UI).
- Spark Master/Worker: Not used in YARN mode (YARN handles scheduling).
To control the CPU and memory usage of each Spark/YARN node (especially slaves), Docker resource limits can be declared in the docker-compose.yml file using cpus and mem_limit:
services:
slave:
image: spark_slave:latest
deploy:
resources:
limits:
cpus: '2.0' # Limit to 2 CPU cores
memory: 4g # Limit to 4 GB of RAM
mem_limit: 4g
cpus: 2
...In the yarn-site.xml distributed to all nodes, the following properties must reflect the same limits:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>Ensure yarn.scheduler.minimum-allocation-mb and minimum-allocation-vcores are ≤ these values.
Spark must be configured accordingly to the maximum resources provided by YARN:
spark.executor.instances 3
spark.executor.memory 3g
spark.executor.cores 1
spark.driver.memory 2g
spark.driver.cores 1Consistent resource settings across Docker, YARN, and Spark prevent resource oversubscription and ensure correct job execution within physical limits.
- Jupyter notebook
- YARN ResourceManager
- Hadoop HDFS NameNode UI
- Spark Standalone UI
- Spark History Server
You can connect to the master container via SSH:
ssh sparker@spark-cluster-master -p 2222Password:
sparker
Volume mappings and folder descriptions:
| Folder | Mapped Path | Purpose |
|---|---|---|
conf-master |
/home/sparker/hadoop-*/conf |
Master Hadoop config |
conf-slave |
/home/sparker/hadoop-*/conf |
Slave Hadoop config |
master |
- | Dockerfile for master |
slave |
- | Dockerfile for slave |
shared-master |
/home/sparker/shared |
Shared directory for master |
shared-slave |
/home/sparker/shared |
Shared directory for each slave |
hibench-data |
/home/sparker/HiBench |
Volume for HiBench benchmarks |
jupyter |
/home/sparker/work |
Jupyter notebook files |
You can drop your Spark JARs into the shared folder and run them using spark-submit.
To access the Spark master container:
docker exec -it hadoop-spark-cluster-master /bin/bashIf unsure of the container name:
docker ps --format '{{.Names}}'spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 1g --executor-memory 1g --executor-cores 1 ~/spark-3.3.2-bin-hadoop3/examples/jars/spark-examples*.jar 1000This cluster is ready to use with our customized HiBench for performance benchmarking of Spark applications, with optional MapReduce-based data generation.
This version is a heavily improved fork of bartosz25/spark-docker.