Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]
Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)
- Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
- Make sure Docker Compose is installed or install it.
- Ressources :
Medium article install and basic use of Docker . Docker official ressources should be enough though.
Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
The source article I (very slightly) adapted the docker-compose file from.
Install Docker engine (apt get), official ressource.
Install Docker compose (apt get), official ressource.
After Docker Engine/compose installation, on linux, do not forget the post-installation steps
- Git clone this repository or create a new one (name of your choice)
- Open terminal, cd into custom directory, make sure
docker-compose.ymlfile is present (copy it in if needed)
spark-cluster/docker-compose.yml
Lines 1 to 30 in 6e5cb82
Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :
- Run docker compose
cd my-directory
docker compose up
# or depending of your Docker Compose install:
docker-compose up
Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.
Access the different interfaces with :
Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081
You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:
from pyspark.sql import SparkSession
# SparkSession
URL_SPARK = "spark://spark:7077"
spark = (
SparkSession.builder
.appName("spark-ml")
.config("executor.memory", "4g")
.master(URL_SPARK)
.getOrCreate()
)
If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.