GitHub - matthieuvion/spark-cluster: Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]

Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)

1. Prerequisites

Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
Make sure Docker Compose is installed or install it.
Ressources :
Medium article install and basic use of Docker . Docker official ressources should be enough though.
Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
The source article I (very slightly) adapted the docker-compose file from.
Install Docker engine (apt get), official ressource.
Install Docker compose (apt get), official ressource.

2. How to

After Docker Engine/compose installation, on linux, do not forget the post-installation steps

Git clone this repository or create a new one (name of your choice)

Open terminal, cd into custom directory, make sure docker-compose.yml file is present (copy it in if needed)

spark-cluster/docker-compose.yml

Lines 1 to 30 in 6e5cb82

    
           version: '3' 
        
           services: 
        
             spark: 
        
               image: bitnami/spark:3.3.1 
        
               environment: 
        
                 - SPARK_MODE=master 
        
               ports: 
        
                 - '8080:8080' 
        
                 - '7077:7077' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work 
        
             spark-worker: 
        
               image: bitnami/spark:3.3.1 
        
               environment: 
        
                 - SPARK_MODE=worker 
        
                 - SPARK_MASTER_URL=spark://spark:7077 
        
                 - SPARK_WORKER_MEMORY=4G 
        
                 - SPARK_EXECUTOR_MEMORY=4G 
        
                 - SPARK_WORKER_CORES=4 
        
               ports: 
        
                 - '8081:8081' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work 
        
             jupyter: 
        
               image: jupyter/pyspark-notebook:spark-3.3.1 
        
               ports: 
        
                 - '8888:8888' 
        
               volumes: 
        
                 - $PWD:/home/jovyan/work

Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :

Run docker compose

cd my-directory
docker compose up
# or depending of your Docker Compose install:   
docker-compose up

Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Access the different interfaces with :

Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081

You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:

from pyspark.sql import SparkSession

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

Bonus : Notebook, predict using spark.ml Pipeline()

If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

1. Prerequisites

2. How to

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Bonus : Notebook, predict using spark.ml Pipeline()

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
README.md		README.md
docker-compose.yml		docker-compose.yml
spark-cluster.ipynb		spark-cluster.ipynb

	version: '3'

	services:
	spark:
	image: bitnami/spark:3.3.1
	environment:
	- SPARK_MODE=master
	ports:
	- '8080:8080'
	- '7077:7077'
	volumes:
	- $PWD:/home/jovyan/work
	spark-worker:
	image: bitnami/spark:3.3.1
	environment:
	- SPARK_MODE=worker
	- SPARK_MASTER_URL=spark://spark:7077
	- SPARK_WORKER_MEMORY=4G
	- SPARK_EXECUTOR_MEMORY=4G
	- SPARK_WORKER_CORES=4
	ports:
	- '8081:8081'
	volumes:
	- $PWD:/home/jovyan/work
	jupyter:
	image: jupyter/pyspark-notebook:spark-3.3.1
	ports:
	- '8888:8888'
	volumes:
	- $PWD:/home/jovyan/work

Uh oh!

Uh oh!

matthieuvion/spark-cluster

Folders and files

Latest commit

History

Repository files navigation

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

1. Prerequisites

2. How to

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Bonus : Notebook, predict using spark.ml Pipeline()

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages