Skip to content

Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

Notifications You must be signed in to change notification settings

matthieuvion/spark-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How to : deploy a local Spark cluster (standalone) w/Docker (Linux)

License: MIT made-with-python

Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]

Deploying a local spark cluster (standalone) can be tricky
Most of online ressources focus on single driver installation w/ Spark in a custom env or using jupyter-docker-stacks
Here my notes to work with Spark locally, with a Jupyter Labs interface, with one Master and one Worker, using Docker Compose.
All the PySpark dependencies already configured in a container, access to your local files (in an existing directory)
You might also want to do it the easy way --not local though, using Databricks community (free)

1. Prerequisites


  • Install Docker Engine, either through Docker Desktop or directly Docker engine. Personally, using the latter.
  • Make sure Docker Compose is installed or install it.
  • Ressources :
    Medium article install and basic use of Docker . Docker official ressources should be enough though.
    Jupyter-Docker-Stacks. "A set of ready-to-run Docker images containing Jupyter applications".
    The source article I (very slightly) adapted the docker-compose file from.
    Install Docker engine (apt get), official ressource.
    Install Docker compose (apt get), official ressource.

2. How to


After Docker Engine/compose installation, on linux, do not forget the post-installation steps

  1. Git clone this repository or create a new one (name of your choice)
  2. Open terminal, cd into custom directory, make sure docker-compose.yml file is present (copy it in if needed)
    version: '3'
    services:
    spark:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=master
    ports:
    - '8080:8080'
    - '7077:7077'
    volumes:
    - $PWD:/home/jovyan/work
    spark-worker:
    image: bitnami/spark:3.3.1
    environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark:7077
    - SPARK_WORKER_MEMORY=4G
    - SPARK_EXECUTOR_MEMORY=4G
    - SPARK_WORKER_CORES=4
    ports:
    - '8081:8081'
    volumes:
    - $PWD:/home/jovyan/work
    jupyter:
    image: jupyter/pyspark-notebook:spark-3.3.1
    ports:
    - '8888:8888'
    volumes:
    - $PWD:/home/jovyan/work

Basically, the yml file tells Docker compose how to run the Spark Master, Worker, Jupyterlab. You will have access to your local disk/current working directory every time you run this command :

  1. Run docker compose
cd my-directory
docker compose up
# or depending of your Docker Compose install:   
docker-compose up

Docker compose will automatically download the needed images (spark:3.3.1 for Master and Worker, pyspark-notebook for the JupyterLab interface) and run the whole thing. The next times, it will only run them.

3. Profit : JupyterLab interface, Spark cluster (standalone) mode

Access the different interfaces with :

Jupyter lab interface : http://localhost:8888
Spark Master : http://localhost:8080
Spark Worker : http://localhost:8081

You can use the demo notebook spark-cluster.ipynb for a ready-to-use PySpark notebook, or simply create a new one and run this as a SparkSession:

from pyspark.sql import SparkSession

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

Bonus : Notebook, predict using spark.ml Pipeline()


If you use spark-cluster.ipynb, a demo example shows how to build a spark.ml predict Pipeline() with a random forest regressor, on a well known dataset.

About

Steps to deploy a local spark cluster w/ Docker. Bonus: a ready-to-use notebook for model prediction on Pyspark using spark.ml Pipeline() on a well known dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published