Skip to content

Monitor live Spark applications within Spark History Server UI on Kubernetes.

License

OKDP/spark-web-proxy

Repository files navigation

ci release-please image-rebuild License Apache2

spark-web-proxy acts as a reverse proxy for Spark History Server and Spark UI. It completes Spark History Server by seamlessly integrating live (running) Spark applications UIs. The web proxy enables real-time dynamic discovery and monitoring of running spark applications (without delay) alongside completed applications, all within your existing Spark History Server Web UI.

The proxy is non-intrusive and independent of any specific version of Spark History Server or Spark. It supports all Spark application deployment modes, including Kubernetes Jobs, Spark Operator, notebooks (Jupyter, etc), etc.

Spark History

Requirements

Note

You can use the following Spark History Server helm chart.

Installation

To deploy the Spark Web Proxy, refer to helm chart README for customization options and installation guidelines.

The web proxy can also be deployed as a sidecar container alongside your existing Spark History Server. Ensure to set the property configuration.spark.service to localhost.

In both cases, you need to use the Spark Web Proxy ingress instead of your spark history ingress.

Spark History and spark jobs Configuration

Both Spark History and Spark jobs themselves must be configured to log events, and to log them to the same shared, writable directory.

Spark History:

spark.history.fs.logDirectory /path/to/the/same/shared/event/logs

Spark Jobs:

spark.eventLog.enabled true
spark.eventLog.dir /path/to/the/same/shared/event/logs

Spark Reverse Proxy Support

The web proxy supports Spark Reverse Proxy feature for Spark web UIs by enabling the property spark.ui.reverseProxy=true in your spark jobs. In that case, the web proxy configuration property configuration.spark.ui.proxyBase should be set to /proxy

For more configuration properties, refer to Spark Monitoring configuration page.

Spark jobs deployment

Cluster mode

In a cluster mode, no additional configuration is needed as spark by default adds the label spark-role: driver and the spark-ui port in the spark driver pods as shown in the following:

apiVersion: v1
kind: Pod
metadata:
  labels:
    ...
    spark-role: driver
spec:
  containers:
  - args:
    - driver
    name: spark-kubernetes-driver
    ports:
    ...
    - containerPort: 4040
      name: spark-ui
      protocol: TCP

Notebooks and Client mode

In a client mode, the web proxy relies on /api/v1/applications/[app-id]/environment Spark History Rest API to get the Spark driver IP and UI port and /api/v1/applications/[app-id] to get the application status.

By default, Spark does not render the property spark.ui.port in the environment properties. So, you should set the property during the job submission or using a listener.

Here is an example of how to set the spark.ui.port on a jupyter notebook:

import socket
def find_available_port(start_port=4041, max_port=4100):
    """Find the next available port starting from start_port."""
    for port in range(start_port, max_port):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(("localhost", port)) != 0:
                return port
    raise Exception(f"No available ports found in range {start_port}-{max_port}")
conf.set("spark.ui.port", find_available_port())

Authentication

The Spark Web Proxy is independent of any specific authentication mechanism. It simply forwards credentials and headers to the running Spark instances without modifying or enforcing authentication itself.

This allows to use the Spark Authentication Filter or any other authentication solution to secure both the Spark History Server and Spark Jobs to ensure user authentication and authorization.