spark-web-proxy acts as a reverse proxy for Spark History Server and Spark UI. It completes Spark History Server by seamlessly integrating live (running) Spark applications UIs. The web proxy enables real-time dynamic discovery and monitoring of running spark applications (without delay) alongside completed applications, all within your existing Spark History Server Web UI.
The proxy is non-intrusive and independent of any specific version of Spark History Server or Spark. It supports all Spark application deployment modes, including Kubernetes Jobs, Spark Operator, notebooks (Jupyter, etc), etc.
- Kubernetes cluster
- Spark History Server
- Helm installed
Note
You can use the following Spark History Server helm chart.
To deploy the Spark Web Proxy, refer to helm chart README for customization options and installation guidelines.
The web proxy can also be deployed as a sidecar container alongside your existing Spark History Server. Ensure to set the property configuration.spark.service to localhost.
In both cases, you need to use the Spark Web Proxy ingress instead of your spark history ingress.
Both Spark History and Spark jobs themselves must be configured to log events, and to log them to the same shared, writable directory.
spark.history.fs.logDirectory /path/to/the/same/shared/event/logsspark.eventLog.enabled true
spark.eventLog.dir /path/to/the/same/shared/event/logsThe web proxy supports Spark Reverse Proxy feature for Spark web UIs by enabling the property spark.ui.reverseProxy=true in your spark jobs. In that case, the web proxy configuration property configuration.spark.ui.proxyBase should be set to /proxy
For more configuration properties, refer to Spark Monitoring configuration page.
In a cluster mode, no additional configuration is needed as spark by default adds the label spark-role: driver and the spark-ui port in the spark driver pods as shown in the following:
apiVersion: v1
kind: Pod
metadata:
labels:
...
spark-role: driver
spec:
containers:
- args:
- driver
name: spark-kubernetes-driver
ports:
...
- containerPort: 4040
name: spark-ui
protocol: TCPIn a client mode, the web proxy relies on /api/v1/applications/[app-id]/environment Spark History Rest API to get the Spark driver IP and UI port and /api/v1/applications/[app-id] to get the application status.
By default, Spark does not render the property spark.ui.port in the environment properties. So, you should set the property during the job submission or using a listener.
Here is an example of how to set the spark.ui.port on a jupyter notebook:
import socket
def find_available_port(start_port=4041, max_port=4100):
"""Find the next available port starting from start_port."""
for port in range(start_port, max_port):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
if s.connect_ex(("localhost", port)) != 0:
return port
raise Exception(f"No available ports found in range {start_port}-{max_port}")conf.set("spark.ui.port", find_available_port())The Spark Web Proxy is independent of any specific authentication mechanism. It simply forwards credentials and headers to the running Spark instances without modifying or enforcing authentication itself.
This allows to use the Spark Authentication Filter or any other authentication solution to secure both the Spark History Server and Spark Jobs to ensure user authentication and authorization.
