Skip to content

PySpark Project Creation

Awantik Das edited this page Nov 21, 2017 · 21 revisions
  1. Create Project directory
  2. Copy launch_spark_submit script here ( Required if notebook also running on same spark )

#!/bin/bash unset PYSPARK_DRIVER_PYTHON spark-submit $* export PYSPARK_DRIVER_PYTHON=jupyter

  1. Now create entry program entry.py with 'main'
  2. create another dir 'additionalCode'
  3. cd additionalCode
  4. Create setup.py from setuptools import setup

setup( name='PySparkUtilities', version='0.1dev', packages=['utilities'], license=''' Creative Commons Attribution-Noncommercial-Share Alike license''', long_description=''' An example of how to package code for PySpark''' )

  1. mkdir utilities
  2. Copy modules inside it
  3. In additionalCode execute - python setup.py bdist_egg
  4. This will create dist dir.
  5. dist will contain egg file
  6. To run ./launch_spark_submit.sh --master local[4] --py-files additionalCode/dist/PySparkUtilities-0.2.dev0-py2.7.egg entry.py
Clone this wiki locally