Oozie & PySpark workflow

This project demonstrates a Oozie workflow with a PySpark action. It assumes that all the PySpark logic is in a Python library that only needs a HiveContext and a date to run. The Python library is distributed to all the workers on the cluster and a pipeline within the library is kicked off daily depending on some data sources.

A Oozie workflow with a PySpark action is non-trivial, because:

Oozie doesn't have native support for a PySpark action.
YARN can kill your Oozie container when running PySpark in a shell.
Distributing eggs with PySpark can be a challenge due to problems with the PYTHON_EGG_CACHE (depending on cluster setup).

Issue 1. is solved by calling spark-submit in a shell-action; 2. by setting the configuration in in the workflow definition; and 3. by setting the PYTHON_EGG_CACHE for the driver in the calling Python script and for the executors in the spark-submit.

Note: this repo will crash (and this time it's not Oozie's fault!): this is a stripped/anonymized version of a workflow. . Change references to namenodes, HDFS, Oozie urls, /user/-stuff, etc.

Submitting a new workflow

Create a Python egg by cd-ing in Python project directory and copying it to dist/ in the Oozie workflow folder:

$ cd python-project/
$ python setup.py bdist_egg
$ cp dist/project-1.0-py3.5.egg ../oozie-workflow/dist/

Run update_and_run.sh to upload oozie-workflow/ to HDFS and run the Oozie coordinator.

Folders

`oozie-workflow/`

Coordinator to run a workflow.

job.properties: Coordinator properties (start & end date, job tracker, etc.)
coordinator.xml: Coordinator setting data dependencies.

workflow.xml: Workflow specification.
bin/: Scripts used in the workflow.
dist/: Folder with Python used in the workflow

`python-project/`

Python project. Project code in pythonproject, egg instructions in setup.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Oozie & PySpark workflow

Submitting a new workflow

Folders

`oozie-workflow/`

`python-project/`

Files

README.md

Latest commit

History

README.md

File metadata and controls

Oozie & PySpark workflow

Submitting a new workflow

Folders

oozie-workflow/

python-project/

`oozie-workflow/`

`python-project/`