Skip to content

Latest commit

 

History

History
54 lines (36 loc) · 1.3 KB

pyspark.md

File metadata and controls

54 lines (36 loc) · 1.3 KB

PySpark

This document assumes you already have python.

To run PySpark, we first need to add it to PYTHONPATH:

export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH"

Make sure that the version under ${SPARK_HOME}/python/lib/ matches the filename of py4j or you will encounter ModuleNotFoundError: No module named 'py4j' while executing import pyspark.

For example, if the file under ${SPARK_HOME}/python/lib/ is py4j-0.10.9.3-src.zip, then the export PYTHONPATH statement above should be changed to

export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"

Now you can run Jupyter or IPython to test if things work. Go to some other directory, e.g. ~/tmp.

Download a CSV file that we'll use for testing:

wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

Now let's run ipython (or jupyter notebook) and execute:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

df = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

df.show()

Test that writing works as well:

df.write.parquet('zones')