Skip to content

Latest commit

 

History

History
31 lines (21 loc) · 2.38 KB

README.md

File metadata and controls

31 lines (21 loc) · 2.38 KB

spark-intro

This is an attempt to list down resources to make it easier to engage with Spark. You can download it from here.

You may ask why do I need to learn just another framework for 'Big Data' processing when I know Hadoop? To quote from its Wikipedia entry

In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

You might find this video interesting. I found lecture slides (slides 1-9 and 16-19) by Prof. Randal Burns very useful for understanding Spark concepts which lead to such speedup.


Spark requires a cluster manager and a distributed storage system. It has its own native Spark cluster and it can be used with Hadoop YARN cluster as well. Instruction on how to use a YARN cluster on AWS can be accessed here.


Python programming guide is accessible here. Explanations of various transformations and actions supported is also documented there. Spark supports python 2.7 (they have recently merged python 3 support, so it might become available soon).

I have added some sample program under python folder. A more comprehensive set of examples is accessible here.

To work and test directly in an IDE like PyCharm, you can either add pyspark module as a source folder in your project or you can set appropriate environment variable (refer to setup.py).

To run a program on your local machine you can execute : /Users/User1/Developer/spark-1.3.0-bin-hadoop2.4/bin/spark-submit wordcount.py

To run this on a spark-cluster (standalone or EC2), you need to specify the cluster address as the argument below : /Users/User1/Developer/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --master spark://<master-ip>:7077 wordcount.py