spark-intro

This is an attempt to list down resources to make it easier to engage with Spark. You can download it from here.

You may ask why do I need to learn just another framework for 'Big Data' processing when I know Hadoop? To quote from its Wikipedia entry

In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

You might find this video interesting. I found lecture slides (slides 1-9 and 16-19) by Prof. Randal Burns very useful for understanding Spark concepts which lead to such speedup.

Spark requires a cluster manager and a distributed storage system. It has its own native Spark cluster and it can be used with Hadoop YARN cluster as well. Instruction on how to use a YARN cluster on AWS can be accessed here.

Python programming guide is accessible here. Explanations of various transformations and actions supported is also documented there. Spark supports python 2.7 (they have recently merged python 3 support, so it might become available soon).

I have added some sample program under python folder. A more comprehensive set of examples is accessible here.

To work and test directly in an IDE like PyCharm, you can either add pyspark module as a source folder in your project or you can set appropriate environment variable (refer to setup.py).

To run a program on your local machine you can execute : /Users/User1/Developer/spark-1.3.0-bin-hadoop2.4/bin/spark-submit wordcount.py

To run this on a spark-cluster (standalone or EC2), you need to specify the cluster address as the argument below : /Users/User1/Developer/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --master spark://<master-ip>:7077 wordcount.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
python		python
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-intro

About

Releases

Packages

Languages

agarg2008/spark-intro

Folders and files

Latest commit

History

Repository files navigation

spark-intro

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages