spark-yarn
Follow instructions for your environment.
$ spark-shell
Access 'Resource Manager UI' as follows:
- Login to AMBARI
- Click on YARN service
- From 'Quick Links' drop down menu on top access 'Resource Manager UI'
Inspect running applications.
You won't see the Spark-shell application, as it is launched in 'local' mode.
In Spark Shell, try the following:
sc.master
You will see 'local[*]' as master.
Quit the Spark shell by typing exit
Let's launch Spark-shell and connect to YARN
$ spark-shell --master yarn --deploy-mode client \
--driver-memory 512m --executor-memory 512m \
--num-executors 2 --executor-cores 1
Once the shell is running, try the following
sc.master
Also if you need to disable logging...
sc.setLogLevel("WARN")
Also inspect YARN Resource Manager UI. Now you'd see the Spark Shell running as application.
- --master yarn : submitting to YARN cluster
- --deploy-mode client : Using client mode, recommended for interactive applications like Spark Shell
- --driver-memory 512m : specify how much memory the driver needs to use
- --executor-memory 512m : memory for executors in YARN
- --num-executors 2 : Use 2 executors for processing
- --executor-cores 1 : use only 1 CPU core
We are 'throttling down' the resource usage, as we are running on a small virtual machine.
Let's do a simple dataframe computation in this Spark Shell.
val clickstream = sqlContext.read.json("/user/root/clickstream/in-json/clickstream.json")
clickstream.count
clickstream.show
clickstream.registerTempTable("clickstream")
// count traffic per domain from highest to lowest
sqlContext.sql("select domain, count(*) as total from clickstream group by domain order by total desc").show
// now load the entire json dir
val clickstream = sqlContext.read.json("/user/root/clickstream/in-json/")
clickstream.count
clickstream.registerTempTable("clickstream")
sqlContext.sql("select domain, count(*) as total from clickstream group by domain order by total desc").show