Skip to content

apreto/poc-tuning-spark-join

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tuning/optimizing Spark Joins (BACKLOG-30038 Spike)

This builds an scala app in a uber jar for using with spark-submit.

Build the uber jar from cmd line using sbt:

  • /opt/sbt-1.2.7/sbt/bin/sbt assembly

This will an uber-jar to be used with spark-submit on target/scala-2.11/BL30038-1.0-assembly.jar. Run it with:

  • /opt/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --master local --class B01DatasetJoinNoPartition target/scala-2.11/BL30038-1.0-assembly.jar

Available classes on uber jar to run with spark-submit:

  • A01RDDJoinNoPartition
  • A02RDDJoinWithPartition
  • B01DatasetJoinNoPartition
  • B02DatasetJoinWithPartition
  • B03DatasetJoinBroadcast

For running in yarn mode on a cluster (e.g., EMR), copy jar file there and try something like:

  • spark-submit --master yarn --class MainClassName BL30038-1.0-assembly.jar
  • spark-submit --master yarn --driver-memory 4g --executor-memory 2g --executor-cores 3 --class MainClassName BL30038-1.0-assembly.jar
  • spark-submit --master yarn --driver-memory 4g --executor-memory 2g --executor-cores 3 --num-executors 5 --class MainClassName BL30038-1.0-assembly.jar

About

Proof of concept - Spark Join Tuning (BACKLOG-30038)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published