This repository contains two projects
A tool built with SparkSQL and HIVE to derive some performance metrics from the log files generated by spark. In order to use the tool you need a version of Spark that include support for Hive. To include such support in your spark build blease look at the Spark build documentation
To generate log files from Spark you should add the following properties to your spark-defaults.conf:
spark.eventLog.enabled true
spark.eventLog.dir file://some/directory/of/your/choiche
Note that the spark event log directory can also be on another file system like hdfs (e.g. hdfs://user/logs) In order to build the DAG from from the logs the latest development versionof Spark (1.4.0.snaphot) should be used.
Use parameter -u --usage
to show the Usage Guide
Once built has been built can be invoked by using spark-submit script available in spark
The tool can be built using Maven with
mvn clean package -Dmaven.test.skip=true
it will generate a fat jar with all the needed dependencies
The Performance Estimator is a tool that estimates the runtime of a Spark application starting from an estimation of the time spent on each stages. It uses the DAG build by Spark Log Processor (and exported with -e
parameter)
To use the performance estimation tool just, first build it then run the executable jar using as parameters:
-i
to specify the input folder containign the DAGs (or a single DAG) of each Job-p
to specify the input file containg the performance information of stages, such a file can be exported by the Spark Logger Parser tool
Note: currently only DAGs containing stages can be processed.
To build the tool run mvn install