Skip to content

Latest commit

 

History

History
129 lines (97 loc) · 4.46 KB

README.md

File metadata and controls

129 lines (97 loc) · 4.46 KB

hdf5-on-spark

This repository is the source code of a paper I wrote which detailed some performance improvements for the Plant Phenotyping and Imaging Research Center (P2IRC) Project and their storage of HDF5 files. This implementation provides a HDF5 reader which ingests our HDF5 files into a binary format for storage on Hadoop HDFS. We also make use of Elasticsearch for storing metadata about the partitions and blocks for faster filtering and RDD reconstruction.

This implementation is very specific to our use case and setup. However, it is open source so that other developers may learn from it when looking for a solution to doing HDF5 processing on a Spark/Hadoop (or another storage backend) cluster.

It's also possible to write some Python APIs to make job submission easier for data scientists.

Feel free to open an issue if you have any questions!

Many thanks to Noah who helped with the initial code for this project.

Development Setup

In order to get setup for development there is a few things you need to install first.

  • Scala 2.11
  • Java 8
  • Hadoop 2.7.7
  • Spark 2.4.1

Then you can get your Hadoop cluster setup by doing the following:

  • Download and extract the precompiled hadoop archive
  • Ensure you have SSHd running without passwords on your local machine (ie: setup SSH keys)
  • Set the JAVA_HOME variable in etc/hadoop/hadoop-env.sh export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home
  • Configure:

etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <!-- Optionally change the data directory -->
    <property>
        <name>dfs.name.dir</name>
        <value>/usr/local/hadoop/dfs/name</value>
        <final>true</final>
    </property>
</configuration>

Then you can setup your Spark cluster by doing the following:

  • For spark, it's easy. Just download the prebuilt package, extract it.
  • Edit some of the configuration files to your liking, for instance:

conf/spark-defaults.conf:

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory                1g
spark.executor.memory		   4g
spark.driver.cores		   1
spark.executor.cores		   2
spark.driver.extraJavaOptions	-Djava.library.path=/path/to/this/repo/lib
spark.executor.extraJavaOptions	-Djava.library.path=/path/to/this/repo/lib

The java library path is of particular instance to link to this repositories lib/ folder when running locally if you aren't using an assembled jar to submit jobs with. Otherwise it can be safely ignored.

conf/spark-env.sh: If you have multiple java versions installed you may want to provide a JAVA_HOME to the driver and executors.

JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home

Lastly, you can start/stop your local cluster by using the following commands.

  • Startup:
# In Hadoop Install Directory
sbin/start-dfs.sh
# WebUI Running on http://localhost:50070

# In Spark Install Directory
sbin/start-all.sh
# WebUI Running on http://localhost:8080
  • Shutdown:
# In Hadoop Install Directory
sbin/stop-dfs.sh

# In Spark Install Directory
sbin/stop-all.sh

To submit jobs to a running cluster you can use the included scripts/ingest.sh, but first make sure to copy the .env.template to .env and edit it accordingly.

Some benchmark commands

  • Python Naive Filter
# Combined keith data
 python3 src/main/python/naiveFilter.py /trux/data/P2IRC/hdf5/keith --time-gte 1533833720014 --time-lte 1533837525405
# Small data
 python3 src/main/python/naiveFilter.py /trux/data/P2IRC/hdf5/small-data --time-gte 1527207282644 --time-lte 1527238446092

Libraries

We make use of two HDF5 Libraries, jhdf5 and hdf5-java. Everything in lib except jarhdf5 is jhdf5 and needs to be there.

We include the batteries included jar and the jarhdf5 jar.