Skip to content

BigFrame User Guide

Xiaodan Zhu edited this page Jun 24, 2015 · 20 revisions

BigFrame is a benchmark generator, which captures your requirement by the 3V's, namely, Volume, Variety and Velocity emphasized in the Big Data environment. Given the benchmark specification you provided, it will generate

  • The set of data for initial load (with data loading utility)
  • The refresh pattern for each data set (with refresh driver)
  • The query stream (with query implementation and driver to run on different systems)
  • The benchmark metrics

Prequisite

Since BigFrame relies on hadoop to do the parallel data generation, it is a must to install hadoop beforehand.

BigFrame requires:

  • JDK 1.6 is needed, JDK 1.7 is recommended.
  • Hadoop 1.0.4 (other versions are not tested)

To build BigFrame from source, execute the the command,

sbt/sbt assembly

Benchmark specification

You can tailor the specification to meet you special need by modifying the file

cong/bigframe-core.xml

For example, to select an application domain to benchmark on, you can specify the corresponding domain name

<property>
	<name>bigframe.application.domain</name>
	<value>BI</value>
	<description>
		Choose the application domain you want to benchmark on.
		Currently, supported applications are: BI
	</description>
</property>

To specify the data volume, you can enter an number presented your data size,

<property>
	<name>bigframe.datavolume</name>
	<value>10</value>
	<description>
		tiny: around 10GB
		small: around 100GB
		medium: around 1TB
		large: around 10TB
		extra large: around 100TB 
	</description>
</property>

Besides the 3V's, you can also specify which engine the query will run on. Further more, if a query involves several data types, you can even tell BigFrame which engine will do the job for each specific data type. For example,

<property>
	<name>bigframe.queryengine.relational</name>
	<value>hadoop</value>
</property>

<property>
	<name>bigframe.queryengine.graph</name>
	<value>hive-tez</value>
</property>

<property>
	<name>bigframe.queryengine.nested</name>
	<value>spark</value>
</property>

<property>
	<name>bigframe.queryengine.text</name>
	<value>hive-tez</value>
</property>

Of course, you need to install and setup the corresponding systems before actually run the queries, BigFrame will not do this job for you.

BigFrame Configuration

Before running BigFrame, you need to edit the conf/config.sh to set the following variables:

HADOOP_HOME: By default, it tries to get it from the environment variables.
TPCDS_LOCAL: A temp directory to store the imtermediate data for tpcds generator. 

There are other variable related to the drivers. For example, you need to tell where BigFrame can find Spark if you want to run the benchmark on Spark. This can be done by specified the SPARK_HOME parameter as follow

SPARK_HOME=/path/to/spark_home

Run BigFrame

After finish all the setup above, you can now run the BigFrame program. The first program you need to run is the data generator. To start the data generator, you can type the following command in BigFrame's root directory

/bin/datagen -mode datagen

Then, it will try to generate the set of data you specified before. Be sure that you have started the HDFS and MapReduce Engine.

After the data generation finish, you can then run the benchmark queries by this command

/bin/qgen -mode runqueries

It will prepare a set of queries based on your benchmark specification, and then run the queries on the system you specified.