Skip to content

Running Shark with Tachyon

Haoyuan Li edited this page Jun 28, 2013 · 24 revisions

Shark 0.7 adds a new storage format to support efficiently reading data from Tachyon, which enables data sharing and isolation across instances of Shark.

export TACHYON_MASTER="ec2-67-202-40-159.compute-1.amazonaws.com:9999"

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.

Downloads | User Documentation | Developer Documentation | Acknowledgement

Highlights

Fast Execution Engine

Shark is built on top of Spark, a data-parallel execution engine that is fast and fault-tolerant. Even if data are on disk, Shark can be noticeably faster than Hive because of the fast execution engine. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk. Thanks to this fast engine, Shark can answer queries in sub-second latency.

Columnar Memory Store

Analytical queries usually focus on a particular subset or time window, e.g., http logs from the previous month, touching only the (small) dimension tables and a small portion of the fact table. These queries exhibit strong temporal locality, and in many cases, it is plausible to fit the working set into a cluster’s memory.

Shark allows users to exploit this temporal locality by storing their working set of data across a cluster's memory, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as Java primitives arrays), which is very efficient for storage and garbage collection, yet provides maximum performance (orders of magnitude faster than reading data from disk). Below is an example on how to cache data in Shark:

CREATE TABLE logs_last_month_cached AS SELECT * FROM logs WHERE time > date(...);

SELECT page, count(*) c FROM logs_last_month_cached GROUP BY page ORDER BY c DESC LIMIT 10;

Spark/Machine Learning Integration

Shark provides a simple API for programmers to convert results from SQL queries into a special type of RDDs (Resilient Distributed Datasets). This integrates SQL query processing with machine learning, and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.

val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.map(extractFeatures(_))
kmeans(featureMatrix)

Running Shark Locally: Get Shark up and running on a single node for a quick spin in ~ 5 mins.

Running Shark on EC2: Launch a Shark cluster on Amazon EC2 in ~ 10 mins, including examples on how to query data in S3.

Running Shark on a Cluster: Get Shark up and running on your own cluster.

Shark User Guide: An introduction to running Shark and its API.

Building Shark from Source Code

Compatibility with Apache Hive: Deploying Shark in existing Hive Warehouses.

Downloads

shark-0.7.0-hadoop1-bin.tgz — Shark 0.7.0 binary with patched Hive 0.9 and Spark 0.7.2 jars - Hadoop1/CDH3

shark-0.7.0-hadoop2-bin.tgz — Shark 0.7.0 binary with patched Hive 0.9 and Spark 0.7.2 jars - Hadoop2/CDH4

amp-hive-0.9.0-shark-0.7.0.tgz — Patched Hive 0.9 - Shark 0.7.0

Older Versions:

shark-0.2.1-bin.tgz — Shark 0.2.1 binary with patched Hive 0.9 and Spark 0.6.1 jars - Hadoop1/CDH3

shark-0.2.1-bin-hadoop2.tgz — Shark 0.2.1 binary with patched Hive 0.9 and Spark 0.6.2 jars - Hadoop2/CDH4

shark-0.2-bin.tgz — Shark 0.2 binary with patched Hive 0.9 and Spark 0.6.2 jars

hive-0.9.0-bin.tar.gz — Patched Hive 0.9

Related Projects

Spark: The in-memory cluster computing framework that powers Shark.

Apache Hive: Apache Hive data warehouse system.

Apache Mesos: cluster manager that provides efficient resource isolation and sharing across distributed applications.