Running Shark with Tachyon

Shark 0.7 adds a new storage format to support efficiently reading data from Tachyon, which enables data sharing and isolation across instances of Shark. Slide gives a good overview of the benefits of using Tachyon to cache Shark's tables. In summary, there are three major ones:

In-memory data sharing across multiple Shark instances (i.e. stronger isolation)
Instant recovery of in-memory tables
Reduce heap size => faster GC in shark
If the table is larger than the memory size, only the hot columns will be cached in memory

Setup

In order to use Spark on Tachyon, you need to setup Tachyon first, either Local Mode, or Cluster Mode, with HDFS.

Then, edit shark-env.sh and add export TACHYON_MASTER="ec2-67-202-40-159.compute-1.amazonaws.com:9999" (your tachyon master address)

Cache Shark table in Tachyon

Specify TBLPROPERTIES(“shark.cache” = “tachyon”), for example: CREATE TABLE data TBLPROPERTIES(“shark.cache” = “tachyon”) AS SELECT a, b, c from data_on_disk WHERE month=“May”
Specify table's name ending with _tachyon, for example:

CREATE TABLE orders_tachyon AS SELECT * FROM orders;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Shark with Tachyon

Setup

Cache Shark table in Tachyon

Clone this wiki locally