Shark 0.8.1
Release date: Jan 15, 2014
Shark 0.8.1 introduces set of performance, maintenance and usability features, with emphasis on improved Hive compatibility, Tachyon support, Spark integration, and table generating functions. This release requires
- Scala 2.9.3
- Spark 0.8.1
- AMPLab's Hive 0.9 distribution. Binaries are provided in the
hive-0.9.0-bin.tgz
shipped with this release.
Caching Semantics
To simplify caching and table recovery semantics, we've implemented a write-through cache as the default for in-memory tables (i.e., tables created with _cached
or with the shark.cache
table property set to MEMORY
).
Any table data written to the in-memory, columnar cache is synchronized with the backing, fault-tolerant store specified by the Hive warehouse directory (e.g., HDFS). Since table metadata and in-memory data are both persistent, such tables can now be automatically recovered across Shark session restarts.
Additional notes on table caching semantics:
- You can now create a cached,
MEMORY
table by simply caching the underlying table:
CACHE <table_name>
- Append operations (i.e., using
INSERT
,LOAD
) onMEMORY
tables may be slower due to the additional write to persistent store. - Tables targeted with the
CACHE
command and created with the_cached
name suffix are always pinned at theMEMORY
level. To revert to the ephemeral scheme offered in v0.8.0 and prior, create a table withshark.cache
table property set toMEMORY_ONLY
and a name that does not include the_cached
suffix.
Partitioned Tables
Users are now able to create and cache partitioned tables. Different from RDD partitions that correspond to Hadoop splits, Hive "partitions" are analogous to indexes. Each partition is represented by an RDD and identifiable by the set of runtimes values for virtual partitioning columns that specified at table creation.
In-memory partitioned tables also adhere to partition-level cache policies, which can be toggled through the shark.cache.policy
table property and customized by implementing the CachePolicy interface (an LRU implementation is provided).
During query execution, Shark uses partitioning keys to automatically filter input partitions. This feature can is be combined with RDD-partition level pruning on non-partitioned columns to further decrease the amount of data that needs to be fetched and scanned.
Tachyon Support
The complete set of commands supported for in-memory Shark tables stored in the Spark-managed heap are now supported for Tachyon-backed tables as well. This includes Hive-partitioned tables and table recovery features added in this 0.8.1 release.
Spark Integration
Stability and usability improvements have been added to reduce friction in converting between native Spark RDDs and Shark tables. A key pair of features are SharkContext
’s sqlRdd()
functions and rddToTable
implicit conversions, both of which can automatically deduce data types and update necessary metadata for transitions between RDDs and Shark tables. Both can be tested by launching a Shark shell (SHARK_HOME/bin/shark-shell
).
Table Generating Functions (TGFs)
Shark can now call into libraries that generate tables through TGFs. This enables Shark to easily access external libraries, such as Spark’s machine learning library (MLLIB).
Calls can be made into TGFs by executing GENERATE tgf_name(params)
or GENERATE tgf_name(params) SAVE table_name
. TGFs are flexible and can take arbitrary tables and parameters as inputs and produce a new table with an accompanying schema.
Other improvements
- To reduce the overhead for Hive-partitioned table scans, job configurations are only broadcasted once and shared throughout the entire read operation over a partitioned table. Previously, these configuration variables were broadcasted once per partition.
- Commands that use
COUNT DISTINCT
operations, but don’t include grouping keys, are automatically rewritten to generate query plans that can take advantage of multiple reducers (set through themapred.reduce.tasks
property) and increased parallelism. This eliminates the previous single-reducer bottleneck.
Credits
Michael Armbrust - test util improvements
Harvey Feng - Tachyon support, caching semantics, partitioned table, release manager
Ali Ghodsi - table generating functions
Mark Hamstra - build fix
Cheng Hao - work on removing Hive operator dependencies
Nandu Jayakumar - code and style cleanup
Andy Konwinski - build script fix
Haoyuan Li - Tachyon integration
Xi Liu - byte buffer overflow bug fix
Sundeep Narravula - support for database namespaces for cached tables, code cleanup
Patrick Wendell - bug fix
Reynold Xin - caching semantics, Spark integration, miscellaneous bug fixes
Thanks to everyone who contributed!