Shark 0.7.0
Release date: June 6, 2013
We are happy to announce Shark 0.7.0, a new release with a number of bug fixes and improvements. In particular, we have added experimental support for the Tachyon project. The current release requires:
- Scala 2.9.3
- Spark 0.7.2
- OpenJDK 7 or Oracle HotSpot JDK7 or Oracle HotSpot JDK 6u23+ (because we are using certain Unsafe operations that are available only in the more recent JDKs)
You can download the pre-packaged binary tarballs on our GitHub Wiki: https://github.com/amplab/shark/wiki
Release Versioning
With this release, we are experimenting with a simplified versioning scheme for Shark. The major release number for Shark will synchronize with the major Spark release number.
Tachyon Integration
Tachyon is a new project at UC Berkeley AMPLab that acts as a distributed in-memory storage layer on top of HDFS. Shark’s in-memory columnar storage engine has been rewritten to work with Tachyon, and users can choose to save an in-memory table into Tachyon. By decoupling the lifespan of the in-memory tables from the lifespan of the Shark processes, Tachyon provides a number of benefits:
- In-memory tables can now be shared by multiple Shark / Spark instances.
- JVM garbage collection times are shorter because of smaller JVM heap sizes for Shark processes.
- In-memory tables can survive when rogue applications crash Shark processes.
To choose Tachyon as the storage system for in-memory tables, set the table property “shark.cache” to “tachyon”, e.g.
CREATE TABLE data TBLPROPERTIES("shark.cache" = "tachyon") AS
SELECT a, b, c from data_on_disk WHERE month="May";
Improved sql2rdd/sql2console API
We have improved the reliability of sql2rdd
and sql2console
API. In particular, they are now used extensively in unit-tests.
New Data Types and Data Serialization/Deserialization Formats
We added two new data types to the memory store: timestamp and binary. We also added Avro serialization and deserialization so Shark can read Avro files.
Improved LIMIT Support
Shark now avoids launching any tasks if a query or a subquery uses LIMIT 0
. For quick exploratory queries, Shark launches one task at a time when LIMIT
is specified.
Appending Data Into In-Memory Tables
You can now insert (with or without overwrite) additional data into in-memory tables.
Enhanced EC2/S3/EMR Support
We have enhanced EC2/S3/EMR support in Shark. For example, the Shark CLI can now execute queries defined in an S3 file (bin/shark -f s3://...
). Shark also picks up AWS credentials directly from the environmental variable settings (AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
).
Better Support for Hadoop2/CDH4
The latest release of Spark and Shark includes pre-compiled binaries for both Hadoop1 and Hadoop2 storage API’s, eliminating the need for users to build themselves. We’ve also updated the documentation to point out major “gotchas” encountered when running on Hadoop2.
Better Memory Management and Cluster Resource UI
Thanks to the new features in Spark, you can now monitor the status of in-memory storage and cluster nodes on Spark’s web UI.
Credits
We would like to thank Mikhail Bautin, Tathagata Das, Harvey Feng, Mark Hamstra, Cheng Hao, Jon Hartlaub, Nandu Jayakumar, Jey Kottalam, Haoyuan Li, Josh Rosen, Ram Sriharsha, Patrick Wendell, and Reynold Xin for their contributions.