Skip to content

Shark 0.7.0

Compare
Choose a tag to compare
@rxin rxin released this 17 Oct 23:24
· 855 commits to master since this release

Release date: June 6, 2013

We are happy to announce Shark 0.7.0, a new release with a number of bug fixes and improvements. In particular, we have added experimental support for the Tachyon project. The current release requires:

  • Scala 2.9.3
  • Spark 0.7.2
  • OpenJDK 7 or Oracle HotSpot JDK7 or Oracle HotSpot JDK 6u23+ (because we are using certain Unsafe operations that are available only in the more recent JDKs)

You can download the pre-packaged binary tarballs on our GitHub Wiki: https://github.com/amplab/shark/wiki

Release Versioning

With this release, we are experimenting with a simplified versioning scheme for Shark. The major release number for Shark will synchronize with the major Spark release number.

Tachyon Integration

Tachyon is a new project at UC Berkeley AMPLab that acts as a distributed in-memory storage layer on top of HDFS. Shark’s in-memory columnar storage engine has been rewritten to work with Tachyon, and users can choose to save an in-memory table into Tachyon. By decoupling the lifespan of the in-memory tables from the lifespan of the Shark processes, Tachyon provides a number of benefits:

  • In-memory tables can now be shared by multiple Shark / Spark instances.
  • JVM garbage collection times are shorter because of smaller JVM heap sizes for Shark processes.
  • In-memory tables can survive when rogue applications crash Shark processes.

To choose Tachyon as the storage system for in-memory tables, set the table property “shark.cache” to “tachyon”, e.g.

CREATE TABLE data TBLPROPERTIES("shark.cache" = "tachyon") AS
SELECT a, b, c from data_on_disk WHERE month="May";

Improved sql2rdd/sql2console API

We have improved the reliability of sql2rdd and sql2console API. In particular, they are now used extensively in unit-tests.

New Data Types and Data Serialization/Deserialization Formats

We added two new data types to the memory store: timestamp and binary. We also added Avro serialization and deserialization so Shark can read Avro files.

Improved LIMIT Support

Shark now avoids launching any tasks if a query or a subquery uses LIMIT 0. For quick exploratory queries, Shark launches one task at a time when LIMIT is specified.

Appending Data Into In-Memory Tables

You can now insert (with or without overwrite) additional data into in-memory tables.

Enhanced EC2/S3/EMR Support

We have enhanced EC2/S3/EMR support in Shark. For example, the Shark CLI can now execute queries defined in an S3 file (bin/shark -f s3://...). Shark also picks up AWS credentials directly from the environmental variable settings (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

Better Support for Hadoop2/CDH4

The latest release of Spark and Shark includes pre-compiled binaries for both Hadoop1 and Hadoop2 storage API’s, eliminating the need for users to build themselves. We’ve also updated the documentation to point out major “gotchas” encountered when running on Hadoop2.

Better Memory Management and Cluster Resource UI

Thanks to the new features in Spark, you can now monitor the status of in-memory storage and cluster nodes on Spark’s web UI.

Credits

We would like to thank Mikhail Bautin, Tathagata Das, Harvey Feng, Mark Hamstra, Cheng Hao, Jon Hartlaub, Nandu Jayakumar, Jey Kottalam, Haoyuan Li, Josh Rosen, Ram Sriharsha, Patrick Wendell, and Reynold Xin for their contributions.