Shark 0.9.1
Release date: April 10, 2014
Shark 0.9.1 is a maintenance release that stabilizes 0.9.0, which bumps up Scala compatibility to 2.10.3 and Hive compliance to 0.11. The core dependencies for this version are:
- Scala 2.10.3
- Spark 0.9.1
- AMPLab’s Hive 0.9.0
- (Optional) Tachyon 0.4.1
Hive Compatibility
We’ve extensively upgraded the Shark codebase to be Hive 0.11 compliant. Existing users can now launch Shark as a drop-in replacement for operating with existing Hive 0.11 metastores.
Two major components added during this upgrade process are support for new windowing and analytics functions, and SharkServer2. More detail is available in the respective sections below.
Analytics Functions
Windowing functions
Shark now supports the windowing functions added by HIVE-896. All of the supported window functions operate based on the SQL standard.
Rollups
Shark also supports enhanced aggregation in the form of rollups. This feature allows users to compute aggregations over multiple groups easily and efficiently. For example, the following query uses the new GROUPING SETS
clause:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)
The above query is equivalent to running multiple aggregations as follows:
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION ALL
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SharkServer2
SharkServer2 is an improved Thrift server that’s compatible with the HiveServer2 developed in Hive 0.11. SharkServer2 allows for hosting concurrent client connections and query executions. Semantics are the same as for HiveServer2:
To start a SharkServer2:
$ bin/shark -service sharkserver2
To connect to the server from remote clients, you can use JDBC with the network address and port that the server is listening on. For example, to use the Beeline CLI:
$ bin/beeline
beeline > !connect jdbc:hive2://localhost:10000/default
Usability
<table name>_cached
now caches the table in theMEMORY_ONLY
ephemeral layer (Spark block manager), which is consistent with pre-0.8.0 behavior. Previously, Shark was usingMEMORY
, which incurs added latency in DDL commands due to writes to both persistent and ephemeral storage.CACHE <table name> IN <cache type>
can be used to specify the cache layer for a table. This is equivalent toALTER TABLE <table name> TBLPROPERTIES('shark.cache'='<cache type>')
.<cache type>
can beMEMORY
,MEMORY_ONLY
, orTACHYON
.
Maven Central and Easier Deployment
To simplify deployment and installation, we’ve uploaded all AMPLab Hive and Shark binaries to Maven Central under the edu.berkeley.cs.shark
organization. HIVE_HOME
is now obsolete, and Hive binary downloads are no longer required to begin running Shark. Instead, simply download the Shark binaries, and execute SHARK_HOME/bin/shark
.
To include Shark as a dependency in your application:
For an sbt build file:
libraryDependencies ++= Seq(“edu.berkeley.cs.shark” %% “shark” % 0.9.1)
For Maven, in the dependencies
section in pom.xml
:
<dependency>
<groupId>edu.berkeley.cs.shark</groupId>
<artifactId>shark</artifactId>
<version>0.9.1</version>
</dependency>
Query Execution and Performance Improvements
- Delta encoding for
int
andlong
primitives stored in columnar format. To save memory. we only store differences between consecutive values in eachint
orlong
column. - Table scans over Hive-partitioned tables (i.e., tables created using
PARTITIONED BY
clause) now broadcast a single configuration for each table scan, as opposed to broadcasts linear in the number of partitions for that table.
Download Links
Shark with Hadoop 1
Shark with Hadoop 2 (cdh5)
Credits
Michael Armbrust - SharkServer bugfix, Scala 2.10 upgrade
Oleg Danilov - Hive 0.11 upgrade, bug fixes
Aaron Davidson - Tachyon API revamp, improved caching semantics
Harvey Feng - Hive 0.11, Spark 0.9 upgrade, release manager
Cheng Hao - Windowing functions, join refactor
Nandu Jayakumar - Delta encoding
Andy Konwinski - Build script fix
Steven Leung - Bug fix for partitioned table stats
ChengXiang Li - Yarn compatibility
Antonio Lupher - Hive 0.11 upgrade, lateral view improvements
Sundeep Narravula - Job cancellation using JDBC
Brian O’Neill - Build fix
Kay Ousterhout - Improved logging messages
Ahir Reddy - Python support
Sun Rui - Testing, analytic function support
Sergey Soldatov - Hive 0.11 upgrade, serialization bug fix
Henry Wang - SharkServer2 addition
Reynold Xin - SparkConf integration
Tian Yi - Combiner bug fix
Yury Yudin - Hive 0.11 support
Thanks to everyone who contributed!