Skip to content

Latest commit

 

History

History
104 lines (70 loc) · 3.63 KB

0_quick_start.md

File metadata and controls

104 lines (70 loc) · 3.63 KB

Documentation

5-minute quick start guide

In this tutorial, you'll learn how to setup a very simple Spark application for reading and writing data from/to Cassandra. Before you start, you need to have basic knowledge of Apache Cassandra and Apache Spark. Refer to Cassandra documentation and Spark documentation.

Prerequisites

Install and launch a Cassandra 2.0 cluster and a Spark cluster.

Configure a new Scala project with the following dependencies:

  • Apache Spark 0.9 or 1.0 and its dependencies
  • Apache Cassandra thrift and clientutil libraries matching the version of Cassandra
  • DataStax Cassandra driver for your Cassandra version

This driver does not depend on the Cassandra server code.

  • For a detailed dependency list, see project dependencies in the project/CassandraSparkBuild.scala file.
  • For dependency versions, see project/Versions.scala file.

Add the spark-cassandra-connector jar and its dependency jars to the following classpaths:

"com.datastax.spark" %% "spark-cassandra-connector" % Version
  • the classpath of your project
  • the classpath of every Spark cluster node

The easiest way to do this is to make the assembled connector jar using

 sbt assembly

This will generate a jar file with all of the required dependencies in

 spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-*.jar

Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf

 spark.executor.extraClassPath  spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-$CurrentVersion-SNAPSHOT.jar

This driver is also compatible with Spark distribution provided in DataStax Enterprise 4.5.

Preparing example Cassandra schema

Create a simple keyspace and table in Cassandra. Run the following statements in cqlsh:

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE test.kv(key text PRIMARY KEY, value int);

Then insert some example data:

INSERT INTO test.kv(key, value) VALUES ('key1', 1);
INSERT INTO test.kv(key, value) VALUES ('key2', 2);

Now you're ready to write your first Spark program using Cassandra.

Setting up SparkContext

Before creating the SparkContext, set the spark.cassandra.connection.host property to the address of one of the Cassandra nodes:

val conf = new SparkConf(true)
   .set("spark.cassandra.connection.host", "127.0.0.1")

Create a SparkContext. Substitute 127.0.0.1 with the actual address of your Spark Master (or use "local" to run in local mode):

val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)

Enable Cassandra-specific functions on the SparkContext and RDD:

import com.datastax.spark.connector._

Loading and analyzing data from Cassandra

Use the sc.cassandraTable method to view this table as a Spark RDD:

val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)        

Saving data from RDD to Cassandra

Add two more rows to the table:

val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))       

Next - Connecting to Cassandra