Merge branch 'master' of github.com:datastax/spark-cassandra-connector

christobill · Aug 22, 2016 · aa7fa41 · aa7fa41
2 parents 67df163 + ab4eda2
commit aa7fa41
Show file tree

Hide file tree

Showing 14 changed files with 373 additions and 289 deletions.
diff --git a/doc/0_quick_start.md b/doc/0_quick_start.md
@@ -4,8 +4,8 @@
 
 In this tutorial, you'll learn how to setup a very simple Spark application for reading and writing data from/to Cassandra.
 Before you start, you need to have basic knowledge of Apache Cassandra and Apache Spark.
-Refer to [Cassandra documentation](http://www.datastax.com/documentation/cassandra/2.0/cassandra/gettingStartedCassandraIntro.html) 
-and [Spark documentation](https://spark.apache.org/docs/0.9.1/). 
+Refer to [Datastax](http://docs.datastax.com/en/cassandra/latest/) and [Cassandra documentation](http://cassandra.apache.org/doc/latest/getting_started/index.html)
+and [Spark documentation](https://spark.apache.org/docs/latest/). 
 
 ### Prerequisites
 
@@ -16,8 +16,15 @@ Configure a new Scala project with the Apache Spark and dependency.
 The dependencies are easily retrieved via the spark-packages.org website. For example, if you're using `sbt`, your build.sbt should include something like this:
 
     resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
-    libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.11"
-
+    libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.10"
+
+The spark-packages libraries can also be used with spark-submit and spark shell, these
+commands will place the connector and all of its dependencies on the path of the
+Spark Driver and all Spark Executors.
+
+    $SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
+    $SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
+
 For the list of available versions, see:
 - https://spark-packages.org/package/datastax/spark-cassandra-connector
 
@@ -26,14 +33,6 @@ This driver does not depend on the Cassandra server code.
  - For a detailed dependency list, see [project/CassandraSparkBuild.scala](../project/CassandraSparkBuild.scala)
  - For dependency versions, see [project/Versions.scala](../project/Versions.scala)
 
-Add the `spark-cassandra-connector` jar and its dependency jars to the following classpaths.
-**Make sure the Connector version you use coincides with your Spark version (i.e. Spark 1.2.x with Connector 1.2.x)**:
-
-    "com.datastax.spark" %% "spark-cassandra-connector" % Version
-
- - the classpath of your project
- - the classpath of every Spark cluster node
-
 ### Building
 See [Building And Artifacts](12_building_and_artifacts.md)
 
@@ -54,28 +53,18 @@ INSERT INTO test.kv(key, value) VALUES ('key2', 2);
 
 Now you're ready to write your first Spark program using Cassandra.
 
-### Setting up `SparkContext`   
-As usual, start by importing Spark:
-
-```scala
-import org.apache.spark._
-```
-
-Before creating the `SparkContext`, set the `spark.cassandra.connection.host` property to the address of one 
-of the Cassandra nodes:
+### Loading up the Spark-Shell
 
-```scala
-val conf = new SparkConf(true)
-   .set("spark.cassandra.connection.host", "127.0.0.1")
-```
-       
-Create a `SparkContext`. Substitute `127.0.0.1` with the actual address of your Spark Master
-(or use `"local"` to run in local mode): 
+Run the `spark-shell` with the packages line for your version. To configure
+the default Spark Configuration pass key value pairs with `--conf`
 
-```scala
-val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
-```
+    $SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 \
+                                --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
 
+This command would set the Spark Cassandra Connector parameter 
+`spark.cassandra.connection.host` to `127.0.0.1`. Change this
+to the address of one of the nodes in your Cassandra cluster.
+
 Enable Cassandra-specific functions on the `SparkContext`, `RDD`, and `DataFrame`:
 
 ```scala
@@ -101,3 +90,4 @@ collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
 ```
 
 [Next - Connecting to Cassandra](1_connecting.md)
+[Jump to - Accessing data with DataFrames](14_data_frames.md)
diff --git a/doc/10_embedded.md b/doc/10_embedded.md
@@ -2,7 +2,9 @@
 
 ## The `spark-cassandra-connector-embedded` Artifact
 
-The `spark-cassandra-connector-embedded` artifact can be used as a test or prototype dependency to spin up embedded servers for testing ideas, quickly learning, integration, etc.
+The `spark-cassandra-connector-embedded` artifact can be used as a test 
+or prototype dependency to spin up embedded servers for testing ideas, 
+quickly learning, integration, etc.
 
 Pulling this dependency in allows you to:
 

diff --git a/doc/12_building_and_artifacts.md b/doc/12_building_and_artifacts.md
@@ -11,7 +11,7 @@ the `binary.version` thereof:
 
     sbt -Dscala-2.11=true
 
-For Spark see: [Building Spark for Scala 2.11](http://spark.apache.org/docs/1.2.0/building-spark.html)
+For Spark see: [Building Spark for Scala 2.11](http://spark.apache.org/docs/latest/building-spark.html)
 
 For Scala 2.11 tasks:
 
@@ -74,14 +74,11 @@ The easiest way to do this is to make the assembled connector jar using
 Remember that if you need to build the assembly jar against Scala 2.11:
 
      sbt -Dscala-2.11=true assembly
+     
+This jar can be used with spark submit with `--jars`
 
-This will generate a jar file with all of the required dependencies in
+    spark-submit --jars spark-cassandra-connector-assembly.jar
 
-     spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-*.jar
-
-Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf
-
-     spark.executor.extraClassPath  spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$CurrentVersion-SNAPSHOT.jar
 
 This driver is also compatible with Spark distribution provided in
 [DataStax Enterprise](http://datastax.com/docs/latest-dse/).

diff --git a/doc/13_1_setup_spark_shell.md b/doc/13_1_setup_spark_shell.md
@@ -3,7 +3,7 @@
 ## Setting up Cassandra
 
 The easiest way to get quickly started with Cassandra is to follow the instructions provided by 
-[DataStax](http://docs.datastax.com/en/cassandra/2.1/cassandra/install/install_cassandraTOC.html)
+[DataStax](http://docs.datastax.com/en/cassandra/latest/cassandra/install/install_cassandraTOC.html)
 
 ## Setting up spark
 

diff --git a/doc/13_spark_shell.md b/doc/13_spark_shell.md
@@ -2,7 +2,7 @@
 
 ## Using the Spark Cassandra Connector with the Spark Shell 
 
-These instructions were last confirmed with C* 3.0.5, Spark 1.6.1 and Connector 1.6.0-M2.
+These instructions were last confirmed with C* 3.0.5, Spark 1.6.1 and Connector 1.6.0.
 
 For this guide, we assume an existing Cassandra deployment, running either locally or on a cluster, a local installation of Spark and an optional Spark cluster. For detail setup instructions see [setup spark-shell](13_1_setup_spark_shell.md)   
 
@@ -18,7 +18,7 @@ Find additional versions at [Spark Packages](http://spark-packages.org/package/d
 ```bash
 cd spark/install/dir
 #Include the --master if you want to run against a spark cluster and not local mode
-./bin/spark-shell [--master sparkMasterAddress] --jars yourAssemblyJar --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10 --conf spark.cassandra.connection.host=yourCassandraClusterIp
+./bin/spark-shell [--master sparkMasterAddress] --jars yourAssemblyJar --packages datastax:spark-cassandra-connector:1.6.0-s_2.10 --conf spark.cassandra.connection.host=yourCassandraClusterIp
 ```
 
 By default spark will log everything to the console and this may be a bit of an overload. To change this copy and modify the `log4j.properties` template file

diff --git a/doc/14_data_frames.md b/doc/14_data_frames.md
@@ -23,12 +23,12 @@ Those followed with a default of N/A are required, all others are optional.
 | cluster     | The group of the Cluster Level Settings to inherit    | String        | "default"|
 | pushdown    | Enables pushing down predicates to C* when applicable | (true,false)  | true     |
 
-####Read, Writing and CassandraConnector Options
+#### Read, Writing and CassandraConnector Options
 Any normal Spark Connector configuration options for Connecting, Reading or Writing
 can be passed through as DataFrame options as well. When using the `read` command below these
 options should appear exactly the same as when set in the SparkConf.
 
-####Setting Cluster and Keyspace Level Options
+#### Setting Cluster and Keyspace Level Options
 The connector also provides a way to describe the options which should be applied to all
 DataFrames within a cluster or within a keyspace. When a property has been specified at the
 table level it will override the default keyspace or cluster property.
@@ -37,7 +37,7 @@ To add these properties add keys to your `SparkConf` in the format
 
     clusterName:keyspaceName/propertyName.
 
-Example Changing Cluster/Keyspace Level Properties
+#### Example Changing Cluster/Keyspace Level Properties
 ```scala
 sqlContext.setConf("ClusterOne/spark.cassandra.input.split.size_in_mb", "32")
 sqlContext.setConf("default:test/spark.cassandra.input.split.size_in_mb", "128")
@@ -66,6 +66,8 @@ val lastdf = sqlContext
   ).load() // This DataFrame will use a spark.cassandra.input.split.size of 48
 ```
 
+
+#### Example Using TypeSafe Parameter Configuration Options
 There are also some helper method which simplifies setting Spark Cassandra Connector related parameters. They are a part
 of `CassandraSqlContext`:
 ```scala
@@ -89,7 +91,7 @@ The most programmatic way to create a data frame is to invoke a `read` command o
  You can then use `options` to give a map of `Map[String,String]` of options as described above.
  Then finish by calling `load` to actually get a `DataFrame`.
 
-Example Creating a DataFrame using a Read Command
+#### Example Creating a DataFrame using a Read Command
 ```scala
 val df = sqlContext
   .read
@@ -108,6 +110,7 @@ There are also some helper methods which can make creating data frames easier. T
 `org.apache.spark.sql.cassandra` package. In the following example, all the commands used to create a data frame are 
 equivalent:
 
+#### Example Using Format Helper Functions
 ```scala
 import org.apache.spark.sql.cassandra._
 
@@ -123,13 +126,13 @@ val df2 = sqlContext
   .load()
 ```
 
-###Creating DataFrames using Spark SQL
+### Creating DataFrames using Spark SQL
 
 Accessing data Frames using Spark SQL involves creating temporary tables and specifying the
 source as `org.apache.spark.sql.cassandra`. The `OPTIONS` passed to this table are used to
 establish a relation between the CassandraTable and the internally used DataSource.
 
-Example Creating a Source Using Spark SQL:
+#### Example Creating a Source Using Spark SQL:
 
 Create Relation with the Cassandra table test.words
 ```scala
@@ -171,7 +174,7 @@ DataFrames provide a save function which allows them to persist their data to an
 DataSource. The connector supports using this feature to persist a DataFrame a Cassandra
 Table.
 
-Example Copying Between Two Tables Using DataFrames
+#### Example Copying Between Two Tables Using DataFrames
 ```scala
 val df = sqlContext
   .read
@@ -187,6 +190,8 @@ df.write
 
 Similarly to reading Cassandra tables into data frames, we have some helper methods for the write path which are 
 provided by `org.apache.spark.sql.cassandra` package. In the following example, all the commands are equivalent:
+
+#### Example Using Helper Commands to Write DataFrames
 ```scala
 import org.apache.spark.sql.cassandra._
 
@@ -201,7 +206,7 @@ df.write
 
 ```
 
-###Setting Connector specific options on data frames
+### Setting Connector specific options on DataFrames
 Connector specific options can be set by invoking `options` method on either `DataFrameReader` or `DataFrameWriter`. 
 There are several settings you may want to change in `ReadConf`, `WriteConf`, `CassandraConnectorConf`, `AuthConf` and
 others. Those settings are identified by instances of `ConfigParameter` case class which offers an easy way to apply 
@@ -233,7 +238,7 @@ Once the new table is created, you can persist the DataFrame to the new table us
 The partition key and clustering key of the newly generated table can be set by passing in a list of 
 names of columns which should be used as partition key and clustering key.
 
-Example Transform DataFrame and Save to New Table
+#### Example Creating a Cassandra Table from a DataFrame
 ```scala
 // Add spark connector specific methods to DataFrame
 import com.datastax.spark.connector._
@@ -257,7 +262,7 @@ renamed.write
   .save()
 ```
 
-###Pushing down clauses to Cassandra
+### Pushing down clauses to Cassandra
 The DataFrame API will automatically pushdown valid where clauses to Cassandra as long as the
 pushdown option is enabled (defaults to enabled.)
 
@@ -281,6 +286,7 @@ First we can create a DataFrame and see that it has no `pushdown filters` set in
 means all requests will go directly to C* and we will require reading all of the data to `show`
 this DataFrame.
 
+#### Example Catalyst Optimization with Cassandra Server Side Pushdowns
 ```scala
 val df = sqlContext
   .read
@@ -343,7 +349,7 @@ dfWithPushdown.show
 +-----+----+-----+
 ```
 
-####Pushdown Filter Examples
+#### Example Pushdown Filters
 Example table
 ```sql
 CREATE KEYSPACE IF NOT EXISTS pushdowns WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

diff --git a/doc/15_python.md b/doc/15_python.md
@@ -24,6 +24,7 @@ http://spark-packages.org/package/datastax/spark-cassandra-connector
 A DataFrame can be created which links to Cassandra by using the the `org.apache.spark.sql.cassandra` 
 source and by specifying keyword arguments for `keyspace` and `table`.
 
+#### Example Loading a Cassandra Table as a Pyspark DataFrame
 ```python
  sqlContext.read\
     .format("org.apache.spark.sql.cassandra")\
@@ -47,6 +48,7 @@ source and by specifying keyword arguments for `keyspace` and `table`.
 
 A DataFrame can be saved to an *existing* Cassandra table by using the the `org.apache.spark.sql.cassandra` source and by specifying keyword arguments for `keyspace` and `table` and saving mode (`append`, `overwrite`, `error` or `ignore`, see [Data Sources API doc](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)).
 
+#### Example Saving to a Cassanra Table as a Pyspark DataFrame
 ```python
  df.write\
     .format("org.apache.spark.sql.cassandra")\