Skip to content

Commit

Permalink
Merge branch 'master' of github.com:datastax/spark-cassandra-connector
Browse files Browse the repository at this point in the history
  • Loading branch information
RussellSpitzer committed Aug 22, 2016
2 parents 67df163 + ab4eda2 commit aa7fa41
Show file tree
Hide file tree
Showing 14 changed files with 373 additions and 289 deletions.
52 changes: 21 additions & 31 deletions doc/0_quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

In this tutorial, you'll learn how to setup a very simple Spark application for reading and writing data from/to Cassandra.
Before you start, you need to have basic knowledge of Apache Cassandra and Apache Spark.
Refer to [Cassandra documentation](http://www.datastax.com/documentation/cassandra/2.0/cassandra/gettingStartedCassandraIntro.html)
and [Spark documentation](https://spark.apache.org/docs/0.9.1/).
Refer to [Datastax](http://docs.datastax.com/en/cassandra/latest/) and [Cassandra documentation](http://cassandra.apache.org/doc/latest/getting_started/index.html)
and [Spark documentation](https://spark.apache.org/docs/latest/).

### Prerequisites

Expand All @@ -16,8 +16,15 @@ Configure a new Scala project with the Apache Spark and dependency.
The dependencies are easily retrieved via the spark-packages.org website. For example, if you're using `sbt`, your build.sbt should include something like this:

resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.11"

libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.10"

The spark-packages libraries can also be used with spark-submit and spark shell, these
commands will place the connector and all of its dependencies on the path of the
Spark Driver and all Spark Executors.

$SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-s_2.10
$SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:1.6.0-s_2.10

For the list of available versions, see:
- https://spark-packages.org/package/datastax/spark-cassandra-connector

Expand All @@ -26,14 +33,6 @@ This driver does not depend on the Cassandra server code.
- For a detailed dependency list, see [project/CassandraSparkBuild.scala](../project/CassandraSparkBuild.scala)
- For dependency versions, see [project/Versions.scala](../project/Versions.scala)

Add the `spark-cassandra-connector` jar and its dependency jars to the following classpaths.
**Make sure the Connector version you use coincides with your Spark version (i.e. Spark 1.2.x with Connector 1.2.x)**:

"com.datastax.spark" %% "spark-cassandra-connector" % Version

- the classpath of your project
- the classpath of every Spark cluster node

### Building
See [Building And Artifacts](12_building_and_artifacts.md)

Expand All @@ -54,28 +53,18 @@ INSERT INTO test.kv(key, value) VALUES ('key2', 2);

Now you're ready to write your first Spark program using Cassandra.

### Setting up `SparkContext`
As usual, start by importing Spark:

```scala
import org.apache.spark._
```

Before creating the `SparkContext`, set the `spark.cassandra.connection.host` property to the address of one
of the Cassandra nodes:
### Loading up the Spark-Shell

```scala
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
```
Create a `SparkContext`. Substitute `127.0.0.1` with the actual address of your Spark Master
(or use `"local"` to run in local mode):
Run the `spark-shell` with the packages line for your version. To configure
the default Spark Configuration pass key value pairs with `--conf`

```scala
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
```
$SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 \
--packages datastax:spark-cassandra-connector:1.6.0-s_2.10

This command would set the Spark Cassandra Connector parameter
`spark.cassandra.connection.host` to `127.0.0.1`. Change this
to the address of one of the nodes in your Cassandra cluster.

Enable Cassandra-specific functions on the `SparkContext`, `RDD`, and `DataFrame`:

```scala
Expand All @@ -101,3 +90,4 @@ collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
```

[Next - Connecting to Cassandra](1_connecting.md)
[Jump to - Accessing data with DataFrames](14_data_frames.md)
4 changes: 3 additions & 1 deletion doc/10_embedded.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## The `spark-cassandra-connector-embedded` Artifact

The `spark-cassandra-connector-embedded` artifact can be used as a test or prototype dependency to spin up embedded servers for testing ideas, quickly learning, integration, etc.
The `spark-cassandra-connector-embedded` artifact can be used as a test
or prototype dependency to spin up embedded servers for testing ideas,
quickly learning, integration, etc.

Pulling this dependency in allows you to:

Expand Down
11 changes: 4 additions & 7 deletions doc/12_building_and_artifacts.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ the `binary.version` thereof:

sbt -Dscala-2.11=true

For Spark see: [Building Spark for Scala 2.11](http://spark.apache.org/docs/1.2.0/building-spark.html)
For Spark see: [Building Spark for Scala 2.11](http://spark.apache.org/docs/latest/building-spark.html)

For Scala 2.11 tasks:

Expand Down Expand Up @@ -74,14 +74,11 @@ The easiest way to do this is to make the assembled connector jar using
Remember that if you need to build the assembly jar against Scala 2.11:

sbt -Dscala-2.11=true assembly
This jar can be used with spark submit with `--jars`

This will generate a jar file with all of the required dependencies in
spark-submit --jars spark-cassandra-connector-assembly.jar

spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-*.jar

Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf

spark.executor.extraClassPath spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$CurrentVersion-SNAPSHOT.jar

This driver is also compatible with Spark distribution provided in
[DataStax Enterprise](http://datastax.com/docs/latest-dse/).
Expand Down
2 changes: 1 addition & 1 deletion doc/13_1_setup_spark_shell.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Setting up Cassandra

The easiest way to get quickly started with Cassandra is to follow the instructions provided by
[DataStax](http://docs.datastax.com/en/cassandra/2.1/cassandra/install/install_cassandraTOC.html)
[DataStax](http://docs.datastax.com/en/cassandra/latest/cassandra/install/install_cassandraTOC.html)

## Setting up spark

Expand Down
4 changes: 2 additions & 2 deletions doc/13_spark_shell.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Using the Spark Cassandra Connector with the Spark Shell

These instructions were last confirmed with C* 3.0.5, Spark 1.6.1 and Connector 1.6.0-M2.
These instructions were last confirmed with C* 3.0.5, Spark 1.6.1 and Connector 1.6.0.

For this guide, we assume an existing Cassandra deployment, running either locally or on a cluster, a local installation of Spark and an optional Spark cluster. For detail setup instructions see [setup spark-shell](13_1_setup_spark_shell.md)

Expand All @@ -18,7 +18,7 @@ Find additional versions at [Spark Packages](http://spark-packages.org/package/d
```bash
cd spark/install/dir
#Include the --master if you want to run against a spark cluster and not local mode
./bin/spark-shell [--master sparkMasterAddress] --jars yourAssemblyJar --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10 --conf spark.cassandra.connection.host=yourCassandraClusterIp
./bin/spark-shell [--master sparkMasterAddress] --jars yourAssemblyJar --packages datastax:spark-cassandra-connector:1.6.0-s_2.10 --conf spark.cassandra.connection.host=yourCassandraClusterIp
```

By default spark will log everything to the console and this may be a bit of an overload. To change this copy and modify the `log4j.properties` template file
Expand Down
28 changes: 17 additions & 11 deletions doc/14_data_frames.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ Those followed with a default of N/A are required, all others are optional.
| cluster | The group of the Cluster Level Settings to inherit | String | "default"|
| pushdown | Enables pushing down predicates to C* when applicable | (true,false) | true |

####Read, Writing and CassandraConnector Options
#### Read, Writing and CassandraConnector Options
Any normal Spark Connector configuration options for Connecting, Reading or Writing
can be passed through as DataFrame options as well. When using the `read` command below these
options should appear exactly the same as when set in the SparkConf.

####Setting Cluster and Keyspace Level Options
#### Setting Cluster and Keyspace Level Options
The connector also provides a way to describe the options which should be applied to all
DataFrames within a cluster or within a keyspace. When a property has been specified at the
table level it will override the default keyspace or cluster property.
Expand All @@ -37,7 +37,7 @@ To add these properties add keys to your `SparkConf` in the format

clusterName:keyspaceName/propertyName.

Example Changing Cluster/Keyspace Level Properties
#### Example Changing Cluster/Keyspace Level Properties
```scala
sqlContext.setConf("ClusterOne/spark.cassandra.input.split.size_in_mb", "32")
sqlContext.setConf("default:test/spark.cassandra.input.split.size_in_mb", "128")
Expand Down Expand Up @@ -66,6 +66,8 @@ val lastdf = sqlContext
).load() // This DataFrame will use a spark.cassandra.input.split.size of 48
```


#### Example Using TypeSafe Parameter Configuration Options
There are also some helper method which simplifies setting Spark Cassandra Connector related parameters. They are a part
of `CassandraSqlContext`:
```scala
Expand All @@ -89,7 +91,7 @@ The most programmatic way to create a data frame is to invoke a `read` command o
You can then use `options` to give a map of `Map[String,String]` of options as described above.
Then finish by calling `load` to actually get a `DataFrame`.

Example Creating a DataFrame using a Read Command
#### Example Creating a DataFrame using a Read Command
```scala
val df = sqlContext
.read
Expand All @@ -108,6 +110,7 @@ There are also some helper methods which can make creating data frames easier. T
`org.apache.spark.sql.cassandra` package. In the following example, all the commands used to create a data frame are
equivalent:

#### Example Using Format Helper Functions
```scala
import org.apache.spark.sql.cassandra._

Expand All @@ -123,13 +126,13 @@ val df2 = sqlContext
.load()
```

###Creating DataFrames using Spark SQL
### Creating DataFrames using Spark SQL

Accessing data Frames using Spark SQL involves creating temporary tables and specifying the
source as `org.apache.spark.sql.cassandra`. The `OPTIONS` passed to this table are used to
establish a relation between the CassandraTable and the internally used DataSource.

Example Creating a Source Using Spark SQL:
#### Example Creating a Source Using Spark SQL:

Create Relation with the Cassandra table test.words
```scala
Expand Down Expand Up @@ -171,7 +174,7 @@ DataFrames provide a save function which allows them to persist their data to an
DataSource. The connector supports using this feature to persist a DataFrame a Cassandra
Table.

Example Copying Between Two Tables Using DataFrames
#### Example Copying Between Two Tables Using DataFrames
```scala
val df = sqlContext
.read
Expand All @@ -187,6 +190,8 @@ df.write

Similarly to reading Cassandra tables into data frames, we have some helper methods for the write path which are
provided by `org.apache.spark.sql.cassandra` package. In the following example, all the commands are equivalent:

#### Example Using Helper Commands to Write DataFrames
```scala
import org.apache.spark.sql.cassandra._

Expand All @@ -201,7 +206,7 @@ df.write

```

###Setting Connector specific options on data frames
### Setting Connector specific options on DataFrames
Connector specific options can be set by invoking `options` method on either `DataFrameReader` or `DataFrameWriter`.
There are several settings you may want to change in `ReadConf`, `WriteConf`, `CassandraConnectorConf`, `AuthConf` and
others. Those settings are identified by instances of `ConfigParameter` case class which offers an easy way to apply
Expand Down Expand Up @@ -233,7 +238,7 @@ Once the new table is created, you can persist the DataFrame to the new table us
The partition key and clustering key of the newly generated table can be set by passing in a list of
names of columns which should be used as partition key and clustering key.

Example Transform DataFrame and Save to New Table
#### Example Creating a Cassandra Table from a DataFrame
```scala
// Add spark connector specific methods to DataFrame
import com.datastax.spark.connector._
Expand All @@ -257,7 +262,7 @@ renamed.write
.save()
```

###Pushing down clauses to Cassandra
### Pushing down clauses to Cassandra
The DataFrame API will automatically pushdown valid where clauses to Cassandra as long as the
pushdown option is enabled (defaults to enabled.)

Expand All @@ -281,6 +286,7 @@ First we can create a DataFrame and see that it has no `pushdown filters` set in
means all requests will go directly to C* and we will require reading all of the data to `show`
this DataFrame.

#### Example Catalyst Optimization with Cassandra Server Side Pushdowns
```scala
val df = sqlContext
.read
Expand Down Expand Up @@ -343,7 +349,7 @@ dfWithPushdown.show
+-----+----+-----+
```

####Pushdown Filter Examples
#### Example Pushdown Filters
Example table
```sql
CREATE KEYSPACE IF NOT EXISTS pushdowns WITH replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
Expand Down
2 changes: 2 additions & 0 deletions doc/15_python.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ http://spark-packages.org/package/datastax/spark-cassandra-connector
A DataFrame can be created which links to Cassandra by using the the `org.apache.spark.sql.cassandra`
source and by specifying keyword arguments for `keyspace` and `table`.

#### Example Loading a Cassandra Table as a Pyspark DataFrame
```python
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
Expand All @@ -47,6 +48,7 @@ source and by specifying keyword arguments for `keyspace` and `table`.

A DataFrame can be saved to an *existing* Cassandra table by using the the `org.apache.spark.sql.cassandra` source and by specifying keyword arguments for `keyspace` and `table` and saving mode (`append`, `overwrite`, `error` or `ignore`, see [Data Sources API doc](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)).

#### Example Saving to a Cassanra Table as a Pyspark DataFrame
```python
df.write\
.format("org.apache.spark.sql.cassandra")\
Expand Down
Loading

0 comments on commit aa7fa41

Please sign in to comment.