This project provides a simple example of how to write map only jobs that export data to Cassandra from hadoop, using a Cassandra driver based insert from within the map task.
The reason this example was created was that it was heard users are doing/interested in this approach but there was little documentation available for this approach.
The problem that this example solves is as follows:
- how to load a lot of data into Cassandra from Hadoop in parallel without using a reducer.
The solution approach taken in this example was to create a map only mr job that leverages the DataStax 2.0 driver to insert data directly into Cassandra from within the map tasks.
This is a very simplified example that simply inserts key/value pairs (the full split is inserted). The approach can be expanded upon by either leveraging a different InputFormat to create unique splits, or by transforming data within a map task.
The following steps should be followed to execute this example:
- Setup your environment
- We used hadoop 2.2 + yarn
- We used DataStax Enterprise 4.0.1 (OSS C* 2.0 could be used as well)
- Create .ddl in Cassandra using the schema.ddl file
- ./cqlsh -f schema.ddl
- Copy pom.xml and src directory into a local directoy - clone
- Use Maven to create a project
- In the MRExample.java file change the following line to include your node ip addresses 6. private static final String NODES = "ENTER YOUR NODES LIST HERE";
- Make a jar containing the mrexample file
- Explicitly download the dependencies for the DataStax driver to pass into Hadoop
- Find dependencies here for the DataStax Java Driver 2.0.1
- Execute the following command for hadoop
- hadoop jar {yourjar.jar} com.datastax.mrexample.MRExample -libjars cassandra-driver-core-2.0.1.jar,guava-16.0.1.jar,metrics-core-3.0.2.jar,netty-3.9.0.Final.jar,lz4-1.2.0.jar,testng-6.8.8.jar,snappy-java-1.0.4.1.jar {input path on hadoop} {output path on hadoop}