diff --git a/README.md b/README.md index 6c5f70d64..3ad427067 100644 --- a/README.md +++ b/README.md @@ -114,6 +114,7 @@ See [Building And Artifacts](doc/12_building_and_artifacts.md) - [Python](doc/15_python.md) - [Frequently Asked Questions](doc/FAQ.md) - [Configuration Parameter Reference Table](doc/reference.md) + - [Tips for Developing the Spark Cassandra Connector](doc/developers.md) ## Online Training ### DataStax Academy @@ -137,8 +138,11 @@ Make sure you have installed and enabled the Scala Plugin. Open the project with IntelliJ IDEA and it will automatically create the project structure from the provided SBT configuration. +[Tips for Developing the Spark Cassandra Connector](doc/developers.md) + Before contributing your changes to the project, please make sure that all unit tests and integration tests pass. Don't forget to add an appropriate entry at the top of CHANGES.txt. +Create a Jira at the [Spark Cassandra Connector Jira](https://datastax-oss.atlassian.net/projects/SPARKC/issues) Finally open a pull-request on GitHub and await review. Please prefix pull request description with the JIRA number, for example: "SPARKC-123: Fix the ...". diff --git a/doc/developers.md b/doc/developers.md new file mode 100644 index 000000000..560c1249e --- /dev/null +++ b/doc/developers.md @@ -0,0 +1,84 @@ +# Documentation + +## Developers Tips + +### Getting Started + +The Spark Cassandra Connector is built using sbt. There is a premade +launching script for sbt so it is unneccessary to download it. To invoke +this script you can run `./sbt/sbt` from a clone of this repository. + +For information on setting up your clone please follow the [Github +Help](https://help.github.com/articles/cloning-a-repository/) + +Once in the sbt shell you will be able to build and run tests for the +connector without any Spark or Cassandra nodes running. The most common +commands to use when developing the connector are + +1. `test` - Runs the the unit tests for the project. +2. `it:test` - Runs the integeration tests with embedded C* and Spark +3. `assembly` - Builds a fat jar for using with --jars in spark submit or spark-shell + +The integration tests located in `spark-cassandra-connector/src/it` should +probably be the first place to look for anyone considering adding code. +There are many examples of executing a feature of the connector with +the embedded Cassandra and Spark nodes and are the core of our test +coverage. + +### Sub-Projects + +The connector currently contains several subprojects +#### spark-cassandra-connector +This sub project contains all of the actual connector code and is where +any new features or tests should go. This Scala project also contains the +Java api and related code. + +#### spark-cassandra-connector-embedded +The code used to start the embedded services used in the integration tests. +This contains methods for starting up C* as a thread within the running +test code. + +#### spark-cassandra-connector-doc +Code for building reference documentation. This uses the code from +`spark-cassandra-connector` to determine what belongs in the reference +file. It should mostly be used for regenerating the reference file after +new parameters have been added or old parameters have been changed. Tests +in `spark-cassandra-connector` will throw errors if the reference file is +not up to date. To fix this run `spark-cassandra-connector-doc/run` to +update the file. It is still necessary to commit the changed file after +running this sub-project. + +#### spark-cassandra-connector-perf +Code for performance based tests. Add any performance comparisons needed +to this project. + +### Continuous Testing + +It's often useful when implementing new features to have the tests run +in a loop on code change. Sbt provides a method for this by using the +`~` operator. With this `~ test` will run the unit tests every time a +change in the source code is detected. This is often useful to use in +conjunction with `testOnly` which runs a single test. So if a new feature +were being added to the integration suite `foo` you may want to run +`~ it:testOnly foo`. Which would only run the suite you are interested in +on a loop while you are modifying the connector code. + +### Packaging + +The `spark-shell` and `spark-submit` are able to use local caches to load +libraries and this can be taken advantage of by the SCC. For example +if you wanted to test the maven artifacts produced for your current build +you could run `publishM2` which would generate the needed artifacts and +pom in your local cache. You can then reference this from `spark-shell` +or `spark-submit` using the following code +```bash +./bin/spark-shell --repositories file:/Users/yourUser/.m2/repository --packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0-14-gcfca49e +``` +Where you would change the revision `1.6.0-14-gcfca49e` to match the output +of your publish command. + +This same method should work with `publishLocal` +after the merging of [SPARK-12666](https://issues.apache.org/jira/browse/SPARK-12666) + + + diff --git a/project/Settings.scala b/project/Settings.scala index 09d84586c..558ba3550 100644 --- a/project/Settings.scala +++ b/project/Settings.scala @@ -103,6 +103,7 @@ object Settings extends Build { spAppendScalaVersion := true, spIncludeMaven := true, spIgnoreProvided := true, + spShade := true, credentials += Credentials(Path.userHome / ".ivy2" / ".credentials") ) @@ -205,7 +206,8 @@ object Settings extends Build { lazy val defaultSettings = projectSettings ++ mimaSettings ++ releaseSettings ++ testSettings lazy val rootSettings = Seq( - cleanKeepFiles ++= Seq("resolution-cache", "streams", "spark-archives").map(target.value / _) + cleanKeepFiles ++= Seq("resolution-cache", "streams", "spark-archives").map(target.value / _), + updateOptions := updateOptions.value.withCachedResolution(true) ) lazy val demoSettings = projectSettings ++ noPublish ++ Seq( @@ -230,7 +232,7 @@ object Settings extends Build { cp } ) - lazy val assembledSettings = defaultSettings ++ customTasks ++ sparkPackageSettings ++ sbtAssemblySettings + lazy val assembledSettings = defaultSettings ++ customTasks ++ sbtAssemblySettings ++ sparkPackageSettings val testOptionSettings = Seq( Tests.Argument(TestFrameworks.ScalaTest, "-oDF"), @@ -347,8 +349,7 @@ object Settings extends Build { assemblyShadeRules in assembly := { val shadePackage = "shade.com.datastax.spark.connector" Seq( - ShadeRule.rename("com.google.common.**" -> s"$shadePackage.google.common.@1").inAll, - ShadeRule.rename("io.netty.**" -> s"$shadePackage.netty.@1").inAll + ShadeRule.rename("com.google.common.**" -> s"$shadePackage.google.common.@1").inAll ) } ) diff --git a/project/SparkCassandraConnectorBuild.scala b/project/SparkCassandraConnectorBuild.scala index cdf669441..3bad27e98 100644 --- a/project/SparkCassandraConnectorBuild.scala +++ b/project/SparkCassandraConnectorBuild.scala @@ -19,6 +19,8 @@ import java.io.File import sbt._ import sbt.Keys._ +import sbtassembly._ +import sbtassembly.AssemblyKeys._ import sbtsparkpackage.SparkPackagePlugin.autoImport._ import pl.project13.scala.sbt.JmhPlugin @@ -36,7 +38,7 @@ object CassandraSparkBuild extends Build { name = "root", dir = file("."), settings = rootSettings ++ Seq(cassandraServerClasspath := { "" }), - contains = Seq(embedded, connector, demos) + contains = Seq(embedded, connectorDistribution, demos) ).disablePlugins(AssemblyPlugin, SparkPackagePlugin) lazy val cassandraServerProject = Project( @@ -59,14 +61,84 @@ object CassandraSparkBuild extends Build { "org.scala-lang" % "scala-compiler" % scalaVersion.value)) ).disablePlugins(AssemblyPlugin, SparkPackagePlugin) configs IntegrationTest - lazy val connector = CrossScalaVersionsProject( + /** + * Do not included shaded dependencies so they will not be listed in the Pom created for this + * project. + * + * Run the compile from the shaded project since we are no longer including shaded libs + */ + lazy val connectorDistribution = CrossScalaVersionsProject( name = namespace, - conf = assembledSettings ++ Seq(libraryDependencies ++= Dependencies.connector ++ Seq( + conf = assembledSettings ++ Seq( + libraryDependencies ++= Dependencies.connectorDistribution ++ Seq( "org.scala-lang" % "scala-reflect" % scalaVersion.value, - "org.scala-lang" % "scala-compiler" % scalaVersion.value % "test,it")) ++ pureCassandraSettings - ).copy(dependencies = Seq(embedded % "test->test;it->it,test;") + "org.scala-lang" % "scala-compiler" % scalaVersion.value % "test,it"), + //Use the assembly which contains all of the libs not just the shaded ones + assembly in spPackage := (assembly in shadedConnector).value, + //Use the Pom file for this project in spark packages not the assembly pom + spMakePom := makePom.value, + assembly := (assembly in fullConnector).value, + //Use the shaded jar as our packageTarget + packageBin := { + val shaded = (assembly in shadedConnector).value + val targetName = target.value + val expected = target.value / s"$namespace-${version.value}.jar" + IO.move(shaded, expected) + val log = streams.value.log + log.info(s"""Shaded jar moved to $expected""".stripMargin) + expected + }, + sbt.Keys.`package` := packageBin.value) + ++ pureCassandraSettings + //Update the distribution tasks to use the shaded jar + ++ {for (taskKey <- Seq(publishLocal in Compile, publish in Compile, publishM2 in Compile)) yield { + packagedArtifacts in taskKey := { + val previous = (packagedArtifacts in Compile).value + val shadedJar = (artifact.value.copy(configurations = Seq(Compile)) -> packageBin.value) + //Clobber the old build artifact with the shaded jar + previous + shadedJar + } + }} + ).copy(dependencies = Seq(embedded % "test->test;it->it,test;") ) configs IntegrationTest + /** Because the distribution project has to mark the shaded jars as provided to remove them from + * the distribution dependencies we provide this additional project to build a fat jar which + * includes everything. The artifact produced by this project is unshaded while the assembly + * remains shaded. + */ + lazy val fullConnector = CrossScalaVersionsProject( + name = s"$namespace-full", + conf = assembledSettings ++ Seq( + libraryDependencies ++= Dependencies.connectorAll + ++ Dependencies.includedInShadedJar + ++ Seq( + "org.scala-lang" % "scala-reflect" % scalaVersion.value, + "org.scala-lang" % "scala-compiler" % scalaVersion.value % "test,it"), + target := target.value / "full" + ) + ++ pureCassandraSettings, + base = Some(namespace) + ).copy(dependencies = Seq(embedded % "test->test;it->it,test")) configs IntegrationTest + + + lazy val shadedConnector = CrossScalaVersionsProject( + name = s"$namespace-shaded", + conf = assembledSettings ++ Seq( + libraryDependencies ++= Dependencies.connectorNonShaded + ++ Dependencies.includedInShadedJar + ++ Seq( + "org.scala-lang" % "scala-reflect" % scalaVersion.value, + "org.scala-lang" % "scala-compiler" % scalaVersion.value % "test,it"), + target := target.value / "shaded", + test in assembly := {}, + publishArtifact in (Compile, packageBin) := false) + ++ pureCassandraSettings, + base = Some(namespace) + ).copy(dependencies = Seq(embedded % "test->test;it->it,test;") + ) configs IntegrationTest + + lazy val demos = RootProject( name = "demos", dir = demosPath, @@ -77,7 +149,7 @@ object CassandraSparkBuild extends Build { id = "simple-demos", base = demosPath / "simple-demos", settings = demoSettings, - dependencies = Seq(connector, embedded) + dependencies = Seq(connectorDistribution, embedded) ).disablePlugins(AssemblyPlugin, SparkPackagePlugin) /* lazy val kafkaStreaming = CrossScalaVersionsProject( @@ -86,26 +158,26 @@ object CassandraSparkBuild extends Build { libraryDependencies ++= (CrossVersion.partialVersion(scalaVersion.value) match { case Some((2, minor)) if minor < 11 => Dependencies.kafka case _ => Seq.empty - }))).copy(base = demosPath / "kafka-streaming", dependencies = Seq(connector, embedded)) + }))).copy(base = demosPath / "kafka-streaming", dependencies = Seq(connectorAll, embedded)) */ lazy val twitterStreaming = Project( id = "twitter-streaming", base = demosPath / "twitter-streaming", settings = demoSettings ++ Seq(libraryDependencies ++= Dependencies.twitter), - dependencies = Seq(connector, embedded) + dependencies = Seq(connectorDistribution, embedded) ).disablePlugins(AssemblyPlugin, SparkPackagePlugin) lazy val refDoc = Project( id = s"$namespace-doc", base = file(s"$namespace-doc"), settings = defaultSettings ++ Seq(libraryDependencies ++= Dependencies.spark) - ) dependsOn connector + ) dependsOn connectorDistribution lazy val perf = Project( id = s"$namespace-perf", base = file(s"$namespace-perf"), settings = projectSettings, - dependencies = Seq(connector, embedded) + dependencies = Seq(connectorDistribution, embedded) ) enablePlugins(JmhPlugin) def crossBuildPath(base: sbt.File, v: String): sbt.File = base / s"scala-$v" / "src" @@ -113,8 +185,9 @@ object CassandraSparkBuild extends Build { /* templates */ def CrossScalaVersionsProject(name: String, conf: Seq[Def.Setting[_]], - reliesOn: Seq[ClasspathDep[ProjectReference]] = Seq.empty) = - Project(id = name, base = file(name), dependencies = reliesOn, settings = conf ++ Seq( + reliesOn: Seq[ClasspathDep[ProjectReference]] = Seq.empty, + base: Option[String] = None) = + Project(id = name, base = file(base.getOrElse(name)), dependencies = reliesOn, settings = conf ++ Seq( unmanagedSourceDirectories in (Compile, packageBin) += crossBuildPath(baseDirectory.value, scalaBinaryVersion.value), unmanagedSourceDirectories in (Compile, doc) += @@ -139,22 +212,37 @@ object Artifacts { import Versions._ implicit class Exclude(module: ModuleID) { - def guavaExclude: ModuleID = + def guavaExclude(): ModuleID = module exclude("com.google.guava", "guava") - def sparkExclusions: ModuleID = module.guavaExclude + /**We will just include netty-all as a dependency **/ + def nettyExclude(): ModuleID = module + .exclude("io.netty", "netty") + .exclude("io.netty", "netty-buffer") + .exclude("io.netty", "netty-codec") + .exclude("io.netty", "netty-common") + .exclude("io.netty", "netty-handler") + .exclude("io.netty", "netty-transport") + + def driverExclusions(): ModuleID = + guavaExclude().nettyExclude() + .exclude("io.dropwizard.metrics", "metrics-core") + .exclude("org.slf4j", "slf4j-api") + + + def sparkExclusions(): ModuleID = module.nettyExclude .exclude("org.apache.spark", s"spark-core_$scalaBinary") - def logbackExclude: ModuleID = module + def logbackExclude(): ModuleID = module .exclude("ch.qos.logback", "logback-classic") .exclude("ch.qos.logback", "logback-core") - def replExclusions: ModuleID = module.guavaExclude + def replExclusions(): ModuleID = nettyExclude().guavaExclude() .exclude("org.apache.spark", s"spark-bagel_$scalaBinary") .exclude("org.apache.spark", s"spark-mllib_$scalaBinary") .exclude("org.scala-lang", "scala-compiler") - def kafkaExclusions: ModuleID = module + def kafkaExclusions(): ModuleID = module .exclude("org.slf4j", "slf4j-simple") .exclude("com.sun.jmx", "jmxri") .exclude("com.sun.jdmk", "jmxtools") @@ -164,12 +252,13 @@ object Artifacts { val akkaActor = "com.typesafe.akka" %% "akka-actor" % Akka % "provided" // ApacheV2 val akkaRemote = "com.typesafe.akka" %% "akka-remote" % Akka % "provided" // ApacheV2 val akkaSlf4j = "com.typesafe.akka" %% "akka-slf4j" % Akka % "provided" // ApacheV2 - val cassandraDriver = "com.datastax.cassandra" % "cassandra-driver-core" % CassandraDriver guavaExclude // ApacheV2 + val cassandraDriver = "com.datastax.cassandra" % "cassandra-driver-core" % CassandraDriver driverExclusions() // ApacheV2 val config = "com.typesafe" % "config" % Config % "provided" // ApacheV2 val guava = "com.google.guava" % "guava" % Guava val jodaC = "org.joda" % "joda-convert" % JodaC val jodaT = "joda-time" % "joda-time" % JodaT val lzf = "com.ning" % "compress-lzf" % Lzf % "provided" + val netty = "io.netty" % "netty-all" % Netty val slf4jApi = "org.slf4j" % "slf4j-api" % Slf4j % "provided" // MIT val jsr166e = "com.twitter" % "jsr166e" % JSR166e // Creative Commons val airlift = "io.airlift" % "airline" % Airlift @@ -261,11 +350,48 @@ object Dependencies { val spark = Seq(sparkCore, sparkStreaming, sparkSql, sparkCatalyst, sparkHive, sparkUnsafe) - val connector = testKit ++ metrics ++ jetty ++ logging ++ akka ++ cassandra ++ spark.map(_ % "provided") ++ Seq( - config, guava, jodaC, jodaT, lzf, jsr166e) + /** + * Dependencies which will be shaded in our distribution artifact and not listed on the + * distribution artifact's dependency list. + */ + val includedInShadedJar = Seq(guava, cassandraDriver) + + /** + * This is the full dependency list required to build an assembly with all dependencies + * required to run the connector not listed as provided except for those which will + * be on the classpath because of Spark. + */ + val connectorAll = testKit ++ + metrics ++ + jetty ++ + logging ++ + akka ++ + cassandra ++ + spark.map(_ % "provided") ++ + Seq(config, jodaC, jodaT, lzf, netty, jsr166e) ++ + includedInShadedJar + + /** + * Mark the shaded dependencies as Provided, this removes them from the artifacts to be downloaded + * by build systems. This will avoid downloading a Cassandra Driver which does not have it's guava + * references shaded. + */ + val connectorDistribution = (connectorAll.toSet -- includedInShadedJar.toSet).toSeq ++ + includedInShadedJar.map(_ % "provided") + + /** + * When building the shaded jar we want the assembly task to ONLY include the shaded libs, to + * accomplish this we set all other dependencies as provided. + */ + val connectorNonShaded = (connectorAll.toSet -- includedInShadedJar.toSet).toSeq.map { dep => + dep.configurations match { + case Some(conf) => dep + case _ => dep % "provided" + } + } val embedded = logging ++ spark ++ cassandra ++ Seq( - cassandraServer % "it,test", Embedded.jopt, Embedded.sparkRepl, Embedded.kafka, Embedded.snappy, guava) + cassandraServer % "it,test", Embedded.jopt, Embedded.sparkRepl, Embedded.kafka, Embedded.snappy, guava, netty) val perf = logging ++ spark ++ cassandra diff --git a/project/Versions.scala b/project/Versions.scala index d3954743e..78a58583d 100644 --- a/project/Versions.scala +++ b/project/Versions.scala @@ -42,6 +42,7 @@ object Versions { val Kafka = "0.8.2.1" val Kafka210 = "0.8.1.1" val Lzf = "0.8.4" + val Netty = "4.0.33.Final" val CodaHaleMetrics = "3.0.2" val ScalaCheck = "1.12.5" val ScalaMock = "3.2" diff --git a/project/plugins.sbt b/project/plugins.sbt index abcc86ea7..846139133 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -19,7 +19,7 @@ addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.0.4") //SbtAssembly 0.12.0 is included in sbt-spark-package resolvers += "Spark Packages Main repo" at "https://dl.bintray.com/spark-packages/maven" -addSbtPlugin("org.spark-packages" %% "sbt-spark-package" % "0.2.3") +addSbtPlugin("org.spark-packages" %% "sbt-spark-package" % "0.2.5") addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.2.6")