Skip to content

Latest commit

 

History

History
executable file
·
68 lines (48 loc) · 3.57 KB

File metadata and controls

executable file
·
68 lines (48 loc) · 3.57 KB

Overview

This example shows how to execute a spark streaming example on the BigInsights cluster that consumes messages from IBM MessageHub (Kafka). Note that the spark streaming job runs on a single node and not as a yarn application.

See also

  • See here for a similar project aimed at IBM Data Science Experience (DSX) - https://github.com/snowch/SparkMessageHubScala
  • Please add your +1 to this ticket to add similar examples to this example for pyspark and spark 2.x
  • This example is similar to SparkMessageHubScalaYarn but in that version the spark streaming job runs as a yarn application. If you are just starting with Spark Streaming and MessageHub, it is recommended that you use this example as it is much easier to debug.

Developer experience

Developers will gain the most from these examples if they are:

  • Comfortable using Windows, OS X or *nix command prompts
  • Familiar with Scala
  • Familiar with the Gradle build tool
  • Familiar with Spark concepts such as spark streaming
  • Familiar with MessageHub (Kafka)

Example Requirements

  • You meet the pre-requisites in the top level README
  • You have performed the setup instructions in the top level README
  • You are using BigInsights 4.2 with Spark 1.6 (this solution has not been tested with BigInsights 4.3 and Spark 2.0)
  • You have setup an instance of MessageHub
  • If you want to use the python script for sending messages to kafka, you must have:
    • Python 3.x installed
    • kafka-python library installed (pip install kafka-python)

Run the example

To run the examples, in a command prompt window:

  • copy vcap.json_template to vcap.json
  • edit vcap.json with your messagehub connection details
  • change into the directory containing this example and run gradle to execute the example
    • ./gradlew Example (OS X / *nix)
    • gradlew.bat Example (Windows)
  • While the streaming example is running, open a new terminal window and execute the python script, e.g.
    • python send_message.py 12345
  • Ssh into your cluster
  • Run hdfs dfs -ls /user/<<your_username>>/ to see the output folders
  • Run hdfs dfs -cat /user/<<your_username>>/test-<<output_folder_uuid>>/*

You should see something like this:

-bash-4.1$ hdfs dfs -cat /user/snowch/test-0458b164-007e-465a-89fe-13aa8f40f35d/*
(null,12345)

When you have finished running, CTRL-C the window running ./gradlew Example, then run ./gradlew KillAll to kill the streaming example on the cluster.

Decomposition Instructions

The ./build.gradle script runs the example. Build.gradle:

It then uses a ssh plugin to:

  • copy build/libs/SparkMessageHubScala-all.jar to the BigInsights cluster
  • from the ssh session, execute the MessageHubConsumer spark streaming job

It is recommended that you spend time investigating the classes in the package com.ibm.cds.spark.samples.*. You will probably need to make changes in these classes, especially in the method initConfigKeys() in MessageHubConfig