Skip to content

Stream Processing Changes using Azure Cosmos DB Change Feed and Apache Spark

Denny Lee edited this page Sep 6, 2017 · 3 revisions

In this sample, we set up a streaming source of tweets filtered by particular hashtags which is fed to a Cosmos DB collection. Change feed support is set up for this Cosmos DB collection and we showcase how the Spark compute engine can be connected to this change feed.

The example will allow you to:

  1. Feed a streaming source to a Cosmos DB collection
    • Create a VM and use the Stream feed from Twitter to Cosmos DB code to populate your Cosmos DB collection with twitter feeds
    • Run the twitter script as a background job
  2. Spin up an HDI Spark cluster to read the Cosmos DB Change Feed
    • Spin up an HDI Spark Cluster
    • Read the data from a notebook

Feeding a streaming source to a Cosmos DB collection

The code and instructions to run the sample script which filters live tweets based on custom hashtags and feeds that stream to Cosmos DB can be found at Stream feed from Twitter to Cosmos DB.

Instructions to spin up a VM on Azure where you can run the script as a background job can be found at Create a Linux virtual machine with the Azure portal.

Spin up an HDI Spark cluster to read the Cosmos DB Change Feed

To spin up your own HDI Spark cluster and upload the Spark Connector jars (allowing the Spark cluster to connect to Cosmos DB), please refer to the instructions at Spark to Cosmos DB Connector Setup.

You can then log into Jupyter notebook service (as per the previously linked instructions) and upload the Twitter Source with Apache Spark and Azure Cosmos DB Change Feed notebook (here's the HTML output) which allows you to watch the Cosmos DB Change Feed via Spark reading Twitter data similar to the series below.