Skip to content

Latest commit

 

History

History
226 lines (161 loc) · 10.6 KB

configuration-reference-guide.md

File metadata and controls

226 lines (161 loc) · 10.6 KB

  Azure Cosmos DB Connector for Apache Spark

Configuration Reference Guide

This guide contains the available configurations of azure-cosmosdb-spark. Depending on your scenario, different configurations should be used to optimize your performance and throughput. Note, these configurations are passed to the Azure Cosmos DB Java SDK.

Note that the configuration key is case-insensitive and for now, the configuration value is always a string.

Table of Contents

 

Reading from Cosmos DB Collection

This section contains the configuration parameters used by azure-cosmosdb-spark to read an Azure Cosmos DB collection. For example, below are code-snippets in Python and Scala

Python Example

# Configure connection to your collection (Python)
readConfig = {
  "Endpoint" : "https://doctorwho.documents.azure.com:443/",
  "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" : "DepartureDelays",
  "Collection" : "flights_pcoll" 
}

Scala Example

// Configure connection to your collection (Scala)
val readConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "Collection" -> "flights_pcoll"
))

Reading: Required Parameters

These parameters are required for azure-cosmosdb-spark to read from an Azure Cosmos DB collection.

Parameter Description
Collection This specifies the name of your Azure Cosmos DB collection
Database This specifies the name of your Azure Cosmos DB database
Endpoint This specifies the endpoint in the format of a URI to connect to your Azure Cosmos DB account, e.g. https://doctorwho.documents.azure.com:443/
Masterkey This specifies the master key that allows azure-cosmosdb-spark to connect to your endpoint, database, and collection.

 

Reading: Common Batch Parameters

While these parameters are optional, these are often used to improve read batch performance.

Parameter Description
query_custom Use this parameter to override azure-cosmosdb-spark predicate pushdown by parsing the Spark SQL WHERE clause, e.g. SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
query_pagesize Sets the size of the query result page for each query request. To optimized for throughput,
  • Use a large page size to reduce the number of round trips to fetch queries results when using a predicate
  • Use a small page size to reduce the likelihood of 429 errors (request rate too large) when specifying a query_custom that does not have a predicate pushdown.
PreferredRegions This specifies the name of your Azure Cosmos DB collection
SamplingRatio Typically set to 1.0
schema_samplesize Number of documents / rows to scan so that Apache Spark can infer the schema. In a collection you can have multiple document types; it may be necessary to increase this number to scan enough of the documents in your collection to Spark can infer the schema of all of the different schemas and merge them into a single Spark DataFrame.

 

Click for Python Example

# Configure connection to your collection (Python)
readConfig = {
  "Endpoint" : "https://doctorwho.documents.azure.com:443/",
  "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" : "DepartureDelays",
  "Collection" : "flights_pcoll",
  "PreferredRegions" : "West US; Southeast Asia"
  "schema_samplesize" : "1000",
  "query_pagesize" : "200000",
  "query_custom" : "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" 
}

 

Click for Scala Example

// Configure connection to your collection (Scala)
val readConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "Collection" -> "flights_pcoll",
  "PreferredRegions" -> "West US; Southeast Asia"
  "schema_samplesize" -> "1000",
  "query_pagesize" -> "200000",
  "query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'" 
))

 

Reading: Required Change Feed Parameters

To read the Azure Cosmos DB change feed, you need only to add the ChangeFeedCheckpointLocation and ChangeFeedQueryName to read it instead of collection (i.e. batch) directly. But you should also add the ConnectionMode and query_custom parameters for better performance.

Parameter Description
ChangeFeedCheckpointLocation The path to the local file storage to persist continuation tokens in case of node failures, e.g. /tmp/checkpointlocation1406
ChangeFeedQueryName A custom string to identify the query. The connector keeps track of the collection continuation tokens for different change feed queries separately. If readchangefeed is true, this is a required configuration.
ConnectionMode Specifies if the connection is through the data partitions or gateway, for change feed, please specify Gateway.
query_custom This is the same query_custom from Common Batch Parameters. For any large documents (e.g. Twitter JSON, use this parameter to reduce the size of the change feed read, e.g. SELECT c.id, c.created_at, c.user.screen_name, c.user.lang, c.user.location, c.text, c.retweet_count, c.entities.hashtags, c.entities.user_mentions, c.favorited, c.source FROM c

 

Click for Python Example

# Configure connection to your collection (Python)
readConfig = {
  "Endpoint" : "https://doctorwho.documents.azure.com:443/",
  "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" : "DepartureDelays",
  "Collection" : "flights_pcoll",
  "ConnectionMode" : "Gateway",
  "ChangeFeedQueryName" : "CF1406",
  "ChangeFeedCheckPointLocation" : "/tmp/checkpointlocation1406",
  "query_custom" : "SELECT c.id, c.created_at, c.user.screen_name,  c.user.lang, c.user.location, c.text, c.retweet_count, c.entities.hashtags, c.entities.user_mentions, c.favorited, c.source FROM c" 
}

 

Click for Scala Example

// Configure connection to your collection (Scala)
val readConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "Collection" -> "flights_pcoll",
  "ConnectionMode" -> "Gateway"
  "ChangeFeedQueryName" -> "CF1406",
  "ChangeFeedCheckPointLocation" -> "/tmp/checkpointlocation1406",
  "query_custom" -> "SELECT c.id, c.created_at, c.user.screen_name,  c.user.lang, c.user.location, c.text, c.retweet_count, c.entities.hashtags, c.entities.user_mentions, c.favorited, c.source FROM c" 
))

 

Reading: Optional Change Feed Parameters

Below are optional change feed parameters.

Parameter Description
changefeedstartfromthebeginning Sets whether change feed should start from the beginning (true) or from the current point in time by default (false).
changefeedusenexttoken A boolean value to support processing failure scenarios. It is used to indicate that the current change feed batch has been handled gracefully and the RDD should use the next continuation tokens to get the subsequent batch of changes.
readchangefeed Indicates that the collection content is fetched from CosmosDB Change Feed. The default value is false
rollingchangefeed Indicates whether the change feed should be from the last query. The default value is false, which means the changes will be counted from the first read of the collection.

 

Reading: Optional Parameters

These miscellaneous optional parameters allow you finer grain control of your read if so desired.

Parameter Description
query_emitverbosetraces Sets the option to allow queries to emit out verbose traces for investigation.
query_enablescan Sets the option to enable scans on the queries which could not be served by an Azure Cosmos DB index because it was disabled (indexes on all attributes are enabled by default in Azure Cosmos DB).
query_maxbuffereditemcount This sets the maximum number of items that can be buffered client side during parallel query execution. A positive property value limits the number of buffered items to the set value. If it is set to less than 0, the system automatically decides the number of items to buffer.
query_maxdegreeofparallelism This sets the number of concurrent operations that execute from the client side during parallel query execution. A positive property value limits the number of concurrent operations to the set value. If it is set to less than 0, the system automatically decides the number of concurrent operations to run. As azure-cosmosdb-spark maps each collection partition with an executor, this value will have minimal effect on the reading operation.

Back to the top