Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds documentation for DocumentDB as a source. #7137

Merged
merged 35 commits into from
May 24, 2024
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
79d3f43
Adds documentation for DocumentDB as a source.
dlvenable May 14, 2024
e444d3a
Merge branch 'main' into data-prepper-documentdb
vagimeli May 20, 2024
c82878e
Apply suggestions from code review
dlvenable May 20, 2024
a5d9d8a
Made some other changes to the documentation per the PR.
dlvenable May 24, 2024
fc88eda
Merge branch 'main' into data-prepper-documentdb
vagimeli May 24, 2024
f59c570
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
8c9b630
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
dd1130d
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
960f6ce
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
b8372f6
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
d19bfb7
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
f7e120e
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
ed6daae
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
7d76a74
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
bc33f53
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
134c4af
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
bf743ab
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
cbb33f1
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
666752c
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
79dc780
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
797ad86
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
c8da4bb
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
b78c736
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
78d551c
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
176dfa7
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
a350940
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
a66cc1c
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
9e7e8fb
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
07c6f25
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
df48ebc
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
6abf7bd
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
4e3e546
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
69102ee
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
dd10324
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
cd27480
Update _data-prepper/pipelines/configuration/sources/documentdb.md
vagimeli May 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions _data-prepper/pipelines/configuration/sources/documentdb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
layout: default
title: documentdb
parent: Sources
grand_parent: Pipelines
nav_order: 2
---

# documentdb

Check failure on line 9 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'documentdb' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'documentdb' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

Check failure on line 9 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: documentdb. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: documentdb. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

The `documentdb` source reads documents from [Amazon DocumentDB](https://aws.amazon.com/documentdb/) collections.
It can read historical data from an export and keep up-to-date on the data using DocumentDB [Change Streams](https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html).
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Amazon DocumentDB always requires "Amazon" to precede it, and AWS does not capitalize "change streams". Also, rather than "keep up to date", would "stay current with the data" work?


The `documentdb` source will read data from DocumentDB and put that data into an [Amazon S3](https://aws.amazon.com/s3/) bucket.

Check failure on line 14 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 14, "column": 102}}}, "severity": "ERROR"}

Check failure on line 14 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 14, "column": 102}}}, "severity": "ERROR"}
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
From there, other Data Prepper workers will read from the S3 bucket to process data.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

## Usage

dlvenable marked this conversation as resolved.
Show resolved Hide resolved
```
version: "2"
documentdb-pipeline:
source:
documentdb:
host: "docdb-mycluster.cluster-random.us-west-2.docdb.amazonaws.com"
port: 27017
authentication:
username: ${{aws_secrets:secret:username}}
password: ${{aws_secrets:secret:password}}
aws:
sts_role_arn: "arn:aws:iam::123456789012:role/MyRole"
s3_bucket: my-bucket
s3_region: us-west-2
collections:
- collection: my-collection
export: true
stream: true
acknowledgments: true
```
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

## Configuration

You can use the following options to configure the `documentdb` source.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`host` | Yes | String | The hostname of the DocumentDB cluster.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`port` | No | Integer | The port number of the DocumentDB cluster. Defaults to `27017`.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`trust_store_file_path` | No | String | The path to a trust store file that contains the public certificate for the DocumentDB cluster.

Check failure on line 48 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'truststore' instead of 'trust store'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'truststore' instead of 'trust store'.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 48, "column": 55}}}, "severity": "ERROR"}
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`trust_store_password` | No | String | The password for the trust store specified by `trust_store_file_path`.

Check failure on line 49 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'truststore' instead of 'trust store'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'truststore' instead of 'trust store'.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 49, "column": 61}}}, "severity": "ERROR"}
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`authentication` | Yes | authentication | The authentication configuration. See [authentication](#authentication) for more information.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`collections` | Yes | List | A list of [collection](#collection) configurations. Exactly one collection is required.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`s3_bucket` | Yes | String | The Amazon S3 bucket to use for processing events from DocumentDB.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`s3_prefix` | No | String | An optional key prefix in Amazon S3. By default, there is no key prefix.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`s3_region` | No | String | The AWS region where the S3 bucket resides.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.

Check failure on line 55 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 55, "column": 2}}}, "severity": "ERROR"}
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`id_key` | No | String | When specified, the `_id` field from DocumentDB will be set to the key name specified by `id_key`. You can use this when need more information beyond an ObjectId string saved to your sink. By default, the `_id` is not made as part of the Event.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`direct_connection` | No | Boolean | When `true`, the MongoDB driver will connect directly to the specified DocumentDB server or servers without discovering and connecting to the entire replica set. Defaults to `true`.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`read_preference` | No | String | Determines how to read from DocumentDB. See [Read Preference Modes](https://www.mongodb.com/docs/v3.6/reference/read-preference/#read-preference-modes) for details. Defaults to `primaryPreferred`.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Determines the method used to read from Amazon DocumentDB"?

`disable_s3_read_for_leader` | No | Boolean | When `true`, the current leader node will not read from S3. It will only do the work of reading the stream. Defaults to `false`.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`partition_acknowledgment_timeout` | No | Duration | Configures how long a node will hold onto a partition. Defaults to `2h`.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`acknowledgments` | No | Boolean | When `true`, enables [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#end-to-end-acknowledgments) on this source after events are sent to the sinks.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`insecure` | No | Boolean | Disables TLS. Defaults to `false`. Do not use this value in production.
`ssl_insecure_disable_verification` | No | Boolean | Disables TLS hostname verification. Defaults to `false`. Do not use this value in production. Use the `trust_store_file_path` to verify the hostname.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`ssl_insecure_disable_verification` | No | Boolean | Disables TLS hostname verification. Defaults to `false`. Do not use this value in production. Use the `trust_store_file_path` to verify the hostname.
`ssl_insecure_disable_verification` | No | Boolean | Disables TLS hostname verification. Defaults to `false`. Use the `trust_store_file_path` to verify the hostname. Do not use this value in production.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagimeli , I took a different approach here. I found the clearest way to communicate this was to leave the order as it is. I made some changes in my latest revision which include the word "instead" to try to clarify that one configuration is better than the other.


### authentication

Check failure on line 65 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'authentication' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'authentication' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 65, "column": 5}}}, "severity": "ERROR"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that this and the following headings are intentionally lowercase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the configuration settings proper name. I'll set in code format for clarity.

vagimeli marked this conversation as resolved.
Show resolved Hide resolved

The following parameters allow you to configure authentication to the DocumentDB cluster.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Option | Required | Type | Description
:--- | :--- | :--- | :---
`username` | Yes | String | The username to use when authenticating with the DocumentDB cluster. Supports automatic refresh.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`password` | Yes | String | The password to use when authenticating with the DocumentDB cluster. Supports automatic refresh.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

vagimeli marked this conversation as resolved.
Show resolved Hide resolved

### collection

Check failure on line 75 in _data-prepper/pipelines/configuration/sources/documentdb.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'collection' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'collection' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sources/documentdb.md", "range": {"start": {"line": 75, "column": 5}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

The following parameters allow you to configure the collection to read from the DocumentDB cluster.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Option | Required | Type | Description
:--- | :--- | :--- | :---
`collection` | Yes | String | The name of the collection
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`export` | No | Boolean | Whether to include an export or full load. Defaults to `true`.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`stream` | No | Boolean | Whether to enable a stream. Defaults to `true`.
`partition_count` | No | Integer | Defines the number of partitions to create in Amazon S3. Defaults to `100`.
`export_batch_size` | No | Integer | Defaults to `10,000`.
`stream_batch_size` | No | Integer | Defaults to `1,000`.

## aws
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Option | Required | Type | Description
:--- | :--- | :--- | :---
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
`aws_sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
Loading