From 1c94268fe05fa1cd3810a2a6128a2d27827275dd Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Wed, 15 May 2024 09:41:05 -0500 Subject: [PATCH 01/44] Add new s3 sink documentation for Data Prepper 2.8 Signed-off-by: Taylor Gray --- .../pipelines/configuration/sinks/s3.md | 57 +++++++++++++++---- 1 file changed, 46 insertions(+), 11 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 71cb7b1f70..00d8e6adfe 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -70,20 +70,48 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM } ``` +## Cross-account S3 access + +When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the +[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). +By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured, and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` will default to the account id from the `aws.sts_role_arn`. + +If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions: + +- If all S3 buckets you want data from belong to the same account, set `default_bucket_owner` to the account ID of the bucket account holder. +- If your S3 buckets are in multiple accounts, use a `bucket_owners` map. + +The following example shows a `my-bucket-01` that is owned by `123456789012` and `my-bucket-02` that is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration: + +``` +sink: + - s3: + default_bucket_owner: 111111111111 + bucket_owners: + my-bucket-01: 123456789012 + my-bucket-02: 999999999999 +``` + +You can use both `bucket_owners` and `default_bucket_owner` together. + ## Configuration Use the following options when customizing the `s3` sink. -Option | Required | Type | Description -:--- | :--- | :--- | :--- -`bucket` | Yes | String | The name of the S3 bucket to which objects are stored. The `name` must match the name of your object store. -`codec` | Yes | [Codec](#codec) | The codec determining the format of output data. -`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. -`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. -`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. -`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. +Option | Required | Type | Description +:--- |:---------|:------------------------------------------------| :--- +`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped. +`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed. +`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. +`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. +`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. +`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory. +`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. +`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. +`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. +`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. ## aws @@ -106,6 +134,13 @@ Option | Required | Type | Description `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. `event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. +## Aggregate threshold configuration + +Option | Required | Type | Description +:--- |:-----------------------------------|:-------| :--- +`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5 +`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. + ## Buffer type @@ -119,7 +154,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. ## codec From 34f3d3434db1d4cdb9b62b1370db7b8a62a35109 Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Tue, 21 May 2024 11:14:42 -0500 Subject: [PATCH 02/44] Apply suggestions from code review Co-authored-by: Melissa Vagi Signed-off-by: Taylor Gray --- .../pipelines/configuration/sinks/s3.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 00d8e6adfe..920820c468 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -74,14 +74,14 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). -By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured, and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` will default to the account id from the `aws.sts_role_arn`. +By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` defaults to the account id from the `aws.sts_role_arn`. -If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions: +When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: -- If all S3 buckets you want data from belong to the same account, set `default_bucket_owner` to the account ID of the bucket account holder. -- If your S3 buckets are in multiple accounts, use a `bucket_owners` map. +- For S3 buckets under the same account, set `default_bucket_owner` to that account's ID. +- For S3 buckets across multiple accounts, use a `bucket_owners` map. -The following example shows a `my-bucket-01` that is owned by `123456789012` and `my-bucket-02` that is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration: +The `bucket_owners` map specifies account IDs for buckets across accounts, for example, `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`, as shown in the following configuration: ``` sink: @@ -92,7 +92,7 @@ sink: my-bucket-02: 999999999999 ``` -You can use both `bucket_owners` and `default_bucket_owner` together. +Both `bucket_owners` and `default_bucket_owner` can be used together. ## Configuration @@ -100,18 +100,18 @@ Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- |:---------|:------------------------------------------------| :--- -`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped. -`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed. -`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). -`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). -`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. -`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. -`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory. -`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. -`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. -`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. +`bucket` | Yes | String | Specifies the S3 bucket name for the sink. Supports using dynamic bucket naming using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. +`default_bucket` | No | String | Static bucket for inaccessible dynamic buckets in `bucket`. +`bucket_owners` | No | Map | Map of bucket names and their account owner IDs for cross-account access. See [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. See [Cross-account S3 access](#s3_bucket_ownership). +`codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. +`aws` | Yes | AWS | AWS configuration. See [aws](#aws). +`threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. +`aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | Condition for flushing objects with dynamic `path_prefix`. +`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. +`compression` | No | String | Compression algorithm (`none`, `gzip`, `snappy`). Default is `none`. +`buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. +`max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. ## aws From f8ef1181a520b2bc9adf96c3d6a08ccba3854d09 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 24 May 2024 10:29:50 -0600 Subject: [PATCH 03/44] Update s3.md Clean up formatting. Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- .../pipelines/configuration/sinks/s3.md | 58 +++++-------------- 1 file changed, 15 insertions(+), 43 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 920820c468..585fd97677 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -16,18 +16,17 @@ The `s3` sink uses the following format when batching events: ${pathPrefix}events-%{yyyy-MM-dd'T'HH-mm-ss'Z'}-${currentTimeInNanos}-${uniquenessId}.${codecSuppliedExtension} ``` -When a batch of objects is written to S3, the objects are formatted similarly to the following: +When a batch of objects is written to Amazon S3, the objects are formatted similarly to the following: ``` my-logs/2023/06/09/06/events-2023-06-09T06-00-01-1686290401871214927-ae15b8fa-512a-59c2-b917-295a0eff97c8.json ``` - For more information about how to configure an object, see the [Object key](#object-key-configuration) section. ## Usage -The following example creates a pipeline configured with an s3 sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type `ndjson`: +The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type `ndjson`: ``` pipeline: @@ -72,9 +71,7 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM ## Cross-account S3 access -When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the -[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). -By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` defaults to the account id from the `aws.sts_role_arn`. +When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` defaults to the account id from the `aws.sts_role_arn`. When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: @@ -113,7 +110,7 @@ Option | Required | Type | Descriptio `buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. `max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. -## aws +## `aws` Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -122,8 +119,6 @@ Option | Required | Type | Description `sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. `sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the role. For more information, see the `ExternalId` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. - - ## Threshold configuration Use the following options to set ingestion thresholds for the `s3` sink. When any of these conditions are met, Data Prepper will write events to an S3 object. @@ -141,7 +136,6 @@ Option | Required | Type | Description `flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5 `maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. - ## Buffer type `buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. Use one of the following options: @@ -156,70 +150,48 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- `path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. - -## codec +## `codec` The `codec` determines how the `s3` source formats data written to each S3 object. -### avro codec +### `avro` codec The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. -Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. -In general, you should define your own schema because it will most accurately reflect your needs. - -We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions). -Without the null union, each field must be present or the data will fail to write to the sink. -If you can be certain that each each event has a given field, you can make it non-nullable. +Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. In general, you should define your own schema because it will most accurately reflect your needs. -When you provide your own Avro schema, that schema defines the final structure of your data. -Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. -To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. +We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions). Without the null union, each field must be present or the data will fail to write to the sink. If you can be certain that each each event has a given field, you can make it non-nullable. -In cases where your data is uniform, you may be able to automatically generate a schema. -Automatically generated schemas are based on the first event received by the codec. -The schema will only contain keys from this event. -Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. -Automatically generated schemas make all fields nullable. -Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema. +When you provide your own Avro schema, that schema defines the final structure of your data. Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. +In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. +The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema. Option | Required | Type | Description :--- | :--- | :--- | :--- `schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. `auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event. - ### ndjson codec -The `ndjson` codec writes each line as a JSON object. - -The `ndjson` codec does not take any configurations. - +The `ndjson` codec writes each line as a JSON object. The `ndjson` codec does not take any configurations. ### json codec -The `json` codec writes events in a single large JSON file. -Each event is written into an object within a JSON array. - +The `json` codec writes events in a single large JSON file. Each event is written into an object within a JSON array. Option | Required | Type | Description :--- | :--- | :--- | :--- `key_name` | No | String | The name of the key for the JSON array. By default this is `events`. - ### parquet codec -The `parquet` codec writes events into a Parquet file. -When using the Parquet codec, set the `buffer_type` to `in_memory`. +The `parquet` codec writes events into a Parquet file. When using the Parquet codec, set the `buffer_type` to `in_memory`. -The Parquet codec writes data using the Avro schema. -Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. -However, we generally recommend that you define your own schema so that it can best meet your needs. +The Parquet codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. However, we generally recommend that you define your own schema so that it can best meet your needs. For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation. - Option | Required | Type | Description :--- | :--- | :--- | :--- `schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. From e608d5e212f4c16aee344f7aba680c305b0618c9 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:22:51 -0600 Subject: [PATCH 04/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 585fd97677..8b2838cb35 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -71,7 +71,9 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM ## Cross-account S3 access -When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` defaults to the account id from the `aws.sts_role_arn`. +When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). + +By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, `default_bucket_owner` defaults to the account id from `aws.sts_role_arn`. When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: From 6b472fa39d35e7625e6d456cb64365314f49c234 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:24:00 -0600 Subject: [PATCH 05/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 8b2838cb35..0507cb5f1b 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -91,7 +91,7 @@ sink: my-bucket-02: 999999999999 ``` -Both `bucket_owners` and `default_bucket_owner` can be used together. +`bucket_owners` and `default_bucket_owner` can be used together. ## Configuration From bd2fbba486037215c022c58ac40217c7e050fd9f Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:27:11 -0600 Subject: [PATCH 06/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 0507cb5f1b..02c82665f1 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -135,7 +135,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- |:-----------------------------------|:-------| :--- -`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5 +`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Default is `0.5`. `maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. ## Buffer type From 426baedb7729c7570f054d0681ea86cc462c89f4 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:28:26 -0600 Subject: [PATCH 07/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 02c82665f1..8a8465b0f5 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -136,7 +136,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- |:-----------------------------------|:-------| :--- `flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Default is `0.5`. -`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. +`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force-flushing objects. For example, `128mb`. ## Buffer type From 64f2193178ad2b67e13a35714ddfe3982d38d845 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:29:49 -0600 Subject: [PATCH 08/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 8a8465b0f5..38ca44935f 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -150,7 +150,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the `my_partition_key` value. The prefix path should end with `/`. By default, Data Prepper writes objects to the S3 bucket root. ## `codec` From b19ec96ac08460436a6993a6201ed5f5e0cc4824 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:32:53 -0600 Subject: [PATCH 09/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 38ca44935f..9ba93588a6 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -160,7 +160,7 @@ The `codec` determines how the `s3` source formats data written to each S3 objec The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. -Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. In general, you should define your own schema because it will most accurately reflect your needs. +Because Avro requires a schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions). Without the null union, each field must be present or the data will fail to write to the sink. If you can be certain that each each event has a given field, you can make it non-nullable. From 339cb2c6532d028db592f57b0d558cc998d4b718 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:36:39 -0600 Subject: [PATCH 10/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 9ba93588a6..8cb7f4bb78 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -162,7 +162,7 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d Because Avro requires a schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. -We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions). Without the null union, each field must be present or the data will fail to write to the sink. If you can be certain that each each event has a given field, you can make it non-nullable. +It is recommended making Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions) to allow missing values; otherwise, ensure all required fields are present for each event. Use non-nullable fields only when certain of their presence. When you provide your own Avro schema, that schema defines the final structure of your data. Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. From bb8ae3f7d752f9b7d837e9f14e1ee78389a01768 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:37:51 -0600 Subject: [PATCH 11/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 8cb7f4bb78..8c55e37df6 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -166,7 +166,7 @@ It is recommended making Avro fields use a null [union](https://avro.apache.org/ When you provide your own Avro schema, that schema defines the final structure of your data. Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. -In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. +In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema. Option | Required | Type | Description From 7330f631eec38b40761632a926e14c33132f94ce Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:40:00 -0600 Subject: [PATCH 12/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 8c55e37df6..38a1d8eb50 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -188,7 +188,7 @@ Option | Required | Type | Description ### parquet codec -The `parquet` codec writes events into a Parquet file. When using the Parquet codec, set the `buffer_type` to `in_memory`. +The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`. The Parquet codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. However, we generally recommend that you define your own schema so that it can best meet your needs. From 6ff8db9ebdf904265c938851b34f98c9154048a1 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 15:41:54 -0600 Subject: [PATCH 13/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 38a1d8eb50..a6a52c5730 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -190,7 +190,7 @@ Option | Required | Type | Description The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`. -The Parquet codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. However, we generally recommend that you define your own schema so that it can best meet your needs. +The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation. From 52d63f1ed118f9f56ce2700c671e49c69cb9b4aa Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 16:16:39 -0600 Subject: [PATCH 14/44] Update s3.md Signed-off-by: Melissa Vagi Signed-off-by: Melissa Vagi --- .../pipelines/configuration/sinks/s3.md | 74 +++++++++++-------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index a6a52c5730..d158a14afd 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -15,18 +15,20 @@ The `s3` sink uses the following format when batching events: ``` ${pathPrefix}events-%{yyyy-MM-dd'T'HH-mm-ss'Z'}-${currentTimeInNanos}-${uniquenessId}.${codecSuppliedExtension} ``` +{% include copy-curl.html %} When a batch of objects is written to Amazon S3, the objects are formatted similarly to the following: ``` my-logs/2023/06/09/06/events-2023-06-09T06-00-01-1686290401871214927-ae15b8fa-512a-59c2-b917-295a0eff97c8.json ``` +{% include copy-curl.html %} -For more information about how to configure an object, see the [Object key](#object-key-configuration) section. +For more information about how to configure an object, refer to [Object key](#object-key-configuration). ## Usage -The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type `ndjson`: +The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type as `ndjson`: ``` pipeline: @@ -48,10 +50,11 @@ pipeline: ndjson: buffer_type: in_memory ``` +{% include copy-curl.html %} ## IAM permissions -In order to use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration: +To use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration: ```json { @@ -68,19 +71,20 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM ] } ``` +{% include copy-curl.html %} ## Cross-account S3 access When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). -By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of these mapped configurations, `default_bucket_owner` defaults to the account id from `aws.sts_role_arn`. +By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of the mapped configurations, `default_bucket_owner` defaults to the account ID in `aws.sts_role_arn`. `bucket_owners` and `default_bucket_owner` can be used together. When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: - For S3 buckets under the same account, set `default_bucket_owner` to that account's ID. - For S3 buckets across multiple accounts, use a `bucket_owners` map. -The `bucket_owners` map specifies account IDs for buckets across accounts, for example, `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`, as shown in the following configuration: +A `bucket_owners` map specifies account IDs for buckets across accounts, for example as shown in the following configuration, with `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`: ``` sink: @@ -90,8 +94,7 @@ sink: my-bucket-01: 123456789012 my-bucket-02: 999999999999 ``` - -`bucket_owners` and `default_bucket_owner` can be used together. +{% include copy-curl.html %} ## Configuration @@ -99,16 +102,16 @@ Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- |:---------|:------------------------------------------------| :--- -`bucket` | Yes | String | Specifies the S3 bucket name for the sink. Supports using dynamic bucket naming using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. +`bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports using dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. `default_bucket` | No | String | Static bucket for inaccessible dynamic buckets in `bucket`. -`bucket_owners` | No | Map | Map of bucket names and their account owner IDs for cross-account access. See [Cross-account S3 access](#s3_bucket_ownership). -`default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. See [Cross-account S3 access](#s3_bucket_ownership). +`bucket_owners` | No | Map | Map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. -`aws` | Yes | AWS | AWS configuration. See [aws](#aws). +`aws` | Yes | AWS | AWS configuration. Refer to [aws](#aws). `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. `aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | Condition for flushing objects with dynamic `path_prefix`. -`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. -`compression` | No | String | Compression algorithm (`none`, `gzip`, `snappy`). Default is `none`. +`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the bucket's root directory. `path_prefix` is configurable. +`compression` | No | String | Compression algorithm: `none`, `gzip`, and `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. `max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. @@ -117,22 +120,24 @@ Option | Required | Type | Descriptio Option | Required | Type | Description :--- | :--- | :--- | :--- `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). -`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon S3. Defaults to `null`, which uses [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. -`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the role. For more information, see the `ExternalId` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. +`sts_external_id` | No | String | An AWS STS external ID used when Data Prepper assumes the role. For more information, refer to the `ExternalId` section under [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) in the AWS STS API reference. ## Threshold configuration -Use the following options to set ingestion thresholds for the `s3` sink. When any of these conditions are met, Data Prepper will write events to an S3 object. +Use the following options to set ingestion thresholds for the `s3` sink. Data Prepper writes events to an S3 object when any of these conditions occur. Option | Required | Type | Description :--- | :--- | :--- | :--- `event_count` | Yes | Integer | The number of Data Prepper events to accumulate before writing an object to S3. `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. -`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. +`event_collect_timeout` | Yes | String | The maximum amount of timeout before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. ## Aggregate threshold configuration +Use the following options to set rules or limits that trigger certain actions or behavior when an aggregated value crosses a defined threshold. + Option | Required | Type | Description :--- |:-----------------------------------|:-------| :--- `flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Default is `0.5`. @@ -140,14 +145,18 @@ Option | Required | Type | Description ## Buffer type -`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. Use one of the following options: +`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. + +Use one of the following options: - `in_memory`: Stores the record in memory. -- `local_file`: Flushes the record into a file on your local machine. This uses your machine's temporary directory. +- `local_file`: Flushes the record into a file on your local machine. This option uses your machine's temporary directory. - `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part. ## Object key configuration +Use the following options to define how object keys are constructed for objects stored in S3. + Option | Required | Type | Description :--- | :--- | :--- | :--- `path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the `my_partition_key` value. The prefix path should end with `/`. By default, Data Prepper writes objects to the S3 bucket root. @@ -158,41 +167,44 @@ The `codec` determines how the `s3` source formats data written to each S3 objec ### `avro` codec -The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. +The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. It is recommended that you define the schema to accurately reflect your needs. -Because Avro requires a schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. +When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. This is to avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations. -It is recommended making Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions) to allow missing values; otherwise, ensure all required fields are present for each event. Use non-nullable fields only when certain of their presence. +In cases where your data is uniform, you may be able to automatically generate a schema. Auto-generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and you must have all keys present in all events in order for the auto-generated schema to produce a working schema. Auto-generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control what data is included in the auto-generated schema. -When you provide your own Avro schema, that schema defines the final structure of your data. Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. +It is recommended making Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions) to allow missing values; otherwise, ensure all required fields are present for each event. Use non-nullable fields only when certain of their presence. -In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. -The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema. +Use the following options to configure the codec. Option | Required | Type | Description :--- | :--- | :--- | :--- `schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. `auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event. -### ndjson codec +### `ndjson` codec The `ndjson` codec writes each line as a JSON object. The `ndjson` codec does not take any configurations. -### json codec +### `json` codec The `json` codec writes events in a single large JSON file. Each event is written into an object within a JSON array. +Use the following options to configure the codec. + Option | Required | Type | Description :--- | :--- | :--- | :--- `key_name` | No | String | The name of the key for the JSON array. By default this is `events`. -### parquet codec +### `parquet` codec The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`. The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. -For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation. +For details on the Avro schema and recommendations, refer to [Avro codec](#avro-codec). + +Use the following options to configure the codec. Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -201,7 +213,7 @@ Option | Required | Type | Description ### Setting a schema with Parquet -The following example shows you how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records): +The following example pipeline shows how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records): ``` pipeline: @@ -244,4 +256,4 @@ pipeline: event_collect_timeout: PT15M buffer_type: in_memory ``` - +{% include copy-curl.html %} From 4e30d2d05c578182f36dbc84c9811992d1562a23 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 21 Jun 2024 15:58:29 -0600 Subject: [PATCH 15/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index d158a14afd..a62128c3b6 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -167,7 +167,7 @@ The `codec` determines how the `s3` source formats data written to each S3 objec ### `avro` codec -The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. It is recommended that you define the schema to accurately reflect your needs. +The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case. When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. This is to avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations. From eb83359ebb31d37a80d991c875b70ccc372f40f0 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 21 Jun 2024 16:01:12 -0600 Subject: [PATCH 16/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index a62128c3b6..edccefc32d 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -173,7 +173,7 @@ When you provide your own Avro schema, that schema defines the final structure o In cases where your data is uniform, you may be able to automatically generate a schema. Auto-generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and you must have all keys present in all events in order for the auto-generated schema to produce a working schema. Auto-generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control what data is included in the auto-generated schema. -It is recommended making Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions) to allow missing values; otherwise, ensure all required fields are present for each event. Use non-nullable fields only when certain of their presence. +Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions), as this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. Use the following options to configure the codec. From 3dc6c55a614396c38d06c1354d35d19ac803761c Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 21 Jun 2024 16:02:40 -0600 Subject: [PATCH 17/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index edccefc32d..406d4841ce 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -140,7 +140,7 @@ Use the following options to set rules or limits that trigger certain actions or Option | Required | Type | Description :--- |:-----------------------------------|:-------| :--- -`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Default is `0.5`. +`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Percentage is expressed from `0.0`--`1.0`. Default is `0.5`. `maximum_size` | Yes | String | The maximum number of bytes to accumulate before force-flushing objects. For example, `128mb`. ## Buffer type From 074f1dae209f752a4fcb07ea3422e582d7c147e4 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 21 Jun 2024 16:12:46 -0600 Subject: [PATCH 18/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 406d4841ce..15ac54e263 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -77,7 +77,7 @@ To use the `s3` sink, configure AWS Identity and Access Management (IAM) to gran When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). -By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of the mapped configurations, `default_bucket_owner` defaults to the account ID in `aws.sts_role_arn`. `bucket_owners` and `default_bucket_owner` can be used together. +By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of the mapped configurations, `default_bucket_owner` defaults to the account ID in `aws.sts_role_arn`. You can configure both `bucket_owners` and `default_bucket_owner` and apply the settings together. When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: From 0b1e89f84290ab75ed1ad933b048fc847cea9ea7 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 21 Jun 2024 16:13:56 -0600 Subject: [PATCH 19/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 15ac54e263..d7e97e2eb8 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -200,7 +200,7 @@ Option | Required | Type | Description The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`. -The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. It is recommended that you define your schema to accurately reflect your needs. +The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case. For details on the Avro schema and recommendations, refer to [Avro codec](#avro-codec). From 518d112cf96a3a99de5355afb9bb177a08a75913 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 14:58:56 -0600 Subject: [PATCH 20/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index d7e97e2eb8..bf4b2a6ee1 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -28,7 +28,7 @@ For more information about how to configure an object, refer to [Object key](#ob ## Usage -The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type as `ndjson`: +The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for the pipeline and sets the codec type as `ndjson`: ``` pipeline: From 4745f640b9086b88ecd7bbf9a6b98c3811da9ad6 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 14:59:22 -0600 Subject: [PATCH 21/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index bf4b2a6ee1..31ce97e690 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -75,7 +75,7 @@ To use the `s3` sink, configure AWS Identity and Access Management (IAM) to gran ## Cross-account S3 access -When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). +When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using a [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of the mapped configurations, `default_bucket_owner` defaults to the account ID in `aws.sts_role_arn`. You can configure both `bucket_owners` and `default_bucket_owner` and apply the settings together. From 9700cc76c26186b20f6532a9cafef811376326c4 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:00:11 -0600 Subject: [PATCH 22/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 31ce97e690..c4246b5532 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -81,7 +81,7 @@ By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: -- For S3 buckets under the same account, set `default_bucket_owner` to that account's ID. +- For S3 buckets belonging to the same account, set `default_bucket_owner` to that account's ID. - For S3 buckets across multiple accounts, use a `bucket_owners` map. A `bucket_owners` map specifies account IDs for buckets across accounts, for example as shown in the following configuration, with `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`: From f65bfb0c1eb2681681806cb4b4790a3893a123b3 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:00:29 -0600 Subject: [PATCH 23/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index c4246b5532..1bae03d4d9 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -82,7 +82,7 @@ By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions: - For S3 buckets belonging to the same account, set `default_bucket_owner` to that account's ID. -- For S3 buckets across multiple accounts, use a `bucket_owners` map. +- For S3 buckets belonging to multiple accounts, use a `bucket_owners` map. A `bucket_owners` map specifies account IDs for buckets across accounts, for example as shown in the following configuration, with `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`: From 67ebd6360457cdc032170505416adb6b52b73762 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:00:43 -0600 Subject: [PATCH 24/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 1bae03d4d9..5dc82efca7 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -84,7 +84,7 @@ When ingesting data from multiple S3 buckets with different account associations - For S3 buckets belonging to the same account, set `default_bucket_owner` to that account's ID. - For S3 buckets belonging to multiple accounts, use a `bucket_owners` map. -A `bucket_owners` map specifies account IDs for buckets across accounts, for example as shown in the following configuration, with `my-bucket-01` owned by `123456789012` and `my-bucket-02` owned by `999999999999`: +A `bucket_owners` map specifies account IDs for buckets belonging to multiple accounts. For example, in the following configuration, `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`: ``` sink: From 340ea25b0f751de852b9e12408d9087be1e824ec Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:00:57 -0600 Subject: [PATCH 25/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 5dc82efca7..83467f87dd 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -102,7 +102,7 @@ Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- |:---------|:------------------------------------------------| :--- -`bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports using dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. +`bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. `default_bucket` | No | String | Static bucket for inaccessible dynamic buckets in `bucket`. `bucket_owners` | No | Map | Map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). `default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). From d51418f452f515a0950402e82e0a4507a7f3c077 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:04 -0600 Subject: [PATCH 26/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 83467f87dd..cb19bc06a8 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -104,7 +104,7 @@ Option | Required | Type | Descriptio :--- |:---------|:------------------------------------------------| :--- `bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. `default_bucket` | No | String | Static bucket for inaccessible dynamic buckets in `bucket`. -`bucket_owners` | No | Map | Map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). +`bucket_owners` | No | Map | A map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). `default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. `aws` | Yes | AWS | AWS configuration. Refer to [aws](#aws). From 8afd19d140f5c9d0c086e438fd94b5f750d9f799 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:15 -0600 Subject: [PATCH 27/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index cb19bc06a8..ad70779b6d 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -103,7 +103,7 @@ Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- |:---------|:------------------------------------------------| :--- `bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. -`default_bucket` | No | String | Static bucket for inaccessible dynamic buckets in `bucket`. +`default_bucket` | No | String | A static bucket for inaccessible dynamic buckets in `bucket`. `bucket_owners` | No | Map | A map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). `default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. From 7729a67494a0cc559cb0e7b31b300823a6b9ce92 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:24 -0600 Subject: [PATCH 28/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index ad70779b6d..571606845c 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -105,7 +105,7 @@ Option | Required | Type | Descriptio `bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped. `default_bucket` | No | String | A static bucket for inaccessible dynamic buckets in `bucket`. `bucket_owners` | No | Map | A map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). -`default_bucket_owner` | No | String | AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | The AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. `aws` | Yes | AWS | AWS configuration. Refer to [aws](#aws). `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. From 521ac9391cc3a689447cf093e589aa11f1b05dd1 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:32 -0600 Subject: [PATCH 29/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 571606845c..f994ea2d1f 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -107,7 +107,7 @@ Option | Required | Type | Descriptio `bucket_owners` | No | Map | A map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership). `default_bucket_owner` | No | String | The AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership). `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. -`aws` | Yes | AWS | AWS configuration. Refer to [aws](#aws). +`aws` | Yes | AWS | The AWS configuration. Refer to [aws](#aws). `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. `aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | Condition for flushing objects with dynamic `path_prefix`. `object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the bucket's root directory. `path_prefix` is configurable. From cf65a3791762bdf91582025bb1671338e358c586 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:43 -0600 Subject: [PATCH 30/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index f994ea2d1f..7d415cc0ec 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -109,7 +109,7 @@ Option | Required | Type | Descriptio `codec` | Yes | [Codec](#codec) | Serializes data in S3 objects. `aws` | Yes | AWS | The AWS configuration. Refer to [aws](#aws). `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. -`aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | Condition for flushing objects with dynamic `path_prefix`. +`aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | A condition for flushing objects with a dynamic `path_prefix`. `object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the bucket's root directory. `path_prefix` is configurable. `compression` | No | String | Compression algorithm: `none`, `gzip`, and `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. From b9c58fe69d33665dd5162c94f943c07530b1f518 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:01:55 -0600 Subject: [PATCH 31/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 7d415cc0ec..5faa6691d8 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -110,7 +110,7 @@ Option | Required | Type | Descriptio `aws` | Yes | AWS | The AWS configuration. Refer to [aws](#aws). `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. `aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | A condition for flushing objects with a dynamic `path_prefix`. -`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the bucket's root directory. `path_prefix` is configurable. +`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, these objects are found in the bucket's root directory. `path_prefix` is configurable. `compression` | No | String | Compression algorithm: `none`, `gzip`, and `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. `max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. From 3e23dc40409e8cf88320571d96619f4df246be5c Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:02:08 -0600 Subject: [PATCH 32/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 5faa6691d8..df7df1418f 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -111,7 +111,7 @@ Option | Required | Type | Descriptio `threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3. `aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | A condition for flushing objects with a dynamic `path_prefix`. `object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, these objects are found in the bucket's root directory. `path_prefix` is configurable. -`compression` | No | String | Compression algorithm: `none`, `gzip`, and `snappy`. Default is `none`. +`compression` | No | String | The compression algorithm: Either `none`, `gzip`, or `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. `max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. From f4e6e2bcbfe12998a246aae2cb5258e66c1f119b Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:02:18 -0600 Subject: [PATCH 33/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index df7df1418f..9a459a8570 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -112,7 +112,7 @@ Option | Required | Type | Descriptio `aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | A condition for flushing objects with a dynamic `path_prefix`. `object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, these objects are found in the bucket's root directory. `path_prefix` is configurable. `compression` | No | String | The compression algorithm: Either `none`, `gzip`, or `snappy`. Default is `none`. -`buffer_type` | No | [Buffer type](#buffer-type) | Buffer type configuration. +`buffer_type` | No | [Buffer type](#buffer-type) | The buffer type configuration. `max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. ## `aws` From dffe88168e80fadc0811ff5a04eedce80aeeac8a Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:02:26 -0600 Subject: [PATCH 34/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 9a459a8570..f70dd9dfc9 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -113,7 +113,7 @@ Option | Required | Type | Descriptio `object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, these objects are found in the bucket's root directory. `path_prefix` is configurable. `compression` | No | String | The compression algorithm: Either `none`, `gzip`, or `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | The buffer type configuration. -`max_retries` | No | Integer | Maximum retries for S3 ingestion requests. Default is `5`. +`max_retries` | No | Integer | The maximum number of retries for S3 ingestion requests. Default is `5`. ## `aws` From e2fd6a2cd7289ebb36d6e7bbe234fa0634b5a217 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:36:58 -0600 Subject: [PATCH 35/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index f70dd9dfc9..9fcdf13ab6 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -120,7 +120,7 @@ Option | Required | Type | Descriptio Option | Required | Type | Description :--- | :--- | :--- | :--- `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). -`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon S3. Defaults to `null`, which uses [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon S3. Defaults to `null`, which uses the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. `sts_external_id` | No | String | An AWS STS external ID used when Data Prepper assumes the role. For more information, refer to the `ExternalId` section under [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) in the AWS STS API reference. From 559d0a15140d9728f686b20c2154126d4ba9ae4e Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:37:11 -0600 Subject: [PATCH 36/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 9fcdf13ab6..72ddce8cd9 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -132,7 +132,7 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- `event_count` | Yes | Integer | The number of Data Prepper events to accumulate before writing an object to S3. `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. -`event_collect_timeout` | Yes | String | The maximum amount of timeout before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. +`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. ## Aggregate threshold configuration From a0b8d9441be120878827833df80c69092864ea66 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:37:32 -0600 Subject: [PATCH 37/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 72ddce8cd9..da855fbdb1 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -140,7 +140,7 @@ Use the following options to set rules or limits that trigger certain actions or Option | Required | Type | Description :--- |:-----------------------------------|:-------| :--- -`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Percentage is expressed from `0.0`--`1.0`. Default is `0.5`. +`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. The percentage is expressed as a number between `0.0` and `1.0`. Default is `0.5`. `maximum_size` | Yes | String | The maximum number of bytes to accumulate before force-flushing objects. For example, `128mb`. ## Buffer type From 11a26125733f0cbccc2e96c3e070901e3c787138 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:37:41 -0600 Subject: [PATCH 38/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index da855fbdb1..6767fe2367 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -167,7 +167,7 @@ The `codec` determines how the `s3` source formats data written to each S3 objec ### `avro` codec -The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case. +The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. Defining your own schema is recommended because this will allow it to be tailored to your particular use case. When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. This is to avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations. From cb19c6b5ddd1d7769909fdc5eac73d722c6d9254 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:38:07 -0600 Subject: [PATCH 39/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 6767fe2367..27198c0890 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -169,7 +169,7 @@ The `codec` determines how the `s3` source formats data written to each S3 objec The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. Defining your own schema is recommended because this will allow it to be tailored to your particular use case. -When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. This is to avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations. +When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values in any incoming events that are not mapped in the Avro schema will not be included in the final destination. Data Prepper does not allow the use of `include_keys` or `exclude_keys` with a custom schema so as to avoid confusion between a custom Avro schema and the `include_keys` or `exclude_keys` sink configurations. In cases where your data is uniform, you may be able to automatically generate a schema. Auto-generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and you must have all keys present in all events in order for the auto-generated schema to produce a working schema. Auto-generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control what data is included in the auto-generated schema. From b7b1d930616c8408f398eb87634636dfbeeaa60d Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:38:28 -0600 Subject: [PATCH 40/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 27198c0890..c591b4cb4f 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -171,7 +171,7 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values in any incoming events that are not mapped in the Avro schema will not be included in the final destination. Data Prepper does not allow the use of `include_keys` or `exclude_keys` with a custom schema so as to avoid confusion between a custom Avro schema and the `include_keys` or `exclude_keys` sink configurations. -In cases where your data is uniform, you may be able to automatically generate a schema. Auto-generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and you must have all keys present in all events in order for the auto-generated schema to produce a working schema. Auto-generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control what data is included in the auto-generated schema. +In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and all keys must be present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control which data is included in the automatically generated schema. Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions), as this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. From 34c043996282ab601496f8db8fab6cc48c274e92 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:38:35 -0600 Subject: [PATCH 41/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index c591b4cb4f..988c21b515 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -173,7 +173,7 @@ When you provide your own Avro schema, that schema defines the final structure o In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and all keys must be present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control which data is included in the automatically generated schema. -Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions), as this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. +Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions) because this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. Use the following options to configure the codec. From 27ef838ddec54fd48c073ad9a829f659f0a64d6e Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:38:49 -0600 Subject: [PATCH 42/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 988c21b515..9c79b3f4f2 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -200,7 +200,7 @@ Option | Required | Type | Description The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`. -The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case. +The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. Defining your own schema is recommended because this will allow it to be tailored to your particular use case. For details on the Avro schema and recommendations, refer to [Avro codec](#avro-codec). From 8ef550ebfd0ab32770bd08268b9735dd2135a335 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:39:02 -0600 Subject: [PATCH 43/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 9c79b3f4f2..42d26134fd 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -202,7 +202,7 @@ The `parquet` codec writes events into a Parquet file. When using the codec, set The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. Defining your own schema is recommended because this will allow it to be tailored to your particular use case. -For details on the Avro schema and recommendations, refer to [Avro codec](#avro-codec). +For more information about the Avro schema, refer to [Avro codec](#avro-codec). Use the following options to configure the codec. From f2ab4ae6639c5ae21657d62f2bea9f998c2ac88e Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Tue, 25 Jun 2024 15:41:20 -0600 Subject: [PATCH 44/44] Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Melissa Vagi --- _data-prepper/pipelines/configuration/sinks/s3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 42d26134fd..d1413f6ffc 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -171,7 +171,7 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values in any incoming events that are not mapped in the Avro schema will not be included in the final destination. Data Prepper does not allow the use of `include_keys` or `exclude_keys` with a custom schema so as to avoid confusion between a custom Avro schema and the `include_keys` or `exclude_keys` sink configurations. -In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and all keys must be present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control which data is included in the automatically generated schema. +In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and all keys must be present in all events in order to automatically generate a working schema. Automatically generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control which data is included in the automatically generated schema. Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions) because this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist.