diff --git a/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc b/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc index 41f7847f005..8c49e0733c0 100644 --- a/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc +++ b/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc @@ -82,9 +82,99 @@ Please see <> for alternate AWS expand_event_list_from_field: Records ---- +<<<<<<< HEAD The `aws-s3` input supports the following configuration options plus the <<{beatname_lc}-input-{type}-common-options>> described later. +======= +[float] +=== Document ID Generation + +This aws-s3 input feature prevents the duplication of events in Elasticsearch by +generating a custom document `_id` for each event, rather than relying on +Elasticsearch to automatically generate one. Each document in an Elasticsearch +index must have a unique `_id`, and {beatname_uc} uses this property to avoid +ingesting duplicate events. + +The custom `_id` is based on several pieces of information from the S3 object: +the Last-Modified timestamp, the bucket ARN, the object key, and the byte +offset of the data in the event. + +Duplicate prevention is particularly useful in scenarios where {beatname_uc} +needs to retry an operation. {beatname_uc} guarantees at-least-once delivery, +meaning it will retry any failed or incomplete operations. These retries may be +triggered by issues with the host, `{beatname_uc}`, network connectivity, or +services such as Elasticsearch, SQS, or S3. + +[float] +==== Limitations of `_id`-Based Deduplication + +There are some limitations to consider when using `_id`-based deduplication in +Elasticsearch: + +* Deduplication works only within a single index. The same `_id` can exist in + different indices, which is important if you're using data streams or index + aliases. When the backing index rolls over, a duplicate may be ingested. + +* Indexing operations in Elasticsearch may take longer when an `_id` is + specified. Elasticsearch needs to check if the ID already exists before + writing, which can increase the time required for indexing. + +[float] +==== Disabling Duplicate Prevention + +If you want to disable the `_id`-based deduplication, you can remove the +document `_id` using the <> processor in +{beatname_uc}. + +["source","yaml",subs="attributes"] +---- +{beatname_lc}.inputs: + - type: aws-s3 + queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue + processors: + - drop_fields: + fields: + - '@metadata._id' + ignore_missing: true +---- + +Alternatively, you can remove the `_id` field using an Elasticsearch Ingest +Node pipeline. + +["source","json",subs="attributes"] +---- +{ + "processors": [ + { + "remove": { + "if": "ctx.input?.type == \"aws-s3\"", + "field": "_id", + "ignore_missing": true + } + } + ] +} +---- + +[float] +=== Handling Compressed Objects + +S3 objects that use the gzip format +(https://rfc-editor.org/rfc/rfc1952.html[RFC 1952]) with the DEFLATE compression +algorithm are automatically decompressed during processing. This is achieved by +checking for the gzip file magic header. + +[float] +=== Configuration + +The `aws-s3` input supports the following configuration options plus the +<<{beatname_lc}-input-{type}-common-options>> described later. + +NOTE: For time durations, valid time units are - "ns", "us" (or "µs"), "ms", +"s", "m", "h". For example, "2h" + +>>>>>>> 7fd2d46de (x-pack/filebeat/docs/ - document gzip S3 object handling (#42306)) [float] ==== `api_timeout`