x-pack/filebeat/docs/ - document gzip S3 object handling (#42306)

Document how compressed objects are handled by the aws-s3 input. (cherry picked from commit 7fd2d46) # Conflicts: # x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
elastic · Jan 17, 2025 · de906f4 · de906f4
1 parent 308418e
commit de906f4
Showing 1 changed file with 90 additions and 0 deletions.
diff --git a/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc b/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
@@ -82,9 +82,99 @@ Please see <<aws-credentials-config,Configuration parameters>> for alternate AWS
   expand_event_list_from_field: Records
 ----
 
+<<<<<<< HEAD
 The `aws-s3` input supports the following configuration options plus the
 <<{beatname_lc}-input-{type}-common-options>> described later.
 
+=======
+[float]
+=== Document ID Generation
+
+This aws-s3 input feature prevents the duplication of events in Elasticsearch by
+generating a custom document `_id` for each event, rather than relying on
+Elasticsearch to automatically generate one. Each document in an Elasticsearch
+index must have a unique `_id`, and {beatname_uc} uses this property to avoid
+ingesting duplicate events.
+
+The custom `_id` is based on several pieces of information from the S3 object:
+the Last-Modified timestamp, the bucket ARN, the object key, and the byte
+offset of the data in the event.
+
+Duplicate prevention is particularly useful in scenarios where {beatname_uc}
+needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
+meaning it will retry any failed or incomplete operations. These retries may be
+triggered by issues with the host, `{beatname_uc}`, network connectivity, or
+services such as Elasticsearch, SQS, or S3.
+
+[float]
+==== Limitations of `_id`-Based Deduplication
+
+There are some limitations to consider when using `_id`-based deduplication in
+Elasticsearch:
+
+* Deduplication works only within a single index. The same `_id` can exist in
+  different indices, which is important if you're using data streams or index
+  aliases. When the backing index rolls over, a duplicate may be ingested.
+
+* Indexing operations in Elasticsearch may take longer when an `_id` is
+  specified. Elasticsearch needs to check if the ID already exists before
+  writing, which can increase the time required for indexing.
+
+[float]
+==== Disabling Duplicate Prevention
+
+If you want to disable the `_id`-based deduplication, you can remove the
+document `_id` using the <<drop-fields,`drop_fields`>> processor in
+{beatname_uc}.
+
+["source","yaml",subs="attributes"]
+----
+{beatname_lc}.inputs:
+  - type: aws-s3
+    queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
+    processors:
+      - drop_fields:
+          fields:
+            - '@metadata._id'
+          ignore_missing: true
+----
+
+Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
+Node pipeline.
+
+["source","json",subs="attributes"]
+----
+{
+  "processors": [
+    {
+      "remove": {
+        "if": "ctx.input?.type == \"aws-s3\"",
+        "field": "_id",
+        "ignore_missing": true
+      }
+    }
+  ]
+}
+----
+
+[float]
+=== Handling Compressed Objects
+
+S3 objects that use the gzip format
+(https://rfc-editor.org/rfc/rfc1952.html[RFC 1952]) with the DEFLATE compression
+algorithm are automatically decompressed during processing. This is achieved by
+checking for the gzip file magic header.
+
+[float]
+=== Configuration
+
+The `aws-s3` input supports the following configuration options plus the
+<<{beatname_lc}-input-{type}-common-options>> described later.
+
+NOTE: For time durations, valid time units are - "ns", "us" (or "µs"), "ms",
+"s", "m", "h". For example, "2h"
+
+>>>>>>> 7fd2d46de (x-pack/filebeat/docs/ - document gzip S3 object handling (#42306))
 [float]
 ==== `api_timeout`