Skip to content

Commit

Permalink
x-pack/filebeat/docs/ - document gzip S3 object handling (#42306)
Browse files Browse the repository at this point in the history
Document how compressed objects are handled by the aws-s3 input.

(cherry picked from commit 7fd2d46)

# Conflicts:
#	x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
  • Loading branch information
andrewkroh authored and mergify[bot] committed Jan 17, 2025
1 parent 308418e commit de906f4
Showing 1 changed file with 90 additions and 0 deletions.
90 changes: 90 additions & 0 deletions x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,99 @@ Please see <<aws-credentials-config,Configuration parameters>> for alternate AWS
expand_event_list_from_field: Records
----

<<<<<<< HEAD
The `aws-s3` input supports the following configuration options plus the
<<{beatname_lc}-input-{type}-common-options>> described later.

=======
[float]
=== Document ID Generation
This aws-s3 input feature prevents the duplication of events in Elasticsearch by
generating a custom document `_id` for each event, rather than relying on
Elasticsearch to automatically generate one. Each document in an Elasticsearch
index must have a unique `_id`, and {beatname_uc} uses this property to avoid
ingesting duplicate events.
The custom `_id` is based on several pieces of information from the S3 object:
the Last-Modified timestamp, the bucket ARN, the object key, and the byte
offset of the data in the event.
Duplicate prevention is particularly useful in scenarios where {beatname_uc}
needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
meaning it will retry any failed or incomplete operations. These retries may be
triggered by issues with the host, `{beatname_uc}`, network connectivity, or
services such as Elasticsearch, SQS, or S3.
[float]
==== Limitations of `_id`-Based Deduplication
There are some limitations to consider when using `_id`-based deduplication in
Elasticsearch:
* Deduplication works only within a single index. The same `_id` can exist in
different indices, which is important if you're using data streams or index
aliases. When the backing index rolls over, a duplicate may be ingested.
* Indexing operations in Elasticsearch may take longer when an `_id` is
specified. Elasticsearch needs to check if the ID already exists before
writing, which can increase the time required for indexing.
[float]
==== Disabling Duplicate Prevention
If you want to disable the `_id`-based deduplication, you can remove the
document `_id` using the <<drop-fields,`drop_fields`>> processor in
{beatname_uc}.
["source","yaml",subs="attributes"]
----
{beatname_lc}.inputs:
- type: aws-s3
queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
processors:
- drop_fields:
fields:
- '@metadata._id'
ignore_missing: true
----
Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
Node pipeline.
["source","json",subs="attributes"]
----
{
"processors": [
{
"remove": {
"if": "ctx.input?.type == \"aws-s3\"",
"field": "_id",
"ignore_missing": true
}
}
]
}
----
[float]
=== Handling Compressed Objects
S3 objects that use the gzip format
(https://rfc-editor.org/rfc/rfc1952.html[RFC 1952]) with the DEFLATE compression
algorithm are automatically decompressed during processing. This is achieved by
checking for the gzip file magic header.
[float]
=== Configuration
The `aws-s3` input supports the following configuration options plus the
<<{beatname_lc}-input-{type}-common-options>> described later.
NOTE: For time durations, valid time units are - "ns", "us" (or "µs"), "ms",
"s", "m", "h". For example, "2h"
>>>>>>> 7fd2d46de (x-pack/filebeat/docs/ - document gzip S3 object handling (#42306))
[float]
==== `api_timeout`
Expand Down

0 comments on commit de906f4

Please sign in to comment.