Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws] [s3] Introduce ignore_older & start_timestamp for S3 input allowing better registry cleanups #41817

Conversation

Kavindu-Dodan
Copy link
Contributor

@Kavindu-Dodan Kavindu-Dodan commented Nov 27, 2024

Proposed commit message

Introduce ignore_older and start_timestamp properties to AWS S3 input. This is a follow-up for #41694.

The configurations introduced here act as input object filters. If the object fails to match derived filters, the entries will be cleaned up from the registry, reducing filebeat memory consumption.

Introduced configurations are,

  • ignore_older : Accepts a time duration in which entries are accepted for processing
  • start_timestamp: A timestamp from which objects are accepted for processing

For both inputs, the object's last modified timestamp is taken into comparison. See Use cases section for further explanation

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None as defaults are disabled. However, when configurations introduced here are used, the following can have an impact on the user,

  • Whenstart_timestamp is defined, then objects with the last modified timestamps prior to the timestamp are ignored from processing (documented 1)
  • When ignore_older is defined, then objects that do not fall within the look-back period when processing starts (polling run) are ignored (documented 1)
  • When both start_timestamp & ignore_older are defined, the initial run will process all entries up to start_timestamp. The subsequent runs will not include entries that do not fall within ignore_older even if processing failed for an object. (documented 1)

How to test this PR locally

  • Build filebeat from the changest included in the PR
  • Source S3 bucket with objects (you may use this tool 2 to create entries)
  • Try configuring filebeat with alternative values for ignore_older & start_timestamp to see how data ingestion change with their values. See Use cases section for further explanation

Related issues

Use cases

Consider below diagrams where there're 3 objects Object A, Object B and Object C with their last modified timestamps of t1, t2 and t3.

And consider how filebeat processes and tracks registry entries based on the following scenarios

Default behavior

If none of the configurations are used, then filebeat will process and the internal registry will track all objects continuously unless they are removed from the bucket.

image

Use start_timestamp

If start_timestamp is used, objects newer than the timestamp are accepted for processing. The registry will grow unless objects are removed from the bucket by other means (ex:- lifecycle policy).

image

Use ignore_older

If ignore_older is defined, input will process objects within the provided duration, calculated from the current time. The registry will track objects within the current timeframe and others will get cleaned up eventually by subsequent runs.

image

Use both ignore_older & start_timestamp

If both properties are defined,

  • The initial run will include entries within the start_timestamp (ignoring ignore_older duration).
  • Subsequent runs will only consider entries within the ignore_older duration.

image

Footnotes

  1. https://github.com/elastic/beats/pull/41817/files#diff-422765b7341c5bbf6de7af38927e34e00a5073b188585a7af3c4fee1175b64a6 2 3

  2. https://github.com/Kavindu-Dodan/data-gen

@Kavindu-Dodan Kavindu-Dodan added enhancement Team:obs-ds-hosted-services Label for the Observability Hosted Services team backport-8.x Automated backport to the 8.x branch with mergify labels Nov 27, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 27, 2024
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch 2 times, most recently from 4924d70 to 79ae2c1 Compare November 27, 2024 22:32
CHANGELOG.asciidoc Outdated Show resolved Hide resolved
@Kavindu-Dodan Kavindu-Dodan requested a review from a team November 29, 2024 16:06
@elastic elastic deleted a comment from mergify bot Dec 3, 2024
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from 52fad61 to 6f5472c Compare December 3, 2024 23:06
@elastic elastic deleted a comment from mergify bot Dec 3, 2024
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from 6f5472c to ec00024 Compare December 6, 2024 16:58
@Kavindu-Dodan Kavindu-Dodan marked this pull request as ready for review December 6, 2024 17:12
@Kavindu-Dodan Kavindu-Dodan requested review from a team as code owners December 6, 2024 17:12
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch 2 times, most recently from 85f883e to fb4990b Compare December 6, 2024 19:53
@elastic elastic deleted a comment from mergify bot Dec 6, 2024
@elastic elastic deleted a comment from elasticmachine Dec 6, 2024
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from fb4990b to dab88c6 Compare December 6, 2024 20:48
@elastic elastic deleted a comment from mergify bot Dec 6, 2024
@leehinman leehinman requested a review from faec December 9, 2024 16:29
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like @faec to take a look.

@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from dab88c6 to 459e3c9 Compare December 31, 2024 15:37
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from ce2786a to d678713 Compare January 2, 2025 20:49
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from d678713 to 3a4c0bd Compare January 3, 2025 14:17
@bturquet bturquet added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 6, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/s3_test.go
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from 3a4c0bd to 82db670 Compare January 7, 2025 15:13
@rdner rdner removed their request for review January 7, 2025 15:42
@Kavindu-Dodan Kavindu-Dodan merged commit 4ba7d1c into elastic:main Jan 7, 2025
22 checks passed
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…wing better registry cleanups (#41817)

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* sort config entries

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* introduce ignore old and start timestamp configurations and document them

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add filtering logic

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* filter tests

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add component test for filtering and fix lint issues

Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/s3_test.go

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* review changes - improve naming, change signature and improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

---------

Signed-off-by: Kavindu Dodanduwa <[email protected]>
(cherry picked from commit 4ba7d1c)
@Kavindu-Dodan Kavindu-Dodan deleted the feat/s3-input-start-time-and-ignore-old branch January 7, 2025 19:24
Kavindu-Dodan added a commit that referenced this pull request Jan 7, 2025
…wing better registry cleanups (#41817) (#42246)

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* sort config entries

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* introduce ignore old and start timestamp configurations and document them

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add filtering logic

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* filter tests

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add component test for filtering and fix lint issues

Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/s3_test.go

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* review changes - improve naming, change signature and improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

---------

Signed-off-by: Kavindu Dodanduwa <[email protected]>
(cherry picked from commit 4ba7d1c)

Co-authored-by: Kavindu Dodanduwa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
5 participants