Process logs in batches and expose source info #7

roman-mazur · 2015-01-12T11:48:33Z

Previous implementation was listing all the objects in the bucket.
If you feed a considerably large bucket for the first time, this can
take too much time postponing actual events import and also blow
your machine memory.
Now we list at most @batch_size objects and start their processing.

Also this change exposes s3_bucket and s3_key that might be very
useful. Like path exposed by the file plugin.

NB!
I haven't tested how it now works with backup_bucket. Please let me know if you see any potential issues.
I would also make some tests, if I figured out the best way to mock s3 bucket input for this plugin...

Previous implementation was listing all the objects in the bucket. If you feed a considerably large bucket for the first time, this can take too much time postponing actual events import and also blow your machine memory. Now we list at most `@batch_size` objects and start their processing. Also this change exposes `s3_bucket` and `s3_key` that might be very useful. Like `path` exposed by the `file` plugin.

elasticsearch-release · 2015-11-02T11:06:10Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

roman-mazur · 2015-11-02T11:30:39Z

So this PR was submitted long time ago. Since that moment new logstash versions has been released. And AFAIK source information is already exposed.
I need to understand if it makes sense for the owners to add batching feature to this input plugin. If yes, I will rebase and adapt. If no, let's close this PR.

Freyert · 2020-06-16T21:38:59Z

Batching would be tremendous. I honestly don't think they maintain this plugin anymore with how little feedback I've seen on other PRs.

eherot · 2023-10-25T15:41:31Z

lib/logstash/inputs/s3.rb


-    return sorted_objects = objects.keys.sort {|a,b| objects[a] <=> objects[b]}


It seems like you've eliminated the sort step. In cases where alphabetical order matches chronological order by last_modified time, this will work (since the S3 API always returns results in alphabetical order) but if not this creates a problem because the sincedb assumes that objects are always handled in the same order.

ph self-assigned this Jan 12, 2015

jordansissel added the missing cla label Sep 22, 2015

roaksoax added the status:needs-triage label Apr 6, 2021

eherot reviewed Oct 25, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process logs in batches and expose source info #7

Process logs in batches and expose source info #7

roman-mazur commented Jan 12, 2015

elasticsearch-release commented Nov 2, 2015

roman-mazur commented Nov 2, 2015

Freyert commented Jun 16, 2020

eherot Oct 25, 2023


		return sorted_objects = objects.keys.sort {\|a,b\| objects[a] <=> objects[b]}

Process logs in batches and expose source info #7

Are you sure you want to change the base?

Process logs in batches and expose source info #7

Conversation

roman-mazur commented Jan 12, 2015

elasticsearch-release commented Nov 2, 2015

roman-mazur commented Nov 2, 2015

Freyert commented Jun 16, 2020

eherot Oct 25, 2023

Choose a reason for hiding this comment