Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process logs in batches and expose source info #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

roman-mazur
Copy link

Previous implementation was listing all the objects in the bucket.
If you feed a considerably large bucket for the first time, this can
take too much time postponing actual events import and also blow
your machine memory.
Now we list at most @batch_size objects and start their processing.

Also this change exposes s3_bucket and s3_key that might be very
useful. Like path exposed by the file plugin.

NB!
I haven't tested how it now works with backup_bucket. Please let me know if you see any potential issues.
I would also make some tests, if I figured out the best way to mock s3 bucket input for this plugin...

Previous implementation was listing all the objects in the bucket.
If you feed a considerably large bucket for the first time, this can
take too much time postponing actual events import and also blow
your machine memory.
Now we list at most `@batch_size` objects and start their processing.

Also this change exposes `s3_bucket` and `s3_key` that might be very
useful. Like `path` exposed by the `file` plugin.
@ph ph self-assigned this Jan 12, 2015
@elasticsearch-release
Copy link

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

@roman-mazur
Copy link
Author

So this PR was submitted long time ago. Since that moment new logstash versions has been released. And AFAIK source information is already exposed.
I need to understand if it makes sense for the owners to add batching feature to this input plugin. If yes, I will rebase and adapt. If no, let's close this PR.

@Freyert
Copy link

Freyert commented Jun 16, 2020

Batching would be tremendous. I honestly don't think they maintain this plugin anymore with how little feedback I've seen on other PRs.


return sorted_objects = objects.keys.sort {|a,b| objects[a] <=> objects[b]}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you've eliminated the sort step. In cases where alphabetical order matches chronological order by last_modified time, this will work (since the S3 API always returns results in alphabetical order) but if not this creates a problem because the sincedb assumes that objects are always handled in the same order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants