-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 input plugin not handling shutdown correctly, leading to duplicates once started again #226
Comments
In the meantime I have made a modification to the v3.4.1 on my own fork to not stop in the middle (ie. my option (b) that I suggested above), and confirmed that it is now working as expected when Logstash receives a shutdown signal while busy processing like a 6MB .gzip json encoded file:
I made the modification on top of 3.4.1, since that was the version I was using earlier: v3.4.1...PadaKwaak:v342 Once the issue with 3.6.0 has been fixed that I've raised with #225, I'll see if I can create a pull request where this above modification is enabled/disabled with a setting so that people can choose to use it or not. |
The duplication here is caused by an interruption before a file can update the checkpoint/ timestamp in sincedb. The sincedb is updated per file, so when the pipeline restarts it processes the same file again. As it does not cause a missing event, I do not classify this as a bug. A better way to handle it is enhancing sincedb to have a checkpoint per file and record the latest event id. The whole plugin is getting old and needs refactoring. It is crucial to keep the restart process fast from the system administration point of view. I would avoid stalling the shutdown. |
The s3 input plugin does not store the position of the file it was busy processing when it detected that it should stop.
From the log file, you can see that the following code was called:
Because it simply stops processing and does not store the position that it already processed, when you start logstash again, it would parse the same lines again, which then leads to duplicates.
I would have expected the S3 input plugin to either
a) continue processing till the end of the file and then update the sincedb file, or to
b) stop processing immediately, but before it scans the next line, and then write the current position of the file to the sincedb file like the file input plugin does.
Upload a very large file onto S3 and let the S3 plugin ingest the file.
While it is busy ingesting the file, let logstash shutdown (eg. sending it SIGTERM).
You should then notice the following log message:
When you then start logstash again, it would process the same records again.
The text was updated successfully, but these errors were encountered: