-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential thread safety issue with LzoDecompressor #106
Comments
I think this was fixed in #103 |
Just tried with latest commit but the problem remains |
too bad. Does each thread read from a different file or do multiple threads read from the same file? anything you can add here to reproduce easily will be very useful. |
So it looks like it is due to something that changed in hadoop 2 and when using the basic textFile method from spark it expects the input to be splittable (in my case the files are not indexed). Discussed on SO. Anyway using the input format avoids this problem. Should I close this issue? |
So this was in fact because of reader trying to read from an arbitrary offset, right? Thanks for the update. |
So will it be fixed in the future version? I hope sc.textFile can decompress and split any input files correctly and automatically. |
I don't know if it has been fixed but you can use LzoTextInputFormat with the lower level api methods where you can specify the input format to avoid this problem. @rangadi yeah this is the problem. The reader thinks the input is splittable and tries to read at an arbitrary offset which yields to an invalid format. For small files that don't need to be splitted in theory the problem should not happen. |
I meet the same problem. is it fixed on higher version? I am now using hadoop-lzo-0.4.19.jar ( it looks like be published on 2011 or 2013, too old). |
The problem occurs when trying to read lzo compressed files with spark using sc.textFile(...).
But works fine when using LzoTextInputFormat, with the same dataset and job config.
I encounter multiple
And sometime few
Those happen only when having multiple threads per jvm (multiple executor-cores).
We are using a snapshot version of 0.4.20 starting from this commit.
Thanks
The text was updated successfully, but these errors were encountered: