Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential thread safety issue with LzoDecompressor #106

Closed
EugenCepoi opened this issue May 20, 2015 · 8 comments
Closed

Potential thread safety issue with LzoDecompressor #106

EugenCepoi opened this issue May 20, 2015 · 8 comments

Comments

@EugenCepoi
Copy link

The problem occurs when trying to read lzo compressed files with spark using sc.textFile(...).
But works fine when using LzoTextInputFormat, with the same dataset and job config.

I encounter multiple

java.lang.InternalError: lzo1x_decompress_safe returned: -6
    at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
    at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
    at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
    at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:252)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)

And sometime few

Compressed length 892154724 exceeds max block size 67108864 (probably corrupt file)
  at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)

Those happen only when having multiple threads per jvm (multiple executor-cores).
We are using a snapshot version of 0.4.20 starting from this commit.

Thanks

@EugenCepoi EugenCepoi changed the title Potential thread safety issue Potential thread safety issue with LzoDecompressor May 20, 2015
@rangadi
Copy link
Contributor

rangadi commented May 20, 2015

I think this was fixed in #103

@EugenCepoi
Copy link
Author

Just tried with latest commit but the problem remains

@rangadi
Copy link
Contributor

rangadi commented May 22, 2015

too bad. Does each thread read from a different file or do multiple threads read from the same file? anything you can add here to reproduce easily will be very useful.

@EugenCepoi
Copy link
Author

So it looks like it is due to something that changed in hadoop 2 and when using the basic textFile method from spark it expects the input to be splittable (in my case the files are not indexed).

Discussed on SO. Anyway using the input format avoids this problem.

Should I close this issue?

@rangadi
Copy link
Contributor

rangadi commented May 29, 2015

So this was in fact because of reader trying to read from an arbitrary offset, right? Thanks for the update.

@LTzycLT
Copy link

LTzycLT commented May 23, 2016

So will it be fixed in the future version? I hope sc.textFile can decompress and split any input files correctly and automatically.

@EugenCepoi
Copy link
Author

EugenCepoi commented May 23, 2016

I don't know if it has been fixed but you can use LzoTextInputFormat with the lower level api methods where you can specify the input format to avoid this problem.

@rangadi yeah this is the problem. The reader thinks the input is splittable and tries to read at an arbitrary offset which yields to an invalid format. For small files that don't need to be splitted in theory the problem should not happen.

@leafjungle
Copy link

I meet the same problem. is it fixed on higher version? I am now using hadoop-lzo-0.4.19.jar ( it looks like be published on 2011 or 2013, too old).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants