Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing .gz files #630

Open
srinubabuin opened this issue Jun 7, 2023 · 3 comments
Open

Processing .gz files #630

srinubabuin opened this issue Jun 7, 2023 · 3 comments

Comments

@srinubabuin
Copy link

Hi Team,
While processing .gz files using Cobrix we are getting the error like as follows

There are some files in abc.gz that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (3018 bytes per record). Check the logs for the names of the files.

but my abc.gz has only one file. Is cobrix supports .gz file processing? if not can we pass inputstream to cobrix instead of file ?

@yruslan
Copy link
Collaborator

yruslan commented Jun 7, 2023

Hi @srinubabuin ,

No, compression is not supported, and neither are inputStreams (although I'm not 100% sure what do you mean there).

The best option is to unpack the file first.

@srinubabuin
Copy link
Author

Hi Yruslan,
https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala
In this code finally we are finally reading BufferedFSDataInputStream with filePath, so here can i pass directly BufferedFSDataInputStream instead of filePath?

private var bufferedStream = new BufferedFSDataInputStream(getHadoopPath(filePath), fileSystem, startOffset, Constants.defaultStreamBufferInMB, maximumBytes)

@yruslan
Copy link
Collaborator

yruslan commented Jun 7, 2023

Sorry I'm not sure I understand. Keep in mind that the file will be ready in Executors, not on the driver node, and you cannot pass the stream from the driver to an executor. You need to create the stream on the executor. But you can create this stream from the file path.

Alternatively, you can use RDDs to read and uncompress input files, and then apply the record extractor to it. the example is called "Working example 3 - Using RDDs and record parsers directly" from https://github.com/AbsaOSS/cobrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants