Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to read a nested Binary Field? #658

Open
Il-Pela opened this issue Feb 16, 2024 · 1 comment
Open

Is it possible to read a nested Binary Field? #658

Il-Pela opened this issue Feb 16, 2024 · 1 comment
Labels
question Further information is requested

Comments

@Il-Pela
Copy link

Il-Pela commented Feb 16, 2024

Background

Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema.
Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.

The DF looks something like this:

ID SCHEMA_ID BINARY_FIELD FIELD1 FIELD2 .....
1 001 M1B1N4R11 valueX valueZ ..
2 010 M1B1N4R12 valueY valueW ..

And in the folder copycobol/ I have:

  • 001.cob
  • 010.cob

Question

Is it possible to leverage the library to decode a field instead of a file? Or do I have to save the binary field temporarily in a file and decode it from there?

Thank you for any suggestion! :)

@Il-Pela Il-Pela added the question Further information is requested label Feb 16, 2024
@yruslan
Copy link
Collaborator

yruslan commented Feb 19, 2024

Hi, thanks for the interest in the library.
Yes, it is possible to use Cobrix in this case, but it can be quite involved. You can't use spark-cobol Spark data source to decode the data, but have to do it manually like this:

  1. You need to parse each copybook to get an AST:
    val copybookForField1 = CopybookParser.parseSimple(copyBookContents)
  2. Then, you can decode each value by applying the copybook to the binary field:
     val row = RecordExtractors.extractRecord(copybookForField1.ast, field1Bytes, 0, handler = handler)
     val record = handler.create(row.toArray, copybook.ast)
    The resulting record will be Array[Any] and for each subfield you can cast to the corresponding Java data type.
  3. If you want decoding to happen in parallel handeled by Spark SQL, you can write a UDF per field. Each UDF could contain pre-parsed copybook, and can just apply extractRecord() and handler.create() to each value. The resulting output can be a JSON string. See how Jackson could be used to convert each record to a JSON:

Let me know if you decide to do it and have any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants