You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema.
Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.
The DF looks something like this:
ID
SCHEMA_ID
BINARY_FIELD
FIELD1
FIELD2
.....
1
001
M1B1N4R11
valueX
valueZ
..
2
010
M1B1N4R12
valueY
valueW
..
And in the folder copycobol/ I have:
001.cob
010.cob
Question
Is it possible to leverage the library to decode a field instead of a file? Or do I have to save the binary field temporarily in a file and decode it from there?
Thank you for any suggestion! :)
The text was updated successfully, but these errors were encountered:
Hi, thanks for the interest in the library.
Yes, it is possible to use Cobrix in this case, but it can be quite involved. You can't use spark-cobol Spark data source to decode the data, but have to do it manually like this:
The resulting record will be Array[Any] and for each subfield you can cast to the corresponding Java data type.
If you want decoding to happen in parallel handeled by Spark SQL, you can write a UDF per field. Each UDF could contain pre-parsed copybook, and can just apply extractRecord() and handler.create() to each value. The resulting output can be a JSON string. See how Jackson could be used to convert each record to a JSON:
Background
Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema.
Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.
The DF looks something like this:
And in the folder copycobol/ I have:
Question
Is it possible to leverage the library to decode a field instead of a file? Or do I have to save the binary field temporarily in a file and decode it from there?
Thank you for any suggestion! :)
The text was updated successfully, but these errors were encountered: