Avro field names not getting parsed, just values #311

cjlyons81 · 2022-10-22T20:59:14Z

cjlyons81
Oct 22, 2022

Hi and thank you for sharing this code! So far it seems very promising and will reduce a lot of complexity in our pipelines. I have managed to work my way through how to implement this in pyspark (pyspark=>3.3.0,scala=>2.12) and this is my submit:

spark-submit \ --packages \ org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,\ za.co.absa:abris_2.12:6.3.0 \ main.py

This is my convert (put the code in separate file so I pass in the SparkContext)
df2 = df.withColumn("parsed",ab.from_avro("value",from_avro_abris_settings,sc))

My issue is that it doesn't seem to be parsing out the field name, just the values, with the exception of a single field name (Entry_ID):

"parsed": [
    "CDC.REMEDY.T1908",
    "U",
    "2022-10-22 16:29:49.011585",
    "2022-10-22 16:29:54.224000",
    "00000151760236354984",
    [
      "Entry_ID"
    ],
    {},
    "INC0001",
    "User123",
    1666454247,
......

Example of the matching scheme entries:

 {
   "subject":"CDC.REMEDY.T1908-value",
   "version":1,
   "id":3,
   "schema":{
      "type":"record",
      "name":"T1908",
      "namespace":"CDC.REMEDY",
      "fields":[
         {
            "name":"table",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"op_type",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"op_ts",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"current_ts",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"pos",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"primary_keys",
            "type":[
               "null",
               {
                  "type":"array",
                  "items":"string"
               }
            ],
            "default":null
         },
         {
            "name":"tokens",
            "type":[
               "null",
               {
                  "type":"map",
                  "values":"string"
               }
            ],
            "default":null
         },
         {
            "name":"Entry_ID",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"Submitter",
            "type":[
               "null",
               "string"
            ],
            "default":null
         },
         {
            "name":"Submit_Date",
            "type":[
               "null",
               "long"
            ],
            "default":null
         },
.....

Am I missing something simple or is this some type of compatibility issue?

Thanks for any help/guidance you can offer!

Answered by kevinwallimann

Oct 24, 2022

Hi @cjlyons81
Thanks for your interest in our library. I am not sure I understand your issue correctly. What command do you use to arrive at the output?

"parsed": [
    "CDC.REMEDY.T1908",
    "U",
......

The field names are converted to Dataframe column names. You can see them by inspecting the schema, e.g. df2.printSchema. Also btw, quite often, it is convenient to select the fields directly, instead of keeping the wrapper struct (parsed in your case). E.g. you can write val df2 = df.select(from_avro(col("value"), from_avro_abris_settings, sc) as 'data).select("data.*")

View full answer

kevinwallimann · 2022-10-24T09:04:23Z

kevinwallimann
Oct 24, 2022
Maintainer

Hi @cjlyons81
Thanks for your interest in our library. I am not sure I understand your issue correctly. What command do you use to arrive at the output?

"parsed": [
    "CDC.REMEDY.T1908",
    "U",
......

The field names are converted to Dataframe column names. You can see them by inspecting the schema, e.g. df2.printSchema. Also btw, quite often, it is convenient to select the fields directly, instead of keeping the wrapper struct (parsed in your case). E.g. you can write val df2 = df.select(from_avro(col("value"), from_avro_abris_settings, sc) as 'data).select("data.*")

1 reply

cjlyons81 Oct 24, 2022
Author

Thanks for getting back so quickly! Yes, I made the assumption that the parsing simply places the deserialized object in the target column. Mostly related to my lack of exposure to nested struct objects. I was able to get it to work with

df2 = df.withColumn("parsed",ab.from_avro("value",from_avro_abris_settings,sc)).select('parsed.*')

as you described. Such a simple fix ;)

Thanks again for the help and the code!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avro field names not getting parsed, just values #311

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Avro field names not getting parsed, just values #311

cjlyons81 Oct 22, 2022

Replies: 1 comment · 1 reply

kevinwallimann Oct 24, 2022 Maintainer

cjlyons81 Oct 24, 2022 Author

cjlyons81
Oct 22, 2022

Replies: 1 comment 1 reply

kevinwallimann
Oct 24, 2022
Maintainer

cjlyons81 Oct 24, 2022
Author