Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top level array of strings field not working on Athena #75

Open
billjohnston opened this issue Aug 24, 2021 · 1 comment
Open

Top level array of strings field not working on Athena #75

billjohnston opened this issue Aug 24, 2021 · 1 comment

Comments

@billjohnston
Copy link

billjohnston commented Aug 24, 2021

Having trouble getting a top level array field to work with Athena

Athena create table query:

CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
    id STRING, 
    ...etc
    tags ARRAY<STRING>
)
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://bucket'

parquetjs schema:

new ParquetSchema({
    id: { type: 'UTF8' },
    ...etc
    tags: { type: 'UTF8', repeated: true },
})

I'm able to upload and create the parquet file with this, and queries that don't include the tags field are working fine:

SELECT id from db.table 

But if I run any query that includes the tags field I get this error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<path_to_the_parquet_file> (offset=0, length=8612):
org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

I noticed there is a LIST field type that is supposed to work with Athena, but I'm not sure how I'd specify a top level list of strings

@billjohnston billjohnston changed the title Top level array field not working on athena on Athena Top level array field not working on Athena Aug 24, 2021
@billjohnston billjohnston changed the title Top level array field not working on Athena Top level array of strings field not working on Athena Aug 24, 2021
@lwillmeth
Copy link

lwillmeth commented Nov 4, 2021

Did you ever figure this out? I spent a couple hours today and finally came up with this schema:

new parquetJs.ParquetSchema({
    names: {
        type: 'LIST',
        fields: {
            list: {
                repeated: true,
                fields: {
                    element: {
                        type: 'UTF8'
                    }
                }
            }
        }
    }
})

I also had to prepare the data going into the names field by wrapping values with list and element, like so:

{
  names: {
    list: [
      { element: 'cathy' },
      { element: 'Zooplar the Magnificent' },
      { element: 'tim' }
    ]
}

Using this pattern, AWS Glue sees the names field as array<string>, and they show up in Athena queries.

It feels like this was a LOT harder than it should be, so if anyone knows of an easier way, please do let me know. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants