We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having trouble getting a top level array field to work with Athena
Athena create table query:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table ( id STRING, ...etc tags ARRAY<STRING> ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://bucket'
parquetjs schema:
new ParquetSchema({ id: { type: 'UTF8' }, ...etc tags: { type: 'UTF8', repeated: true }, })
I'm able to upload and create the parquet file with this, and queries that don't include the tags field are working fine:
tags
SELECT id from db.table
But if I run any query that includes the tags field I get this error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<path_to_the_parquet_file> (offset=0, length=8612): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
I noticed there is a LIST field type that is supposed to work with Athena, but I'm not sure how I'd specify a top level list of strings
LIST
The text was updated successfully, but these errors were encountered:
Did you ever figure this out? I spent a couple hours today and finally came up with this schema:
new parquetJs.ParquetSchema({ names: { type: 'LIST', fields: { list: { repeated: true, fields: { element: { type: 'UTF8' } } } } } })
I also had to prepare the data going into the names field by wrapping values with list and element, like so:
list
element
{ names: { list: [ { element: 'cathy' }, { element: 'Zooplar the Magnificent' }, { element: 'tim' } ] }
Using this pattern, AWS Glue sees the names field as array<string>, and they show up in Athena queries.
array<string>
It feels like this was a LOT harder than it should be, so if anyone knows of an easier way, please do let me know. 😅
Sorry, something went wrong.
No branches or pull requests
Having trouble getting a top level array field to work with Athena
Athena create table query:
parquetjs schema:
I'm able to upload and create the parquet file with this, and queries that don't include the
tags
field are working fine:But if I run any query that includes the
tags
field I get this error:I noticed there is a
LIST
field type that is supposed to work with Athena, but I'm not sure how I'd specify a top level list of stringsThe text was updated successfully, but these errors were encountered: