Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for annotations #6

Open
raganhan opened this issue Aug 30, 2018 · 0 comments
Open

Add support for annotations #6

raganhan opened this issue Aug 30, 2018 · 0 comments

Comments

@raganhan
Copy link
Contributor

raganhan commented Aug 30, 2018

Hive types have no concept of annotations or metadata, but we can use a special column to preserve the necessary information.

Original SerDe property proposal:

Specify the column that will be used to keep all annotation information for the whole row. This column must be
specified in the table and will not be used to map any values. The annotations are kept as a list of symbols, one for
each column name annotated with it's value annotations in order.

Important when mapping containers to a Hive collection type, e.g. array and map, this will only work for top level
annotations. Currently there is no mechanism to preserve the nested mapped values annotations.

WITH SERDEPROPERTIES (
    "annotation.column" = "<colum_name_to_keep_annotations>"
)

Example:

-- Ion Document
/*
{ age: year::32, height: cm::178 }
{ age: year::31, height: in::70 }
*/

CREATE EXTERNAL TABLE withAnnotations (
    age         int,
    height      int
    annotations string 
)
WITH SERDEPROPERTIES (
    "annotation.column" = "annotations"
)

-- resulting table 
| age   | height | annotations               | 
| ----- | ------ | ------------------------- |
| 32    | 178    | "[year::age, cm::height]" |
| 31    | 70     | "[year::age, in::height]" |

Other considerations

  • The proposal above needs to be extended to support nested values annotations, example: {my_column: ["first", annon::"second"]}
  • Need to define a UDF to allow users to create queries that use annotations
  • Zack suggested using a Map instead of String to keep the annotation data. This makes it easier for users to manipulate the annotation column with existing UDFs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant