Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for patternProperties #26

Open
pcorbel opened this issue Jun 24, 2019 · 7 comments
Open

Add support for patternProperties #26

pcorbel opened this issue Jun 24, 2019 · 7 comments

Comments

@pcorbel
Copy link
Contributor

pcorbel commented Jun 24, 2019

Issue:
If a tap have some fields not explicitly declared, but declared in a

"patternProperties": {
  ".+": {}
}

block, they won't be loaded into Redshift.
While, when using another target (like target-csv), the fields are available.

How to reproduce:

  • data.jsonl.txt
    ( a file with a schema and an example record)

  • issues.csv.txt
    A CSV generated by the following command
    cat data.jsonl | target-csv

Version:
Python 3.7.3
target-csv==0.3.0
target-redshift==0.0.7

Documentation:
The link to the target-csv flatten function

@AlexanderMann
Copy link
Collaborator

@pcorbel so I took a look at things in tap-csv and the way I think it actually works is that it simply takes the first record for a stream, and then uses that records keys for the headers and subsequent keys for all future records.

Since it's not doing any batching, it doesn't have any ability to add new keys as they come up.

So for instance, if you have:

{"patternProperties": {
  "a.+": {type: int}
}}

{a: 123}
{a_1: 456}
{a_2: 789}
...

You'll get:

a
123
NULL
NULL

All records will pass validation, but you'll still effectively lose the data in a_1 and a_2.

Is this the same as your experience and is this how you were expecting the support herein to work?

@pcorbel
Copy link
Contributor Author

pcorbel commented Jun 24, 2019

@AlexanderMann With the taps I work with, all keys are always represented, so I'll always have

{a: 123, a_1: None, a_2: None}
{a: None, a_1: 456, a_2: None}
{a: None, a_1: None, a_2: 789}

and I think a lot of API/taps are working that way.

I think it would still be perfectible but it would be an enormous improvement to implement that like the CSV target.

@pcorbel
Copy link
Contributor Author

pcorbel commented Jul 23, 2019

@AlexanderMann Hello Alexander, could you give me some pointers to where to begin to implement it target-csv way?

@AlexanderMann
Copy link
Collaborator

Hey @pcorbel. So there's an issue over in datamill-co/target-postgres#129 which deals with a similar issue you're describing here.

I'm thinking that your option of using target-csv and simply taking the fields off of the first record is a good alternative option.

Since this repo depends on target-postgres pretty heavily, whatever solution we come up with for target-redshift will most likely effect target-postgres.

(also, sorry about the tardy reply, vacation and then wrapping up a project with a client)

@AlexanderMann
Copy link
Collaborator

Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties?

1 similar comment
@AlexanderMann
Copy link
Collaborator

Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties?

@pcorbel
Copy link
Contributor Author

pcorbel commented Aug 5, 2019

@AlexanderMann I was using tap-jira and most of the fields in an issue was discarded because they were in a patternProperties field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants