-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: show schemas difference when throwing InvalidSchema exception #350
ENH: show schemas difference when throwing InvalidSchema exception #350
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I left some feedback in the review. I'm concerned about the case where column orders don't line up.
else: | ||
diffs_left = len(schema_difference) - 3 | ||
schema_difference = schema_difference[:3] | ||
if diffs_left != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this always evaluate to True
? Might be throwing off our branch test-coverage numbers if so.
if len(schema_difference) < 4: | ||
diff_to_show = "\n".join(schema_difference) | ||
else: | ||
diffs_left = len(schema_difference) - 3 | ||
schema_difference = schema_difference[:3] | ||
if diffs_left != 0: | ||
schema_difference.append("And {} more left.".format(diffs_left)) | ||
diff_to_show = "\n".join(schema_difference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since the diff_to_show = "\n".join(schema_difference)
is duplicated, we can refactor this a bit.
if len(schema_difference) < 4: | |
diff_to_show = "\n".join(schema_difference) | |
else: | |
diffs_left = len(schema_difference) - 3 | |
schema_difference = schema_difference[:3] | |
if diffs_left != 0: | |
schema_difference.append("And {} more left.".format(diffs_left)) | |
diff_to_show = "\n".join(schema_difference) | |
if len(schema_difference) > 3: | |
diffs_left = len(schema_difference) - 3 | |
schema_difference = schema_difference[:3] | |
schema_difference.append("And {} more left.".format(diffs_left)) | |
diff_to_show = "\n".join(schema_difference) |
"""Calculates difference in dataframe and BigQuery schemas. | ||
|
||
Compares dataframe and BigQuery schemas to identify exact differences | ||
in each field (field can be missing in the dataframe or field can have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing fields in the dataframe should be OK, right? It's extra fields in the dataframe that can be a problem (unless we allow field addition, as requested here: #107)
for field_remote in fields_remote: | ||
for field_local in fields_local: | ||
if field_local["name"] == field_remote["name"]: | ||
if field_local["type"] != field_remote["type"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might miss type mismatches if the order of the columns is different in either field_remote
or field_local
(or does _clean_schema_fields
sort them?). I think we'd want:
- Loop through only
fields_local
(dataframe) - Check if the field name isn't in
fields_remote
(table) - Check if the field types don't match
), | ||
( | ||
[ | ||
{"name": "A", "type": "FLOAT"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want some tests where the order of the columns doesn't line up. That way we can be sure we aren't showing errors that aren't actually errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Last review was meant to be request changes
Checking in - this PR has set for some time. Close? Revise? |
Hey, sorry, unfortunately, I do not have time to work on this one |
After merging this PR -
to_gbq
method will show a more detailed description of an error in case ofappend
mode.In particular, it will show at most 3 differences in schemas (both missing field and the same field but a different type) and will indicate how many more differences left (if any).
In #349 there was a suggestion to use
set
to compare schemas, but the elements havedict
type, so they are not hashable. So I decided to go with a more straightforward solution and compare dataframe and BigQuery fields one by one. And I am doing this only if the dataframe schema is not a subset of a BigQuery schema to leave the number of successful path operations the same.Do you want me to add more system-level tests to verify different exception messages or is it enough to have them covered by unit tests?
nox -s blacken lint
docs/source/changelog.rst
entry