ENH: show schemas difference when throwing InvalidSchema exception #350

artemrys · 2020-12-22T15:29:12Z

After merging this PR - to_gbq method will show a more detailed description of an error in case of append mode.

In particular, it will show at most 3 differences in schemas (both missing field and the same field but a different type) and will indicate how many more differences left (if any).

In #349 there was a suggestion to use set to compare schemas, but the elements have dict type, so they are not hashable. So I decided to go with a more straightforward solution and compare dataframe and BigQuery fields one by one. And I am doing this only if the dataframe schema is not a subset of a BigQuery schema to leave the number of successful path operations the same.

Do you want me to add more system-level tests to verify different exception messages or is it enough to have them covered by unit tests?

closes Improve InvalidSchema exception by adding specific fields that do not match #349
tests added / passed
passes nox -s blacken lint
docs/source/changelog.rst entry

…_gbq``.

tswast

Thanks for the contribution! I left some feedback in the review. I'm concerned about the case where column orders don't line up.

tswast · 2021-01-08T15:51:31Z

pandas_gbq/schema.py

+    else:
+        diffs_left = len(schema_difference) - 3
+        schema_difference = schema_difference[:3]
+        if diffs_left != 0:


Doesn't this always evaluate to True? Might be throwing off our branch test-coverage numbers if so.

tswast · 2021-01-08T15:55:33Z

pandas_gbq/schema.py

+    if len(schema_difference) < 4:
+        diff_to_show = "\n".join(schema_difference)
+    else:
+        diffs_left = len(schema_difference) - 3
+        schema_difference = schema_difference[:3]
+        if diffs_left != 0:
+            schema_difference.append("And {} more left.".format(diffs_left))
+        diff_to_show = "\n".join(schema_difference)


Nit: Since the diff_to_show = "\n".join(schema_difference) is duplicated, we can refactor this a bit.

Suggested change

if len(schema_difference) < 4:

diff_to_show = "\n".join(schema_difference)

else:

diffs_left = len(schema_difference) - 3

schema_difference = schema_difference[:3]

if diffs_left != 0:

schema_difference.append("And {} more left.".format(diffs_left))

diff_to_show = "\n".join(schema_difference)

if len(schema_difference) > 3:

diffs_left = len(schema_difference) - 3

schema_difference = schema_difference[:3]

schema_difference.append("And {} more left.".format(diffs_left))

diff_to_show = "\n".join(schema_difference)

tswast · 2021-01-08T16:01:49Z

pandas_gbq/schema.py

+    """Calculates difference in dataframe and BigQuery schemas.
+
+    Compares dataframe and BigQuery schemas to identify exact differences
+    in each field (field can be missing in the dataframe or field can have


Missing fields in the dataframe should be OK, right? It's extra fields in the dataframe that can be a problem (unless we allow field addition, as requested here: #107)

tswast · 2021-01-08T16:07:11Z

pandas_gbq/schema.py

+    for field_remote in fields_remote:
+        for field_local in fields_local:
+            if field_local["name"] == field_remote["name"]:
+                if field_local["type"] != field_remote["type"]:


This might miss type mismatches if the order of the columns is different in either field_remote or field_local (or does _clean_schema_fields sort them?). I think we'd want:

Loop through only fields_local (dataframe)

Check if the field name isn't in fields_remote (table)

Check if the field types don't match

tswast · 2021-01-08T16:09:50Z

tests/unit/test_schema.py

+        ),
+        (
+            [
+                {"name": "A", "type": "FLOAT"},


I think we'll want some tests where the order of the columns doesn't line up. That way we can be sure we aren't showing errors that aren't actually errors.

tswast

Oops. Last review was meant to be request changes

meredithslota · 2021-09-10T18:29:27Z

Checking in - this PR has set for some time. Close? Revise?

artemrys · 2021-09-23T10:14:03Z

Hey, sorry, unfortunately, I do not have time to work on this one

artemrys marked this pull request as ready for review December 23, 2020 15:05

tswast approved these changes Jan 8, 2021

View reviewed changes

tswast requested changes Jan 8, 2021

View reviewed changes

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jul 17, 2021

tswast mentioned this pull request Sep 24, 2021

Pandas-gbq falsely claiming an invalid schema #390

Closed

tswast closed this Dec 23, 2021

artemrys deleted the feature/better-message-when-different-schemas branch December 31, 2021 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: show schemas difference when throwing InvalidSchema exception #350

ENH: show schemas difference when throwing InvalidSchema exception #350

artemrys commented Dec 22, 2020

tswast left a comment

tswast Jan 8, 2021

tswast Jan 8, 2021

tswast Jan 8, 2021

tswast Jan 8, 2021

tswast Jan 8, 2021

tswast left a comment

meredithslota commented Sep 10, 2021

artemrys commented Sep 23, 2021

ENH: show schemas difference when throwing InvalidSchema exception #350

ENH: show schemas difference when throwing InvalidSchema exception #350

Conversation

artemrys commented Dec 22, 2020

tswast left a comment

Choose a reason for hiding this comment

tswast Jan 8, 2021

Choose a reason for hiding this comment

tswast Jan 8, 2021

Choose a reason for hiding this comment

tswast Jan 8, 2021

Choose a reason for hiding this comment

tswast Jan 8, 2021

Choose a reason for hiding this comment

tswast Jan 8, 2021

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

meredithslota commented Sep 10, 2021

artemrys commented Sep 23, 2021