Round trip tests & various fixes #42

kylebarron · 2024-04-17T21:10:34Z

Change list

Adds manual checking for recursive dict-equality, which allows for some variations between the expected and result dicts:
- We allow numbers to vary up to precision.
- We consider key: None and a missing key to be equivalent.
- We allow RFC3339 date strings with varying precision levels, as long as they
  represent the same parsed datetime.
Adds STAC items from 12 different collections (chosen mostly at random) from planetary computer.
Support 3-dimensional bounding boxes
Test float bbox downcasting with allowed precision; test bbox round trip is exactly equal with no downcasting.
Cast null timestamp columns to timestamp[millisecond] type.

Closes #39

Original PR description:

Creating this as a stub to ask about some behavior in the round tripping.

In particular, the round tripping is working except for this diff:

--- orig.json	2024-04-17 16:53:13
+++ new.json	2024-04-17 16:53:14
@@ -1,167 +1,174 @@
 {
   "assets": {
     "image": {
       "eo:bands": [
         {
           "common_name": "red",
+          "description": null,
           "name": "Red"
         },
         {
           "common_name": "green",
+          "description": null,
           "name": "Green"
         },
         {
           "common_name": "blue",
+          "description": null,
           "name": "Blue"
         },
         {
           "common_name": "nir",
           "description": "near-infrared",
           "name": "NIR"
         }
       ],
       "href": "https://naipeuwest.blob.core.windows.net/naip/v002/pr/2022/pr_030cm_2022/18065/51/m_1806551_nw_20_030_20221212_20230329.tif",
       "roles": [
         "data"
       ],
       "title": "RGBIR COG tile",
       "type": "image/tiff; application=geotiff; profile=cloud-optimized"
     },
     "rendered_preview": {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/preview.png?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329&assets=image&asset_bidx=image%7C1%2C2%2C3&format=png",
       "rel": "preview",
       "roles": [
         "overview"
       ],
       "title": "Rendered preview",
       "type": "image/png"
     },
     "thumbnail": {
       "href": "https://naipeuwest.blob.core.windows.net/naip/v002/pr/2022/pr_030cm_2022/18065/m_1806551_nw_20_030_20221212_20230329.200.jpg",
       "roles": [
         "thumbnail"
       ],
       "title": "Thumbnail",
       "type": "image/jpeg"
     },
     "tilejson": {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/tilejson.json?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329&assets=image&asset_bidx=image%7C1%2C2%2C3&format=png",
       "roles": [
         "tiles"
       ],
       "title": "TileJSON with default rendering",
       "type": "application/json"
     }
   },
   "bbox": [
     -65.75386,
     18.183872,
     -65.683663,
     18.253643
   ],
   "collection": "naip",
   "geometry": {
     "coordinates": [
       [
         [
           -65.683663,
           18.184851
         ],
         [
           -65.684718,
           18.253643
         ],
         [
           -65.75386,
           18.25266
         ],
         [
           -65.752778,
           18.183872
         ],
         [
           -65.683663,
           18.184851
         ]
       ]
     ],
     "type": "Polygon"
   },
   "id": "pr_m_1806551_nw_20_030_20221212_20230329",
   "links": [
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip",
       "rel": "collection",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip",
       "rel": "parent",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/",
       "rel": "root",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip/items/pr_m_1806551_nw_20_030_20221212_20230329",
       "rel": "self",
+      "title": null,
       "type": "application/geo+json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/map?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329",
       "rel": "preview",
       "title": "Map of item",
       "type": "text/html"
     }
   ],
   "properties": {
-    "datetime": "2022-12-12T16:00:00Z",
+    "datetime": "2022-12-12T16:00:00.000000Z",
     "gsd": 0.3,
     "naip:state": "pr",
     "naip:year": "2022",
     "proj:bbox": [
       208796.4,
       2012712.9000000001,
       216113.69999999998,
       2020332.3
     ],
     "proj:centroid": {
       "lat": 18.21876,
       "lon": -65.71875
     },
     "proj:epsg": 26920,
     "proj:shape": [
       25398,
       24391
     ],
     "proj:transform": [
       0.3,
       0.0,
       208796.4,
       0.0,
       -0.3,
       2020332.3,
       0.0,
       0.0,
       1.0
     ],
     "providers": [
       {
         "name": "USDA Farm Service Agency",
         "roles": [
           "producer",
           "licensor"
         ],
         "url": "https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/"
       }
     ]
   },
   "stac_extensions": [
     "https://stac-extensions.github.io/eo/v1.0.0/schema.json",
     "https://stac-extensions.github.io/projection/v1.0.0/schema.json"
   ],
   "stac_version": "1.0.0",
   "type": "Feature"
 }

That is, because we coerce to a columnar representation, by default, pyarrow generates null values for any keys that were not originally present. Should we add a step to remove any null value? One tricky part is that we don't want to remove any field that was originally null. Essentially Arrow/Parquet have a null but not also an undefined. So when converting Arrow -> Parquet we can't necessarily know that null was not present in the original JSON. The key could've existed and been set to null.

The other diff here is in the datetime string formatting. Both of these date strings conform to ISO 8601/RFC 3339, so it's probably ok to leave the output string as is?

Should we write more complex dict-equality logic in the tests? We could allow:

float precision differences in the bbox if downcast=True in to_arrow._convert_bbox_to_struct.
datetime differences
Allow null in the recreated JSON

TomAugspurger · 2024-04-18T18:37:02Z

Should we add a step to remove any null value

IMO, no. I think that would inevitably lead to the ambiguities you mentioned around not wanting to remove nulls that were originally there.

The other diff here is in the datetime string formatting. Both of these date strings conform to ISO 8601/RFC 3339, so it's probably ok to leave the output string as is?

Agreed. As long as it's valid (according to STAC) I'm OK with any formatting difference, as long as we document any potential precision loss issues, if there are any.

Should we write more complex dict-equality logic in the tests? We could allow:

stac_geoparquet.utils has an assert_equal function that uses singledispatch. We could register a method for dictionaries (and datetimes, lists, ...) if needed.

kylebarron · 2024-04-18T19:19:13Z

stac_geoparquet.utils has an assert_equal function that uses singledispatch. We could register a method for dictionaries (and datetimes, lists, ...) if needed.

Is conversion between dict and pystac.Item always lossless? There's already an overload for pystac.Item equality... should we coerce the dict to a pystac.Item for equality?

Or maybe it would be better to create a new type for ItemDict that just wraps a dict but allows us to create a new overload based on this ItemDict type. (It seems bad to create a dispatch overload that works on any two dicts.)

TomAugspurger · 2024-04-19T13:57:39Z

Or maybe it would be better to create a new type for ItemDict

I think either that, or if we do make an overload for dict then it should just do assert left == right (unless we want to try to dispatch on some key of the dictionary to make some special rules; but final fallback should be a plain left == right).

TomAugspurger

Thanks!

Looks good overall. Just two small questions on the assert equal definitions.

TomAugspurger · 2024-04-19T20:32:19Z

tests/test_arrow.py

+    Raises:
+        AssertionError: If the two values are not equal
+    """
+    if isinstance(result, (list, tuple)) and isinstance(expected, (list, tuple)):


Do we care whether these should be considered equal?

assert_equal([1, 2], (1, 2))

IIUC, the current implementation considered them equal. If that's deliberate, might want to add it to the documented allowed variations.

I believe the JSON parser only creates lists, not tuples, so we can remove the tuples from the check here, if that's preferred

I separate the two checks in f50f64c (#42). Does that look ok to you?

TomAugspurger · 2024-04-19T20:35:07Z

tests/test_arrow.py

+    key_name: str,
+) -> None:
+    """Compare two JSON numbers"""
+    assert abs(result - expected) <= precision, (


Can nan get to this point, and if so will this fail since nan != nan by definition?

I didn't think JSON was able to store NaN.

Seems to be not standardized, but used?

In [1]: import json In [2]: json.dumps({"a": float("nan")}) Out[2]: '{"a": NaN}' In [3]: json.loads(_) Out[3]: {'a': nan}

The Python docs mention

It also understands NaN, Infinity, and -Infinity as their corresponding float values, which is outside the JSON spec.

I'd say if the JSON parser that we're using supports NaN (and IIUC this code path should only be hit by JSON that was parsed by pyarrow) then let's add code to handle NaN.

85b17ac (#42) Allows NaN equality

TomAugspurger · 2024-04-19T20:46:35Z

One thing I noticed about the implementation, which is selecting bbox fields by position. Might need to update that based on the discussion in opengeospatial/geoparquet#202.

kylebarron · 2024-04-19T22:03:46Z

One thing I noticed about the implementation, which is selecting bbox fields by position.

Fixed in 5c028ae (#42)

TomAugspurger · 2024-04-22T13:29:18Z

Thanks!

kylebarron added 2 commits April 17, 2024 16:35

Add naip test

911469b

WIP round trip test

339bee9

kylebarron added 2 commits April 19, 2024 13:44

Merge branch 'main' into kyle/tests

fa7a63e

Fix tests

3fd1727

kylebarron marked this pull request as ready for review April 19, 2024 18:34

kylebarron changed the title ~~WIP: Round trip tests~~ Round trip tests Apr 19, 2024

Improve docstring

8c32560

kylebarron requested a review from TomAugspurger April 19, 2024 18:38

kylebarron added 2 commits April 19, 2024 15:14

Add 3dep lidar tests

b61f1e3

Add more test files

00144e7

kylebarron changed the title ~~Round trip tests~~ Round trip tests & various fixes Apr 19, 2024

kylebarron added 2 commits April 19, 2024 15:52

Merge branch 'main' into kyle/tests

5db7060

Ignore pystac client from mypy

7f7a53c

TomAugspurger approved these changes Apr 19, 2024

View reviewed changes

access struct fields by name

5c028ae

kylebarron added 2 commits April 19, 2024 18:06

Allow NaN equality

85b17ac

separate list and tuple checks

f50f64c

TomAugspurger merged commit fd7c9a4 into stac-utils:main Apr 22, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round trip tests & various fixes #42

Round trip tests & various fixes #42

kylebarron commented Apr 17, 2024 •

edited

Loading

TomAugspurger commented Apr 18, 2024

kylebarron commented Apr 18, 2024

TomAugspurger commented Apr 19, 2024

TomAugspurger left a comment

TomAugspurger Apr 19, 2024

kylebarron Apr 19, 2024

kylebarron Apr 19, 2024

TomAugspurger Apr 19, 2024

kylebarron Apr 19, 2024

TomAugspurger Apr 19, 2024

kylebarron Apr 19, 2024

TomAugspurger commented Apr 19, 2024

kylebarron commented Apr 19, 2024 •

edited

Loading

TomAugspurger commented Apr 22, 2024

Round trip tests & various fixes #42

Round trip tests & various fixes #42

Conversation

kylebarron commented Apr 17, 2024 • edited Loading

Change list

TomAugspurger commented Apr 18, 2024

kylebarron commented Apr 18, 2024

TomAugspurger commented Apr 19, 2024

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Apr 19, 2024

Choose a reason for hiding this comment

kylebarron Apr 19, 2024

Choose a reason for hiding this comment

kylebarron Apr 19, 2024

Choose a reason for hiding this comment

TomAugspurger Apr 19, 2024

Choose a reason for hiding this comment

kylebarron Apr 19, 2024

Choose a reason for hiding this comment

TomAugspurger Apr 19, 2024

Choose a reason for hiding this comment

kylebarron Apr 19, 2024

Choose a reason for hiding this comment

TomAugspurger commented Apr 19, 2024

kylebarron commented Apr 19, 2024 • edited Loading

TomAugspurger commented Apr 22, 2024

kylebarron commented Apr 17, 2024 •

edited

Loading

kylebarron commented Apr 19, 2024 •

edited

Loading