Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round trip tests & various fixes #42

Merged
merged 12 commits into from
Apr 22, 2024

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Apr 17, 2024

Change list

  • Adds manual checking for recursive dict-equality, which allows for some variations between the expected and result dicts:

    • We allow numbers to vary up to precision.
    • We consider key: None and a missing key to be equivalent.
    • We allow RFC3339 date strings with varying precision levels, as long as they
      represent the same parsed datetime.
  • Adds STAC items from 12 different collections (chosen mostly at random) from planetary computer.

  • Support 3-dimensional bounding boxes

  • Test float bbox downcasting with allowed precision; test bbox round trip is exactly equal with no downcasting.

  • Cast null timestamp columns to timestamp[millisecond] type.

Closes #39


Original PR description:

Creating this as a stub to ask about some behavior in the round tripping.

In particular, the round tripping is working except for this diff:

--- orig.json	2024-04-17 16:53:13
+++ new.json	2024-04-17 16:53:14
@@ -1,167 +1,174 @@
 {
   "assets": {
     "image": {
       "eo:bands": [
         {
           "common_name": "red",
+          "description": null,
           "name": "Red"
         },
         {
           "common_name": "green",
+          "description": null,
           "name": "Green"
         },
         {
           "common_name": "blue",
+          "description": null,
           "name": "Blue"
         },
         {
           "common_name": "nir",
           "description": "near-infrared",
           "name": "NIR"
         }
       ],
       "href": "https://naipeuwest.blob.core.windows.net/naip/v002/pr/2022/pr_030cm_2022/18065/51/m_1806551_nw_20_030_20221212_20230329.tif",
       "roles": [
         "data"
       ],
       "title": "RGBIR COG tile",
       "type": "image/tiff; application=geotiff; profile=cloud-optimized"
     },
     "rendered_preview": {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/preview.png?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329&assets=image&asset_bidx=image%7C1%2C2%2C3&format=png",
       "rel": "preview",
       "roles": [
         "overview"
       ],
       "title": "Rendered preview",
       "type": "image/png"
     },
     "thumbnail": {
       "href": "https://naipeuwest.blob.core.windows.net/naip/v002/pr/2022/pr_030cm_2022/18065/m_1806551_nw_20_030_20221212_20230329.200.jpg",
       "roles": [
         "thumbnail"
       ],
       "title": "Thumbnail",
       "type": "image/jpeg"
     },
     "tilejson": {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/tilejson.json?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329&assets=image&asset_bidx=image%7C1%2C2%2C3&format=png",
       "roles": [
         "tiles"
       ],
       "title": "TileJSON with default rendering",
       "type": "application/json"
     }
   },
   "bbox": [
     -65.75386,
     18.183872,
     -65.683663,
     18.253643
   ],
   "collection": "naip",
   "geometry": {
     "coordinates": [
       [
         [
           -65.683663,
           18.184851
         ],
         [
           -65.684718,
           18.253643
         ],
         [
           -65.75386,
           18.25266
         ],
         [
           -65.752778,
           18.183872
         ],
         [
           -65.683663,
           18.184851
         ]
       ]
     ],
     "type": "Polygon"
   },
   "id": "pr_m_1806551_nw_20_030_20221212_20230329",
   "links": [
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip",
       "rel": "collection",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip",
       "rel": "parent",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/",
       "rel": "root",
+      "title": null,
       "type": "application/json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip/items/pr_m_1806551_nw_20_030_20221212_20230329",
       "rel": "self",
+      "title": null,
       "type": "application/geo+json"
     },
     {
       "href": "https://planetarycomputer.microsoft.com/api/data/v1/item/map?collection=naip&item=pr_m_1806551_nw_20_030_20221212_20230329",
       "rel": "preview",
       "title": "Map of item",
       "type": "text/html"
     }
   ],
   "properties": {
-    "datetime": "2022-12-12T16:00:00Z",
+    "datetime": "2022-12-12T16:00:00.000000Z",
     "gsd": 0.3,
     "naip:state": "pr",
     "naip:year": "2022",
     "proj:bbox": [
       208796.4,
       2012712.9000000001,
       216113.69999999998,
       2020332.3
     ],
     "proj:centroid": {
       "lat": 18.21876,
       "lon": -65.71875
     },
     "proj:epsg": 26920,
     "proj:shape": [
       25398,
       24391
     ],
     "proj:transform": [
       0.3,
       0.0,
       208796.4,
       0.0,
       -0.3,
       2020332.3,
       0.0,
       0.0,
       1.0
     ],
     "providers": [
       {
         "name": "USDA Farm Service Agency",
         "roles": [
           "producer",
           "licensor"
         ],
         "url": "https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/"
       }
     ]
   },
   "stac_extensions": [
     "https://stac-extensions.github.io/eo/v1.0.0/schema.json",
     "https://stac-extensions.github.io/projection/v1.0.0/schema.json"
   ],
   "stac_version": "1.0.0",
   "type": "Feature"
 }

That is, because we coerce to a columnar representation, by default, pyarrow generates null values for any keys that were not originally present. Should we add a step to remove any null value? One tricky part is that we don't want to remove any field that was originally null. Essentially Arrow/Parquet have a null but not also an undefined. So when converting Arrow -> Parquet we can't necessarily know that null was not present in the original JSON. The key could've existed and been set to null.

The other diff here is in the datetime string formatting. Both of these date strings conform to ISO 8601/RFC 3339, so it's probably ok to leave the output string as is?

Should we write more complex dict-equality logic in the tests? We could allow:

  • float precision differences in the bbox if downcast=True in to_arrow._convert_bbox_to_struct.
  • datetime differences
  • Allow null in the recreated JSON

@TomAugspurger
Copy link
Collaborator

Should we add a step to remove any null value

IMO, no. I think that would inevitably lead to the ambiguities you mentioned around not wanting to remove nulls that were originally there.

The other diff here is in the datetime string formatting. Both of these date strings conform to ISO 8601/RFC 3339, so it's probably ok to leave the output string as is?

Agreed. As long as it's valid (according to STAC) I'm OK with any formatting difference, as long as we document any potential precision loss issues, if there are any.

Should we write more complex dict-equality logic in the tests? We could allow:

stac_geoparquet.utils has an assert_equal function that uses singledispatch. We could register a method for dictionaries (and datetimes, lists, ...) if needed.

@kylebarron
Copy link
Collaborator Author

stac_geoparquet.utils has an assert_equal function that uses singledispatch. We could register a method for dictionaries (and datetimes, lists, ...) if needed.

Is conversion between dict and pystac.Item always lossless? There's already an overload for pystac.Item equality... should we coerce the dict to a pystac.Item for equality?

Or maybe it would be better to create a new type for ItemDict that just wraps a dict but allows us to create a new overload based on this ItemDict type. (It seems bad to create a dispatch overload that works on any two dicts.)

@TomAugspurger
Copy link
Collaborator

Or maybe it would be better to create a new type for ItemDict

I think either that, or if we do make an overload for dict then it should just do assert left == right (unless we want to try to dispatch on some key of the dictionary to make some special rules; but final fallback should be a plain left == right).

@kylebarron kylebarron marked this pull request as ready for review April 19, 2024 18:34
@kylebarron kylebarron changed the title WIP: Round trip tests Round trip tests Apr 19, 2024
@kylebarron kylebarron changed the title Round trip tests Round trip tests & various fixes Apr 19, 2024
Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Looks good overall. Just two small questions on the assert equal definitions.

Raises:
AssertionError: If the two values are not equal
"""
if isinstance(result, (list, tuple)) and isinstance(expected, (list, tuple)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care whether these should be considered equal?

assert_equal([1, 2], (1, 2))

IIUC, the current implementation considered them equal. If that's deliberate, might want to add it to the documented allowed variations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the JSON parser only creates lists, not tuples, so we can remove the tuples from the check here, if that's preferred

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separate the two checks in f50f64c (#42). Does that look ok to you?

key_name: str,
) -> None:
"""Compare two JSON numbers"""
assert abs(result - expected) <= precision, (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can nan get to this point, and if so will this fail since nan != nan by definition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think JSON was able to store NaN.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be not standardized, but used?

In [1]: import json

In [2]: json.dumps({"a": float("nan")})
Out[2]: '{"a": NaN}'

In [3]: json.loads(_)
Out[3]: {'a': nan}

The Python docs mention

It also understands NaN, Infinity, and -Infinity as their corresponding float values, which is outside the JSON spec.

I'd say if the JSON parser that we're using supports NaN (and IIUC this code path should only be hit by JSON that was parsed by pyarrow) then let's add code to handle NaN.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

85b17ac (#42) Allows NaN equality

@TomAugspurger
Copy link
Collaborator

One thing I noticed about the implementation, which is selecting bbox fields by position. Might need to update that based on the discussion in opengeospatial/geoparquet#202.

@kylebarron
Copy link
Collaborator Author

kylebarron commented Apr 19, 2024

One thing I noticed about the implementation, which is selecting bbox fields by position.

Fixed in 5c028ae (#42)

@TomAugspurger TomAugspurger merged commit fd7c9a4 into stac-utils:main Apr 22, 2024
1 check passed
@TomAugspurger
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test files with sequence of items
2 participants