-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update usage and schema documentation #70
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
148afaf
Allow pystac.Item input
kylebarron fd2d967
Add doc with schema considerations
kylebarron 60c8cc5
Add page with drawbacks
kylebarron cf20e98
drawbacks toc
kylebarron 15b1e31
Add usage file
kylebarron c181a07
Add pgstac page
kylebarron 3e85aa4
Update docs/usage.md
kylebarron e705de6
Update stac_geoparquet/arrow/_batch.py
kylebarron 1ed0285
Update docs/usage.md
kylebarron 3b3bfc2
fix typo
kylebarron f6174a5
wording
kylebarron 94e910b
Merge branch 'main' into kyle/docs-usage
kylebarron d02a69b
add example
kylebarron File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# pgstac integration | ||
|
||
`stac_geoparquet.pgstac_reader` has some helpers for working with items coming from a `pgstac.items` table. It takes care of | ||
|
||
- Rehydrating the dehydrated items | ||
- Partitioning by time | ||
- Injecting dynamic links and assets from a STAC API | ||
|
||
::: stac_geoparquet.pgstac_reader.CollectionConfig | ||
options: | ||
show_if_no_docstring: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Drawbacks | ||
|
||
Trying to represent STAC data in GeoParquet has some drawbacks. | ||
|
||
## Unable to represent undefined values | ||
|
||
Parquet is unable to represent the difference between _undefined_ and _null_, and so is unable to perfectly round-trip STAC data with _undefined_ values. | ||
|
||
In JSON a value can have one of three states: defined, undefined, or null. The `"b"` key in the next three examples illustrates this: | ||
|
||
Defined: | ||
|
||
```json | ||
{ | ||
"a": 1, | ||
"b": "foo" | ||
} | ||
``` | ||
|
||
Undefined: | ||
|
||
```json | ||
{ | ||
"a": 2 | ||
} | ||
``` | ||
|
||
Null: | ||
|
||
```json | ||
{ | ||
"a": 3, | ||
"b": null | ||
} | ||
``` | ||
|
||
Because Parquet is a columnar format, it is only able to represent undefined at the _column_ level. So if those three JSON items above were converted to Parquet, the column `"b"` would exist because it exists in the first and third item, and the second item would have `"b"` inferred as `null`: | ||
|
||
| a | b | | ||
| --- | ----- | | ||
| 1 | "foo" | | ||
| 2 | null | | ||
| 3 | null | | ||
|
||
Then when the second item is converted back to JSON, it will be returned as | ||
|
||
```json | ||
{ | ||
"a": 2 | ||
"b": null | ||
} | ||
``` | ||
|
||
which is not strictly equal to the input. | ||
|
||
## Schema difficulties | ||
|
||
JSON is schemaless while Parquet requires a strict schema, and it can be very difficult to unite these two systems. This is such an important consideration that we have a [documentation page](./schema.md) just to discuss this point. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Schema considerations | ||
|
||
A STAC Item is a JSON object to describe an external geospatial dataset. The STAC specification defines a common core, plus a variety of extensions. Additionally, STAC Items may include custom extensions outside the common ones. Crucially, the majority of the specified fields in the core spec and extensions define optional keys. Those keys often differ across STAC collections and may even differ within a single collection across items. | ||
|
||
STAC's flexibility is a blessing and a curse. The flexibility of schemaless JSON allows for very easy writing as each object can be dumped separately to JSON. Every item is allowed to have a different schema. And newer items are free to have a different schema than older items in the same collection. But this write-time flexibility makes it harder to read as there are no guarantees (outside STAC's few required fields) about what fields exist. | ||
|
||
Parquet is the complete opposite of JSON. Parquet has a strict schema that must be known before writing can start. This puts the burden of work onto the writer instead of the reader. Reading Parquet is very efficient because the file's metadata defines the exact schema of every record. This also enables use cases like reading specific columns that would not be possible without a strict schema. | ||
|
||
This conversion from schemaless to strict-schema is the difficult part of converting STAC from JSON to GeoParquet, especially for large input datasets like STAC that are often larger than memory. | ||
|
||
## Full scan over input data | ||
|
||
The most foolproof way to convert STAC JSON to GeoParquet is to perform a full scan over input data. This is done automatically by [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow] when a schema is not provided. | ||
|
||
This is time consuming as it requires two full passes over the input data: once to infer a common schema and again to actually write to Parquet (though items are never fully held in memory, allowing this process to scale). | ||
|
||
## User-provided schema | ||
|
||
Alternatively, the user can pass in an Arrow schema themselves using the `schema` parameter of [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow]. This `schema` must match the on-disk schema of the the STAC JSON data. | ||
|
||
## Multiple schemas per collection | ||
|
||
It is also possible to write multiple Parquet files with STAC data where each Parquet file may have a different schema. This simplifies the conversion and writing process but makes reading and using the Parquet data harder. | ||
|
||
### Merging data with schema mismatch | ||
|
||
If you've created STAC GeoParquet data where the schema has updated, you can use [`pyarrow.concat_tables`][pyarrow.concat_tables] with `promote_options="permissive"` to combine multiple STAC GeoParquet files. | ||
|
||
```py | ||
import pyarrow as pa | ||
import pyarrow.parquet as pq | ||
|
||
table_1 = pq.read_table("stac1.parquet") | ||
table_2 = pq.read_table("stac2.parquet") | ||
combined_table = pa.concat_tables([table1, table2], promote_options="permissive") | ||
``` | ||
|
||
## Future work | ||
|
||
Schema operations is an area where future work can improve reliability and ease of use of STAC GeoParquet. | ||
|
||
It's possible that in the future we could automatically infer an Arrow schema from the STAC specification's published JSON Schema files. If you're interested in this, open an issue and discuss. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,67 @@ | ||
# Usage | ||
|
||
[Apache Arrow](https://arrow.apache.org/) is used as the in-memory interchange format between all formats. While some end-to-end helper functions are provided, the user can go through Arrow objects for maximal flexibility in the conversion process. | ||
|
||
All functionality that goes through Arrow is currently exported via the `stac_geoparquet.arrow` namespace. | ||
|
||
## `dict`/JSON - Arrow conversion | ||
|
||
### Convert `dict`s to Arrow | ||
|
||
Use [`parse_stac_items_to_arrow`][stac_geoparquet.arrow.parse_stac_items_to_arrow] to convert STAC items either in memory or on disk to a stream of Arrow record batches. This accepts either an iterable of Python `dict`s or an iterable of [`pystac.Item`][pystac.Item] objects. | ||
|
||
For example: | ||
|
||
```py | ||
import pyarrow as pa | ||
import pystac | ||
|
||
import stac_geoparquet | ||
|
||
item = pystac.read_file( | ||
"https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20230112T104411_R008_T29NPE_20230113T053333" | ||
) | ||
assert isinstance(item, pystac.Item) | ||
|
||
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow([item]) | ||
table = record_batch_reader.read_all() | ||
``` | ||
|
||
### Convert JSON to Arrow | ||
|
||
[`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow] is a helper function to take one or more JSON or newline-delimited JSON files on disk, infer the schema from all of them, and convert the data to a stream of Arrow record batches. | ||
|
||
### Convert Arrow to `dict`s | ||
|
||
Use [`stac_table_to_items`][stac_geoparquet.arrow.stac_table_to_items] to convert a table or stream of Arrow record batches of STAC data to a generator of Python `dict`s. This accepts either a `pyarrow.Table` or a `pyarrow.RecordBatchReader`, which allows conversions of larger-than-memory files in a streaming manner. | ||
|
||
### Convert Arrow to JSON | ||
|
||
Use [`stac_table_to_ndjson`][stac_geoparquet.arrow.stac_table_to_ndjson] to convert a table or stream of Arrow record batches of STAC data to a newline-delimited JSON file. This accepts either a `pyarrow.Table` or a `pyarrow.RecordBatchReader`, which allows conversions of larger-than-memory files in a streaming manner. | ||
|
||
## Parquet | ||
|
||
Use [`to_parquet`][stac_geoparquet.arrow.to_parquet] to write STAC Arrow data from memory to a path or file-like object. This is a special function to ensure that [GeoParquet](https://geoparquet.org/) 1.0 or 1.1 metadata is written to the Parquet file. | ||
|
||
[`parse_stac_ndjson_to_parquet`][stac_geoparquet.arrow.parse_stac_ndjson_to_parquet] is a helper that connects reading (newline-delimited) JSON on disk to writing out to a Parquet file. | ||
|
||
No special API is required for reading a STAC GeoParquet file back into Arrow. You can use [`pyarrow.parquet.read_table`][pyarrow.parquet.read_table] or [`pyarrow.parquet.ParquetFile`][pyarrow.parquet.ParquetFile] directly to read the STAC GeoParquet data back into Arrow. | ||
|
||
## Delta Lake | ||
|
||
|
||
Use [`parse_stac_ndjson_to_delta_lake`][stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake] to read (newline-delimited) JSON on disk and write out to a Delta Lake table. | ||
|
||
No special API is required for reading a STAC Delta Lake table back into Arrow. You can use the [`DeltaTable`][deltalake.DeltaTable] class directly to read the data back into Arrow. | ||
|
||
!!! important | ||
Arrow has a null data type, where every value in the column is always null, but Delta Lake does not. This means that for any column inferred to have a `null` data type, writing to Delta Lake will error with | ||
``` | ||
_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null | ||
``` | ||
|
||
This is a problem because if all items in a STAC Collection have a `null` JSON key, it gets inferred as an Arrow `null` type. For example, in the `3dep-lidar-copc` collection in the tests, it has `start_datetime` and `end_datetime` fields, and so according to the spec, `datetime` is always `null`. This column would need to be casted to a timestamp type before being written to Delta Lake. | ||
|
||
This means we cannot write this collection to Delta Lake **solely with automatic schema inference**. | ||
|
||
In such cases, users may need to manually update the inferred schema to cast any `null` type to another Delta Lake-compatible type. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that an example here (or on the index page?) from STAC -> geoparquet would be valuable.
We have the one currently at https://github.com/stac-utils/stac-geoparquet?tab=readme-ov-file#usage. Even an example using a simple
pystac.Item()
built locally would be good to have.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a simple example below