Skip to content

Commit

Permalink
Add documentation website (#65)
Browse files Browse the repository at this point in the history
* Add documentation website

Co-authored-by: Pete Gadomski <[email protected]>
  • Loading branch information
kylebarron and gadomski authored Jun 21, 2024
1 parent dd580cd commit e13f237
Show file tree
Hide file tree
Showing 18 changed files with 303 additions and 82 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/continuous-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: python -m pip install -e .[pgstac,pc,test]
run: python -m pip install -e .[pgstac,pc,test,docs]

- name: Run tests
run: pytest tests -v
Expand All @@ -32,3 +32,7 @@ jobs:

- name: Type check
run: mypy .

# Ensure docs build without warnings
- name: Check docs
run: mkdocs build --strict
46 changes: 46 additions & 0 deletions .github/workflows/deploy-mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: Publish docs via GitHub Pages

# Only run manually or on new tags starting with `v`
on:
push:
tags:
- "v*"
workflow_dispatch:

jobs:
build:
name: Deploy docs
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4
# We need to additionally fetch the gh-pages branch for mike deploy
with:
fetch-depth: 0

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: python -m pip install -e .[pgstac,pc,test,docs]

- name: Deploy docs
env:
GIT_COMMITTER_NAME: CI
GIT_COMMITTER_EMAIL: [email protected]
run: |
# Get most recent git tag
# https://stackoverflow.com/a/7261049
# We don't use {{github.ref_name}} because if triggered manually, it
# will be a branch name instead of a tag version.
VERSION=$(git describe --tags --abbrev=0)
# Only push docs if no letters in git tag after the first character
# (usually the git tag will have v as the first character)
if ! echo $VERSION | sed 's/^.//' | grep -q "[A-Za-z]"; then
mike deploy $VERSION latest --update-aliases --push
fi
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ __pycache__
dist
.direnv
stac_geoparquet/_version.py
.cache
site
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# STAC-geoparquet

Convert STAC items to GeoParquet.
Convert [STAC](https://stacspec.org/en) items between JSON, [GeoParquet](https://geoparquet.org/), [pgstac](https://github.com/stac-utils/pgstac), and [Delta Lake](https://delta.io/).

## Purpose

This library helps convert [STAC Items](https://github.com/radiantearth/stac-spec/blob/master/overview.md#item-overview) to [GeoParquet](https://github.com/opengeospatial/geoparquet). While STAC Items are commonly distributed as individual JSON files on object storage or through a [STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet allows users to access a large number of STAC items in bulk without making repeated HTTP requests.
The STAC spec defines a JSON-based schema. But it can be hard to manage and search through many millions of STAC items in JSON format. For one, JSON is very large on disk. And you need to parse the entire JSON data into memory to extract just a small piece of information, say the `datetime` and one `asset` of an Item.

GeoParquet can be a good complement to JSON for many bulk-access and analytic use cases. While STAC Items are commonly distributed as individual JSON files on object storage or through a [STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet allows users to access a large number of STAC items in bulk without making repeated HTTP requests.

For analytic questions like "find the items in the Sentinel-2 collection in June 2024 over New York City with cloud cover of less than 20%" it can be much, much faster to find the relevant data from a GeoParquet source than from JSON, because GeoParquet needs to load only the relevant columns for that query, not the full data.

See the [STAC-GeoParquet specification](./spec/stac-geoparquet-spec.md) for details on the exact schema of the written Parquet files.

## Usage

Expand All @@ -30,7 +36,6 @@ Note that `stac_geoparquet` lifts the keys in the item `properties` up to the to
>>> items2 = list(stac_geoparquet.arrow.stac_table_to_items(table2))
```

See the [specification](./spec/stac-geoparquet-spec.md) for details on the output stac-geoparquet dataset.

## pgstac integration

Expand Down
5 changes: 5 additions & 0 deletions docs/api/arrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# `stac_geoparquet.arrow`

Arrow-based format conversions.

::: stac_geoparquet.arrow
7 changes: 7 additions & 0 deletions docs/api/legacy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Direct GeoPandas conversion (Legacy)

The API listed here was the initial non-Arrow-based STAC-GeoParquet implementation, converting between JSON and GeoPandas directly. For large collections of STAC items, using the new Arrow-based functionality (under the `stac_geoparquet.arrow` namespace) will be more performant.

::: stac_geoparquet.to_geodataframe
::: stac_geoparquet.to_item_collection
::: stac_geoparquet.to_dict
1 change: 1 addition & 0 deletions docs/index.md
1 change: 1 addition & 0 deletions docs/spec/stac-geoparquet-spec.md
1 change: 1 addition & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Usage
132 changes: 132 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
site_name: stac-geoparquet
repo_name: stac-geoparquet
repo_url: https://github.com/stac-utils/stac-geoparquet
site_description: Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
# Note: trailing slash recommended with mike:
# https://squidfunk.github.io/mkdocs-material/setup/setting-up-versioning/#publishing-a-new-version
site_url: https://stac-utils.github.io/stac-geoparquet/
docs_dir: docs

extra:
social:
- icon: "fontawesome/brands/github"
link: "https://github.com/stac-utils"
version:
provider: mike

nav:
- index.md
- usage.md
- Specification: spec/stac-geoparquet-spec.md
- API Reference:
- api/arrow.md
- Legacy: api/legacy.md
# - api/pgstac.md

watch:
- stac_geoparquet
- docs

theme:
name: material
palette:
# Palette toggle for automatic mode
- media: "(prefers-color-scheme)"
toggle:
icon: material/brightness-auto
name: Switch to light mode

# Palette toggle for light mode
- media: "(prefers-color-scheme: light)"
primary: deep purple
accent: indigo
toggle:
icon: material/brightness-7
name: Switch to dark mode

# Palette toggle for dark mode
- media: "(prefers-color-scheme: dark)"
scheme: slate
primary: deep purple
accent: indigo
toggle:
icon: material/brightness-4
name: Switch to system preference

font:
text: Roboto
code: Roboto Mono

features:
- content.code.annotate
- content.code.copy
- navigation.indexes
- navigation.instant
- navigation.tracking
- search.suggest
- search.share

plugins:
- search
- social
- mike:
alias_type: "copy"
canonical_version: "latest"
- mkdocstrings:
enable_inventory: true
handlers:
python:
options:
docstring_section_style: list
docstring_style: google
line_length: 80
separate_signature: true
show_root_heading: true
show_signature_annotations: true
show_source: false
show_symbol_type_toc: true
signature_crossrefs: true
extensions:
- griffe_inherited_docstrings

import:
- https://arrow.apache.org/docs/objects.inv
- https://delta-io.github.io/delta-rs/objects.inv
- https://docs.python.org/3/objects.inv
- https://geoarrow.github.io/geoarrow-rs/python/latest/objects.inv
- https://geopandas.org/en/stable/objects.inv
- https://numpy.org/doc/stable/objects.inv
- https://pandas.pydata.org/pandas-docs/stable/objects.inv
- https://pystac.readthedocs.io/en/stable/objects.inv
- https://shapely.readthedocs.io/en/stable/objects.inv

# https://github.com/developmentseed/titiler/blob/50934c929cca2fa8d3c408d239015f8da429c6a8/docs/mkdocs.yml#L115-L140
markdown_extensions:
- admonition
- attr_list
- codehilite:
guess_lang: false
- def_list
- footnotes
- md_in_html
- pymdownx.arithmatex
- pymdownx.betterem
- pymdownx.caret:
insert: false
- pymdownx.details
- pymdownx.emoji:
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
- pymdownx.escapeall:
hardbreak: true
nbsp: true
- pymdownx.magiclink:
hide_protocol: true
repo_url_shortener: true
- pymdownx.smartsymbols
- pymdownx.superfences
- pymdownx.tasklist:
custom_checkbox: true
- pymdownx.tilde
- toc:
permalink: true
8 changes: 8 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,14 @@ source = "vcs"
version-file = "stac_geoparquet/_version.py"

[project.optional-dependencies]
docs = [
"black",
"griffe-inherited-docstrings",
"mike>=2",
"mkdocs-material[imaging]>=9.5",
"mkdocs",
"mkdocstrings[python]>=0.25.1",
]
pgstac = [
"fsspec",
"psycopg[binary,pool]",
Expand Down
43 changes: 21 additions & 22 deletions spec/stac-geoparquet-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ library, but aims to provide guidance for anyone putting STAC data into GeoParqu

## Use cases

* Provide a STAC GeoParquet that mirrors a static Collection as a way to query the whole dataset instead of reading every specific GeoJSON file.
* As an output format for STAC API responses that is more efficient than paging through thousands of pages of GeoJSON.
* Provide efficient access to specific fields of a STAC item, thanks to Parquet's columnar format.
- Provide a STAC GeoParquet that mirrors a static Collection as a way to query the whole dataset instead of reading every specific GeoJSON file.
- As an output format for STAC API responses that is more efficient than paging through thousands of pages of GeoJSON.
- Provide efficient access to specific fields of a STAC item, thanks to Parquet's columnar format.

## Guidelines

Expand All @@ -19,7 +19,7 @@ from JSON into nested structures. We do pull the properties to the top level, so
most of the fields should be the same in STAC and in GeoParquet.

| Field | GeoParquet Type | Required | Details |
|--------------------|----------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ------------------ | -------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| type | String | Optional | This is just needed for GeoJSON, so it is optional and not recommended to include in GeoParquet |
| stac_extensions | List of Strings | Required | This column is required, but can be empty if no STAC extensions were used |
| id | String | Required | Required, should be unique within each collection |
Expand All @@ -30,20 +30,20 @@ most of the fields should be the same in STAC and in GeoParquet.
| collection | String | Optional | The ID of the collection this Item is a part of. See notes below on 'Collection' and 'Collection JSON' in the Parquet metadata |
| _property columns_ | _varies_ | - | Each property should use the relevant Parquet type, and be pulled out of the properties object to be a top-level Parquet field |

* Must be valid GeoParquet, with proper metadata. Ideally the geometry types are defined and as narrow as possible.
* Strongly recommend to only have one GeoParquet per STAC 'Collection'. Not doing this will lead to an expanded GeoParquet schema (the union of all the schemas of the collection) with lots of empty data
* Any field in 'properties' of the STAC item should be moved up to be a top-level field in the GeoParquet.
* STAC GeoParquet does not support properties that are named such that they collide with a top-level key.
* datetime columns should be stored as a [native timestamp][timestamp], not as a string
* The Collection JSON should be included in the Parquet metadata. See [Collection JSON](#collection-json) below.
* Any other properties that would be stored as GeoJSON in a STAC JSON Item (e.g. `proj:geometry`) should be stored as a binary column with WKB encoding. This simplifies the handling of collections with multiple geometry types.
- Must be valid GeoParquet, with proper metadata. Ideally the geometry types are defined and as narrow as possible.
- Strongly recommend to only have one GeoParquet per STAC 'Collection'. Not doing this will lead to an expanded GeoParquet schema (the union of all the schemas of the collection) with lots of empty data
- Any field in 'properties' of the STAC item should be moved up to be a top-level field in the GeoParquet.
- STAC GeoParquet does not support properties that are named such that they collide with a top-level key.
- datetime columns should be stored as a [native timestamp][timestamp], not as a string
- The Collection JSON should be included in the Parquet metadata. See [Collection JSON](#including-a-stac-collection-json-in-a-stac-geoparquet-collection) below.
- Any other properties that would be stored as GeoJSON in a STAC JSON Item (e.g. `proj:geometry`) should be stored as a binary column with WKB encoding. This simplifies the handling of collections with multiple geometry types.

### Link Struct

The GeoParquet dataset can contain zero or more Link Structs. Each Link Struct has 2 required fields and 2 optional ones:

| Field Name | Type | Description |
|------------|--------|-------------------------------------------------------------------------------------------------------------------------------------|
| ---------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------- |
| href | string | **REQUIRED.** The actual link in the format of an URL. Relative and absolute links are both allowed. |
| rel | string | **REQUIRED.** Relationship between the current document and the linked document. See chapter "Relation types" for more information. |
| type | string | [Media type][media-type] of the referenced entity. |
Expand All @@ -56,7 +56,7 @@ See [Link Object][link] for more.
The GeoParquet dataset can contain zero or more Asset Structs. Each Asset Struct can have the following fields:

| Field Name | Type | Description |
|-------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ----------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| href | string | **REQUIRED.** URI to the asset object. Relative and absolute URI are both allowed. |
| title | string | The displayed title for clients and users. |
| description | string | A description of the Asset providing additional details, such as how it was processed or created. [CommonMark 0.29](http://commonmark.org/) syntax MAY be used for rich text representation. |
Expand All @@ -82,19 +82,18 @@ A common use case of stac-geoparquet is to create a mirror of a STAC collection.

For example:

| Field Name | Type | Value |
|-------------|-----------|-------------------------------------|
| href | string | s3://example/uti/to/file.geoparquet |
| title | string | An example STAC geoparquet. |
| description | string | Example description. |
| type | string | application/vnd.apache.parquet |
| roles | \[string] | [collection-mirror]\* |
| Field Name | Type | Value |
| ----------- | --------- | -------------------------------- |
| href | string | s3://example/uri/to/file.parquet |
| title | string | An example STAC GeoParquet. |
| description | string | Example description. |
| type | string | `application/vnd.apache.parquet` |
| roles | \[string] | [collection-mirror]\* |

\*Note the IANA has not approved the new Media type `application/vnd.apache.parquet` yet, it's been (submitted for approval)[https://issues.apache.org/jira/browse/PARQUET-1889].
\*Note the IANA has not approved the new Media type `application/vnd.apache.parquet` yet, it's been [submitted for approval](https://issues.apache.org/jira/browse/PARQUET-1889).

The description should ideally include details about the spatial partitioning method.


## Mapping to other geospatial data formats

The principles here can likely be used to map into other geospatial data formats (GeoPackage, FlatGeobuf, etc), but we embrace Parquet's nested 'structs' for some of the mappings, so other formats will need to do something different. The obvious thing to do is to dump JSON into those fields, but that's outside the scope of this document, and we recommend creating a general document for that.
Expand Down
3 changes: 1 addition & 2 deletions stac_geoparquet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@

__all__ = [
"__version__",
"to_geodataframe",
"to_dict",
"to_geodataframe",
"to_item_collection",
"__version__",
]
14 changes: 14 additions & 0 deletions stac_geoparquet/arrow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,18 @@
DEFAULT_PARQUET_SCHEMA_VERSION,
SUPPORTED_PARQUET_SCHEMA_VERSIONS,
)
from ._delta_lake import parse_stac_ndjson_to_delta_lake
from ._to_parquet import parse_stac_ndjson_to_parquet, to_parquet

__all__ = (
"DEFAULT_JSON_CHUNK_SIZE",
"DEFAULT_PARQUET_SCHEMA_VERSION",
"parse_stac_items_to_arrow",
"parse_stac_ndjson_to_arrow",
"parse_stac_ndjson_to_delta_lake",
"parse_stac_ndjson_to_parquet",
"stac_table_to_items",
"stac_table_to_ndjson",
"SUPPORTED_PARQUET_SCHEMA_VERSIONS",
"to_parquet",
)
Loading

0 comments on commit e13f237

Please sign in to comment.