Add documentation website (#65)

* Add documentation website Co-authored-by: Pete Gadomski <[email protected]>
stac-utils · Jun 21, 2024 · e13f237 · e13f237
1 parent dd580cd
commit e13f237
Show file tree

Hide file tree

Showing 18 changed files with 303 additions and 82 deletions.
diff --git a/.github/workflows/continuous-integration.yml b/.github/workflows/continuous-integration.yml
@@ -22,7 +22,7 @@ jobs:
           python-version: ${{ matrix.python-version }}
 
       - name: Install dependencies
-        run: python -m pip install -e .[pgstac,pc,test]
+        run: python -m pip install -e .[pgstac,pc,test,docs]
 
       - name: Run tests
         run: pytest tests -v
@@ -32,3 +32,7 @@ jobs:
 
       - name: Type check
         run: mypy .
+
+      # Ensure docs build without warnings
+      - name: Check docs
+        run: mkdocs build --strict
diff --git a/.github/workflows/deploy-mkdocs.yml b/.github/workflows/deploy-mkdocs.yml
@@ -0,0 +1,46 @@
+name: Publish docs via GitHub Pages
+
+# Only run manually or on new tags starting with `v`
+on:
+  push:
+    tags:
+      - "v*"
+  workflow_dispatch:
+
+jobs:
+  build:
+    name: Deploy docs
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.11"]
+    steps:
+      - uses: actions/checkout@v4
+        # We need to additionally fetch the gh-pages branch for mike deploy
+        with:
+          fetch-depth: 0
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: python -m pip install -e .[pgstac,pc,test,docs]
+
+      - name: Deploy docs
+        env:
+          GIT_COMMITTER_NAME: CI
+          GIT_COMMITTER_EMAIL: [email protected]
+        run: |
+          # Get most recent git tag
+          # https://stackoverflow.com/a/7261049
+          # We don't use {{github.ref_name}} because if triggered manually, it
+          # will be a branch name instead of a tag version.
+          VERSION=$(git describe --tags --abbrev=0)
+
+          # Only push docs if no letters in git tag after the first character
+          # (usually the git tag will have v as the first character)
+          if ! echo $VERSION | sed 's/^.//' | grep -q "[A-Za-z]"; then
+            mike deploy $VERSION latest --update-aliases --push
+          fi
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,5 @@ __pycache__
 dist
 .direnv
 stac_geoparquet/_version.py
+.cache
+site
diff --git a/README.md b/README.md
@@ -1,10 +1,16 @@
 # STAC-geoparquet
 
-Convert STAC items to GeoParquet.
+Convert [STAC](https://stacspec.org/en) items between JSON, [GeoParquet](https://geoparquet.org/), [pgstac](https://github.com/stac-utils/pgstac), and [Delta Lake](https://delta.io/).
 
 ## Purpose
 
-This library helps convert [STAC Items](https://github.com/radiantearth/stac-spec/blob/master/overview.md#item-overview) to [GeoParquet](https://github.com/opengeospatial/geoparquet). While STAC Items are commonly distributed as individual JSON files on object storage or through a [STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet allows users to access a large number of STAC items in bulk without making repeated HTTP requests.
+The STAC spec defines a JSON-based schema. But it can be hard to manage and search through many millions of STAC items in JSON format. For one, JSON is very large on disk. And you need to parse the entire JSON data into memory to extract just a small piece of information, say the `datetime` and one `asset` of an Item.
+
+GeoParquet can be a good complement to JSON for many bulk-access and analytic use cases. While STAC Items are commonly distributed as individual JSON files on object storage or through a [STAC API](https://github.com/radiantearth/stac-api-spec), STAC GeoParquet allows users to access a large number of STAC items in bulk without making repeated HTTP requests.
+
+For analytic questions like "find the items in the Sentinel-2 collection in June 2024 over New York City with cloud cover of less than 20%" it can be much, much faster to find the relevant data from a GeoParquet source than from JSON, because GeoParquet needs to load only the relevant columns for that query, not the full data.
+
+See the [STAC-GeoParquet specification](./spec/stac-geoparquet-spec.md) for details on the exact schema of the written Parquet files.
 
 ## Usage
 
@@ -30,7 +36,6 @@ Note that `stac_geoparquet` lifts the keys in the item `properties` up to the to
 >>> items2 = list(stac_geoparquet.arrow.stac_table_to_items(table2))
 ```
 
-See the [specification](./spec/stac-geoparquet-spec.md) for details on the output stac-geoparquet dataset.
 
 ## pgstac integration
 

diff --git a/docs/api/arrow.md b/docs/api/arrow.md
@@ -0,0 +1,5 @@
+# `stac_geoparquet.arrow`
+
+Arrow-based format conversions.
+
+::: stac_geoparquet.arrow
diff --git a/docs/api/legacy.md b/docs/api/legacy.md
@@ -0,0 +1,7 @@
+# Direct GeoPandas conversion (Legacy)
+
+The API listed here was the initial non-Arrow-based STAC-GeoParquet implementation, converting between JSON and GeoPandas directly. For large collections of STAC items, using the new Arrow-based functionality (under the `stac_geoparquet.arrow` namespace) will be more performant.
+
+::: stac_geoparquet.to_geodataframe
+::: stac_geoparquet.to_item_collection
+::: stac_geoparquet.to_dict
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1 @@
+../README.md
diff --git a/docs/spec/stac-geoparquet-spec.md b/docs/spec/stac-geoparquet-spec.md
@@ -0,0 +1 @@
+../../spec/stac-geoparquet-spec.md
diff --git a/docs/usage.md b/docs/usage.md
@@ -0,0 +1 @@
+# Usage
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -0,0 +1,132 @@
+site_name: stac-geoparquet
+repo_name: stac-geoparquet
+repo_url: https://github.com/stac-utils/stac-geoparquet
+site_description: Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
+# Note: trailing slash recommended with mike:
+# https://squidfunk.github.io/mkdocs-material/setup/setting-up-versioning/#publishing-a-new-version
+site_url: https://stac-utils.github.io/stac-geoparquet/
+docs_dir: docs
+
+extra:
+  social:
+    - icon: "fontawesome/brands/github"
+      link: "https://github.com/stac-utils"
+  version:
+    provider: mike
+
+nav:
+  - index.md
+  - usage.md
+  - Specification: spec/stac-geoparquet-spec.md
+  - API Reference:
+      - api/arrow.md
+      - Legacy: api/legacy.md
+      # - api/pgstac.md
+
+watch:
+  - stac_geoparquet
+  - docs
+
+theme:
+  name: material
+  palette:
+    # Palette toggle for automatic mode
+    - media: "(prefers-color-scheme)"
+      toggle:
+        icon: material/brightness-auto
+        name: Switch to light mode
+
+    # Palette toggle for light mode
+    - media: "(prefers-color-scheme: light)"
+      primary: deep purple
+      accent: indigo
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+
+    # Palette toggle for dark mode
+    - media: "(prefers-color-scheme: dark)"
+      scheme: slate
+      primary: deep purple
+      accent: indigo
+      toggle:
+        icon: material/brightness-4
+        name: Switch to system preference
+
+  font:
+    text: Roboto
+    code: Roboto Mono
+
+  features:
+    - content.code.annotate
+    - content.code.copy
+    - navigation.indexes
+    - navigation.instant
+    - navigation.tracking
+    - search.suggest
+    - search.share
+
+plugins:
+  - search
+  - social
+  - mike:
+      alias_type: "copy"
+      canonical_version: "latest"
+  - mkdocstrings:
+      enable_inventory: true
+      handlers:
+        python:
+          options:
+            docstring_section_style: list
+            docstring_style: google
+            line_length: 80
+            separate_signature: true
+            show_root_heading: true
+            show_signature_annotations: true
+            show_source: false
+            show_symbol_type_toc: true
+            signature_crossrefs: true
+            extensions:
+              - griffe_inherited_docstrings
+
+          import:
+            - https://arrow.apache.org/docs/objects.inv
+            - https://delta-io.github.io/delta-rs/objects.inv
+            - https://docs.python.org/3/objects.inv
+            - https://geoarrow.github.io/geoarrow-rs/python/latest/objects.inv
+            - https://geopandas.org/en/stable/objects.inv
+            - https://numpy.org/doc/stable/objects.inv
+            - https://pandas.pydata.org/pandas-docs/stable/objects.inv
+            - https://pystac.readthedocs.io/en/stable/objects.inv
+            - https://shapely.readthedocs.io/en/stable/objects.inv
+
+# https://github.com/developmentseed/titiler/blob/50934c929cca2fa8d3c408d239015f8da429c6a8/docs/mkdocs.yml#L115-L140
+markdown_extensions:
+  - admonition
+  - attr_list
+  - codehilite:
+      guess_lang: false
+  - def_list
+  - footnotes
+  - md_in_html
+  - pymdownx.arithmatex
+  - pymdownx.betterem
+  - pymdownx.caret:
+      insert: false
+  - pymdownx.details
+  - pymdownx.emoji:
+      emoji_index: !!python/name:material.extensions.emoji.twemoji
+      emoji_generator: !!python/name:material.extensions.emoji.to_svg
+  - pymdownx.escapeall:
+      hardbreak: true
+      nbsp: true
+  - pymdownx.magiclink:
+      hide_protocol: true
+      repo_url_shortener: true
+  - pymdownx.smartsymbols
+  - pymdownx.superfences
+  - pymdownx.tasklist:
+      custom_checkbox: true
+  - pymdownx.tilde
+  - toc:
+      permalink: true
diff --git a/pyproject.toml b/pyproject.toml
@@ -31,6 +31,14 @@ source = "vcs"
 version-file = "stac_geoparquet/_version.py"
 
 [project.optional-dependencies]
+docs = [
+    "black",
+    "griffe-inherited-docstrings",
+    "mike>=2",
+    "mkdocs-material[imaging]>=9.5",
+    "mkdocs",
+    "mkdocstrings[python]>=0.25.1",
+]
 pgstac = [
     "fsspec",
     "psycopg[binary,pool]",

diff --git a/spec/stac-geoparquet-spec.md b/spec/stac-geoparquet-spec.md
@@ -8,9 +8,9 @@ library, but aims to provide guidance for anyone putting STAC data into GeoParqu
 
 ## Use cases
 
-* Provide a STAC GeoParquet that mirrors a static Collection as a way to query the whole dataset instead of reading every specific GeoJSON file.
-* As an output format for STAC API responses that is more efficient than paging through thousands of pages of GeoJSON.
-* Provide efficient access to specific fields of a STAC item, thanks to Parquet's columnar format.
+- Provide a STAC GeoParquet that mirrors a static Collection as a way to query the whole dataset instead of reading every specific GeoJSON file.
+- As an output format for STAC API responses that is more efficient than paging through thousands of pages of GeoJSON.
+- Provide efficient access to specific fields of a STAC item, thanks to Parquet's columnar format.
 
 ## Guidelines
 
@@ -19,7 +19,7 @@ from JSON into nested structures. We do pull the properties to the top level, so
 most of the fields should be the same in STAC and in GeoParquet.
 
 | Field              | GeoParquet Type      | Required | Details                                                                                                                                                                                                                                                 |
-|--------------------|----------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ------------------ | -------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | type               | String               | Optional | This is just needed for GeoJSON, so it is optional and not recommended to include in GeoParquet                                                                                                                                                         |
 | stac_extensions    | List of Strings      | Required | This column is required, but can be empty if no STAC extensions were used                                                                                                                                                                               |
 | id                 | String               | Required | Required, should be unique within each collection                                                                                                                                                                                                       |
@@ -30,20 +30,20 @@ most of the fields should be the same in STAC and in GeoParquet.
 | collection         | String               | Optional | The ID of the collection this Item is a part of. See notes below on 'Collection' and 'Collection JSON' in the Parquet metadata                                                                                                                          |
 | _property columns_ | _varies_             | -        | Each property should use the relevant Parquet type, and be pulled out of the properties object to be a top-level Parquet field                                                                                                                          |
 
-* Must be valid GeoParquet, with proper metadata. Ideally the geometry types are defined and as narrow as possible.
-* Strongly recommend to only have one GeoParquet per STAC 'Collection'. Not doing this will lead to an expanded GeoParquet schema (the union of all the schemas of the collection) with lots of empty data
-* Any field in 'properties' of the STAC item should be moved up to be a top-level field in the GeoParquet.
-* STAC GeoParquet does not support properties that are named such that they collide with a top-level key.
-* datetime columns should be stored as a [native timestamp][timestamp], not as a string
-* The Collection JSON should be included in the Parquet metadata. See [Collection JSON](#collection-json) below.
-* Any other properties that would be stored as GeoJSON in a STAC JSON Item (e.g. `proj:geometry`) should be stored as a binary column with WKB encoding. This simplifies the handling of collections with multiple geometry types.
+- Must be valid GeoParquet, with proper metadata. Ideally the geometry types are defined and as narrow as possible.
+- Strongly recommend to only have one GeoParquet per STAC 'Collection'. Not doing this will lead to an expanded GeoParquet schema (the union of all the schemas of the collection) with lots of empty data
+- Any field in 'properties' of the STAC item should be moved up to be a top-level field in the GeoParquet.
+- STAC GeoParquet does not support properties that are named such that they collide with a top-level key.
+- datetime columns should be stored as a [native timestamp][timestamp], not as a string
+- The Collection JSON should be included in the Parquet metadata. See [Collection JSON](#including-a-stac-collection-json-in-a-stac-geoparquet-collection) below.
+- Any other properties that would be stored as GeoJSON in a STAC JSON Item (e.g. `proj:geometry`) should be stored as a binary column with WKB encoding. This simplifies the handling of collections with multiple geometry types.
 
 ### Link Struct
 
 The GeoParquet dataset can contain zero or more Link Structs. Each Link Struct has 2 required fields and 2 optional ones:
 
 | Field Name | Type   | Description                                                                                                                         |
-|------------|--------|-------------------------------------------------------------------------------------------------------------------------------------|
+| ---------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------- |
 | href       | string | **REQUIRED.** The actual link in the format of an URL. Relative and absolute links are both allowed.                                |
 | rel        | string | **REQUIRED.** Relationship between the current document and the linked document. See chapter "Relation types" for more information. |
 | type       | string | [Media type][media-type] of the referenced entity.                                                                                  |
@@ -56,7 +56,7 @@ See [Link Object][link] for more.
 The GeoParquet dataset can contain zero or more Asset Structs. Each Asset Struct can have the following fields:
 
 | Field Name  | Type      | Description                                                                                                                                                                                  |
-|-------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ----------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | href        | string    | **REQUIRED.** URI to the asset object. Relative and absolute URI are both allowed.                                                                                                           |
 | title       | string    | The displayed title for clients and users.                                                                                                                                                   |
 | description | string    | A description of the Asset providing additional details, such as how it was processed or created. [CommonMark 0.29](http://commonmark.org/) syntax MAY be used for rich text representation. |
@@ -82,19 +82,18 @@ A common use case of stac-geoparquet is to create a mirror of a STAC collection.
 
 For example:
 
-| Field Name  | Type      | Value                               |
-|-------------|-----------|-------------------------------------|
-| href        | string    | s3://example/uti/to/file.geoparquet |
-| title       | string    | An example STAC geoparquet.         |
-| description | string    | Example description.                |
-| type        | string    | application/vnd.apache.parquet      |
-| roles       | \[string] | [collection-mirror]\*                |
+| Field Name  | Type      | Value                            |
+| ----------- | --------- | -------------------------------- |
+| href        | string    | s3://example/uri/to/file.parquet |
+| title       | string    | An example STAC GeoParquet.      |
+| description | string    | Example description.             |
+| type        | string    | `application/vnd.apache.parquet` |
+| roles       | \[string] | [collection-mirror]\*            |
 
-\*Note the IANA has not approved the new Media type `application/vnd.apache.parquet` yet, it's been (submitted for approval)[https://issues.apache.org/jira/browse/PARQUET-1889].
+\*Note the IANA has not approved the new Media type `application/vnd.apache.parquet` yet, it's been [submitted for approval](https://issues.apache.org/jira/browse/PARQUET-1889).
 
 The description should ideally include details about the spatial partitioning method.
 
-
 ## Mapping to other geospatial data formats
 
 The principles here can likely be used to map into other geospatial data formats (GeoPackage, FlatGeobuf, etc), but we embrace Parquet's nested 'structs' for some of the mappings, so other formats will need to do something different. The obvious thing to do is to dump JSON into those fields, but that's outside the scope of this document, and we recommend creating a general document for that.

diff --git a/stac_geoparquet/__init__.py b/stac_geoparquet/__init__.py
@@ -6,8 +6,7 @@
 
 __all__ = [
     "__version__",
-    "to_geodataframe",
     "to_dict",
+    "to_geodataframe",
     "to_item_collection",
-    "__version__",
 ]
diff --git a/stac_geoparquet/arrow/__init__.py b/stac_geoparquet/arrow/__init__.py
@@ -9,4 +9,18 @@
     DEFAULT_PARQUET_SCHEMA_VERSION,
     SUPPORTED_PARQUET_SCHEMA_VERSIONS,
 )
+from ._delta_lake import parse_stac_ndjson_to_delta_lake
 from ._to_parquet import parse_stac_ndjson_to_parquet, to_parquet
+
+__all__ = (
+    "DEFAULT_JSON_CHUNK_SIZE",
+    "DEFAULT_PARQUET_SCHEMA_VERSION",
+    "parse_stac_items_to_arrow",
+    "parse_stac_ndjson_to_arrow",
+    "parse_stac_ndjson_to_delta_lake",
+    "parse_stac_ndjson_to_parquet",
+    "stac_table_to_items",
+    "stac_table_to_ndjson",
+    "SUPPORTED_PARQUET_SCHEMA_VERSIONS",
+    "to_parquet",
+)
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,5 @@ __pycache__ @@
     dist
     .direnv
     stac_geoparquet/_version.py
+    .cache
+    site