diff --git a/README.md b/README.md index 9551e5a..c0c7c36 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,41 @@ See the site [https://guide.cloudnativegeo.org/](https://guide.cloudnativegeo.org/) -This site is built using [Quarto](https://quarto.org/docs/get-started/) -To preview the site locally, install quarto and run: +**Guide**: tinyurl.com/cogeo-guide -```sh -quarto preview -``` +**Source Code**: https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide + +--- + +## Why This Guide Exists + +This guide aims to provide comprehensive documentation on cloud-optimized geospatial formats. Geospatial data is growing in volume and complexity, and this guide serves to highlight best practices for handling such data, especially in a cloud environment. + +For more details, visit the official site: [Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/). + +## How to Get Involved + +1. Read the [Get Involved](./contributing.qmd) guide for detailed contribution guidelines. +2. For questions or discussions, start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). + +## Installation and Local Preview + +This site is built using [Quarto](https://quarto.org/docs/get-started/). To preview the site locally, follow these steps: + +1. Install Quarto from [here](https://quarto.org/). +2. Clone this repository. +3. Run the following command in the project root: + + ```sh + quarto preview + ``` + +## License + +This project is licensed under the Creative Commons Attribution 4.0 International license. + +Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`. + +## Questions? + +If you have any questions or ideas for this guide, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). diff --git a/contributing.qmd b/contributing.qmd index a7e0072..94ed315 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -2,65 +2,91 @@ title: Get Involved --- -We encourage contributions to this guide. +## Introduction + +We encourage contributions to this guide. The guide's goal is to provide documentation on the best practices for the current state-of-the-art cloud-optimized formats. These formats are evolving, and so will the guide. ## Pre-requisites If you wish to preview the site locally, install [quarto](https://quarto.org/). You will also need to be familiar with [quarto markdown](https://quarto.org/docs/authoring/markdown-basics.html). -## Core principles +## Communication Channels + +Discussions can occur in [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues). + + +- **GitHub Discussions**: Ideal for questions, feature requests, or general conversations about the project. Use this space for collaborative discussions or if you're unsure where to start. -1. This guide is intended to be opinionated, but acknowledges there is no one-size-fits all solution. -2. This guide should provide best information and guidance available, but acknowledge there are many existing resources out there developed by experts. Those resources should be linked as appropriate. +- **GitHub Issues**: Use this for reporting bugs, suggesting enhancements, or other tasks that require tracking and possibly code changes. -## Additional criteria +## Core Principles -* All examples should use open data. If an example uses Earthdata, it must include an example of how to provide credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is open to anyone). -* Landing pages with no code should be use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). -* Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html) +1. This guide intends to be opinionated but acknowledges no one-size-fits-all solution. +2. This guide should provide the best information and guidance available but acknowledge that experts develop many existing resources. Those resources should be linked as appropriate. + +## Additional Criteria + +- All examples should use open data. If an example uses data from NASA Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone). +- Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). +- Pages with executable code should be [Jupyter Notebooks (`.ipynb`)](https://ipython.org/notebook.html). ## Code of Conduct -* Be inclusive, respectful and understanding of others' backgrounds and contexts. -* Look for and foster diverse perspectives. -* If you experience any harmful behavior, please contact aimee@developmentseed.org or alex@developmentseed.org. +- Be inclusive, respectful, and understanding of others' backgrounds and contexts. +- Look for and foster diverse perspectives. +- If you experience any harmful behavior, please get in touch with [Aimee](mailto:aimee@developmentseed.org) or [Alex](mailto:alex@developmentseed.org). + +## Bug Reporting & Feature Requests -## How to contribute +Before submitting a bug report or a feature request, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) to see if the issue has already been addressed or if it can be resolved through discussion. -### 0. General +### General Steps -Fork the repository, make changes, use `quarto preview` to make sure the changes look good and open a pull request. +1. Fork the repository. +2. Clone your fork locally. +3. Create a new branch for your changes. +4. Make your changes and use `quarto preview` to make sure they look good. +5. Open a pull request. -Once the pull request is opened, and the github `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at https://guide.cloudnativegeo.org/pr-preview/pr-. +Once the pull request is opened, and the GitHub `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at `https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-`. A bot will comment on your PR when the PR preview is ready. -### 1. Adding a new format +### Specific Contributions -1. Create a folder with the formats name and, within that folder, an `intro.qmd`. The `intro.qmd` file should describe the basics about that format. +#### 1. Adding a New Format + +Follow the steps outlined in the General Steps, then: + +1. Create a folder with the format's name and, within that folder, an `intro.qmd`. 2. Link to the `intro.qmd` page in the `index.qmd` (the **Welcome** page) file and `_quarto.yml` table of contents. -3. Optionally, add a notebook with examples of creating and accessing (via the cloud) a file of that format. We suggest including an [`environment.yml` file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) that defines the Conda packages necessary for the given notebook. -### 2. Modify or add to an existing format +#### 2. Modify or Add to an Existing Format Feel free to modify or add to existing content if you think it could be improved. -### 3. Adding a cookbook +#### 3. Adding a Cookbook + +Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd. + +#### 4. (Optional) Update Slides + +If you have made substantive changes, consider updating the [Overview Slides](./overview.qmd). These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax. + +#### 5. Add Yourself to the List of Authors + +Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page. -Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. For example: +#### 6. Final Steps Before Merging -* How do I chose the chunk shape for my Zarr? -* How do I determine and preview the resampling algorithm for my COG overview? -* How do I create STAC metadata for my cloud-optimized data? -* How do I provide visualizations fo this data? -* How do I provide subsetted/query/analytical access this data? +Once your PR is approved and all checks have passed, a project maintainer will merge your changes into the main repository. -To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd. +## Licensing -### 4. (Optional) Update slides +This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/). For attribution requirements, please look at the [license terms](http://creativecommons.org/licenses/by/4.0/). -If you have made substantive changes, consider if the [Overview Slides](./overview.qmd) should be updated. These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax. +Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`. -### 5. Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page! +## Contact -### 6. Once your PR is approved, merge away. +For questions on how to contribute, start a discussion in the [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) section. -{{< include _thankyous.qmd >}} \ No newline at end of file +{{< include _thankyous.qmd >}} diff --git a/index.qmd b/index.qmd index 6626972..ceccfa1 100644 --- a/index.qmd +++ b/index.qmd @@ -3,34 +3,45 @@ title: "Cloud-Optimized Geospatial Formats Guide" subtitle: "Methods for Generating and Testing Cloud-Optimized Geospatial Formats" --- -If you wish to provide optimized access to geospatial data, this guide is for you. Given the large and growing size of geospatial data, users can no longer rely solely on file download to achieve their science goals. +## Why Cloud Optimize? -## Built for the community, by the community. +Geospatial data is experiencing exponential growth in both size and complexity. As a result, traditional data access methods, such as file downloads, have become increasingly impractical for achieving scientific objectives. With the limitations of these older methods becoming more apparent, cloud-optimized geospatial formats present a much-needed solution. -There is no one-size-fits-all approach to cloud-optimized data, but the community has developed many tools for creating and assessing geospatial formats that should be organized and shared. +Cloud optimization enables efficient, on-the-fly access to geospatial data, offering several advantages: -With this guide, we provide the landscape of cloud-optimized geospatial formats and provide the best-known answers to common questions. +1. **Reduced Latency**: Subsets of the raw data can be fetched and processed much faster compared to downloading entire files. +2. **Scalability**: Cloud-optimized formats are usually stored on cloud object storage, which is infinitely scalable. Object storage supports many parallel read requests when combined with metadata about where different data bits are stored, making it easier to work with large datasets. +3. **Flexibility**: Cloud-optimized formats allow for high levels of customization, enabling users to tailor data access to their specific needs. Additionally, advanced query capabilities provide the freedom to perform complex operations on the data without downloading and processing entire datasets. +4. **Cost-Effectiveness**: Reduced data transfer and storage needs can lower costs. Many of these formats offer compression options, which reduce storage costs. -## How to get involved +If you want to provide optimized access to geospatial data, this guide is designed to help you understand the best practices and tools available for cloud-optimized geospatial formats. + +## Built for the Community, by the Community. + +There is no one-size-fits-all approach to cloud-optimized data. Still, the community has developed many tools for creating and assessing geospatial formats that should be organized and shared. + +This guide provides the landscape of cloud-optimized geospatial formats and the best-known answers to common questions. + +## How to Get Involved If you want to contribute or modify content, read the [Get Involved](./contributing.qmd) page. If you have a question or idea for this guide, please start a [Github Discussion](https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide/discussions/new/choose). -## The opportunity +## The Opportunity -Just putting data on the cloud does not solve the big geospatial data problem. Users cannot reasonably wait to download, store and work with large files on their machines. To have access to data in memory, large volumes of data must be available via subsetting methods. +Storing data in the cloud does not on its own solve geospatial's data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory. -While it is possible to provide subsetting as a service, this requires maintanence of additional servers and network latency (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed remotely without the introduction of an additional server. +While it is possible to provide subsetting as a service, this requires ongoing maintenance of additional servers and as well as extra network latency when accessing data (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed directly from an end user's machine without introducing an additional server. -Regardless, users will be accessing data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. The introduction of a network introduces latency and the number of requests must considered. +Regardless, users will access data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. A network introduces latency, and the number of requests must be considered. As a community, we have arrived at the following **cloud-optimized format pattern:** 1. Metadata includes addresses for data blocks. 2. Metadata is stored in a consistent format and location. 3. Metadata can be read once. -3. Metadata can be used to read the underlying file format which supports subsetted access via adressable chunks, internal tiling or both. +4. Metadata can read the underlying file format, which supports subsetted access via addressable chunks, internal tiling, or both. These characteristics allow for parallelized and partial reading. @@ -42,13 +53,13 @@ The diagram below depicts how some of the cloud-optimized formats discussed in t Notes: -* Some data formats cover multiple data types, specifically: - * GeoJSON can be used for both vector and point data. - * HDF5 can be used for point data or data cubes (or both via groups). - * GeoParquet and FlatGeobuf can be used for vector data or point data. -* LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, same goes for COPC files). -* [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but not explicitly represented in the diagram. -* GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format) +- Some data formats cover multiple data types, specifically: + - GeoJSON can be used for vector and point cloud data. + - HDF5 can be used for point cloud data or data cubes (or both via groups). + - GeoParquet and FlatGeobuf can be used for vector data or point cloud data. +- LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, the same goes for COPC files). +- [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but have yet to be explicitly represented in the diagram. +- GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format) ## Table of Contents @@ -63,22 +74,23 @@ Notes: 3. [Cookbooks](./cookbooks/index.qmd) -## Running examples +## Running Examples -Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of how to read and write the given format. At the top of each notebook is a link to an `environment.yml` file that describes what libraries need to be installed for the notebook to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook. +Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook. ## Authors * Aimee Barciauskas * Alex Mandel * Kyle Barron +* Zac Deziel * [Overview Slide](./overview.qmd) credits: Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey -## Questions to ask when generating cloud-optimized geospatial data in any format +## Questions to Ask When Generating Cloud-Optimized Geospatial Data in Any Format 1. What variable(s) should be included in the new data format? 2. Will you create copies to optimize for different needs? -3. What is the intended use case or usage profile? Will this product be used for visualization, analysis or both? +3. What is the intended use case or usage profile? Will this product be used for visualization, analysis, or both? 4. What is the expected access method? 5. How much of your data is typically rendered or selected at once? diff --git a/overview.qmd b/overview.qmd index 2862f9b..aad35c3 100644 --- a/overview.qmd +++ b/overview.qmd @@ -19,7 +19,7 @@ Source: https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-gui Google Slides version of this content: [Cloud-Optimized Geospatial Formats](https://docs.google.com/presentation/d/1F89kcrtX9LNQPTOuwyL5FRex_8--Vlg-DA8GJNzWqGk/edit?usp=sharing). ::: {.incremental} -# What makes cloud-optimized challenging? +# What Makes Cloud-Optimized Challenging? * No one size fits all approach * Earth observation data may be processed into raster, vector and point cloud data types and stored in a long list of data formats and structures. @@ -28,13 +28,13 @@ Google Slides version of this content: [Cloud-Optimized Geospatial Formats](http * ... hopefully only a few new methods and concepts are necessary. ::: -# What makes cloud-optimized challenging? +# What Makes Cloud-optimized Challenging? ![](./images/2019-points-lines-polygons.png) image source: ui.josiahparry.com/spatial-analysis.html -# What makes cloud-optimized challenging? +# What Makes Cloud-optimized Challenging? :::: {.columns} @@ -52,21 +52,21 @@ Authors: Chris Durbin, Patrick Quinn, Dana Shum :::: -# What does cloud-optimized mean? +# What Does Cloud-Optimized Mean? File formats are read-oriented to support: * Partial reads * Parallel reads -## What does cloud-optimized mean? +## What Does Cloud-Optimized Mean? * File metadata in one read * When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads. * An easy win is metadata in one read, which can be used to read a cloud-native dataset. * A cloud-native dataset is one with small addressable chunks via files, internal tiles, or both. -## What does cloud-optimized mean? +## What Does Cloud-Optimized Mean? :::: {.columns} @@ -178,7 +178,7 @@ image source: https://xarray.dev/ image source: https://fsspec.github.io/kerchunk/detail.html ::: -## Zarr specs in development +## Zarr Specs in Development * V2 and older specs exist, however, * A cross-organization working group has just formed to establish a GeoZarr standards working group, organized by Brianna Pagán (NASA) and includes representatives from many other orgs in the industry. @@ -258,7 +258,7 @@ image source: https://www.wherobots.ai/post/spatial-data-parquet-and-apache-sedo [Return to Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/) or ... -## Not quite +## Not Quite * These formats and their tooling are in active development * Some formats were not mentioned, such as EPT, geopkg, tiledb, Cloud-Optimized HDF5. This presentation was scoped to those known best by the authors.