From af6103fad1d2ebf28cf90f9c0b60135749987eae Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Wed, 6 Sep 2023 11:46:24 -0700 Subject: [PATCH 1/8] Update docs on contributing Add more information to README for a better first impression via GitHub. Add licensing, contact, and introduction to contributing.qmd. Add name to authors in index.qmd. --- README.md | 52 +++++++++++++++++++++++++++---- contributing.qmd | 80 ++++++++++++++++++++++++++++++++---------------- index.qmd | 1 + 3 files changed, 100 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 5452032..c3c354e 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,50 @@ # Cloud-Optimized Geospatial Formats Guide -See the site [https://developmentseed.org/cloud-optimized-geospatial-formats-guide/](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/) +## Introduction -This site is built using [Quarto](https://quarto.org/docs/get-started/) -To preview the site locally, install quarto and run: +This guide aims to provide comprehensive documentation on cloud-optimized geospatial formats. Geospatial data is growing in volume and complexity, and this guide serves to highlight best practices for handling such data, especially in a cloud environment. -```sh -quarto preview -``` +## Why This Guide Exists + +Traditional methods of downloading and processing geospatial data are becoming increasingly impractical due to the sheer size of datasets. This guide is designed to help data providers offer optimized access to such data. It covers community-developed tools and answers common questions about cloud-optimized geospatial formats. + +For more details, visit the official site: [Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/). + +## Features + +- Overview of cloud-optimized geospatial formats. +- Guidelines for generating and testing these formats. +- Community-contributed cookbooks for common tasks. +- Examples are built with open data and Jupyter Notebooks. + +## How to Get Involved + +1. Read the [Get Involved](./contributing.qmd) guide for detailed contribution guidelines. +2. For questions or discussions, start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). + +## Installation and Local Preview + +This site is built using [Quarto](https://quarto.org/docs/get-started/). To preview the site locally, follow these steps: + +1. Install Quarto from [here](https://quarto.org/). +2. Clone this repository. +3. Run the following command in the project root: + + ```sh + quarto preview + ``` + +## Authors + +- Aimee Barciauskas +- Alex Mandel +- Kyle Barron +- Zac Deziel + +## License + +This project is licensed under the Creative Commons Attribution 4.0 International license. + +## Questions? + +If you have any questions or ideas for this guide, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). diff --git a/contributing.qmd b/contributing.qmd index 4569f54..e3be527 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -2,22 +2,28 @@ title: Get Involved --- -We encourage contributions to this guide. +## Introduction + +We encourage contributions to this guide. The goal of the guide on cloud-optimized geospatial formats is to provide documentation on the best practices for the current state of the art cloud-optimized formats. These formats are evolving, and so will the guide. ## Pre-requisites If you wish to preview the site locally, install [quarto](https://quarto.org/). You will also need to be familiar with [quarto markdown](https://quarto.org/docs/authoring/markdown-basics.html). -## Core principles +## Communication Channels + +Discussions can take place in GitHub Discussions: [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues). -1. This guide is intended to be opinionated, but acknowledges there is no one-size-fits all solution. -2. This guide should provide best information and guidance available, but acknowledge there are many existing resources out there developed by experts. Those resources should be linked as appropriate. +## Core Principles -## Additional criteria +1. This guide is intended to be opinionated, but acknowledges there is no one-size-fits-all solution. +2. This guide should provide the best information and guidance available, but acknowledge there are many existing resources out there developed by experts. Those resources should be linked as appropriate. + +## Additional Criteria * All examples should use open data. If an example uses Earthdata, it must include an example of how to provide credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is open to anyone). -* Landing pages with no code should be use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). -* Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html) +* Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). +* Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html). ## Code of Conduct @@ -25,40 +31,60 @@ If you wish to preview the site locally, install [quarto](https://quarto.org/). * Look for and foster diverse perspectives. * If you experience any harmful behavior, please contact aimee@developmentseed.org or alex@developmentseed.org. -## How to contribute +## Bug Reporting + +For bug reports, please submit an issue on the GitHub repository. + +## Feature Requests + +For feature requests, please submit an issue on the GitHub repository. + +## How to Contribute -### 0. General +### General Steps -Fork the repository, make changes, use `quarto preview` to make sure the changes look good and open a pull request. +1. Fork the repository. +2. Clone your fork locally. +3. Create a new branch for your changes. +4. Make your changes and use `quarto preview` to make sure they look good. +5. Open a pull request. -Once the pull request is opened, and the github `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-. +Once the pull request is opened, and the GitHub `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at `https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-`. -### 1. Adding a new format +### Specific Contributions -1. Create a folder with the formats name and, within that folder, an `intro.qmd`. The `intro.qmd` file should describe the basics about that format. +#### 1. Adding a New Format + +Follow the steps outlined in the General Steps, then: + +1. Create a folder with the format's name and, within that folder, an `intro.qmd`. 2. Link to the `intro.qmd` page in the `index.qmd` (the **Welcome** page) file and `_quarto.yml` table of contents. -3. Optionally, add a notebook with examples of creating and accessing (via the cloud) a file of that format. We suggest including an [`environment.yml` file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) that defines the Conda packages necessary for the given notebook. -### 2. Modify or add to an existing format +#### 2. Modify or Add to an Existing Format Feel free to modify or add to existing content if you think it could be improved. -### 3. Adding a cookbook +#### 3. Adding a Cookbook + +Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd. + +#### 4. (Optional) Update Slides + +If you have made substantive changes, consider updating the [Overview Slides](./overview.qmd). These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax. + +#### 5. Add Yourself to the List of Authors + +Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page. -Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. For example: +#### 6. Final Steps Before Merging -* How do I chose the chunk shape for my Zarr? -* How do I determine and preview the resampling algorithm for my COG overview? -* How do I create STAC metadata for my cloud-optimized data? -* How do I provide visualizations fo this data? -* How do I provide subsetted/query/analytical access this data? +Once your PR is approved and all checks have passed, a project maintainer will merge your changes into the main repository. -To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd. +## Licensing -## 4. (Optional) Update slides +Contributions to this project are accepted under the Creative Commons Attribution 4.0 International license. -If you have made substantive changes, consider if the [Overview Slides](./overview.qmd) should be updated. These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax. +## Contact -## 5. Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page! +For questions on how to contribute, start a discussion in the GitHub Discussions section. -## 6. Once your PR is approved, merge away. diff --git a/index.qmd b/index.qmd index 7d568a7..20cd229 100644 --- a/index.qmd +++ b/index.qmd @@ -72,6 +72,7 @@ Most of the data formats covered in this guide have a Jupyter Notebook example t * Aimee Barciauskas * Alex Mandel * Kyle Barron +* Zac Deziel * [Overview Slide](./overview.qmd) credits: Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey ## Questions to ask when generating cloud-optimized geospatial data in any format From e6a73b03b6ed69f69bd870a317afc58fcb2f9f12 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Wed, 6 Sep 2023 12:03:39 -0700 Subject: [PATCH 2/8] Review and edit phrasing of contributing and index --- contributing.qmd | 26 ++++++++++++++------------ index.qmd | 36 +++++++++++++++++------------------- 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/contributing.qmd b/contributing.qmd index e3be527..37fee77 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -4,32 +4,32 @@ title: Get Involved ## Introduction -We encourage contributions to this guide. The goal of the guide on cloud-optimized geospatial formats is to provide documentation on the best practices for the current state of the art cloud-optimized formats. These formats are evolving, and so will the guide. +We encourage contributions to this guide. The guide's goal on cloud-optimized geospatial formats is to provide documentation on the best practices for the current state-of-the-art cloud-optimized formats. These formats are evolving, and so will the guide. ## Pre-requisites -If you wish to preview the site locally, install [quarto](https://quarto.org/). You will also need to be familiar with [quarto markdown](https://quarto.org/docs/authoring/markdown-basics.html). +If you wish to preview the site locally, install quarto. You will also need to be familiar with quarto markdown. ## Communication Channels -Discussions can take place in GitHub Discussions: [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues). +Discussions can occur in [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues). ## Core Principles -1. This guide is intended to be opinionated, but acknowledges there is no one-size-fits-all solution. -2. This guide should provide the best information and guidance available, but acknowledge there are many existing resources out there developed by experts. Those resources should be linked as appropriate. +1. This guide intends to be opinionated but acknowledges no one-size-fits-all solution. +2. This guide should provide the best information and guidance available but acknowledge that experts develop many existing resources. Those resources should be linked as appropriate. ## Additional Criteria -* All examples should use open data. If an example uses Earthdata, it must include an example of how to provide credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is open to anyone). -* Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). -* Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html). +- All examples should use open data. If an example uses Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone). +- Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). +- Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html). ## Code of Conduct -* Be inclusive, respectful and understanding of others' backgrounds and contexts. -* Look for and foster diverse perspectives. -* If you experience any harmful behavior, please contact aimee@developmentseed.org or alex@developmentseed.org. +- Be inclusive, respectful, and understanding of others' backgrounds and contexts. +- Look for and foster diverse perspectives. +- If you experience any harmful behavior, please get in touch with [Aimee](mailto:aimee@developmentseed.org) or [Alex](mailto:alex@developmentseed.org). ## Bug Reporting @@ -84,7 +84,9 @@ Once your PR is approved and all checks have passed, a project maintainer will m Contributions to this project are accepted under the Creative Commons Attribution 4.0 International license. +For more information, see the full [License](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/blob/main/LICENSE). + ## Contact -For questions on how to contribute, start a discussion in the GitHub Discussions section. +For questions on how to contribute, start a discussion in the [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) section. diff --git a/index.qmd b/index.qmd index 20cd229..bfdf44a 100644 --- a/index.qmd +++ b/index.qmd @@ -3,13 +3,13 @@ title: "Cloud-Optimized Geospatial Formats Guide" subtitle: "Methods for Generating and Testing Cloud-Optimized Geospatial Formats" --- -If you wish to provide optimized access to geospatial data, this guide is for you. Given the large and growing size of geospatial data, users can no longer rely solely on file download to achieve their science goals. +If you wish to provide optimized access to geospatial data, this guide is for you. Given geospatial data’s growing size, users can no longer rely solely on file downloads to achieve their science goals. ## Built for the community, by the community. -There is no one-size-fits-all approach to cloud-optimized data, but the community has developed many tools for creating and assessing geospatial formats that should be organized and shared. +There is no one-size-fits-all approach to cloud-optimized data. Still, the community has developed many tools for creating and assessing geospatial formats that should be organized and shared. -With this guide, we provide the landscape of cloud-optimized geospatial formats and provide the best-known answers to common questions. +This guide provides the landscape of cloud-optimized geospatial formats and the best-known answers to common questions. ## How to get involved @@ -19,18 +19,18 @@ If you have a question or idea for this guide, please start a [Github Discussion ## The opportunity -Just putting data on the cloud does not solve the big geospatial data problem. Users cannot reasonably wait to download, store and work with large files on their machines. To have access to data in memory, large volumes of data must be available via subsetting methods. +Putting data in the cloud does not solve the big geospatial data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory. -While it is possible to provide subsetting as a service, this requires maintanence of additional servers and network latency (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed remotely without the introduction of an additional server. +While it is possible to provide subsetting as a service, this requires maintenance of additional servers and network latency (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed remotely without introducing an additional server. -Regardless, users will be accessing data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. The introduction of a network introduces latency and the number of requests must considered. +Regardless, users will access data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. A network introduces latency, and the number of requests must be considered. As a community, we have arrived at the following **cloud-optimized format pattern:** 1. Metadata includes addresses for data blocks. 2. Metadata is stored in a consistent format and location. 3. Metadata can be read once. -3. Metadata can be used to read the underlying file format which supports subsetted access via adressable chunks, internal tiling or both. +4. Metadata can read the underlying file format, which supports subsetted access via addressable chunks, internal tiling, or both. These characteristics allow for parallelized and partial reading. @@ -42,13 +42,13 @@ The diagram below depicts how some of the cloud-optimized formats discussed in t Notes: -* Some data formats cover multiple data types, specifically: - * GeoJSON can be used for both vector and point data. - * HDF5 can be used for point data or data cubes (or both via groups). - * GeoParquet and FlatGeobuf can be used for vector data or point data. -* LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, same goes for COPC files). -* [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but not explicitly represented in the diagram. -* GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format) +- Some data formats cover multiple data types, specifically: + - GeoJSON can be used for both vector and point data. + - HDF5 can be used for point data or data cubes (or both via groups). + - GeoParquet and FlatGeobuf can be used for vector data or point data. +- LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, the same goes for COPC files). +- [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but have yet to be explicitly represented in the diagram. +- GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format) ## Table of Contents @@ -65,7 +65,7 @@ Notes: ## Running examples -Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of how to read and write the given format. At the top of each notebook is a link to an `environment.yml` file that describes what libraries need to be installed for the notebook to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook. +Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use Conda or Mamba (a successor to Conda with faster package installs) to install the environment needed to run the notebook. ## Authors @@ -79,8 +79,6 @@ Most of the data formats covered in this guide have a Jupyter Notebook example t 1. What variable(s) should be included in the new data format? 2. Will you create copies to optimize for different needs? -3. What is the intended use case or usage profile? Will this product be used for visualization, analysis or both? +3. What is the intended use case or usage profile? Will this product be used for visualization, analysis, or both? 4. What is the expected access method? -5. How much of your data is typically rendered or selected at once? - - +5. How much of your data is typically rendered or selected at once? \ No newline at end of file From eaa0f63692efc4d9fb7250a91cd1a47ed1cae6d6 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Wed, 6 Sep 2023 16:02:32 -0700 Subject: [PATCH 3/8] Update README based on PR review Add link to website of guide and link to source code at very top. Add preferred citation to License. --- README.md | 24 +++++++----------------- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index c3c354e..f0d6a0e 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,18 @@ # Cloud-Optimized Geospatial Formats Guide -## Introduction +--- -This guide aims to provide comprehensive documentation on cloud-optimized geospatial formats. Geospatial data is growing in volume and complexity, and this guide serves to highlight best practices for handling such data, especially in a cloud environment. +**Guide**: tinyurl.com/cogeo-guide -## Why This Guide Exists +**Source Code**: https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide -Traditional methods of downloading and processing geospatial data are becoming increasingly impractical due to the sheer size of datasets. This guide is designed to help data providers offer optimized access to such data. It covers community-developed tools and answers common questions about cloud-optimized geospatial formats. +--- -For more details, visit the official site: [Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/). +## Why This Guide Exists -## Features +This guide aims to provide comprehensive documentation on cloud-optimized geospatial formats. Geospatial data is growing in volume and complexity, and this guide serves to highlight best practices for handling such data, especially in a cloud environment. -- Overview of cloud-optimized geospatial formats. -- Guidelines for generating and testing these formats. -- Community-contributed cookbooks for common tasks. -- Examples are built with open data and Jupyter Notebooks. +For more details, visit the official site: [Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/). ## How to Get Involved @@ -34,13 +31,6 @@ This site is built using [Quarto](https://quarto.org/docs/get-started/). To prev quarto preview ``` -## Authors - -- Aimee Barciauskas -- Alex Mandel -- Kyle Barron -- Zac Deziel - ## License This project is licensed under the Creative Commons Attribution 4.0 International license. From f40e881bff3dee616b61c0fdfc474a4f8b684466 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Wed, 6 Sep 2023 16:05:01 -0700 Subject: [PATCH 4/8] Update contributing.qdm based on PR review Add missing links to to tool references. Rename iPython notebooks to Jupyter Notebooks. Add comment directing people to discussions before opening bug reports or feature requests. --- contributing.qmd | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/contributing.qmd b/contributing.qmd index 37fee77..42c69b3 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -8,7 +8,7 @@ We encourage contributions to this guide. The guide's goal on cloud-optimized ge ## Pre-requisites -If you wish to preview the site locally, install quarto. You will also need to be familiar with quarto markdown. +If you wish to preview the site locally, install [quarto](https://quarto.org/). You will also need to be familiar with [quarto markdown](https://quarto.org/docs/authoring/markdown-basics.html). ## Communication Channels @@ -23,7 +23,7 @@ Discussions can occur in [GitHub Discussions](https://github.com/developmentseed - All examples should use open data. If an example uses Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone). - Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). -- Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html). +- Pages with executable code should be [Jupyter Notebooks (`.ipynb`)](https://ipython.org/notebook.html). ## Code of Conduct @@ -31,15 +31,9 @@ Discussions can occur in [GitHub Discussions](https://github.com/developmentseed - Look for and foster diverse perspectives. - If you experience any harmful behavior, please get in touch with [Aimee](mailto:aimee@developmentseed.org) or [Alex](mailto:alex@developmentseed.org). -## Bug Reporting +## Bug Reporting & Feature Requests -For bug reports, please submit an issue on the GitHub repository. - -## Feature Requests - -For feature requests, please submit an issue on the GitHub repository. - -## How to Contribute +Before submitting a bug report or a feature request, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) to see if the issue has already been addressed or if it can be resolved through discussion. ### General Steps @@ -82,9 +76,7 @@ Once your PR is approved and all checks have passed, a project maintainer will m ## Licensing -Contributions to this project are accepted under the Creative Commons Attribution 4.0 International license. - -For more information, see the full [License](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/blob/main/LICENSE). +This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/). For attribution requirements, please look at the [license terms](http://creativecommons.org/licenses/by/4.0/). ## Contact From 58e8ae2203fc874aa3851dd7db886fd3ac6ab238 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Wed, 6 Sep 2023 16:07:02 -0700 Subject: [PATCH 5/8] Update index.qmd based on PR review Expand section on Why cloud optimize. Change reference of point data to point cloud data. Add missing links to external tools. --- index.qmd | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/index.qmd b/index.qmd index bfdf44a..b97f767 100644 --- a/index.qmd +++ b/index.qmd @@ -3,7 +3,18 @@ title: "Cloud-Optimized Geospatial Formats Guide" subtitle: "Methods for Generating and Testing Cloud-Optimized Geospatial Formats" --- -If you wish to provide optimized access to geospatial data, this guide is for you. Given geospatial data’s growing size, users can no longer rely solely on file downloads to achieve their science goals. +## Why Cloud Optimize? + +Geospatial data is experiencing exponential growth in both size and complexity. As a result, traditional data access methods, such as file downloads, have become increasingly impractical for achieving scientific objectives. With the limitations of these older methods becoming more apparent, cloud-optimized geospatial formats present a much-needed solution. + +Cloud optimization enables efficient, on-the-fly access to geospatial data, offering several advantages: + +1. **Reduced Latency**: Data can be fetched and processed much faster compared to downloading entire files. +2. **Scalability**: Cloud-optimized formats are designed to scale, making it easier to work with large datasets. +3. **Flexibility**: Users can access only the necessary data portions, making the process more efficient. +4. **Cost-Effective**: Reduced data transfer and storage needs can lower costs. + +If you want to provide optimized access to geospatial data, this guide is designed to help you understand the best practices and tools available for cloud-optimized geospatial formats. ## Built for the community, by the community. @@ -43,9 +54,9 @@ The diagram below depicts how some of the cloud-optimized formats discussed in t Notes: - Some data formats cover multiple data types, specifically: - - GeoJSON can be used for both vector and point data. - - HDF5 can be used for point data or data cubes (or both via groups). - - GeoParquet and FlatGeobuf can be used for vector data or point data. + - GeoJSON can be used for vector and point cloud data. + - HDF5 can be used for point cloud data or data cubes (or both via groups). + - GeoParquet and FlatGeobuf can be used for vector data or point cloud data. - LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, the same goes for COPC files). - [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but have yet to be explicitly represented in the diagram. - GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format) @@ -65,7 +76,7 @@ Notes: ## Running examples -Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use Conda or Mamba (a successor to Conda with faster package installs) to install the environment needed to run the notebook. +Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook. ## Authors From a88b4f886adc27e1b617a0ab391eeb5590b3f645 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Thu, 7 Sep 2023 08:36:57 -0700 Subject: [PATCH 6/8] Update contributing.qmd and index.qmd based on Kyle's review --- contributing.qmd | 11 ++++++++--- index.qmd | 6 +++--- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/contributing.qmd b/contributing.qmd index 42c69b3..9296d99 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -4,7 +4,7 @@ title: Get Involved ## Introduction -We encourage contributions to this guide. The guide's goal on cloud-optimized geospatial formats is to provide documentation on the best practices for the current state-of-the-art cloud-optimized formats. These formats are evolving, and so will the guide. +We encourage contributions to this guide. The guide's goal is to provide documentation on the best practices for the current state-of-the-art cloud-optimized formats. These formats are evolving, and so will the guide. ## Pre-requisites @@ -14,6 +14,11 @@ If you wish to preview the site locally, install [quarto](https://quarto.org/). Discussions can occur in [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues). + +- **GitHub Discussions**: Ideal for questions, feature requests, or general conversations about the project. Use this space for collaborative discussions or if you're unsure where to start. + +- **GitHub Issues**: Use this for reporting bugs, suggesting enhancements, or other tasks that require tracking and possibly code changes. + ## Core Principles 1. This guide intends to be opinionated but acknowledges no one-size-fits-all solution. @@ -21,7 +26,7 @@ Discussions can occur in [GitHub Discussions](https://github.com/developmentseed ## Additional Criteria -- All examples should use open data. If an example uses Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone). +- All examples should use open data. If an example uses data from NASA Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone). - Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html). - Pages with executable code should be [Jupyter Notebooks (`.ipynb`)](https://ipython.org/notebook.html). @@ -43,7 +48,7 @@ Before submitting a bug report or a feature request, please start a [GitHub Disc 4. Make your changes and use `quarto preview` to make sure they look good. 5. Open a pull request. -Once the pull request is opened, and the GitHub `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at `https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-`. +Once the pull request is opened, and the GitHub `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at `https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-`. A bot will comment on your PR when the PR preview is ready. ### Specific Contributions diff --git a/index.qmd b/index.qmd index b97f767..45822de 100644 --- a/index.qmd +++ b/index.qmd @@ -12,7 +12,7 @@ Cloud optimization enables efficient, on-the-fly access to geospatial data, offe 1. **Reduced Latency**: Data can be fetched and processed much faster compared to downloading entire files. 2. **Scalability**: Cloud-optimized formats are designed to scale, making it easier to work with large datasets. 3. **Flexibility**: Users can access only the necessary data portions, making the process more efficient. -4. **Cost-Effective**: Reduced data transfer and storage needs can lower costs. +4. **Cost-Effectiveness**: Reduced data transfer and storage needs can lower costs. If you want to provide optimized access to geospatial data, this guide is designed to help you understand the best practices and tools available for cloud-optimized geospatial formats. @@ -30,9 +30,9 @@ If you have a question or idea for this guide, please start a [Github Discussion ## The opportunity -Putting data in the cloud does not solve the big geospatial data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory. +Storing data in the cloud does not on its own solve geospatial's data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory. -While it is possible to provide subsetting as a service, this requires maintenance of additional servers and network latency (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed remotely without introducing an additional server. +While it is possible to provide subsetting as a service, this requires ongoing maintenance of additional servers and as well as extra network latency when accessing data (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed directly from an end user's machine without introducing an additional server. Regardless, users will access data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. A network introduces latency, and the number of requests must be considered. From a79eb8a4f2117e11d4b708430e078b44da0f7d71 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Thu, 7 Sep 2023 13:29:53 -0700 Subject: [PATCH 7/8] Add preferred citation to LICENSE section --- README.md | 2 ++ contributing.qmd | 2 ++ 2 files changed, 4 insertions(+) diff --git a/README.md b/README.md index f0d6a0e..a449d12 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,8 @@ This site is built using [Quarto](https://quarto.org/docs/get-started/). To prev This project is licensed under the Creative Commons Attribution 4.0 International license. +Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`. + ## Questions? If you have any questions or ideas for this guide, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). diff --git a/contributing.qmd b/contributing.qmd index 9296d99..4de2559 100644 --- a/contributing.qmd +++ b/contributing.qmd @@ -83,6 +83,8 @@ Once your PR is approved and all checks have passed, a project maintainer will m This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/). For attribution requirements, please look at the [license terms](http://creativecommons.org/licenses/by/4.0/). +Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`. + ## Contact For questions on how to contribute, start a discussion in the [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) section. From b0926979147a3fb8c5f9b14daa620d7359ff0555 Mon Sep 17 00:00:00 2001 From: Zachary Deziel Date: Mon, 11 Sep 2023 13:10:48 -0700 Subject: [PATCH 8/8] Update based on Aimee's feedback Add details to why cloud optimize in index.qmd. Standardize capitalization of words in sub-section headers. --- index.qmd | 18 +++++++++--------- overview.qmd | 16 ++++++++-------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/index.qmd b/index.qmd index 45822de..aed74d2 100644 --- a/index.qmd +++ b/index.qmd @@ -9,26 +9,26 @@ Geospatial data is experiencing exponential growth in both size and complexity. Cloud optimization enables efficient, on-the-fly access to geospatial data, offering several advantages: -1. **Reduced Latency**: Data can be fetched and processed much faster compared to downloading entire files. -2. **Scalability**: Cloud-optimized formats are designed to scale, making it easier to work with large datasets. -3. **Flexibility**: Users can access only the necessary data portions, making the process more efficient. -4. **Cost-Effectiveness**: Reduced data transfer and storage needs can lower costs. +1. **Reduced Latency**: Subsets of the raw data can be fetched and processed much faster compared to downloading entire files. +2. **Scalability**: Cloud-optimized formats are usually stored on cloud object storage, which is infinitely scalable. Object storage supports many parallel read requests when combined with metadata about where different data bits are stored, making it easier to work with large datasets. +3. **Flexibility**: Cloud-optimized formats allow for high levels of customization, enabling users to tailor data access to their specific needs. Additionally, advanced query capabilities provide the freedom to perform complex operations on the data without downloading and processing entire datasets. +4. **Cost-Effectiveness**: Reduced data transfer and storage needs can lower costs. Many of these formats offer compression options, which reduce storage costs. If you want to provide optimized access to geospatial data, this guide is designed to help you understand the best practices and tools available for cloud-optimized geospatial formats. -## Built for the community, by the community. +## Built for the Community, by the Community. There is no one-size-fits-all approach to cloud-optimized data. Still, the community has developed many tools for creating and assessing geospatial formats that should be organized and shared. This guide provides the landscape of cloud-optimized geospatial formats and the best-known answers to common questions. -## How to get involved +## How to Get Involved If you want to contribute or modify content, read the [Get Involved](./contributing.qmd) page. If you have a question or idea for this guide, please start a [Github Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose). -## The opportunity +## The Opportunity Storing data in the cloud does not on its own solve geospatial's data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory. @@ -74,7 +74,7 @@ Notes: 3. [Cookbooks](./cookbooks/index.qmd) -## Running examples +## Running Examples Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook. @@ -86,7 +86,7 @@ Most of the data formats covered in this guide have a Jupyter Notebook example t * Zac Deziel * [Overview Slide](./overview.qmd) credits: Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey -## Questions to ask when generating cloud-optimized geospatial data in any format +## Questions to Ask When Generating Cloud-Optimized Geospatial Data in Any Format 1. What variable(s) should be included in the new data format? 2. Will you create copies to optimize for different needs? diff --git a/overview.qmd b/overview.qmd index ad8f68c..903211d 100644 --- a/overview.qmd +++ b/overview.qmd @@ -19,7 +19,7 @@ Source: https://github.com/developmentseed/cloud-optimized-geospatial-formats-gu Google Slides version of this content: [Cloud-Optimized Geospatial Formats](https://docs.google.com/presentation/d/1F89kcrtX9LNQPTOuwyL5FRex_8--Vlg-DA8GJNzWqGk/edit?usp=sharing). ::: {.incremental} -# What makes cloud-optimized challenging? +# What Makes Cloud-Optimized Challenging? * No one size fits all approach * Earth observation data may be processed into raster, vector and point cloud data types and stored in a long list of data formats and structures. @@ -28,13 +28,13 @@ Google Slides version of this content: [Cloud-Optimized Geospatial Formats](http * ... hopefully only a few new methods and concepts are necessary. ::: -# What makes cloud-optimized challenging? +# What Makes Cloud-optimized Challenging? ![](./images/2019-points-lines-polygons.png) image source: ui.josiahparry.com/spatial-analysis.html -# What makes cloud-optimized challenging? +# What Makes Cloud-optimized Challenging? :::: {.columns} @@ -52,21 +52,21 @@ Authors: Chris Durbin, Patrick Quinn, Dana Shum :::: -# What does cloud-optimized mean? +# What Does Cloud-Optimized Mean? File formats are read-oriented to support: * Partial reads * Parallel reads -## What does cloud-optimized mean? +## What Does Cloud-Optimized Mean? * File metadata in one read * When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads. * An easy win is metadata in one read, which can be used to read a cloud-native dataset. * A cloud-native dataset is one with small addressable chunks via files, internal tiles, or both. -## What does cloud-optimized mean? +## What Does Cloud-Optimized Mean? :::: {.columns} @@ -178,7 +178,7 @@ image source: https://xarray.dev/ image source: https://fsspec.github.io/kerchunk/detail.html ::: -## Zarr specs in development +## Zarr Specs in Development * V2 and older specs exist, however, * A cross-organization working group has just formed to establish a GeoZarr standards working group, organized by Brianna Pagán (NASA) and includes representatives from many other orgs in the industry. @@ -258,7 +258,7 @@ image source: https://www.wherobots.ai/post/spatial-data-parquet-and-apache-sedo [Return to Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/) or ... -## Not quite +## Not Quite * These formats and their tooling are in active development * Some formats were not mentioned, such as EPT, geopkg, tiledb, Cloud-Optimized HDF5. This presentation was scoped to those known best by the authors.