Skip to content

Commit

Permalink
Merge pull request #51 from cloudnativegeo/docs/improve-contributing
Browse files Browse the repository at this point in the history
Enhance Documentation: Update `contributing.qmd`, `README.md`, and `index.qmd`
  • Loading branch information
kylebarron authored Sep 19, 2023
2 parents 5c44a71 + 0615f0d commit cc49f69
Show file tree
Hide file tree
Showing 4 changed files with 136 additions and 66 deletions.
42 changes: 37 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,41 @@

See the site [https://guide.cloudnativegeo.org/](https://guide.cloudnativegeo.org/)

This site is built using [Quarto](https://quarto.org/docs/get-started/)
To preview the site locally, install quarto and run:
**Guide**: <a href="https://tinyurl.com/cogeo-guide" target="_blank">tinyurl.com/cogeo-guide</a>

```sh
quarto preview
```
**Source Code**: <a href="https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide" target="_blank">https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide</a>

---

## Why This Guide Exists

This guide aims to provide comprehensive documentation on cloud-optimized geospatial formats. Geospatial data is growing in volume and complexity, and this guide serves to highlight best practices for handling such data, especially in a cloud environment.

For more details, visit the official site: [Cloud-Optimized Geospatial Formats Guide](https://developmentseed.org/cloud-optimized-geospatial-formats-guide/).

## How to Get Involved

1. Read the [Get Involved](./contributing.qmd) guide for detailed contribution guidelines.
2. For questions or discussions, start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose).

## Installation and Local Preview

This site is built using [Quarto](https://quarto.org/docs/get-started/). To preview the site locally, follow these steps:

1. Install Quarto from [here](https://quarto.org/).
2. Clone this repository.
3. Run the following command in the project root:

```sh
quarto preview
```

## License

This project is licensed under the Creative Commons Attribution 4.0 International license.

Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`.

## Questions?

If you have any questions or ideas for this guide, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions/new/choose).
90 changes: 58 additions & 32 deletions contributing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,65 +2,91 @@
title: Get Involved
---

We encourage contributions to this guide.
## Introduction

We encourage contributions to this guide. The guide's goal is to provide documentation on the best practices for the current state-of-the-art cloud-optimized formats. These formats are evolving, and so will the guide.

## Pre-requisites

If you wish to preview the site locally, install [quarto](https://quarto.org/). You will also need to be familiar with [quarto markdown](https://quarto.org/docs/authoring/markdown-basics.html).

## Core principles
## Communication Channels

Discussions can occur in [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) and issues can be raised at [GitHub Issues](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/issues).


- **GitHub Discussions**: Ideal for questions, feature requests, or general conversations about the project. Use this space for collaborative discussions or if you're unsure where to start.

1. This guide is intended to be opinionated, but acknowledges there is no one-size-fits all solution.
2. This guide should provide best information and guidance available, but acknowledge there are many existing resources out there developed by experts. Those resources should be linked as appropriate.
- **GitHub Issues**: Use this for reporting bugs, suggesting enhancements, or other tasks that require tracking and possibly code changes.

## Additional criteria
## Core Principles

* All examples should use open data. If an example uses Earthdata, it must include an example of how to provide credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is open to anyone).
* Landing pages with no code should be use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html).
* Pages with executable code should be [iPython Notebooks (`.ipynb`)](https://ipython.org/notebook.html)
1. This guide intends to be opinionated but acknowledges no one-size-fits-all solution.
2. This guide should provide the best information and guidance available but acknowledge that experts develop many existing resources. Those resources should be linked as appropriate.

## Additional Criteria

- All examples should use open data. If an example uses data from NASA Earthdata, it must include an example of providing credentials ([Earthdata registration](https://urs.earthdata.nasa.gov/users/new) is available to anyone).
- Landing pages with no code should use [quarto markdown (`.qmd`)](https://quarto.org/docs/authoring/markdown-basics.html).
- Pages with executable code should be [Jupyter Notebooks (`.ipynb`)](https://ipython.org/notebook.html).

## Code of Conduct

* Be inclusive, respectful and understanding of others' backgrounds and contexts.
* Look for and foster diverse perspectives.
* If you experience any harmful behavior, please contact [email protected] or [email protected].
- Be inclusive, respectful, and understanding of others' backgrounds and contexts.
- Look for and foster diverse perspectives.
- If you experience any harmful behavior, please get in touch with [Aimee](mailto:[email protected]) or [Alex](mailto:[email protected]).

## Bug Reporting & Feature Requests

## How to contribute
Before submitting a bug report or a feature request, please start a [GitHub Discussion](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) to see if the issue has already been addressed or if it can be resolved through discussion.

### 0. General
### General Steps

Fork the repository, make changes, use `quarto preview` to make sure the changes look good and open a pull request.
1. Fork the repository.
2. Clone your fork locally.
3. Create a new branch for your changes.
4. Make your changes and use `quarto preview` to make sure they look good.
5. Open a pull request.

Once the pull request is opened, and the github `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at https://guide.cloudnativegeo.org/pr-preview/pr-<YOUR-PR-NUMBER-HERE>.
Once the pull request is opened, and the GitHub `preview.yml` workflow runs ("Deploy PR previews"), you should have a preview available for review at `https://developmentseed.org/cloud-optimized-geospatial-formats-guide/pr-preview/pr-<YOUR-PR-NUMBER-HERE>`. A bot will comment on your PR when the PR preview is ready.

### 1. Adding a new format
### Specific Contributions

1. Create a folder with the formats name and, within that folder, an `intro.qmd`. The `intro.qmd` file should describe the basics about that format.
#### 1. Adding a New Format

Follow the steps outlined in the General Steps, then:

1. Create a folder with the format's name and, within that folder, an `intro.qmd`.
2. Link to the `intro.qmd` page in the `index.qmd` (the **Welcome** page) file and `_quarto.yml` table of contents.
3. Optionally, add a notebook with examples of creating and accessing (via the cloud) a file of that format. We suggest including an [`environment.yml` file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) that defines the Conda packages necessary for the given notebook.

### 2. Modify or add to an existing format
#### 2. Modify or Add to an Existing Format

Feel free to modify or add to existing content if you think it could be improved.

### 3. Adding a cookbook
#### 3. Adding a Cookbook

Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd.

#### 4. (Optional) Update Slides

If you have made substantive changes, consider updating the [Overview Slides](./overview.qmd). These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax.

#### 5. Add Yourself to the List of Authors

Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page.

Cookbooks should address common questions and present solutions for cloud-optimized access and visualization. For example:
#### 6. Final Steps Before Merging

* How do I chose the chunk shape for my Zarr?
* How do I determine and preview the resampling algorithm for my COG overview?
* How do I create STAC metadata for my cloud-optimized data?
* How do I provide visualizations fo this data?
* How do I provide subsetted/query/analytical access this data?
Once your PR is approved and all checks have passed, a project maintainer will merge your changes into the main repository.

To create a cookbook, either add a notebook directly to this repository in the cookbooks directory OR use an external link and add it to cookbooks/index.qmd.
## Licensing

### 4. (Optional) Update slides
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/). For attribution requirements, please look at the [license terms](http://creativecommons.org/licenses/by/4.0/).

If you have made substantive changes, consider if the [Overview Slides](./overview.qmd) should be updated. These slides are generated using [Quarto and Reveal.js](https://quarto.org/docs/presentations/revealjs/) so can be updated with markdown syntax.
Preferred citation: `Barciauskas, A et al. 2023. Cloud Optimized Geospatial Formats Guide. CC-By-4.0`.

### 5. Add yourself to the list of authors on the [Welcome](./index.qmd#authors) page!
## Contact

### 6. Once your PR is approved, merge away.
For questions on how to contribute, start a discussion in the [GitHub Discussions](https://github.com/developmentseed/cloud-optimized-geospatial-formats-guide/discussions) section.

{{< include _thankyous.qmd >}}
{{< include _thankyous.qmd >}}
54 changes: 33 additions & 21 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,34 +3,45 @@ title: "Cloud-Optimized Geospatial Formats Guide"
subtitle: "Methods for Generating and Testing Cloud-Optimized Geospatial Formats"
---

If you wish to provide optimized access to geospatial data, this guide is for you. Given the large and growing size of geospatial data, users can no longer rely solely on file download to achieve their science goals.
## Why Cloud Optimize?

## Built for the community, by the community.
Geospatial data is experiencing exponential growth in both size and complexity. As a result, traditional data access methods, such as file downloads, have become increasingly impractical for achieving scientific objectives. With the limitations of these older methods becoming more apparent, cloud-optimized geospatial formats present a much-needed solution.

There is no one-size-fits-all approach to cloud-optimized data, but the community has developed many tools for creating and assessing geospatial formats that should be organized and shared.
Cloud optimization enables efficient, on-the-fly access to geospatial data, offering several advantages:

With this guide, we provide the landscape of cloud-optimized geospatial formats and provide the best-known answers to common questions.
1. **Reduced Latency**: Subsets of the raw data can be fetched and processed much faster compared to downloading entire files.
2. **Scalability**: Cloud-optimized formats are usually stored on cloud object storage, which is infinitely scalable. Object storage supports many parallel read requests when combined with metadata about where different data bits are stored, making it easier to work with large datasets.
3. **Flexibility**: Cloud-optimized formats allow for high levels of customization, enabling users to tailor data access to their specific needs. Additionally, advanced query capabilities provide the freedom to perform complex operations on the data without downloading and processing entire datasets.
4. **Cost-Effectiveness**: Reduced data transfer and storage needs can lower costs. Many of these formats offer compression options, which reduce storage costs.

## How to get involved
If you want to provide optimized access to geospatial data, this guide is designed to help you understand the best practices and tools available for cloud-optimized geospatial formats.

## Built for the Community, by the Community.

There is no one-size-fits-all approach to cloud-optimized data. Still, the community has developed many tools for creating and assessing geospatial formats that should be organized and shared.

This guide provides the landscape of cloud-optimized geospatial formats and the best-known answers to common questions.

## How to Get Involved

If you want to contribute or modify content, read the [Get Involved](./contributing.qmd) page.

If you have a question or idea for this guide, please start a [Github Discussion](https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide/discussions/new/choose).

## The opportunity
## The Opportunity

Just putting data on the cloud does not solve the big geospatial data problem. Users cannot reasonably wait to download, store and work with large files on their machines. To have access to data in memory, large volumes of data must be available via subsetting methods.
Storing data in the cloud does not on its own solve geospatial's data problem. Users cannot reasonably wait to download, store, and work with large files on their machines. Large volumes of data must be available via subsetting methods to access data in memory.

While it is possible to provide subsetting as a service, this requires maintanence of additional servers and network latency (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed remotely without the introduction of an additional server.
While it is possible to provide subsetting as a service, this requires ongoing maintenance of additional servers and as well as extra network latency when accessing data (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed directly from an end user's machine without introducing an additional server.

Regardless, users will be accessing data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. The introduction of a network introduces latency and the number of requests must considered.
Regardless, users will access data over a network, which must be considered when designing the cloud-optimized format. Traditional geospatial formats are optimized for on-disk access via small internal chunks. A network introduces latency, and the number of requests must be considered.

As a community, we have arrived at the following **cloud-optimized format pattern:**

1. Metadata includes addresses for data blocks.
2. Metadata is stored in a consistent format and location.
3. Metadata can be read once.
3. Metadata can be used to read the underlying file format which supports subsetted access via adressable chunks, internal tiling or both.
4. Metadata can read the underlying file format, which supports subsetted access via addressable chunks, internal tiling, or both.

These characteristics allow for parallelized and partial reading.

Expand All @@ -42,13 +53,13 @@ The diagram below depicts how some of the cloud-optimized formats discussed in t

Notes:

* Some data formats cover multiple data types, specifically:
* GeoJSON can be used for both vector and point data.
* HDF5 can be used for point data or data cubes (or both via groups).
* GeoParquet and FlatGeobuf can be used for vector data or point data.
* LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, same goes for COPC files).
* [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but not explicitly represented in the diagram.
* GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format)
- Some data formats cover multiple data types, specifically:
- GeoJSON can be used for vector and point cloud data.
- HDF5 can be used for point cloud data or data cubes (or both via groups).
- GeoParquet and FlatGeobuf can be used for vector data or point cloud data.
- LAS files are intended for 3D points, not 2D points (since COPC files are compressed LAS files, the same goes for COPC files).
- [TopoJSON](https://github.com/topojson/topojson) (an extension of GeoJSON that encodes topology) and [newline-delimited GeoJSON](https://stevage.github.io/ndgeojson/) are types of GeoJSON worth mentioning but have yet to be explicitly represented in the diagram.
- GeoTIFF and GeoParquet are geospatial versions of the non-geospatial file formats TIFF and Parquet, respectively. FlatGeobuf builds upon the non-geospatial [flatbuffers](https://github.com/google/flatbuffers) serialization library (though flatbuffers is not a standalone file format)

## Table of Contents

Expand All @@ -63,22 +74,23 @@ Notes:
3. [Cookbooks](./cookbooks/index.qmd)


## Running examples
## Running Examples

Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of how to read and write the given format. At the top of each notebook is a link to an `environment.yml` file that describes what libraries need to be installed for the notebook to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook.
Most of the data formats covered in this guide have a Jupyter Notebook example that covers the basics of reading and writing the given format. At the top of each notebook is a link to an environment.yml file describing what libraries need to be installed to run correctly. You can use [Conda](https://www.anaconda.com/download) or [Mamba](https://mamba.readthedocs.io/en/latest/index.html) (a successor to Conda with faster package installs) to install the environment needed to run the notebook.

## Authors

* Aimee Barciauskas
* Alex Mandel
* Kyle Barron
* Zac Deziel
* [Overview Slide](./overview.qmd) credits: Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey

## Questions to ask when generating cloud-optimized geospatial data in any format
## Questions to Ask When Generating Cloud-Optimized Geospatial Data in Any Format

1. What variable(s) should be included in the new data format?
2. Will you create copies to optimize for different needs?
3. What is the intended use case or usage profile? Will this product be used for visualization, analysis or both?
3. What is the intended use case or usage profile? Will this product be used for visualization, analysis, or both?
4. What is the expected access method?
5. How much of your data is typically rendered or selected at once?

Expand Down
Loading

0 comments on commit cc49f69

Please sign in to comment.