diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..da8be8c --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,23 @@ +name: CI + +# On every pull request, but only on push to main +on: + push: + branches: + - main + pull_request: + +jobs: + tests: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - name: Set up Python 3.11 + uses: actions/setup-python@v2 + with: + python-version: 3.11 + + - name: run pre-commit + run: | + python -m pip install pre-commit + pre-commit run --all-files diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000..39e9680 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,8 @@ +repos: + - repo: local + hooks: + - id: glossary-alphabetical + name: glossary-alphabetical + entry: python scripts/glossary_alphabetical.py + language: system + files: 'glossary.qmd' diff --git a/_quarto.yml b/_quarto.yml index 67980c2..e7c8fb6 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -65,6 +65,8 @@ website: - section: PMTiles contents: - pmtiles/intro.qmd + - href: glossary.qmd + text: Glossary - href: cookbooks/index.qmd text: Cookbooks - href: contributing.qmd diff --git a/glossary.qmd b/glossary.qmd new file mode 100644 index 0000000..c4c2069 --- /dev/null +++ b/glossary.qmd @@ -0,0 +1,365 @@ +--- +title: Glossary +format: + html: + toc: true + toc-depth: 5 + toc-expand: true +--- + +This glossary aims to describe all the jargon of geospatial in the cloud! Are we missing something? [Create an issue](https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide/issues/new?title=Glossary+suggestion:&labels=glossary) to suggest an improvement. + +#### Amazon S3 (S3) + +The [object storage](#object-storage-cloud-storage) service offered by Amazon. Part of [Amazon Web Services](#amazon-web-services-aws). + +#### Amazon Web Services (AWS) + +[Cloud computing](#cloud) services offered by Amazon. + +#### Archive format + +A file format which stores one or more other files, possibly with [compression](#compression). Examples include [ZIP archives](#zip-archive) and [PMTiles](#pmtiles). + +#### Array Dimensions + +The number of variables represented by an array. If an array represents longitude, latitude, time, and temperature, the array has four dimensions. + +#### Asynchronous + +A manner of scaling computing, to allow more operations to happen at the same time. + +Think of a glass of water. Synchronous computing is akin to having one straw: when you've finished drinking all you wish to drink, you give the straw to your friend for them to drink. Asynchronous computing is akin to sharing the straw between you and your friend. There's still only one straw, but you can hand off sips. Parallel computing (like [multithreading](#multithreading) or multiprocessing) is like having two straws, where both you and your friend can drink out of the glass at the same time. + +#### Bandwidth + +The speed at which data travels over a network. Usually used in reference to downloading or uploading files. + +See also: [latency](#latency). + +#### Chunk + +A grouping of data as part of a file format. + +In a [COG](#cloud-optimized-geotiff-cog), this refers to a slice of the full array, usually 256 pixels high by 256 pixels wide (256x256), or 512 pixels high by 512 pixels wide (512x512). + +In a [GeoParquet](#geoparquet) file, this refers to a slice of a group of columns, where the slice has the same number of rows in each column. + +#### Chunk size + +The size of each [chunk](#chunk) in a file format. + +The chunk size plays a large part in how efficient [random access](#random-access) within the file can be. If the chunk size is too small, then the [metadata](#metadata) describing the file and the chunk byte ranges will be very large, and many [HTTP range requests](#http-range-request) may have to be made for each small piece desired within the file. On the other hand, if the chunk size is too large, then a reader will have to read a large amount of data even for a very small query. + +#### Cloud + +Computing services hosted by an external provider, where the provider pays for the upfront cost of buying hardware, earning a profit by selling services. This allows users to scale workloads efficiently because users do not need to pay large upfront costs for computers. These rented services can include compute time or [object storage](#object-storage-cloud-storage). + +Usually refers to services hosted by [Amazon](#amazon-web-services-aws), [Google](#google-cloud-gcp), or [Microsoft](#microsoft-azure). + +#### Cloud-Optimized + +The property of a file format to be able to read a meaningful _part_ of the file without needing to download _all_ of the file. In particular, this means the file can be used efficiently from [cloud storage](#object-storage-cloud-storage) via [HTTP range requests](#http-range-request). + +#### Cloud-Optimized GeoTIFF (COG) + +An extension of [GeoTIFF](#geotiff) with well-defined internal chunking, designed for efficient [random access](#random-access) of the contained raster data. + +#### Cloud-Optimized Point Cloud (COPC) + +A [cloud-optimized](#cloud-optimized) file format for [point cloud data](#point-cloud-data). + +#### Compression + +An algorithm that makes data smaller, at the cost of having to encode data into the compressed format before saving and having to decode data out of the compressed format before usage. In most cases, the benefits of smaller file sizes when stored outweigh the time it takes to encode and decode the compressed format. + +Compression can either be [external](#external-compression) or [internal](#internal-compression) to a file, and can either be [lossless](#lossless-compression) or [lossy](#lossy-compression). + +#### Content Delivery Network (CDN) + +A globally-distributed network of storage servers designed to cache HTTP requests so that future requests can use the cached copy instead of asking the origin server or storage. + +#### Coordinate Reference System (CRS) + +Also called a projection + +#### Data Type + +The data type refers to the specific encoding in which values are stored in binary. Data types can be numeric or non-numeric, including string, binary, or nested data structures. The usual numeric data types used most often for scientific data include: + +- `Byte` or `Int8`: signed integer with 8 bit capacity, which can hold values from -128 to 127 (inclusive). +- `Unsigned byte` or `Uint8`: unsigned integer with 8 bit capacity, which can hold values from 0 to 255 (inclusive). +- `Int` or `Int16`: signed integer with 16 bit capacity, which can hold values from -32,768 to 32,767 (inclusive). +- `Unsigned int` or `Uint16`: unsigned integer with 16 bit capacity, which can hold values from 0 to 65,535 (inclusive). +- `Short` or `Int32`: signed integer with 32 bit capacity, which can hold values from -2,147,483,648 to 2,147,483,647 (inclusive). +- `Unsigned short` or `Uint32`: unsigned integer with 32 bit capacity, which can hold values from 0 to 4,294,967,295 (inclusive). +- `Long` or `Int64`: signed integer with 64 bit capacity, which can hold values from -9223372036854775808 to 9223372036854775807 (inclusive). +- `Unsigned long` or `Uint64`: unsigned integer with 64 bit capacity, which can hold values from 0 to 18446744073709551615 (inclusive). +- `float`: 32 bit floating point number. +- `double`: 64 bit floating point number. + +For a good explainer on how floating point numbers work, refer to [this blog post](https://ciechanow.ski/exposing-floating-point/). + +#### Deflate + +A [lossless](#lossless-compression) [compression](#compression) codec used as part of [ZIP archives](#zip-archive) and internally within [COG](#cloud-optimized-geotiff-cog) and [GeoTIFF](#geotiff) files. + +#### Entwine Point Cloud (EPT) + +A [cloud-optimized](#cloud-optimized) file format for [point cloud data](#point-cloud-data). Entwine has largely been superseded by [COPC](#cloud-optimized-point-cloud-copc) because COPC is backwards-compatible with previous point cloud data formats. + +#### EPSG code + +A [projection definition](#coordinate-reference-system-crs) referring to the EPSG database. EPSG codes tend to be four or five digits and tend to be easier to remember and use than longer definitions, such as [WKT strings](#wkt-projection-definition) or [PROJJSON](#projjson). The downside of EPSG codes is that the program needs to have the EPSG database available so that it can perform a lookup from the EPSG code to the full projection definition. + +#### External compression + +[Compression](#compression) that is not part of a file format's own specification, and which is added on after the main file has been saved. This tends to be used as part of [ZIP archives](#zip-archive) (with file extension `.zip`) or with standalone [gzip compression](#gzip) (file extension `.gz`). + +External compression tends to make a file no longer [cloud-optimized](#cloud-optimized), as it is usually no longer possible to read part of the file without fetching the entire file, as the entire file is necessary for decompression. + +This is in contrast to [internal compression](#internal-compression). + +#### fsspec + +A Python library for abstracting across several different file storage solutions, including local file storage, [cloud storage](#object-storage-cloud-storage), and HTTP web urls. Allows uploading and downloading files to each backend with a consistent API. + +#### GDAL + +The _Geospatial Data Abstraction Library_, a widely-used open-source library for converting between different [raster data](#raster-data) formats, as well as reprojecting between [coordinate reference systems](#coordinate-reference-system-crs). + +Its common command-line tools include `gdalinfo` and `gdal_translate`. It can be used from Python with the `rasterio` library or from R with the `terra` library. + +GDAL includes [OGR](#ogr) for processing vector data. + +#### GeoJSON + +A file format for [vector data](#vector-data), built on top of [JSON](https://en.wikipedia.org/wiki/JSON). GeoJSON is a common format for transferring vector data to web browsers, because it's easy for most programming languages to read and write, but tends to have a large size. It's not [cloud-optimized](#cloud-optimized) because it can't be partially parsed; the entire file needs to be downloaded in order to use. + +#### GeoPackage + +A file format for [vector data](#vector-data). GeoPackage supports multiple layers as part of a single file. Because a GeoPackage is internally stored as a [SQLite database](https://en.wikipedia.org/wiki/SQLite), it is not [cloud-optimized](#cloud-optimized) because the entire file must be downloaded in order to read any part of the file. + +#### GeoPandas + +A Python library for using and managing [vector data](#vector-data), organized around [geospatial data frames](#geospatial-data-frame). + +#### GeoParquet + +An extension of the [Parquet](#parquet) file format to store geospatial [vector data](#vector-data). Can be read and written by tools including [GDAL](#gdal) and [GeoPandas](#geopandas). + +#### Geospatial Data Frame + +A tabular data structure for storing geospatial [vector data](#vector-data), where each geometry is paired with one or more attributes in a given row. + +A geospatial data frame structure works best when every feature has the same range of attributes, such as when there is a timestamp or other value associated with _every_ geometry. + +[GeoPandas](#geopandas) in Python and [sf](#sf) in R are two common implementations of the geospatial data frame concept. + +#### GeoTIFF + +An extension of [TIFF](#tagged-image-file-format-tiff) to store geospatially-referenced image and raster data. Includes extra information such as the [coordinate reference system](#coordinate-reference-system-crs) and [geotransform](#geotransform). + +#### Geotransform + +A set of six numbers that describe where a [raster image](#raster-data) lies within its [coordinate reference system](#coordinate-reference-system-crs). + +The geotransform describes the resolution and real-world location of each pixel. The geotransform needs to be used in conjunction with a projection definition for pixels to be located accurately. + +For more information (in a Python context) read [Python affine transforms](https://www.perrygeo.com/python-affine-transforms.html). + +#### Google Cloud (GCP) + +[Cloud computing](#cloud) services provided by Google. + +#### gzip + +A type of [lossless compression](#lossless-compression) for general use. Gzip is based on the [deflate](#deflate) algorithm and tends to be used standalone for [external compression](#external-compression). Files that end with `.gz` have been encoded with gzip compression. + +#### Hilbert curve + +A type of [space-filling curve](#space-filling-curve) used as part of many [spatial indexes](#spatial-index) that ensures that objects near each other in two-dimensional space (e.g. longitude-latitude) are also near each other when ordered in a file. + +#### HTTP Range Request + +HTTP is the protocol that governs how computers ask for data across a network. HTTP range requests is a part of the HTTP specification that defines how to ask for a _specific_ byte range from a file, instead of the entire file. + +HTTP range requests is a core part of what makes a file format [cloud-optimized](#cloud-optimized), because it means that part of a geospatial data file can be read and used without needing to download the entire file. + +#### Internal compression + +[Compression](#compression) that is part of a file format's own specification. + +File formats such as [COG](#cloud-optimized-geotiff-cog), [COPC](#cloud-optimized-point-cloud-copc), and [GeoParquet](#geoparquet) include internal compression. Internal compression is useful for [cloud-optimized](#cloud-optimized) data formats because it allows [internal chunks](#chunk) to be fetched with [range requests](#http-range-request) but still have smaller sizes from compression. + +For files that have already been internally compressed, adding another layer of [external compression](#external-compression), such as [ZIP](#zip-archive) or [gzip](#gzip), will likely not make the file smaller, and only serve to reduce performance by requiring an extra decompression step before the data can be used. + +#### JPEG + +A [lossy compression](#lossy-compression) codec used for visual images. It tends to have a better compression ratio than lossless compression codecs like [deflate](#deflate) or [LZW](#lzw). + +#### Latency + +The time it takes for data to _start_ being retrieved from a server. + +See also: [bandwidth](#bandwidth). + +#### LERC + +LERC (Limited Error Raster Compression) is a [lossy](#lossy-compression) but very efficient compression algorithm for floating point [raster data](#raster-data). This compression rounds values to a precision provided by the user and tends to be useful e.g. for elevation data where the source data is known to not have precision beyond a known value. + +LERC is a relatively new algorithm and may not be supported everywhere. For example, [GDAL](#gdal) needs to be compiled with the LERC driver in order to load a [GeoTIFF](#geotiff) with LERC compression. + +#### Lossless compression + +A type of [compression](#compression) where the exact original values **can** be recovered after decompression. This means that the compression process does not lose any information. Lossless compression codecs tend to give larger file sizes than [lossy compression](#lossy-compression) codecs. + +Examples include [deflate](#deflate), [LZW](#lzw), [gzip](#gzip), and [ZSTD](#zstd). + +#### Lossy compression + +A type of [compression](#compression) where the exact original values **cannot** be recovered after decompression. This means that the compression process will lose information. Lossy compression codecs tend to give smaller file sizes than [lossless compression](#lossless-compression) codecs. + +Examples include [LERC](#lerc) and [JPEG](#jpeg). + +#### LZW + +A [lossless compression](#lossless-compression) codec for general use. It tends to be slightly slower than [deflate](#deflate). + +#### Mapbox Vector Tile + +A file format for tiled [vector data](#vector-data), usually used for visualization on web maps. [PMTiles](#pmtiles) is a [cloud-optimized](#cloud-optimized) [archive format](#archive-format) for storing millions of Mapbox Vector Tile files in an efficient manner, accessible via [HTTP range requests](#http-range-request). + +#### Metadata + +Information _about_ the actual data, saved as part of the file format. This allows for recreating the exact data that existed before saving and for [cloud-optimized](#cloud-optimized) data formats, usually stores the byte ranges of relevant data sections within the file, which allows for using [HTTP range requests](#http-range-request) for efficient [random access](#random-access) to that data section. + +#### Microsoft Azure + +[Cloud computing](#cloud) services offered by Microsoft. + +#### Multi-dimensional raster data + +A type of gridded [raster data](#raster-data) where multiple dimensions help conceptualize various attribtes. For example, a data value may exist for every longitude, latitude, time, and elevation, in which case the raster data would have four dimensions. + +#### Multithreading + +A manner of scaling computing, to allow more operations to happen at the same time. + +Think of a glass of water. Synchronous computing is akin to having one straw: when you've finished drinking all you wish to drink, you give the straw to your friend for them to drink. [Asynchronous](#asynchronous) computing is akin to sharing the straw between you and your friend. There's still only one straw, but you can hand off sips. Parallel computing (including multithreading and multiprocessing) is like having two straws, where both you and your friend can drink out of the glass at the same time. + +#### Numpy + +The foundational Python library for managing multi-dimensional array data. + +#### Object storage (Cloud storage) + +Object storage, or cloud storage, refers to massively scalable [cloud](#cloud) storage solutions like [Amazon S3](#amazon-s3-s3). It is relatively cheap, able to hold files small or large, and supports reading data via [HTTP range requests](#http-range-request). Most open geospatial data is hosted in such cloud storage solutions. + +#### OGR + +A widely-used open-source library for converting between different [vector data](#vector-data) formats, as well as reprojecting between [coordinate reference systems](#coordinate-reference-system-crs). + +Its common command-line tools include `ogrinfo` and `ogr2ogr`. It can be used from Python with the `pyogrio` or `fiona` libraries or from R with the `sf` library. + +OGR is installed as part of [GDAL](#gdal). + +#### Overviews + +Downsampled (aggregated) data intended for visualization and stored as part of a file format. Overviews are part of the [COG](#cloud-optimized-geotiff-cog) specification, and allow reading "zoomed out" data without needing to read and downsample from "full resolution" data. + +Overviews are also known as _pyramids_. + +#### Parquet + +A file format for tabular data with [internal chunking](#chunk) and [internal compression](#internal-compression). Data are stored per _column_ instead of per _row_, making it very fast to select all data from a specific column. + +#### PDAL + +The _Point Data Abstraction Library_, a widely-used open-source library for converting between different [point cloud data](#point-cloud-data) formats and managing point cloud data. + +#### PMTiles + +A type of [cloud-optimized](#cloud-optimized) [archive format](#archive-format) for tiled data. It can be used with either [vector](#vector-data) or [raster](#raster-data) tiled data, but is most often used with [Mapbox Vector Tile](#mapbox-vector-tile) files. The individual tiles in a PMTiles file are accessible via [HTTP range requests](#http-range-request). + +#### Point Cloud Data + +A type of geospatial data storing three dimensional point locations along with attributes for each point. Point cloud data may come from LIDAR sensors or other photogrammetry, and may represent three-dimensional terrain or buildings. + +#### PROJJSON + +A [projection definition](#coordinate-reference-system-crs) that uses JSON for encoding. + +The specification is [defined as part of the PROJ project](https://proj.org/en/9.3/specifications/projjson.html) and used as part of the [GeoParquet](#geoparquet) [vector file](#vector-data) format. + +#### Random access + +The ability to quickly fetch _part_ of a file without reading the entire file. + +For example, consider videos on YouTube. If you select to watch a video starting from the ten minute mark, YouTube does not need to download the video up to that point. Rather it is able to use the [metadata](#metadata) from the video file to know what byte range in the video file corresponds to the ten minute mark, and then use [HTTP range requests](#http-range-request) to download only the part of the video you've reqested. + +The ability to perform efficient random access over a network is a core part of what makes data [cloud-optimized](#cloud-optimized). + +#### Raster data + +A type of geospatial data that stores regularly-gridded data with cells of known and constant size. This often comes from aerial or satellite imagery sensors. + +#### sf + +An R library for using and managing [vector data](#vector-data), organized around [geospatial data frames](#geospatial-data-frame). + +#### Shapefile + +A [vector data](#vector-data) file format. There are [many reasons](http://switchfromshapefile.org/) to no longer use Shapefile. + +#### Space-filling curve + +An algorithm that translates two- or n- dimensional data into one-dimensional data. In practice, this is used as part of [spatial indexes](#spatial-index) to group [vector geometries](#vector-data) nearby within a file according to their two-dimensional location. + +[Read more at Wikipedia](https://en.wikipedia.org/wiki/Space-filling_curve). + +#### Spatial index + +A data structure used for searching through spatial data more efficiently. + +For further reading: [A dive into spatial search algorithms](https://blog.mapbox.com/a-dive-into-spatial-search-algorithms-ebd0c5e39d2a). + +#### Tagged Image File Format (TIFF) + +A file format for image and raster data that supports [lossless compression](#lossless-compression). + +#### Vector data + +A type of geospatial data to represent points, lines, and polygons. + +#### Web Mercator + +A [coordinate reference system](#coordinate-reference-system-crs) often used with tiled data for web maps. + +#### Well-Known Binary (WKB) + +A binary encoding for vector geometries that many systems can read and write. For example, GeoParquet uses WKB in its definition of the geometry column. + +#### WKT (Geometry encoding) + +A text encoding for vector geometries that many systems can read and write. WKT tends to be larger in size and slower to read and write than [WKB](#well-known-binary-wkb), so only use WKT if you need to store geometries in a text file, such as a CSV. + +Note that this is different than [WKT (Projection definition)](#wkt-projection-definition). + +#### WKT (Projection definition) + +An encoding to store [coordinate reference system](#coordinate-reference-system-crs) information. + +There have been multiple versions of WKT; it is suggested to use WKT2 whenever possible. + +#### Zarr + +A chunked, compressed file format for [multi-dimensional raster data](#multi-dimensional-raster-data). + +#### ZIP Archive + +A type of [archive format](#archive-format) that is used to group together existing files. It can also be used to apply [external compression](#external-compression) onto existing files. + +#### ZSTD + +A very efficient [lossless compression](#lossless-compression) codec. ZSTD tends to give a very good compression ratio at very good performance, but may not be available everywhere. Check that your expected programs have access to ZSTD before using this on your data. diff --git a/scripts/glossary_alphabetical.py b/scripts/glossary_alphabetical.py new file mode 100644 index 0000000..a44acb3 --- /dev/null +++ b/scripts/glossary_alphabetical.py @@ -0,0 +1,20 @@ +def main(): + lines = [] + with open('glossary.qmd') as f: + for line in f: + if line.startswith("#### "): + lines.append(line[5:].strip()) + + # Case-insensitive + lines = [line.lower() for line in lines] + + # Sort + sorted_lines = sorted(lines) + + for orig, sort in zip(lines, sorted_lines): + if orig != sort: + print(f"'{sort}' is out of order; should be before '{orig}'") + exit(1) + +if __name__ == "__main__": + main()