From 4f253f0367798d0bd43520a871896034c21c6067 Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Mon, 8 Jan 2024 18:41:01 +0100 Subject: [PATCH 1/6] First iteration, still WIP --- docs/maps/trouble-shooting.asciidoc | 181 +++++++++++++++++++++++++++- 1 file changed, 180 insertions(+), 1 deletion(-) diff --git a/docs/maps/trouble-shooting.asciidoc b/docs/maps/trouble-shooting.asciidoc index 3e4a6dfb42dc1..4bb4f02e71aa1 100644 --- a/docs/maps/trouble-shooting.asciidoc +++ b/docs/maps/trouble-shooting.asciidoc @@ -53,4 +53,183 @@ Increase <> for large data views. [float] ==== Custom tiles are not displayed * When using a custom tile service, ensure your tile server has configured https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS[Cross-Origin Resource Sharing (CORS)] so tile requests from your {kib} domain have permission to access your tile server domain. -* Ensure custom vector and tile services have the required coordinate system. Vector data must use EPSG:4326 and tiles must use EPSG:3857. \ No newline at end of file +* Ensure custom vector and tile services have the required coordinate system. Vector data must use EPSG:4326 and tiles must use EPSG:3857. + +[float] +=== Cleaning your data before uploading it to {es} + +// https://github.com/elastic/kibana/issues/135319 + +Geospatial fields in {es} have certain restrictions that need to be addressed before upload. On this section a few recipes will be presented to help troubleshooting common issues on this type of data. + +[float] +==== Convert to GeoJSON or Shapefile + +With https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.org[GDAL/OGR] suite) it is pretty straight forward to convert datasets from dozens of formats into a GeoJSON or Esri Shapefile. For example, converting a GPX file can be achieved with the following commands: + +[source,sh] +---- +# Example GPX file from https://www.topografix.com/gpx_sample_files.asp +# +# Convert the GPX waypoints layer into a GeoJSON file +$ ogr2ogr \ + -f GeoJSON "waypoints.geo.json" \ # Output format and file name + "fells_loop.pgx" \ # Input File Name + "waypoints" # Input Layer (usually same as file name) + +# Extract the routes +$ ogr2ogr -f "GeoJSON routes.geo.json" "fells_loop.pgx" "routes" +---- + +[float] +==== Set up the correct coordinate reference system (CRS) + +{es} only supports WGS84 Coordinate Reference System. Also with `ogr2ogr`, converting from one coordinate system to WGS84 is usually supported but it depends on the source CRS. + +On the following example, `ogr2ogr` transform a shapefile from https://epsg.org/crs_4269/NAD83.html[NAD83] to https://epsg.org/crs_4326/WGS-84.html[WGS84]. The input CRS is detected automatically thanks to the `.prj` sidecar file in the source dataset. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Convert the Census Counties shapefile to WGS84 (EPSG:4326) +$ ogr2ogr -f "Esri Shapefile" \ + "cb_2018_us_county_5m.4326.shp" \ + -t_srs "EPSG:4326" \ # EPSG:4326 is the code for WGS84 + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" +---- + +[float] +==== Explode records with large number of parts + +Sometimes geospatial datasets are composed by a small amount of geometries that contain a very large amount of individual part geometries. Depending on the final usage of a dataset, you may want to "explode" this type of dataset to keep one geometry per document, considerably increasing the performance of your index. + +[source,sh] +---- +# Example NAD83 file from www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ler_000b16a_e.zip +# +# Check the number of input features +$ ogrinfo -summary ler_000b16a_e.shp ler_000b16a_e \ + | grep "Feature Count" +Feature Count: 76 + +# Convert to WGS84 exploding the multiple geometries +$ ogr2ogr \ + -f "Esri Shapefile" "ler_000b16a_e.4326.shp" \ + -explodecollections \ + -t_srs "EPSG:4326" \ + "ler_000b16a_e.shp" \ + "ler_000b16a_e" + +# Check the number of geometries in the output file +# to confirm the 76 records are exploded into 27 thousand rows +$ ogrinfo -summary ler_000b16a_e.4326.shp ler_000b16a_e.4326 \ + | grep "Feature Count" +Feature Count: 27059 +---- + +[WARNING] +==== +A dataset with a very large amount of parts as on the example below may even hang in {kib} Maps file uploader. +==== + +[float] +==== Reduce the precision + +Some machine generated datasets are stored with more decimals that are strictly necessary. For reference, the GeoJSON RFC 7946 https://datatracker.ietf.org/doc/html/rfc7946#section-11.2[coordinate precision section] specifies six digits to be a common default to around 10 centimeters on the ground. The file uploader in the Maps application will automatically reduce the precision to 6 decimals but for big datasets it is better to do this before uploading. + +`ogr2ogr` generates GeoJSON files with 7 decimal degrees when requesting `RFC7946` compliant files but using the `COORDINATE_PRECISION` https://gdal.org/drivers/vector/geojson.html#layer-creation-options[GeoJSON layer creation option] it can be downsized even more if that is OK for the usage of the data. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Generate a 2008 GeoJSON file +$ ogr2ogr \ + -f GeoJSON "cb_2018_us_county_5m.4326.geo.json" \ + -t_srs "EPSG:4326" \ + -lco "RFC7946=NO" \ # Request a 2008 GeoJSON file + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Generate a RFC7946 GeoJSON file +$ ogr2ogr \ + -f GeoJSON "cb_2018_us_county_5m.4326.RFC7946.geo.json" \ + -t_srs "EPSG:4326" \ + -lco "RFC7946=YES" \ # Request a 2008 GeoJSON file + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Generate a RFC7946 GeoJSON file with just 5 decimal figures +$ ogr2ogr \ + -f GeoJSON "cb_2018_us_county_5m.4326.RFC7946_mini.geo.json" \ + -t_srs "EPSG:4326" \ + -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file + -lco "COORDINATE_PRECISION=5" \ # Downsize to just 5 decimal positions + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Compare the size of the three output files +$ du -h cb_2018_us_county_5m.4326*.geo.json +7,4M cb_2018_us_county_5m.4326.geo.json +6,7M cb_2018_us_county_5m.4326.RFC7946.geo.json +6,1M cb_2018_us_county_5m.4326.RFC7946_mini.geo.json +---- + + +[float] +==== Simplifying region datasets + +Region datasets are polygon datasets where the boundaries of the documents don't overlap. This is common for administrative boundaries, land usage, and other continuous datasets. This type of datasets has the special feature that any geospatial operation modifying the lines of the polygons needs to be applied in the same way to the common sides of the polygons to avoid the generation of slivers or thin gaps and overlaps. + +https://github.com/mbloch/mapshaper[`mapshaper`] is an excellent tool to work with this type of datasets as it understands datasets of this nature and works with them accordingly. + +Depending on the usage of a region dataset, different geospatial precisions may be adequate. A world countries dataset that is displayed for the entire planet does not need the same precision as a map of the countries in the South Asian continent. + +`mapshaper` offers a https://github.com/mbloch/mapshaper/wiki/Command-Reference#-simplify[`simplify`] command that accepts percentages, resolutions, and different simplification algorithms. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Generate a baseline GeoJSON file from OGR +$ ogr2ogr \ + -f GeoJSON "cb_2018_us_county_5m.ogr.geo.json" \ + -t_srs "EPSG:4326" \ + -lco RFC7946=YES \ + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Simplify at different percentages with mapshaper +$ for pct in 10 50 75 99; do \ + mapshaper \ + -i "cb_2018_us_county_5m.shp" \ # Input file + -proj "EPSG:4326" \ # Output projection + -simplify "${pct}%" \ # Simplification + -o cb_2018_us_county_5m.mapshaper_${pct}.geo.json; \ # Output file + done + +# Compare the size of the output files +$ du -h cb_2018_us_county_5m*.geo.json +2,0M cb_2018_us_county_5m.mapshaper_10.geo.json +4,1M cb_2018_us_county_5m.mapshaper_50.geo.json +5,3M cb_2018_us_county_5m.mapshaper_75.geo.json +6,7M cb_2018_us_county_5m.mapshaper_99.geo.json +6,7M cb_2018_us_county_5m.ogr.geo.json +---- + + +[float] +==== Fixing incorrect geometries + +The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to fix invalid geometries: + +* OGR https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-makevalid[`-makevalid`] option +* Mapshaper https://github.com/mbloch/mapshaper/wiki/Command-Reference#-clean[`-clean`] command + + +[float] +==== Conclusion + +Both tools are excellent geospatial ETL (Extract Transform and Load) utilities that can do much more than viewed here. Reading the documentation in detail is worth investment to improve the quality of the datasets by removing unwanted fields, refining data types, validating value domains, etc. Finally, being command line utilities, both can be automated and added to QA pipelines. From 1afbce6fe3a32f21e818846bc8cf4122c8b6df18 Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Tue, 23 Jan 2024 18:44:09 +0100 Subject: [PATCH 2/6] Fixing some small issues and improving the snippets formatting --- docs/maps/trouble-shooting.asciidoc | 52 ++++++++++++++++------------- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/docs/maps/trouble-shooting.asciidoc b/docs/maps/trouble-shooting.asciidoc index 4bb4f02e71aa1..e367f528d47a7 100644 --- a/docs/maps/trouble-shooting.asciidoc +++ b/docs/maps/trouble-shooting.asciidoc @@ -74,11 +74,11 @@ With https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.o # Convert the GPX waypoints layer into a GeoJSON file $ ogr2ogr \ -f GeoJSON "waypoints.geo.json" \ # Output format and file name - "fells_loop.pgx" \ # Input File Name + "fells_loop.gpx" \ # Input File Name "waypoints" # Input Layer (usually same as file name) -# Extract the routes -$ ogr2ogr -f "GeoJSON routes.geo.json" "fells_loop.pgx" "routes" +# Extract the routes layer into a GeoJSON file +$ ogr2ogr -f "GeoJSON" "routes.geo.json" "fells_loop.gpx" "routes" ---- [float] @@ -94,16 +94,16 @@ On the following example, `ogr2ogr` transform a shapefile from https://epsg.org/ # # Convert the Census Counties shapefile to WGS84 (EPSG:4326) $ ogr2ogr -f "Esri Shapefile" \ - "cb_2018_us_county_5m.4326.shp" \ + "cb_2018_us_county_5m.4326.shp" \ # Output file -t_srs "EPSG:4326" \ # EPSG:4326 is the code for WGS84 - "cb_2018_us_county_5m.shp" \ - "cb_2018_us_county_5m" + "cb_2018_us_county_5m.shp" \ # Input file + "cb_2018_us_county_5m" # Input layer ---- [float] ==== Explode records with large number of parts -Sometimes geospatial datasets are composed by a small amount of geometries that contain a very large amount of individual part geometries. Depending on the final usage of a dataset, you may want to "explode" this type of dataset to keep one geometry per document, considerably increasing the performance of your index. +Sometimes geospatial datasets are composed by a small amount of geometries that contain a very large amount of individual part geometries. A good example of this situation is on detailed world country boundaries datasets where records for countries like Canada or Philippines have hundreds of small island geometries. Depending on the final usage of a dataset, you may want to "explode" this type of dataset to keep one geometry per document, considerably increasing the performance of your index. [source,sh] ---- @@ -116,22 +116,23 @@ Feature Count: 76 # Convert to WGS84 exploding the multiple geometries $ ogr2ogr \ - -f "Esri Shapefile" "ler_000b16a_e.4326.shp" \ - -explodecollections \ - -t_srs "EPSG:4326" \ - "ler_000b16a_e.shp" \ - "ler_000b16a_e" + -f "Esri Shapefile" \ + "ler_000b16a_e.4326-parts.shp" \ # Output file + -explodecollections \ # Convert multiparts into single records + -t_srs "EPSG:4326" \ # Transform to WGS84 + "ler_000b16a_e.shp" \ # Input file + "ler_000b16a_e" # Input layer # Check the number of geometries in the output file # to confirm the 76 records are exploded into 27 thousand rows -$ ogrinfo -summary ler_000b16a_e.4326.shp ler_000b16a_e.4326 \ +$ ogrinfo -summary ler_000b16a_e.4326-parts.shp ler_000b16a_e.4326 \ | grep "Feature Count" Feature Count: 27059 ---- [WARNING] ==== -A dataset with a very large amount of parts as on the example below may even hang in {kib} Maps file uploader. +A dataset containing records with a very large amount of parts as the one from the example above may even hang in {kib} Maps file uploader. ==== [float] @@ -147,30 +148,33 @@ Some machine generated datasets are stored with more decimals that are strictly # # Generate a 2008 GeoJSON file $ ogr2ogr \ - -f GeoJSON "cb_2018_us_county_5m.4326.geo.json" \ - -t_srs "EPSG:4326" \ + -f GeoJSON \ + "cb_2018_us_county_5m.4326.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 -lco "RFC7946=NO" \ # Request a 2008 GeoJSON file "cb_2018_us_county_5m.shp" \ "cb_2018_us_county_5m" # Generate a RFC7946 GeoJSON file $ ogr2ogr \ - -f GeoJSON "cb_2018_us_county_5m.4326.RFC7946.geo.json" \ - -t_srs "EPSG:4326" \ - -lco "RFC7946=YES" \ # Request a 2008 GeoJSON file + -f GeoJSON \ + "cb_2018_us_county_5m.4326.RFC7946.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 + -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file "cb_2018_us_county_5m.shp" \ "cb_2018_us_county_5m" # Generate a RFC7946 GeoJSON file with just 5 decimal figures $ ogr2ogr \ - -f GeoJSON "cb_2018_us_county_5m.4326.RFC7946_mini.geo.json" \ - -t_srs "EPSG:4326" \ + -f GeoJSON \ + "cb_2018_us_county_5m.4326.RFC7946_mini.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file -lco "COORDINATE_PRECISION=5" \ # Downsize to just 5 decimal positions "cb_2018_us_county_5m.shp" \ "cb_2018_us_county_5m" -# Compare the size of the three output files +# Compare the disk size of the three output files $ du -h cb_2018_us_county_5m.4326*.geo.json 7,4M cb_2018_us_county_5m.4326.geo.json 6,7M cb_2018_us_county_5m.4326.RFC7946.geo.json @@ -181,7 +185,7 @@ $ du -h cb_2018_us_county_5m.4326*.geo.json [float] ==== Simplifying region datasets -Region datasets are polygon datasets where the boundaries of the documents don't overlap. This is common for administrative boundaries, land usage, and other continuous datasets. This type of datasets has the special feature that any geospatial operation modifying the lines of the polygons needs to be applied in the same way to the common sides of the polygons to avoid the generation of slivers or thin gaps and overlaps. +Region datasets are polygon datasets where the boundaries of the documents don't overlap. This is common for administrative boundaries, land usage, and other continuous datasets. This type of datasets has the special feature that any geospatial operation modifying the lines of the polygons needs to be applied in the same way to the common sides of the polygons to avoid the generation of thin gap and overlap artifacts. https://github.com/mbloch/mapshaper[`mapshaper`] is an excellent tool to work with this type of datasets as it understands datasets of this nature and works with them accordingly. @@ -223,7 +227,7 @@ $ du -h cb_2018_us_county_5m*.geo.json [float] ==== Fixing incorrect geometries -The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to fix invalid geometries: +The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to try to fix invalid geometries: * OGR https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-makevalid[`-makevalid`] option * Mapshaper https://github.com/mbloch/mapshaper/wiki/Command-Reference#-clean[`-clean`] command From 14fbbf4335d0a9f00a7e5d3dbd3253a43da4386a Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Wed, 24 Jan 2024 17:09:22 +0100 Subject: [PATCH 3/6] move the doc to the import data section --- docs/maps/clean-data.asciidoc | 183 ++++++++++++++++++++++++++++ docs/maps/index.asciidoc | 1 + docs/maps/trouble-shooting.asciidoc | 183 ---------------------------- 3 files changed, 184 insertions(+), 183 deletions(-) create mode 100644 docs/maps/clean-data.asciidoc diff --git a/docs/maps/clean-data.asciidoc b/docs/maps/clean-data.asciidoc new file mode 100644 index 0000000000000..e3bc95140815a --- /dev/null +++ b/docs/maps/clean-data.asciidoc @@ -0,0 +1,183 @@ +[role="xpack"] +[[maps-clean-your-data]] +=== Cleaning your data + +// https://github.com/elastic/kibana/issues/135319 + +Geospatial fields in {es} have certain restrictions that need to be addressed before upload. On this section a few recipes will be presented to help troubleshooting common issues on this type of data. + +[float] +==== Convert to GeoJSON or Shapefile + +With https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.org[GDAL/OGR] suite) it is pretty straight forward to convert datasets from dozens of formats into a GeoJSON or Esri Shapefile. For example, converting a GPX file can be achieved with the following commands: + +[source,sh] +---- +# Example GPX file from https://www.topografix.com/gpx_sample_files.asp +# +# Convert the GPX waypoints layer into a GeoJSON file +$ ogr2ogr \ + -f GeoJSON "waypoints.geo.json" \ # Output format and file name + "fells_loop.gpx" \ # Input File Name + "waypoints" # Input Layer (usually same as file name) + +# Extract the routes layer into a GeoJSON file +$ ogr2ogr -f "GeoJSON" "routes.geo.json" "fells_loop.gpx" "routes" +---- + +[float] +==== Set up the correct coordinate reference system (CRS) + +{es} only supports WGS84 Coordinate Reference System. Also with `ogr2ogr`, converting from one coordinate system to WGS84 is usually supported but it depends on the source CRS. + +On the following example, `ogr2ogr` transform a shapefile from https://epsg.org/crs_4269/NAD83.html[NAD83] to https://epsg.org/crs_4326/WGS-84.html[WGS84]. The input CRS is detected automatically thanks to the `.prj` sidecar file in the source dataset. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Convert the Census Counties shapefile to WGS84 (EPSG:4326) +$ ogr2ogr -f "Esri Shapefile" \ + "cb_2018_us_county_5m.4326.shp" \ # Output file + -t_srs "EPSG:4326" \ # EPSG:4326 is the code for WGS84 + "cb_2018_us_county_5m.shp" \ # Input file + "cb_2018_us_county_5m" # Input layer +---- + +[float] +==== Improve performance by breaking out complex geometries into one geometry per document + +Sometimes geospatial datasets are composed by a small amount of geometries that contain a very large amount of individual part geometries. A good example of this situation is on detailed world country boundaries datasets where records for countries like Canada or Philippines have hundreds of small island geometries. Depending on the final usage of a dataset, you may want to break out this type of dataset to keep one geometry per document, considerably increasing the performance of your index. + +[source,sh] +---- +# Example NAD83 file from www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ler_000b16a_e.zip +# +# Check the number of input features +$ ogrinfo -summary ler_000b16a_e.shp ler_000b16a_e \ + | grep "Feature Count" +Feature Count: 76 + +# Convert to WGS84 exploding the multiple geometries +$ ogr2ogr \ + -f "Esri Shapefile" \ + "ler_000b16a_e.4326-parts.shp" \ # Output file + -explodecollections \ # Convert multiparts into single records + -t_srs "EPSG:4326" \ # Transform to WGS84 + "ler_000b16a_e.shp" \ # Input file + "ler_000b16a_e" # Input layer + +# Check the number of geometries in the output file +# to confirm the 76 records are exploded into 27 thousand rows +$ ogrinfo -summary ler_000b16a_e.4326-parts.shp ler_000b16a_e.4326 \ + | grep "Feature Count" +Feature Count: 27059 +---- + +[WARNING] +==== +A dataset containing records with a very large amount of parts as the one from the example above may even hang in {kib} Maps file uploader. +==== + +[float] +==== Reduce the precision + +Some machine generated datasets are stored with more decimals that are strictly necessary. For reference, the GeoJSON RFC 7946 https://datatracker.ietf.org/doc/html/rfc7946#section-11.2[coordinate precision section] specifies six digits to be a common default to around 10 centimeters on the ground. The file uploader in the Maps application will automatically reduce the precision to 6 decimals but for big datasets it is better to do this before uploading. + +`ogr2ogr` generates GeoJSON files with 7 decimal degrees when requesting `RFC7946` compliant files but using the `COORDINATE_PRECISION` https://gdal.org/drivers/vector/geojson.html#layer-creation-options[GeoJSON layer creation option] it can be downsized even more if that is OK for the usage of the data. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Generate a 2008 GeoJSON file +$ ogr2ogr \ + -f GeoJSON \ + "cb_2018_us_county_5m.4326.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 + -lco "RFC7946=NO" \ # Request a 2008 GeoJSON file + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Generate a RFC7946 GeoJSON file +$ ogr2ogr \ + -f GeoJSON \ + "cb_2018_us_county_5m.4326.RFC7946.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 + -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Generate a RFC7946 GeoJSON file with just 5 decimal figures +$ ogr2ogr \ + -f GeoJSON \ + "cb_2018_us_county_5m.4326.RFC7946_mini.geo.json" \ # Output file + -t_srs "EPSG:4326" \ # Convert to WGS84 + -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file + -lco "COORDINATE_PRECISION=5" \ # Downsize to just 5 decimal positions + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Compare the disk size of the three output files +$ du -h cb_2018_us_county_5m.4326*.geo.json +7,4M cb_2018_us_county_5m.4326.geo.json +6,7M cb_2018_us_county_5m.4326.RFC7946.geo.json +6,1M cb_2018_us_county_5m.4326.RFC7946_mini.geo.json +---- + + +[float] +==== Simplifying region datasets + +Region datasets are polygon datasets where the boundaries of the documents don't overlap. This is common for administrative boundaries, land usage, and other continuous datasets. This type of datasets has the special feature that any geospatial operation modifying the lines of the polygons needs to be applied in the same way to the common sides of the polygons to avoid the generation of thin gap and overlap artifacts. + +https://github.com/mbloch/mapshaper[`mapshaper`] is an excellent tool to work with this type of datasets as it understands datasets of this nature and works with them accordingly. + +Depending on the usage of a region dataset, different geospatial precisions may be adequate. A world countries dataset that is displayed for the entire planet does not need the same precision as a map of the countries in the South Asian continent. + +`mapshaper` offers a https://github.com/mbloch/mapshaper/wiki/Command-Reference#-simplify[`simplify`] command that accepts percentages, resolutions, and different simplification algorithms. + +[source,sh] +---- +# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip +# +# Generate a baseline GeoJSON file from OGR +$ ogr2ogr \ + -f GeoJSON "cb_2018_us_county_5m.ogr.geo.json" \ + -t_srs "EPSG:4326" \ + -lco RFC7946=YES \ + "cb_2018_us_county_5m.shp" \ + "cb_2018_us_county_5m" + +# Simplify at different percentages with mapshaper +$ for pct in 10 50 75 99; do \ + mapshaper \ + -i "cb_2018_us_county_5m.shp" \ # Input file + -proj "EPSG:4326" \ # Output projection + -simplify "${pct}%" \ # Simplification + -o cb_2018_us_county_5m.mapshaper_${pct}.geo.json; \ # Output file + done + +# Compare the size of the output files +$ du -h cb_2018_us_county_5m*.geo.json +2,0M cb_2018_us_county_5m.mapshaper_10.geo.json +4,1M cb_2018_us_county_5m.mapshaper_50.geo.json +5,3M cb_2018_us_county_5m.mapshaper_75.geo.json +6,7M cb_2018_us_county_5m.mapshaper_99.geo.json +6,7M cb_2018_us_county_5m.ogr.geo.json +---- + + +[float] +==== Fixing incorrect geometries + +The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to try to fix invalid geometries: + +* OGR https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-makevalid[`-makevalid`] option +* Mapshaper https://github.com/mbloch/mapshaper/wiki/Command-Reference#-clean[`-clean`] command + + +[float] +==== Conclusion + +Both tools are excellent geospatial ETL (Extract Transform and Load) utilities that can do much more than viewed here. Reading the documentation in detail is worth investment to improve the quality of the datasets by removing unwanted fields, refining data types, validating value domains, etc. Finally, being command line utilities, both can be automated and added to QA pipelines. diff --git a/docs/maps/index.asciidoc b/docs/maps/index.asciidoc index f924e60cce9e6..fe0c43f6091f8 100644 --- a/docs/maps/index.asciidoc +++ b/docs/maps/index.asciidoc @@ -65,4 +65,5 @@ include::map-settings.asciidoc[] include::connect-to-ems.asciidoc[] include::import-geospatial-data.asciidoc[] include::indexing-geojson-data-tutorial.asciidoc[] +include::clean-data.asciidoc[] include::trouble-shooting.asciidoc[] diff --git a/docs/maps/trouble-shooting.asciidoc b/docs/maps/trouble-shooting.asciidoc index e367f528d47a7..ebb7ec2a65aa8 100644 --- a/docs/maps/trouble-shooting.asciidoc +++ b/docs/maps/trouble-shooting.asciidoc @@ -54,186 +54,3 @@ Increase <> for large data views. ==== Custom tiles are not displayed * When using a custom tile service, ensure your tile server has configured https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS[Cross-Origin Resource Sharing (CORS)] so tile requests from your {kib} domain have permission to access your tile server domain. * Ensure custom vector and tile services have the required coordinate system. Vector data must use EPSG:4326 and tiles must use EPSG:3857. - -[float] -=== Cleaning your data before uploading it to {es} - -// https://github.com/elastic/kibana/issues/135319 - -Geospatial fields in {es} have certain restrictions that need to be addressed before upload. On this section a few recipes will be presented to help troubleshooting common issues on this type of data. - -[float] -==== Convert to GeoJSON or Shapefile - -With https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.org[GDAL/OGR] suite) it is pretty straight forward to convert datasets from dozens of formats into a GeoJSON or Esri Shapefile. For example, converting a GPX file can be achieved with the following commands: - -[source,sh] ----- -# Example GPX file from https://www.topografix.com/gpx_sample_files.asp -# -# Convert the GPX waypoints layer into a GeoJSON file -$ ogr2ogr \ - -f GeoJSON "waypoints.geo.json" \ # Output format and file name - "fells_loop.gpx" \ # Input File Name - "waypoints" # Input Layer (usually same as file name) - -# Extract the routes layer into a GeoJSON file -$ ogr2ogr -f "GeoJSON" "routes.geo.json" "fells_loop.gpx" "routes" ----- - -[float] -==== Set up the correct coordinate reference system (CRS) - -{es} only supports WGS84 Coordinate Reference System. Also with `ogr2ogr`, converting from one coordinate system to WGS84 is usually supported but it depends on the source CRS. - -On the following example, `ogr2ogr` transform a shapefile from https://epsg.org/crs_4269/NAD83.html[NAD83] to https://epsg.org/crs_4326/WGS-84.html[WGS84]. The input CRS is detected automatically thanks to the `.prj` sidecar file in the source dataset. - -[source,sh] ----- -# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip -# -# Convert the Census Counties shapefile to WGS84 (EPSG:4326) -$ ogr2ogr -f "Esri Shapefile" \ - "cb_2018_us_county_5m.4326.shp" \ # Output file - -t_srs "EPSG:4326" \ # EPSG:4326 is the code for WGS84 - "cb_2018_us_county_5m.shp" \ # Input file - "cb_2018_us_county_5m" # Input layer ----- - -[float] -==== Explode records with large number of parts - -Sometimes geospatial datasets are composed by a small amount of geometries that contain a very large amount of individual part geometries. A good example of this situation is on detailed world country boundaries datasets where records for countries like Canada or Philippines have hundreds of small island geometries. Depending on the final usage of a dataset, you may want to "explode" this type of dataset to keep one geometry per document, considerably increasing the performance of your index. - -[source,sh] ----- -# Example NAD83 file from www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ler_000b16a_e.zip -# -# Check the number of input features -$ ogrinfo -summary ler_000b16a_e.shp ler_000b16a_e \ - | grep "Feature Count" -Feature Count: 76 - -# Convert to WGS84 exploding the multiple geometries -$ ogr2ogr \ - -f "Esri Shapefile" \ - "ler_000b16a_e.4326-parts.shp" \ # Output file - -explodecollections \ # Convert multiparts into single records - -t_srs "EPSG:4326" \ # Transform to WGS84 - "ler_000b16a_e.shp" \ # Input file - "ler_000b16a_e" # Input layer - -# Check the number of geometries in the output file -# to confirm the 76 records are exploded into 27 thousand rows -$ ogrinfo -summary ler_000b16a_e.4326-parts.shp ler_000b16a_e.4326 \ - | grep "Feature Count" -Feature Count: 27059 ----- - -[WARNING] -==== -A dataset containing records with a very large amount of parts as the one from the example above may even hang in {kib} Maps file uploader. -==== - -[float] -==== Reduce the precision - -Some machine generated datasets are stored with more decimals that are strictly necessary. For reference, the GeoJSON RFC 7946 https://datatracker.ietf.org/doc/html/rfc7946#section-11.2[coordinate precision section] specifies six digits to be a common default to around 10 centimeters on the ground. The file uploader in the Maps application will automatically reduce the precision to 6 decimals but for big datasets it is better to do this before uploading. - -`ogr2ogr` generates GeoJSON files with 7 decimal degrees when requesting `RFC7946` compliant files but using the `COORDINATE_PRECISION` https://gdal.org/drivers/vector/geojson.html#layer-creation-options[GeoJSON layer creation option] it can be downsized even more if that is OK for the usage of the data. - -[source,sh] ----- -# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip -# -# Generate a 2008 GeoJSON file -$ ogr2ogr \ - -f GeoJSON \ - "cb_2018_us_county_5m.4326.geo.json" \ # Output file - -t_srs "EPSG:4326" \ # Convert to WGS84 - -lco "RFC7946=NO" \ # Request a 2008 GeoJSON file - "cb_2018_us_county_5m.shp" \ - "cb_2018_us_county_5m" - -# Generate a RFC7946 GeoJSON file -$ ogr2ogr \ - -f GeoJSON \ - "cb_2018_us_county_5m.4326.RFC7946.geo.json" \ # Output file - -t_srs "EPSG:4326" \ # Convert to WGS84 - -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file - "cb_2018_us_county_5m.shp" \ - "cb_2018_us_county_5m" - -# Generate a RFC7946 GeoJSON file with just 5 decimal figures -$ ogr2ogr \ - -f GeoJSON \ - "cb_2018_us_county_5m.4326.RFC7946_mini.geo.json" \ # Output file - -t_srs "EPSG:4326" \ # Convert to WGS84 - -lco "RFC7946=YES" \ # Request a RFC7946 GeoJSON file - -lco "COORDINATE_PRECISION=5" \ # Downsize to just 5 decimal positions - "cb_2018_us_county_5m.shp" \ - "cb_2018_us_county_5m" - -# Compare the disk size of the three output files -$ du -h cb_2018_us_county_5m.4326*.geo.json -7,4M cb_2018_us_county_5m.4326.geo.json -6,7M cb_2018_us_county_5m.4326.RFC7946.geo.json -6,1M cb_2018_us_county_5m.4326.RFC7946_mini.geo.json ----- - - -[float] -==== Simplifying region datasets - -Region datasets are polygon datasets where the boundaries of the documents don't overlap. This is common for administrative boundaries, land usage, and other continuous datasets. This type of datasets has the special feature that any geospatial operation modifying the lines of the polygons needs to be applied in the same way to the common sides of the polygons to avoid the generation of thin gap and overlap artifacts. - -https://github.com/mbloch/mapshaper[`mapshaper`] is an excellent tool to work with this type of datasets as it understands datasets of this nature and works with them accordingly. - -Depending on the usage of a region dataset, different geospatial precisions may be adequate. A world countries dataset that is displayed for the entire planet does not need the same precision as a map of the countries in the South Asian continent. - -`mapshaper` offers a https://github.com/mbloch/mapshaper/wiki/Command-Reference#-simplify[`simplify`] command that accepts percentages, resolutions, and different simplification algorithms. - -[source,sh] ----- -# Example NAD83 file from https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip -# -# Generate a baseline GeoJSON file from OGR -$ ogr2ogr \ - -f GeoJSON "cb_2018_us_county_5m.ogr.geo.json" \ - -t_srs "EPSG:4326" \ - -lco RFC7946=YES \ - "cb_2018_us_county_5m.shp" \ - "cb_2018_us_county_5m" - -# Simplify at different percentages with mapshaper -$ for pct in 10 50 75 99; do \ - mapshaper \ - -i "cb_2018_us_county_5m.shp" \ # Input file - -proj "EPSG:4326" \ # Output projection - -simplify "${pct}%" \ # Simplification - -o cb_2018_us_county_5m.mapshaper_${pct}.geo.json; \ # Output file - done - -# Compare the size of the output files -$ du -h cb_2018_us_county_5m*.geo.json -2,0M cb_2018_us_county_5m.mapshaper_10.geo.json -4,1M cb_2018_us_county_5m.mapshaper_50.geo.json -5,3M cb_2018_us_county_5m.mapshaper_75.geo.json -6,7M cb_2018_us_county_5m.mapshaper_99.geo.json -6,7M cb_2018_us_county_5m.ogr.geo.json ----- - - -[float] -==== Fixing incorrect geometries - -The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to try to fix invalid geometries: - -* OGR https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-makevalid[`-makevalid`] option -* Mapshaper https://github.com/mbloch/mapshaper/wiki/Command-Reference#-clean[`-clean`] command - - -[float] -==== Conclusion - -Both tools are excellent geospatial ETL (Extract Transform and Load) utilities that can do much more than viewed here. Reading the documentation in detail is worth investment to improve the quality of the datasets by removing unwanted fields, refining data types, validating value domains, etc. Finally, being command line utilities, both can be automated and added to QA pipelines. From 6f54ed46e6e58dfbfbc160f42acd0e41d7cd60d6 Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Fri, 26 Jan 2024 16:14:42 +0100 Subject: [PATCH 4/6] feedback fix --- docs/maps/clean-data.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/maps/clean-data.asciidoc b/docs/maps/clean-data.asciidoc index e3bc95140815a..39e4ca7db3891 100644 --- a/docs/maps/clean-data.asciidoc +++ b/docs/maps/clean-data.asciidoc @@ -1,6 +1,6 @@ [role="xpack"] [[maps-clean-your-data]] -=== Cleaning your data +=== Clean your data // https://github.com/elastic/kibana/issues/135319 From a6606a91da76d49dc087e06a61534bf96cec6bcb Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Tue, 30 Jan 2024 19:31:43 +0100 Subject: [PATCH 5/6] Update docs/maps/clean-data.asciidoc Co-authored-by: Nick Peihl --- docs/maps/clean-data.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/maps/clean-data.asciidoc b/docs/maps/clean-data.asciidoc index 39e4ca7db3891..64634be67c4ed 100644 --- a/docs/maps/clean-data.asciidoc +++ b/docs/maps/clean-data.asciidoc @@ -82,7 +82,7 @@ A dataset containing records with a very large amount of parts as the one from t [float] ==== Reduce the precision -Some machine generated datasets are stored with more decimals that are strictly necessary. For reference, the GeoJSON RFC 7946 https://datatracker.ietf.org/doc/html/rfc7946#section-11.2[coordinate precision section] specifies six digits to be a common default to around 10 centimeters on the ground. The file uploader in the Maps application will automatically reduce the precision to 6 decimals but for big datasets it is better to do this before uploading. +Some machine generated datasets are stored with more decimals than are strictly necessary. For reference, the GeoJSON RFC 7946 https://datatracker.ietf.org/doc/html/rfc7946#section-11.2[coordinate precision section] specifies six digits to be a common default to around 10 centimeters on the ground. The file uploader in the Maps application will automatically reduce the precision to 6 decimals but for big datasets it is better to do this before uploading. `ogr2ogr` generates GeoJSON files with 7 decimal degrees when requesting `RFC7946` compliant files but using the `COORDINATE_PRECISION` https://gdal.org/drivers/vector/geojson.html#layer-creation-options[GeoJSON layer creation option] it can be downsized even more if that is OK for the usage of the data. From a649b888f30ac85f0ecdc2fb56cba322643a4663 Mon Sep 17 00:00:00 2001 From: Jorge Sanz Date: Wed, 31 Jan 2024 12:36:01 +0100 Subject: [PATCH 6/6] Improvements from feedback --- docs/maps/clean-data.asciidoc | 14 +++++++------- docs/maps/index.asciidoc | 2 +- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/maps/clean-data.asciidoc b/docs/maps/clean-data.asciidoc index 64634be67c4ed..538e84be53517 100644 --- a/docs/maps/clean-data.asciidoc +++ b/docs/maps/clean-data.asciidoc @@ -9,7 +9,7 @@ Geospatial fields in {es} have certain restrictions that need to be addressed be [float] ==== Convert to GeoJSON or Shapefile -With https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.org[GDAL/OGR] suite) it is pretty straight forward to convert datasets from dozens of formats into a GeoJSON or Esri Shapefile. For example, converting a GPX file can be achieved with the following commands: +Use https://gdal.org/programs/ogr2ogr.html[ogr2ogr] (part of the https://gdal.org[GDAL/OGR] suite) to convert datasets into a GeoJSON or Esri Shapefile. For example, use the following commands to convert a GPX file into JSON: [source,sh] ---- @@ -26,11 +26,11 @@ $ ogr2ogr -f "GeoJSON" "routes.geo.json" "fells_loop.gpx" "routes" ---- [float] -==== Set up the correct coordinate reference system (CRS) +==== Convert to WGS84 Coordinate Reference System -{es} only supports WGS84 Coordinate Reference System. Also with `ogr2ogr`, converting from one coordinate system to WGS84 is usually supported but it depends on the source CRS. +{es} only supports WGS84 Coordinate Reference System. Use `ogr2ogr` to convert from other coordinate systems to WGS84. -On the following example, `ogr2ogr` transform a shapefile from https://epsg.org/crs_4269/NAD83.html[NAD83] to https://epsg.org/crs_4326/WGS-84.html[WGS84]. The input CRS is detected automatically thanks to the `.prj` sidecar file in the source dataset. +On the following example, `ogr2ogr` transforms a shapefile from https://epsg.org/crs_4269/NAD83.html[NAD83] to https://epsg.org/crs_4326/WGS-84.html[WGS84]. The input CRS is detected automatically thanks to the `.prj` sidecar file in the source dataset. [source,sh] ---- @@ -171,13 +171,13 @@ $ du -h cb_2018_us_county_5m*.geo.json [float] ==== Fixing incorrect geometries -The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, also geometries inside the dataset need to be valid. Both `ogr2ogr` and `mapshaper` have options to try to fix invalid geometries: +The Maps application expects valid GeoJSON or Shapefile datasets. Apart from the mentioned CRS requirement, geometries need to be valid. Both `ogr2ogr` and `mapshaper` have options to try to fix invalid geometries: * OGR https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-makevalid[`-makevalid`] option * Mapshaper https://github.com/mbloch/mapshaper/wiki/Command-Reference#-clean[`-clean`] command [float] -==== Conclusion +==== And so much more -Both tools are excellent geospatial ETL (Extract Transform and Load) utilities that can do much more than viewed here. Reading the documentation in detail is worth investment to improve the quality of the datasets by removing unwanted fields, refining data types, validating value domains, etc. Finally, being command line utilities, both can be automated and added to QA pipelines. +`ogr2ogr` and `mapshaper` are excellent geospatial ETL (Extract Transform and Load) utilities that can do much more than viewed here. Reading the documentation in detail is worth investment to improve the quality of the datasets by removing unwanted fields, refining data types, validating value domains, etc. Finally, being command line utilities, both can be automated and added to QA pipelines. diff --git a/docs/maps/index.asciidoc b/docs/maps/index.asciidoc index fe0c43f6091f8..6023cbef8a91d 100644 --- a/docs/maps/index.asciidoc +++ b/docs/maps/index.asciidoc @@ -64,6 +64,6 @@ include::search.asciidoc[] include::map-settings.asciidoc[] include::connect-to-ems.asciidoc[] include::import-geospatial-data.asciidoc[] -include::indexing-geojson-data-tutorial.asciidoc[] include::clean-data.asciidoc[] +include::indexing-geojson-data-tutorial.asciidoc[] include::trouble-shooting.asciidoc[]