Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_mdim cannot find array but cli gdalmdiminfo finds array just fine? #659

Open
cboettig opened this issue Jan 2, 2024 · 17 comments
Open

Comments

@cboettig
Copy link

cboettig commented Jan 2, 2024

The following fails with error:

public_S3_url <- "https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5"

co2 <- paste0("/vsicurl/", public_S3_url) |> stars::read_mdim("/GRSM/dp04/data/fluxCo2/nsae")
Error: Cannot find array

But the same array is easily located by gdalmdiminfo, so I don't understand why this error occurs:

system(paste("gdalmdiminfo",  paste0("/vsicurl/", public_S3_url), "-array", "/GRSM/dp04/data/fluxCo2/nsae"))

Testing using stars 0.6.4 with libraries:

Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
@edzer
Copy link
Member

edzer commented Jan 3, 2024

There's some pretty heavy nested grouping going on in that file!

@cboettig
Copy link
Author

cboettig commented Jan 4, 2024

There's some pretty heavy nested grouping going on in that file!

🤣 haha so true. so many ridiculous examples of nesting in hdf5, I sometimes think they intentionally serialize data in the worst way possible. (but then I remember that this is the pre-release format -- the current format is then manually .gz compressed to an h5.gz file). For context, these esoteric creatures are perhaps the flagship data product (eddy covariance measurements from flux towers) of the 81 NEON sites that our NSF is spending nearly half a billion dollars on... https://data.neonscience.org/data-products/DP4.00200.001

@edzer
Copy link
Member

edzer commented Jan 4, 2024

I can get it to work with a hack that passes on the group nesting, giving something like

> stars::read_mdim("/vsicurl/https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5", "nsae", groups = c("GRSM", "dp04", "data", "fluxCo2"))

Read failed for array nsae
stars object with 1 dimensions and 1 attribute
attribute(s):
      Min. 1st Qu. Median Mean 3rd Qu. Max.
nsae     0       0      0    0       0    0
dimension(s):
     from to offset delta
dim0    1 48    0.5     1
Warning message:
In CPL_read_mdim(file, array_name, groups, options, offset, count,  :
  GDAL Error 1: Array data type is not convertible to buffer data type

indicating that it gets to the array, opens it, but cannot read it.

If you look at the actual array values of the array, e.g. with

gdalmdiminfo -detailed /vsicurl/https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5 -array /GRSM/dp04/data/fluxCo2/nsae

I think it may become clear why this may be (and remain?) a problem: the array values are database tuples.

Have you tried reading these data with some hdf5 package (h5?) directly?

edzer added a commit to r-spatial/sf that referenced this issue Jan 4, 2024
edzer added a commit that referenced this issue Jan 4, 2024
edzer added a commit that referenced this issue Jan 4, 2024
@edzer
Copy link
Member

edzer commented Jan 4, 2024

These commits are probably going to be replaced by the way gdalmdiminfo does this.

@cboettig
Copy link
Author

cboettig commented Jan 4, 2024

Have you tried reading these data with some hdf5 package (h5?) directly?

rhdf5 will subset this file remotely rather well. Couldn't get this to work with the others.

  public_S3_url <- "https://storage.googleapis.com/neon-int-sae-tmp-files/ods/dataproducts/DP4/2020-04-26/GRSM/NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5"
  
  #listGrp <- rhdf5::h5ls(file = public_S3_url, s3 = TRUE) # slow! 
  
  # accessing a given array is fast though
  nee <- rhdf5::h5read(file = public_S3_url,
                       name = glue::glue("{site}/dp04/data/fluxCo2/turb",
                                         site="GRSM"), 
                       s3 = TRUE) 

(rhdf5 can't handle the stupid case where these are gzipped, I know /vsigzip/ can be slow but it works here. still you're probably right that gdal mdiminfo isn't ideal for this!)

In other news, I'd love to see proxy support for stars::read_mdim, e.g. for large Zarr archives, e.g. so we can do R examples like https://planetarycomputer.microsoft.com/dataset/daymet-daily-na#Example-Notebook

@edzer
Copy link
Member

edzer commented Jan 5, 2024

In other news, I'd love to see proxy support for stars::read_mdim, e.g. for large Zarr archives

what would you like to see beyond the ability to read one or more slices, or sub-cubes (currently in offset, count and step)?

edzer added a commit that referenced this issue Jan 5, 2024
@cboettig
Copy link
Author

cboettig commented Jan 5, 2024

what would you like to see beyond the ability to read one or more slices, or sub-cubes (currently in offset, count and step)?

Thanks, and apologies as I should have probably opened an issue or stackoverflow post as this could just be my lack of understanding.

One part of this is just ergonomics. I imagine I can crop spatial extents by setting the appropriate combination of count and offset, and downsample with slice, but this doesn't match the syntax we have elsewhere in stars to do things like crop, right? What about things like aggregation? In python, it appears the xarray syntax for all these operations remains entirely consistent whether we're reading in from Zarr, reading netcdf, or even reading COGs from stac.

I think my idealized workflow would actually look something like:

library(spData)
box <- us_states[us_states$NAME=="California",] |> st_transform(4326)

library(stars)
url <- "https://daymeteuwest.blob.core.windows.net/daymet-zarr/monthly/na.zarr"

# desired but not functional syntax:
x <- read_stars(url) |> st_crop(ca)

and have a lazy-read object where I can slice times (or other dimensions) on indices. Perhaps my expectations here are just entirely unreasonable. Yes, I know we currently need the GDAL decorators to even get this to parse:

dsn <- glue::glue('ZARR:"/vsicurl/{url}":/tmax')
tmax <- read_stars(dsn)
x <- read_stars(url)

This works, but already confuses my students -- this isn't the way I previously taught them to specify dimensions (tmax) in read_stars, and the prefix/quotation syntax causes trouble. I understand that mdim is also the preferred interface for Zarr, but the syntax of offset / count / slice gets progressively harder to anticipate, and again differs from the way these concepts were introduced previously in stars objects:

x <- read_mdim(dsn, count = c(NA, NA, 492), step=c(4, 4, 12))

the ergonomics are again hard here because we hardwire into code numerical values that could be obtained from object metadata but are otherwise not immediately obvious to the user. But this also gives me a bunch of warnings:

x <- read_mdim(dsn, count = c(NA, NA, 492), step=c(4, 4, 12))
Read failed for array prcp
Read failed for array swe
Read failed for array tmax
Read failed for array tmin
Read failed for array vp
Warning messages:
1: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
2: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
3: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
4: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
5: In CPL_read_mdim(file, array_name, options, offset, count, step,  :
  GDAL Error 1: arrayStartIdx[0] + (count[0]-1) * arrayStep[0] >= 492
6: In create_dimension(values = x, is_raster = !sf && i %in% 1:2, point = ifelse(length(x) ==  :
  dimension value(s) non-finite

and seems to have an invalid crs which I can't override:

> st_crs(x) <- 4326
Error in st_crs.character(x[[xy[1]]]$refsys) : invalid crs: udunits

and on attempting to plot it crashes on a platform with 100 GB RAM.

edzer added a commit to r-spatial/sf that referenced this issue Jan 5, 2024
edzer added a commit that referenced this issue Jan 5, 2024
@edzer
Copy link
Member

edzer commented Jan 5, 2024

This gives us

> stars::read_mdim("NEON.D07.GRSM.DP4.00200.001.nsae.2020-04-26.expanded.20230921T200400Z.h5", "/GRSM/dp04/data/fluxCo2/nsae")
                    timeBgn                  timeEnd        flux
1  2020-04-26T00:00:00.000Z 2020-04-26T00:29:59.000Z         NaN
2  2020-04-26T00:30:00.000Z 2020-04-26T00:59:59.000Z  -0.7405883
3  2020-04-26T01:00:00.000Z 2020-04-26T01:29:59.000Z  -0.2856589
4  2020-04-26T01:30:00.000Z 2020-04-26T01:59:59.000Z   2.7275024
5  2020-04-26T02:00:00.000Z 2020-04-26T02:29:59.000Z  14.2427594
6  2020-04-26T02:30:00.000Z 2020-04-26T02:59:59.000Z  -4.1640435
7  2020-04-26T03:00:00.000Z 2020-04-26T03:29:59.000Z -23.5961781
8  2020-04-26T03:30:00.000Z 2020-04-26T03:59:59.000Z -38.7475553
9  2020-04-26T04:00:00.000Z 2020-04-26T04:29:59.000Z  -5.9633267
10 2020-04-26T04:30:00.000Z 2020-04-26T04:59:59.000Z  -1.9345659
11 2020-04-26T05:00:00.000Z 2020-04-26T05:29:59.000Z   5.7533219
12 2020-04-26T05:30:00.000Z 2020-04-26T05:59:59.000Z   3.6831897
13 2020-04-26T06:00:00.000Z 2020-04-26T06:29:59.000Z  -5.8330796
14 2020-04-26T06:30:00.000Z 2020-04-26T06:59:59.000Z   2.9189290
15 2020-04-26T07:00:00.000Z 2020-04-26T07:29:59.000Z   3.0636229
16 2020-04-26T07:30:00.000Z 2020-04-26T07:59:59.000Z   3.0182718
17 2020-04-26T08:00:00.000Z 2020-04-26T08:29:59.000Z   2.9283239
18 2020-04-26T08:30:00.000Z 2020-04-26T08:59:59.000Z  -3.0542519
19 2020-04-26T09:00:00.000Z 2020-04-26T09:29:59.000Z   5.5374485
20 2020-04-26T09:30:00.000Z 2020-04-26T09:59:59.000Z   0.2513057
21 2020-04-26T10:00:00.000Z 2020-04-26T10:29:59.000Z   0.6168280
22 2020-04-26T10:30:00.000Z 2020-04-26T10:59:59.000Z  -5.4693606
23 2020-04-26T11:00:00.000Z 2020-04-26T11:29:59.000Z   0.3674658
24 2020-04-26T11:30:00.000Z 2020-04-26T11:59:59.000Z   1.4159392
25 2020-04-26T12:00:00.000Z 2020-04-26T12:29:59.000Z  -1.0238224
26 2020-04-26T12:30:00.000Z 2020-04-26T12:59:59.000Z   0.7054844
27 2020-04-26T13:00:00.000Z 2020-04-26T13:29:59.000Z  -8.9968282
28 2020-04-26T13:30:00.000Z 2020-04-26T13:59:59.000Z  12.3038478
29 2020-04-26T14:00:00.000Z 2020-04-26T14:29:59.000Z  -3.2665178
30 2020-04-26T14:30:00.000Z 2020-04-26T14:59:59.000Z  -3.2718385
31 2020-04-26T15:00:00.000Z 2020-04-26T15:29:59.000Z -10.0929240
32 2020-04-26T15:30:00.000Z 2020-04-26T15:59:59.000Z  -5.9556013
33 2020-04-26T16:00:00.000Z 2020-04-26T16:29:59.000Z  -8.2911191
34 2020-04-26T16:30:00.000Z 2020-04-26T16:59:59.000Z  -5.0356033
35 2020-04-26T17:00:00.000Z 2020-04-26T17:29:59.000Z  -8.3811585
36 2020-04-26T17:30:00.000Z 2020-04-26T17:59:59.000Z  -1.6936340
37 2020-04-26T18:00:00.000Z 2020-04-26T18:29:59.000Z  14.2182930
38 2020-04-26T18:30:00.000Z 2020-04-26T18:59:59.000Z -10.0673130
39 2020-04-26T19:00:00.000Z 2020-04-26T19:29:59.000Z  -1.8299035
40 2020-04-26T19:30:00.000Z 2020-04-26T19:59:59.000Z  -0.7272567
41 2020-04-26T20:00:00.000Z 2020-04-26T20:29:59.000Z   8.0110202
42 2020-04-26T20:30:00.000Z 2020-04-26T20:59:59.000Z         NaN
43 2020-04-26T21:00:00.000Z 2020-04-26T21:29:59.000Z  12.1214397
44 2020-04-26T21:30:00.000Z 2020-04-26T21:59:59.000Z  -2.1407618
45 2020-04-26T22:00:00.000Z 2020-04-26T22:29:59.000Z -65.3029157
46 2020-04-26T22:30:00.000Z 2020-04-26T22:59:59.000Z   0.4240327
47 2020-04-26T23:00:00.000Z 2020-04-26T23:29:59.000Z  -0.8366531
48 2020-04-26T23:30:00.000Z 2020-04-26T23:59:59.000Z -12.9685483

although it still needs a fix if the numeric data is not Float64.

I'll come back to your other comments - I agree, interfaces and usability are important.

edzer added a commit that referenced this issue Jul 24, 2024
@edzer
Copy link
Member

edzer commented Jul 24, 2024

This hides the offset, count and step arguments behind st_crop() or [ calls, giving for example

library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(spData)
# CRS from https://daac.ornl.gov/DAYMET/guides/Daymet_V4_Monthly_Climatology.html:
lcc = st_crs("+proj=lcc +lat_1=25 +lat_2=60 +lat_0=42.5 +lon_0=-100 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs")
ca <- us_states[us_states$NAME=="California",] |> st_transform(lcc)

library(stars)
# Loading required package: abind
url <- "https://daymeteuwest.blob.core.windows.net/daymet-zarr/monthly/na.zarr"
dsn <- glue::glue('ZARR:"/vsicurl/{url}"') # all arrays
(p <- read_mdim(dsn, proxy = TRUE)) # read only array names and dimensions
# stars_proxy object with 5 attributes in 5 file(s):
# $prcp
# [1] "[...]/na.zarr"

# $swe
# [1] "[...]/na.zarr"

# $tmax
# [1] "[...]/na.zarr"

# $tmin
# [1] "[...]/na.zarr"

# $vp
# [1] "[...]/na.zarr"

# dimension(s):
#      from   to   offset delta  refsys                             values x/y
# x       1 7814 -4560750  1000      NA                               NULL [x]
# y       1 8075  4984500 -1000      NA                               NULL [y]
# time    1  492       NA    NA POSIXct 1980-01-16 12:00:00,...,2020-12-16    
p["tmax",,,1:20] |>  # select one array and 20 time steps
  st_set_crs(lcc) |> # p doesn't have a CRS; set it to that of ca
  st_crop(ca) -> crp # crop the ca area
hook = function() plot(st_geometry(ca), col = NA, border = 'orange', add = TRUE) # plot border
plot(crp, hook = hook)
# downsample set to 7

x

edzer added a commit that referenced this issue Jul 24, 2024
@goergen95
Copy link

goergen95 commented Aug 12, 2024

Hi,

I am running into an error on the GDAL-side with latest sf and stars installed trying to run the example above:

#> Error 1: Decompressor blosc not handled

A quick search revealed the same error message earlier in two pangeo threads (pangeo-data/pangeo#196 and pangeo-forge/pangeo-forge-recipes#58), but it is still obscure to me what is causing this.

repex
remotes::install_github("r-spatial/sf")
#> Skipping install of 'sf' from a github remote, the SHA1 (6e13be91) has not changed since last install.
#>   Use `force = TRUE` to force installation
remotes::install_github("r-spatial/stars")
#> Skipping install of 'stars' from a github remote, the SHA1 (b1be4573) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(sf)
#> Linking to GEOS 3.12.2, GDAL 3.9.1, PROJ 9.4.1; sf_use_s2() is TRUE
library(spData)
lcc = st_crs("+proj=lcc +lat_1=25 +lat_2=60 +lat_0=42.5 +lon_0=-100 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs")
ca <- us_states[us_states$NAME=="California",] |> st_transform(lcc)

library(stars)
#> Loading required package: abind
url <- "https://daymeteuwest.blob.core.windows.net/daymet-zarr/monthly/na.zarr"
dsn <- glue::glue('ZARR:"/vsicurl/{url}"') # all arrays
(p <- read_mdim(dsn, proxy = TRUE)) # read only array names and dimensions
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Warning in CPL_read_mdim(file, array_name, options, offset, count, step, : GDAL
#> Error 1: Decompressor blosc not handled
#> Error in which(sapply(x, function(i) inherits(i$values, "sfc"))): argument to 'which' is not logical
p["tmax",,,1:20] |>  # select one array and 20 time steps
  st_set_crs(lcc) |> # p doesn't have a CRS; set it to that of ca
  st_crop(ca) -> crp # crop the ca area
#> Error in eval(expr, envir, enclos): object 'p' not found
hook = function() plot(st_geometry(ca), col = NA, border = 'orange', add = TRUE) # plot border
plot(crp, hook = hook)
#> Error in eval(expr, envir, enclos): object 'crp' not found

Created on 2024-08-12 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2024-08-12
#>  pandoc   3.2 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  abind       * 1.4-5   2016-07-21 [1] CRAN (R 4.4.1)
#>  class         7.3-22  2023-05-03 [2] CRAN (R 4.4.1)
#>  classInt      0.4-10  2023-09-05 [1] CRAN (R 4.4.1)
#>  cli           3.6.3   2024-06-21 [1] RSPM (R 4.4.0)
#>  curl          5.2.1   2024-03-01 [1] RSPM (R 4.4.0)
#>  DBI           1.2.3   2024-06-02 [1] RSPM (R 4.4.0)
#>  digest        0.6.36  2024-06-23 [1] RSPM (R 4.4.0)
#>  e1071         1.7-14  2023-12-06 [1] CRAN (R 4.4.1)
#>  evaluate      0.24.0  2024-06-10 [1] RSPM (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] RSPM (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] RSPM (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] RSPM (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM (R 4.4.0)
#>  KernSmooth    2.23-24 2024-05-17 [2] CRAN (R 4.4.1)
#>  knitr         1.48    2024-07-07 [1] RSPM (R 4.4.0)
#>  lattice       0.22-6  2024-03-20 [2] CRAN (R 4.4.1)
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.4.0)
#>  proxy         0.4-27  2022-06-09 [1] CRAN (R 4.4.1)
#>  purrr         1.0.2   2023-08-10 [1] RSPM (R 4.4.0)
#>  R.cache       0.16.0  2022-07-21 [1] RSPM (R 4.4.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] RSPM (R 4.4.0)
#>  R.oo          1.26.0  2024-01-24 [1] RSPM (R 4.4.0)
#>  R.utils       2.12.3  2023-11-18 [1] RSPM (R 4.4.0)
#>  Rcpp          1.0.13  2024-07-17 [1] RSPM (R 4.4.0)
#>  remotes       2.5.0   2024-03-17 [1] RSPM (R 4.4.0)
#>  reprex        2.1.0   2024-01-11 [1] RSPM (R 4.4.0)
#>  rlang         1.1.4   2024-06-04 [1] RSPM (R 4.4.0)
#>  rmarkdown     2.27    2024-05-17 [1] RSPM (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.4.0)
#>  sf          * 1.0-17  2024-08-12 [1] Github (r-spatial/sf@6e13be9)
#>  sp            2.1-4   2024-04-30 [1] CRAN (R 4.4.1)
#>  spData      * 2.3.1   2024-05-31 [1] RSPM (R 4.4.0)
#>  spDataLarge   2.1.1   2024-08-12 [1] local
#>  stars       * 0.6-7   2024-08-12 [1] Github (r-spatial/stars@b1be457)
#>  styler        1.10.3  2024-04-07 [1] RSPM (R 4.4.0)
#>  units         0.8-5   2023-11-28 [1] CRAN (R 4.4.1)
#>  vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.4.0)
#>  withr         3.0.1   2024-07-31 [1] RSPM (R 4.4.0)
#>  xfun          0.46    2024-07-18 [1] RSPM (R 4.4.0)
#>  yaml          2.3.10  2024-07-26 [1] RSPM (R 4.4.0)
#> 
#>  [1] /usr/local/lib/R/site-library
#>  [2] /usr/local/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@kadyb
Copy link
Contributor

kadyb commented Aug 12, 2024

@goergen95 #564 and #663 are related.

@goergen95
Copy link

goergen95 commented Aug 14, 2024

This is really awesome! I am currently trying to replicate some analysis with vector data cubes done by folks at earthmover in a recent blog post using xarray. I still face some challenges with interpreting ERA5 longitude values (between 0-360°) and st_extract() does not seem to work as efficiently as one might hope when comparing to the time it takes to generate the plot (which is amazingly fast).

remotes::install_github("r-spatial/stars", ref = "b1be457")
#> Skipping install of 'stars' from a github remote, the SHA1 (b1be4573) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(sf)
#> Linking to GEOS 3.12.2, GDAL 3.9.1, PROJ 9.4.1; sf_use_s2() is TRUE
library(stars)
#> Loading required package: abind
library(viridis)
#> Loading required package: viridisLite

# read zarr
url <- "https://storage.googleapis.com/weatherbench2/datasets/era5/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2/"
dsn <- glue::glue('ZARR:"/vsicurl/{url}"')
era5_ds <- read_mdim(dsn, proxy = TRUE, variable = c("2m_temperature", "10m_u_component_of_wind"))
# see https://confluence.ecmwf.int/display/CKB/ERA5%3A+What+is+the+spatial+reference
era5_ds <- st_set_dimensions(era5_ds, "longitude", values = st_get_dimension_values(era5_ds, "longitude") - 180)

# subset
start <- as.POSIXct("2017-01-01 00:00:00 UTC")
end <- as.POSIXct("2018-02-01 00:00:00 UTC")
time <- st_get_dimension_values(era5_ds, "time")
(era5_ds_sub <- era5_ds[,,,which(time >= start & time < end)])
#> stars_proxy object with 2 attributes in 2 file(s):
#> $`2m_temperature`
#> [1] "[...]/"
#> 
#> $`10m_u_component_of_wind`
#> [1] "[...]/"
#> 
#> dimension(s):
#>            from    to         offset   delta  refsys x/y
#> longitude     1  1440           -180    0.25      NA [x]
#> latitude      1   721          90.12   -0.25      NA [y]
#> time      84741 86324 1959-01-01 UTC 6 hours POSIXct

# plot
era5_ds_sub["2m_temperature",,,1583, drop = TRUE] |> # does not drop time
  plot(axes = TRUE, key.pos = 4, col = viridis_pal()) # how to put time as title?
#> Warning in st_is_longlat(x): bounding box has potentially an invalid value
#> range for longlat data
#> Warning in st_is_longlat(x): bounding box has potentially an invalid value
#> range for longlat data

# cities
parquet <- "https://huggingface.co/api/datasets/jamescalam/world-cities-geo/parquet/default/train/0.parquet"
cities <- read_sf(parquet)
cities <- st_as_sf(cities,coords = c("longitude", "latitude"), crs = "EPSG:4326")
(cities_eur <- subset(cities, continent == "Europe"))
#> Simple feature collection with 2844 features and 7 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -25.66667 ymin: 28 xmax: 39.74 ymax: 67.85572
#> Geodetic CRS:  WGS 84
#> # A tibble: 2,844 × 8
#>    city     country region continent     x     y     z            geometry
#>    <chr>    <chr>   <chr>  <chr>     <dbl> <dbl> <dbl>         <POINT [°]>
#>  1 Tirana   Albania South… Europe    4501. 1622. 4207.  (19.81889 41.3275)
#>  2 Durres   Albania South… Europe    4512. 1593. 4207. (19.44139 41.32306)
#>  3 Elbasan  Albania South… Europe    4508. 1648. 4189.  (20.08222 41.1125)
#>  4 Vlore    Albania South… Europe    4569. 1617. 4135. (19.48972 40.46667)
#>  5 Shkoder  Albania South… Europe    4458. 1580. 4269. (19.51258 42.06828)
#>  6 Fier-Ci… Albania South… Europe    4550. 1617. 4156. (19.56667 40.71667)
#>  7 Korce    Albania South… Europe    4521. 1716. 4148. (20.78083 40.61861)
#>  8 Berat    Albania South… Europe    4540. 1648. 4155. (19.95222 40.70583)
#>  9 Lushnje  Albania South… Europe    4531. 1623. 4175.   (19.705 40.94194)
#> 10 Kavaje   Albania South… Europe    4518. 1605. 4195. (19.55694 41.18556)
#> # ℹ 2,834 more rows

# extraction does not seem to work efficiently yet
era5_ds_sub <- st_set_crs(era5_ds_sub, st_crs(cities_eur))
(era5_ds_sub_eur <- st_crop(era5_ds_sub, cities_eur))
#> stars_proxy object with 2 attributes in 2 file(s):
#> $`2m_temperature`
#> [1] "[...]/"
#> 
#> $`10m_u_component_of_wind`
#> [1] "[...]/"
#> 
#> dimension(s):
#>            from    to         offset   delta  refsys x/y
#> longitude   618   879           -180    0.25  WGS 84 [x]
#> latitude     90   249          90.12   -0.25  WGS 84 [y]
#> time      84741 86324 1959-01-01 UTC 6 hours POSIXct    
#> call_list:
#> [[1]]
#> st_crop(x = x, y = y, crop = crop, epsilon = epsilon)
#> attr(,".Environment")
#> <environment: 0x5d1d5ce5c118>
#> 
#> This object has pending lazy operations: dimensions as printed may not reflect this.

# era5_cities_eur <- st_extract(era5_ds_sub_eur[,,,1], cities_eur[1,])

Created on 2024-08-14 with reprex v2.1.0

@edzer
Copy link
Member

edzer commented Aug 26, 2024

I'm seeing

Error: Cannot open "https://huggingface.co/api/datasets/jamescalam/world-cities-geo/parquet/default/train/0.parquet"; The file doesn't seem to exist.
In addition: Warning message:
In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 4: Failed to read TopoJSON data

any ideas why?

@goergen95
Copy link

Maybe GDAL is not build with the Parquet driver? I saw something similar when libarrow was not installed on my system, but I cannot imagine why you get an error message about TopoJSON.

Error: Cannot open "/vsicurl/https://huggingface.co/api/datasets/jamescalam/world-cities-geo/parquet/default/train/0.parquet"; The file doesn't seem to exist.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 1: libarrow_dataset.so.1700: cannot open shared object file: No such file or directory

@edzer
Copy link
Member

edzer commented Aug 26, 2024

Can you change this issue such that I can run it without needing a parquet driver?

@goergen95
Copy link

Sure! I was working closely to the mentioned Earthmother blogpost which uses that parquet, but in the code below I manually specify some European cities and I now see:

Error in `[[<-.data.frame`(`*tmp*`, sf_column, value = list(c(19.82, 41.33 : 
  replacement has 3 rows, data has 0
Unfold code example
remotes::install_github("r-spatial/stars", ref = "b1be457")
library(sf)
library(stars)
library(viridis)

# read zarr
url <- "https://storage.googleapis.com/weatherbench2/datasets/era5/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2/"
dsn <- glue::glue('ZARR:"/vsicurl/{url}"')
era5_ds <- read_mdim(dsn, proxy = TRUE, variable = c("2m_temperature", "10m_u_component_of_wind"))
era5_ds <- st_set_dimensions(era5_ds, "longitude", values = st_get_dimension_values(era5_ds, "longitude") - 180)

# subset
start <- as.POSIXct("2017-01-01 00:00:00 UTC")
end <- as.POSIXct("2018-02-01 00:00:00 UTC")
time <- st_get_dimension_values(era5_ds, "time")
(era5_ds_sub <- era5_ds[,,,which(time >= start & time < end)])

# plot
era5_ds_sub["2m_temperature",,,1583, drop = TRUE] |> # does not drop time
  plot(axes = TRUE, key.pos = 4, col = viridis_pal()) # how to put time as title?

# manually specify some european cities
cities_eur <- data.frame(
  name = c("Tirana", "Berlin", "Madrid"),
  longitude = c(19.82, 13.40, -3.70),
  latitude = c(41.33, 52.52, 40.42)
)

cities_eur <- st_as_sf(cities_eur, coords = c("longitude", "latitude"), crs = "EPSG:4326")

# extraction does not seem to work efficiently yet
era5_ds_sub <- st_set_crs(era5_ds_sub, st_crs(cities_eur))
(era5_ds_sub_eur <- st_crop(era5_ds_sub, cities_eur))
era5_cities_eur <- st_extract(era5_ds_sub_eur[,,,1], cities_eur)

edzer added a commit that referenced this issue Aug 27, 2024
@edzer
Copy link
Member

edzer commented Aug 27, 2024

Thanks! I got this to work:

library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(stars)
# Loading required package: abind
library(viridis)
# Loading required package: viridisLite

# read zarr
url <- "https://storage.googleapis.com/weatherbench2/datasets/era5/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2/"
dsn <- glue::glue('ZARR:"/vsicurl/{url}"')
era5_ds <- read_mdim(dsn, proxy = TRUE, variable = c("2m_temperature", "10m_u_component_of_wind"))
# st_set_dimensions(era5_ds, "longitude", values = st_get_dimension_values(era5_ds, "longitude") - 180)
# This does nothing, and now causes an error

# subset
start <- as.POSIXct("2017-01-01 00:00:00 UTC")
end <- as.POSIXct("2018-02-01 00:00:00 UTC")
time <- st_get_dimension_values(era5_ds, "time")
(era5_ds_sub <- era5_ds[,,,which(time >= start & time < end)])
# stars_proxy object with 2 attributes in 2 file(s):
# $`2m_temperature`
# [1] "[...]/"

# $`10m_u_component_of_wind`
# [1] "[...]/"

# dimension(s):
#            from    to         offset   delta  refsys x/y
# longitude     1  1440         -0.125    0.25      NA [x]
# latitude      1   721          90.12   -0.25      NA [y]
# time      84741 86324 1959-01-01 UTC 6 hours POSIXct    

era5_ds_sub["2m_temperature",,,1583, drop = TRUE] -> x
tm = st_get_dimension_values(x, "time")

# plot
x |> plot(axes = TRUE, key.pos = 4, col = viridis_pal(), main = tm) # puts time as title
# downsample set to 1
# Warning messages:
# 1: In seq.default(len = length(x.dim)) :
#   partial argument match of 'len' to 'length.out'
# 2: In st_is_longlat(x) :
#   bounding box has potentially an invalid value range for longlat data
# 3: In st_is_longlat(x) :
#   bounding box has potentially an invalid value range for longlat data

# manually specify some european cities
#cities_eur <- data.frame(
#  name = c("Tirana", "Berlin", "Madrid"),
#  longitude = c(19.82, 13.40, -3.70),
#  latitude = c(41.33, 52.52, 40.42)
#)
cities_eur <- data.frame(
  name = c("Tirana", "Berlin"),
  longitude = c(19.82, 13.400),
  latitude = c(41.33, 52.52)
)

cities_eur <- st_as_sf(cities_eur, coords = c("longitude", "latitude"), crs = "EPSG:4326")

# extraction does not seem to work efficiently yet
era5_ds_sub <- st_set_crs(era5_ds_sub[,,,1], st_crs(cities_eur))
(era5_ds_sub_eur <- st_crop(era5_ds_sub, cities_eur))
# stars_proxy object with 2 attributes in 2 file(s):
# $`2m_temperature`
# [1] "[...]/"

# $`10m_u_component_of_wind`
# [1] "[...]/"

# dimension(s):
#           from  to         offset   delta  refsys x/y
# longitude   55  80         -0.125    0.25  WGS 84 [x]
# latitude   151 196          90.12   -0.25  WGS 84 [y]
# time         1   1 1959-01-01 UTC 6 hours POSIXct    
# call_list:
# [[1]]
# st_crop(x = x, y = y, crop = crop, epsilon = epsilon)
# attr(,".Environment")
# <environment: 0x637c744bedb0>

# This object has pending lazy operations: dimensions as printed may not reflect this.
(era5_cities_eur <- st_extract(era5_ds_sub_eur, cities_eur))
# Simple feature collection with 2 features and 2 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: 13.4 ymin: 41.33 xmax: 19.82 ymax: 52.52
# Geodetic CRS:  WGS 84
#   2m_temperature 10m_u_component_of_wind            geometry
# 1   275.4243 [K]         -1.266422 [m/s] POINT (19.82 41.33)
# 2   279.8133 [K]          5.374224 [m/s]  POINT (13.4 52.52)

Note that:

  • shifting longitude now causes an error (as it is ignored later on; it is also not the solution to this problem)
  • I moved st_crop() to happen after [,,,1], this is critical for the runtime
  • I took out Madrid because a crop with a bounding box that crosses the prime meridian (currently) doesn't go well with a grid spanning 0-360 degrees... consider this WIP!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants