-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for rasterisation to dask arrays #41
Comments
This definitely sounds like something useful and worth looking into 👍. If you do proceed, I would definitely want to preserve the current behavior and have the dask part something that can be toggled on or off, maybe with the I am not entirely sure about the feasibility if you are thinking about that in terms of the difficulty level. That is something that will take some research. My first thought was that it may be useful to see if you could incorporate dask-geopandas. You might be able to update |
This would be super helpful! I am just this moment fighting with a huge aerial photo (~340GB) for which I need to create a rasterized ROI mask based on a geopandas data frame... |
Hey folks, I've not had the time to implement and test this properly and open a PR, but i did make a slightly hacky attempt that uses the "like" protocol to construct a rasterised dask array with geocube from a reference raster, copying the geospatial metadata (my use case) that could be a useful starting point for a general implementation: import itertools
from functools import partial
from typing import Union, Optional, List, Any
import xarray as xr
import shapely
import geopandas as gpd
import numpy as np
import dask.array as da
import rioxarray as rx
import geocube
import pandas as pd
import pyproj
from geocube.api.core import make_geocube
from dask import delayed
from rasterio.windows import Window
def make_geocube_like_dask(
df:gpd.GeoDataFrame,
measurements:Optional[List[str]],
like:xr.core.dataarray.DataArray,
fill:int=0,
rasterize_function:callable=partial(geocube.rasterize.rasterize_image, all_touched=True),
**kwargs
):
"""
Implements dask support for a subset of the make_geocube API.
Requires using the "like" protocol for array construction. Chunk spatial dimensions are
copied from the reference "like" raster.
Parameters
----------
df :
A geopandas dataframe with relevant polygons to be rasterized
measurements :
Columns of the geopandas dataframe to rasterize
like :
A reference (rio)xarray dataset. Georeferencing and spatial dimensions from this are
used to construct a matching raster from the polygons.
fill :
Fill value for rasterized shapes
rasterize_function :
Function used for rasterization. See the geocube make_geocube documentation.
kwargs :
passed directly to make_geocube
"""
# take spatial dims from "like" xarray to construct chunk window boundaries (rioxarray has channels first)
row_chunks, col_chunks = [list(np.cumsum(c)) for c in like.chunks][1:]
row_chunks[:0], col_chunks[:0] = [0], [0]
# construct 2-tuples of row and col slices corresponding to first/last indices in each chunk
row_slices, col_slices = list(zip(row_chunks, row_chunks[1:])), list(zip(col_chunks, col_chunks[1:]))
# store blocks for recombination of mask chunks into output dask array
delayed_blocks = [[] for _ in range(len(row_slices))]
# go through each chunk of the dask array pointing to the reference raster
for ix, (rs, cs) in enumerate(itertools.product(row_slices, col_slices)):
# calculate the block index in the final dask array
row_ix, col_ix = np.unravel_index(ix, like.data.numblocks[1:])
# slice out the chunk of the raster array with isel_window to preserve correct geospatial metadata
window = Window.from_slices(rs, cs)
blk_window = like.rio.isel_window(window)
# attempt fix at crs multithreaded access bug when writing raster to disk (see: https://github.com/pyproj4/pyproj/issues/589)
blk_window['spatial_ref'] = blk_window.spatial_ref.copy(deep=True)
# create a delayed xarray with make_geocube for this chunk
delayed_task = delayed(make_geocube)(
df,
measurements=measurements,
like=blk_window,
fill=fill,
rasterize_function=rasterize_function
)
# access the (delayed) numpy array underlying it
delayed_block = delayed_task.to_array().data
# convert to a dask array chunk of the correct shape
chunk_shape = (len(measurements) if measurements is not None else 1, *like.data.blocks[0, row_ix, col_ix].shape[1:])
dask_block = da.from_delayed(
delayed_block,
shape=chunk_shape, dtype=np.uint8
)
# insert into the block list at the correct position
delayed_blocks[row_ix].append(dask_block)
# assemble blocks into output array
mask_darr = da.block(delayed_blocks)
# modify spatial metadata to reflect common spatial reference and different band meanings
coords = {'band': list(measurements) if measurements is not None else ['mask'], 'y':like.coords['y'], 'x':like.coords['x']}
xarr = xr.DataArray(
data=mask_darr,
coords=coords,
dims=['band', 'y', 'x']
)
xarr['spatial_ref'] = like.spatial_ref
xarr.attrs = like.attrs
return xarr This produces an dask-backed xarray of the correct shape and gives sensible results when plotted for some small test rasters and geodataframes. However I am running into an issue with this implementation which I think is related to concurrent access to the CRS object described in UnicodeDecodeError when run multithreaded #589. This only seems to occur for larger rasters where dask is crunching multiple chunks at a time. When trying to compute pieces of the array: Traceback is as follows:
Any suggestions would be welcome! |
This is great, thanks for sharing 👍. I am fairly confident that you are running into UnicodeDecodeError when run multithreaded #589. The only way around that is to be able to construct the |
@LiamRMoore, a fix for this is in the master branch of |
I reimplemented this using from functools import partial
from typing import Optional, List
import xarray as xr
import geopandas as gpd
import numpy as np
import rioxarray as rx
import geocube
from geocube.api.core import make_geocube
from dask.distributed import Client, Lock
def make_geocube_like_dask2(
df: gpd.GeoDataFrame,
measurements: Optional[List[str]],
like: xr.core.dataarray.DataArray,
fill: int=0,
rasterize_function:callable=partial(geocube.rasterize.rasterize_image, all_touched=True),
**kwargs
):
def rasterize_block(block):
return(
make_geocube(
df,
measurements=measurements,
like=block,
fill=fill,
rasterize_function=rasterize_function,
)
.to_array(measurements[0])
.assign_coords(block.coords)
)
like = like.rename(dict(zip(['band'], measurements)))
return like.map_blocks(
rasterize_block,
template=like
) |
Hi folks,
I've had a quick look through the source code and haven't able to find this functionality, so apologies if it exists and I've missed something.
What do you think about the feasibility/attractiveness of being able to run make_geocube as a delayed operation so that the returned xarray wraps a dask array rather than an in-memory numpy ndarray (by, say, passing a chunks argument somewhere as in rioxarray.open_rasterio here.)?
An example use case is in a heavy machine learning workload, where a neural network would be trained on a O(10-100)GB dataset of high resolution aerial photography with rasterised vector layers representing ground truth data.
I'm happy to take a look at this but don't have the familiarity with the codebase to know where a good seam would be for it and whether it's possible to do without breaking things downstream, so would be nice to hear your thoughts.
Cheers!
L
The text was updated successfully, but these errors were encountered: