-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56
Comments
That is a lot of columns.
Sounds like a definite possibility. One issue you will run into is NULL values. You would likely have to add a dummy row at the end of your dataframe and fill it with NaN and then fill the NULL values in the raster with
This would get very complex very quickly. You would have to keep track of the weight of each row in the dataframe for each cell in the raster. |
So this worked for my purposes. The rasterization went from ~75min to under 2min.
And then a test that the output would is the same as the unoptimized make_geocube. This could be beefed up to account for NULL values.
|
Hello there, I stumbled on this library from a stackoverflow answer you posted and I have to say I really dig it.
I'm working on a project where I have a GDF that needs to get rasterized, and this turned out to be the perfect solution. However, the GDF's that will be coming through the pipeline will be kind of big (150K rows x 700 columns) . Right now the rasterization part is becoming a bottleneck, it takes a little over an hour while the other operations happen in minutes. We can cut down the resolution of some of this data on our end, but it seems like there could be some room to optimize the function.
For example, one column with the 150K shapely Point features rasterizes in about 5 seconds using the 'nearest' interpolation method. I believe it should be possible to run the 5 second algorithm that aligns the outgoing grid with the 'nearest' vector features just once, and then simply apply the same pattern across the other n columns of data, so as n scales up, the time to execute the function doesn't scale up with it.
For the 'nearest' option, imagine rasterizing a numbered index associated with the geometry features, and then simply mapping the remaining columns to the new pattern (something akin to pandas take(). I'm not sure what's going on exactly under the hood, but I imagine something analogous could be done for the other interpolation options.
The text was updated successfully, but these errors were encountered: