Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking: Faster/better aggregate #763

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Breaking: Faster/better aggregate #763

wants to merge 10 commits into from

Conversation

rafaqz
Copy link
Owner

@rafaqz rafaqz commented Sep 25, 2024

aggregate was written so long ago, was never optimized, and a bunch of things never made sense.

This PR adds optional threading, general performance improvements, and specific fast paths for common methods like sum and mean. For a RasterStack this can be an order of magnitude or more improvement overall.

At the same time I fixed the dumb dissaggregate arguments.

@tiemvanderdeure this was inspired by your comment to use aggregate proving to be actually slower than resample. So if you want to review...

checkbounds(src, u...)
# If a disk array, cache the src so we don't read too many times
src_parent = isdisk(src) ? DiskArrays.cache(parent(src)) : parent(src)
@inbounds broadcast!(dst, CartesianIndices(dst)) do I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worthwhile to optionally run this threaded?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it could be a threaded for loop i guess

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the reason not to is currently it works on GPU and DiskArrays as is, threading will break that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if threaded is false by default then you could either broadcast or do a threaded for loop. But after seeing just how fast this is I don't think that would only be relevant in some niche use cases

$FILENAME_KEYWORD
$SUFFIX_KEYWORD
$PROGRESS_KEYWORD
$THREADED_KEYWORD

Note: currently it is faster to aggregate over memory-backed arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this say disaggregate?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be deleted, the caching should make it fast now

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no cache is broken on our DiskArrays version 😭

Comment on lines 8 to 12
const SKIPMISSING_KEYWORD = """
- `skipmissing`: if `true`, any `missingval` will be skipped during aggregation, so that
only areas of all missing values will be aggregated to `missingval`. If `false`, any
aggregated area containing a `missingval` will be assigned `missingval`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "false by default"

disaggregate!((locus,), dst, src, scale)
end
function disaggregate!(loci::Tuple{Locus,Vararg}, dst::AbstractRaster, src, scale)
function disaggregate!(dst::AbstractRaster, src, scale)
intscale = _scale2int(DisAg(), dims(src), scale)
broadcast!(dst, CartesianIndices(dst)) do I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be faster to loop through the src instead like so since we know it is smaller.

By looping through dst we have lots of unnecessary calls to upsample and lookups in src.

Suggested change
broadcast!(dst, CartesianIndices(dst)) do I
for I in CartesianIndices(src)
upper = upsample.(Tuple(I), intscale)
lower = upper .+ intscale .- 1
I_dst = map(:, upper, lower)
val = src[I]
val === missingval(src) ? missingval(dst) : val
view(dst, I_dst...) .= val
end
return dst

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on unchanged lines, but if we implement this then downsample isn't used anywhere anymore.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes it will be faster. I think the hidden reason again is that wont work on a GPU or a disk array and I was trying to be generic

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use Flatten.jl so see if there is an array inside whatever wrappers, and if it is do the loop, if not the broadcast

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need DiskArraysKernelAbstractions then we could just do all of this with KernelAbstractions, and use Stencils.jl too to make it even faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this woulnd't work with on GPU? We're just viewing into an array and filling it with some value, right?

Copy link
Owner Author

@rafaqz rafaqz Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its all scalar indexing. We need to launch single kernel to do the whole lot in one go.

Well its lots of little views at least. But looking at it I don't think the original will work on GPU either - but it can work on DiskArrays and its fast now since we have cache

Copy link
Owner Author

@rafaqz rafaqz Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broacasts handle kernel launches for us. If you want to write manual code like that with blocks we need KernelAbstractionsjl.

(and actually for this to work on GPU we need KernelAbstractions. But it works on disk...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is to do some clever reshapes and permutes to be able to broadcast without having to do indexing math ourselves. I tried this implementation and it works, but it's actually a little slower than the original. I guess the reshaping makes indexing that much slower.

  intscale = _scale2int(DisAg(), dims(src), scale)
   n = length(intscale)
   reshape_dims = ntuple(i -> mod(i, 2) == 0 ? size(src)[(i ÷ 2)] : intscale[(i ÷ 2) + 1], Val(n*2))
   permute_order = (range(2; step = 2, length = n)..., range(1, step = 2, length = n)...)
   dst_reshaped = Base.PermutedDimsArray(Base.reshape(dst, reshape_dims), permute_order)
   broadcast!(dst_reshaped, parent(src)) do val
       val === missingval(src) ? missingval(dst) : val
   end
   return dst

@tiemvanderdeure
Copy link
Contributor

Amazing! Did a few quick tests/profviews and it seems blazing fast - I can't come up with a way to improve there. I just had a quick thing for disaggregate.

A few other suggestions (short or long term) - now that we're making considerable changes to this function anyway:

  • would it make sense to implement aggregate for RasterSeries?
  • would it make sense to only aggregate along contiuous dimensions by default. So that a raster like Raster(rand(X(1:10), Band(["a", "b"]) won't get aggregated along the Band dimension unless requested explicitly?
  • Would we want an interpolate function that works like disaggregate but does more than just copy values over. I guess this was the idea with the function argument in disaggregate in the first place?

@rafaqz
Copy link
Owner Author

rafaqz commented Sep 26, 2024

would it make sense to implement aggregate for RasterSeries?

Its there! I also threaded it

would it make sense to only aggregate along contiuous dimensions by default. So that a raster like Raster(rand(X(1:10), Band(["a", "b"]) won't get aggregated along the Band dimension unless requested explicitly?

You mean AbstractSampled and NoLookup lookups are aggregated but Categorical are not? Yes I thought about that too looking at the R code doing that in the benchmarks. We basically never want to aggegate adjacent categories. You would use groupby for something like that.

Would we want an interpolate function that works like disaggregate but does more than just copy values over. I guess this was the idea with the function argument in disaggregate in the first place?

Yeah it was going to be but never happened lol. Probably it should be another geostatistics gap fillin method instead of here.

@tiemvanderdeure
Copy link
Contributor

You mean AbstractSampled and NoLookup lookups are aggregated but Categorical are not?

Yes I think that could be the default if scale is an integer. And to override it users can always provide a tuple and specify.

It could also depend on the order - so we don't aggregate along Unordered dimensions.

@rafaqz
Copy link
Owner Author

rafaqz commented Sep 27, 2024

Ok I've implemented this so you have to explicitly aggregate categorical dimensions. I'm not sure what to do with the categories? Maybe we need to join as strings somehow? Or return NoLookup ?

With Unordered lookups we really can't aggregate them in a sensible way - I guess if people explicitly force it we can just return NoLookup too, to show that the lookups are effectively destroyed by the process?

Edit: I think we need an @info about which dimensions got aggregated so its not mysterious, and let verbose=false turn it off if it annoys anyone.

@rafaqz rafaqz changed the title Faster aggregate Breaking: Faster aggregate Sep 27, 2024
@rafaqz rafaqz changed the title Breaking: Faster aggregate Breaking: Faster/better aggregate Sep 27, 2024
@rafaqz rafaqz changed the title Breaking: Faster/better aggregate Breaking: Faster/better aggregate Sep 27, 2024
@rafaqz
Copy link
Owner Author

rafaqz commented Oct 9, 2024

@tiemvanderdeure want to do one last review of this? would be good to get it in with the other breaking changes

@tiemvanderdeure
Copy link
Contributor

I think there's a commit missing here. The behaviour with unordered or missing lookups is not implemented or tested

@tiemvanderdeure
Copy link
Contributor

But you said this might be on your old laptop, right?

@rafaqz
Copy link
Owner Author

rafaqz commented Oct 9, 2024

Oh no yes it probably is. I'll have to put the NVME in some other computer

@tiemvanderdeure
Copy link
Contributor

Bump! Did you get your data back from the broken laptop?

@rafaqz
Copy link
Owner Author

rafaqz commented Dec 10, 2024

Ugh not yet, but I have it on an external drive at least. Will do it soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants