-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking: Faster/better aggregate
#763
base: main
Are you sure you want to change the base?
Conversation
checkbounds(src, u...) | ||
# If a disk array, cache the src so we don't read too many times | ||
src_parent = isdisk(src) ? DiskArrays.cache(parent(src)) : parent(src) | ||
@inbounds broadcast!(dst, CartesianIndices(dst)) do I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worthwhile to optionally run this threaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it could be a threaded for loop i guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the reason not to is currently it works on GPU and DiskArrays as is, threading will break that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well if threaded is false by default then you could either broadcast or do a threaded for loop. But after seeing just how fast this is I don't think that would only be relevant in some niche use cases
src/methods/aggregate.jl
Outdated
$FILENAME_KEYWORD | ||
$SUFFIX_KEYWORD | ||
$PROGRESS_KEYWORD | ||
$THREADED_KEYWORD | ||
|
||
Note: currently it is faster to aggregate over memory-backed arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this say disaggregate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be deleted, the caching should make it fast now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no cache
is broken on our DiskArrays version 😭
src/methods/aggregate.jl
Outdated
const SKIPMISSING_KEYWORD = """ | ||
- `skipmissing`: if `true`, any `missingval` will be skipped during aggregation, so that | ||
only areas of all missing values will be aggregated to `missingval`. If `false`, any | ||
aggregated area containing a `missingval` will be assigned `missingval`. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "false by default"
src/methods/aggregate.jl
Outdated
disaggregate!((locus,), dst, src, scale) | ||
end | ||
function disaggregate!(loci::Tuple{Locus,Vararg}, dst::AbstractRaster, src, scale) | ||
function disaggregate!(dst::AbstractRaster, src, scale) | ||
intscale = _scale2int(DisAg(), dims(src), scale) | ||
broadcast!(dst, CartesianIndices(dst)) do I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be faster to loop through the src
instead like so since we know it is smaller.
By looping through dst
we have lots of unnecessary calls to upsample
and lookups in src
.
broadcast!(dst, CartesianIndices(dst)) do I | |
for I in CartesianIndices(src) | |
upper = upsample.(Tuple(I), intscale) | |
lower = upper .+ intscale .- 1 | |
I_dst = map(:, upper, lower) | |
val = src[I] | |
val === missingval(src) ? missingval(dst) : val | |
view(dst, I_dst...) .= val | |
end | |
return dst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on unchanged lines, but if we implement this then downsample
isn't used anywhere anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yes it will be faster. I think the hidden reason again is that wont work on a GPU or a disk array and I was trying to be generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use Flatten.jl so see if there is an array inside whatever wrappers, and if it is do the loop, if not the broadcast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need DiskArraysKernelAbstractions then we could just do all of this with KernelAbstractions, and use Stencils.jl too to make it even faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this woulnd't work with on GPU? We're just viewing into an array and filling it with some value, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its all scalar indexing. We need to launch single kernel to do the whole lot in one go.
Well its lots of little views at least. But looking at it I don't think the original will work on GPU either - but it can work on DiskArrays and its fast now since we have cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broacasts handle kernel launches for us. If you want to write manual code like that with blocks we need KernelAbstractionsjl.
(and actually for this to work on GPU we need KernelAbstractions. But it works on disk...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea is to do some clever reshapes and permutes to be able to broadcast without having to do indexing math ourselves. I tried this implementation and it works, but it's actually a little slower than the original. I guess the reshaping makes indexing that much slower.
intscale = _scale2int(DisAg(), dims(src), scale)
n = length(intscale)
reshape_dims = ntuple(i -> mod(i, 2) == 0 ? size(src)[(i ÷ 2)] : intscale[(i ÷ 2) + 1], Val(n*2))
permute_order = (range(2; step = 2, length = n)..., range(1, step = 2, length = n)...)
dst_reshaped = Base.PermutedDimsArray(Base.reshape(dst, reshape_dims), permute_order)
broadcast!(dst_reshaped, parent(src)) do val
val === missingval(src) ? missingval(dst) : val
end
return dst
Amazing! Did a few quick tests/profviews and it seems blazing fast - I can't come up with a way to improve there. I just had a quick thing for disaggregate. A few other suggestions (short or long term) - now that we're making considerable changes to this function anyway:
|
Its there! I also threaded it
You mean
Yeah it was going to be but never happened lol. Probably it should be another geostatistics gap fillin method instead of here. |
Yes I think that could be the default if It could also depend on the order - so we don't aggregate along |
Ok I've implemented this so you have to explicitly aggregate categorical dimensions. I'm not sure what to do with the categories? Maybe we need to join as strings somehow? Or return With Edit: I think we need an |
aggregate
@tiemvanderdeure want to do one last review of this? would be good to get it in with the other breaking changes |
I think there's a commit missing here. The behaviour with unordered or missing lookups is not implemented or tested |
But you said this might be on your old laptop, right? |
Oh no yes it probably is. I'll have to put the NVME in some other computer |
Bump! Did you get your data back from the broken laptop? |
Ugh not yet, but I have it on an external drive at least. Will do it soon |
aggregate
was written so long ago, was never optimized, and a bunch of things never made sense.This PR adds optional threading, general performance improvements, and specific fast paths for common methods like
sum
andmean
. For aRasterStack
this can be an order of magnitude or more improvement overall.At the same time I fixed the dumb
dissaggregate
arguments.@tiemvanderdeure this was inspired by your comment to use
aggregate
proving to be actually slower thanresample
. So if you want to review...