-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up wgpu passes/allocations #56
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have two comments! Otherwise LGTM
// TODO: The semantics of this seems really strange, as we might be copying before other commands | ||
// are flushed & done. Ideally, we should pass in the actual current encoder and queue commands there | ||
// or remove the need to copy entirely. This is not used at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would remove copy
from the storage in this PR, since it's not safe to use without sync
, which is removed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great! I've removed it and the code that was using it. Lmk if that's too aggressive, can also just replace wgpu copy specifically with a panic.
_ => None, | ||
}; | ||
|
||
let pass = self.start_compute_pass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rename start_compute_pass
and end_compute_pass
to add_compute_pass
and clear_compute_pass
, otherwise it's strange that we start way more compute pass than we end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point. Actually have just inlined start_compute_pass now, it's probably not a good idea to just call that anywhere anyway. Renamed end_compute_pass to clear_compute_pass
It does prevent in O(n^2) in max_tasks but not worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR significantly reduces CPU overhead, and slightly optimizes GPU performance, of cube on wgpu. It got a bit bigger than intended, as there was a cascade of things to do so bear with me!
Outline
Wgpu 22.0.0 has a new method on ComputePass
leak_lifetime
which allows using the compute pass without a complicated lifetime. It's still safe, there's runtime checks as well, just means you have to be more careful. This allows cube CL to batch executions into one Compute pass, which is faster.However, you can't use encoder.copy_buffer_to_buffer while a compute pass is still open. Since client.create() gets called before each execute, this prevents the compute pass optimization from really working.
The way client.create() works is sub-optimal anyway: queue.write_buffer_with is faster, and on most GPUs doesn't need a staging buffer & copy at all, saving CPU and GPU work. This should also be a nice saving when eg. uploading a batch of data for training.
Using queue.write_buffer_with is tricky however. The GPU memory gets written at the start of the command buffer submitted to the queue. This doesn't work with pooled memory though! Cube might want an order like
kernel A -> write_buffer -> kernel B
. Writing first means kernel A stamps over the memory.So, we need a way to not use memory used earlier in the pass. I've considered some other options, like a separate memory pool for create() allocations, but ultimately have gone with the approach that also applies to this issue, where you can exclude a set of resources to use when allocating. For this case, we only need to not overlap memory regions not whole buffers, but that would get more complicated and isn't worth it probably.
An earlier attempt at the issue above was to exclude ChunkId's. This is ok but a bit strange. The requirement is really to not share buffers. In the implementation one chunk == one buffer so it could work, but that raises the question why there's ChunkIds in the first place. This seemed viable to simplify, so the API just maps StorageIds -> chunk metadata. This saves some overhead & lookups as well.
The current pool code was generic over the chunk and slice type, and mocked these for testing. That meant I couldn't get the storage ID however, and imho this abstraction & mocking isn't great. I've removed this and made the tests work with a "real" setup which exercises more parts of the code. This might catch some more bugs.
The sync() method in mm.alloc() and mm.reserve() made all of this tricker as well and with gnarlier semantics (how to deal with writes that happen soon as we sync?). This doesn't seem to be used atm and imho sounds like trouble in general! I'veremoved this for now, but lmk if I'm missing the crucial reason it should be there.
The ability of ComputeStorage to copy buffers is also a bit problematic, as it probably also needs to be ordered properly within the encoder. I've just left a note of this for now as it's not currently used, I'm not sure what the plans are for extending the max memory but would be great if it could be done without copying.
Benchmarks
Some benchmarks on my burn gaussian splatting code:
Inference CPU
Inference + GPU sync
Training step CPU
Possible future work