Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up wgpu passes/allocations #56

Merged
merged 8 commits into from
Aug 12, 2024

Conversation

ArthurBrussee
Copy link
Contributor

This PR significantly reduces CPU overhead, and slightly optimizes GPU performance, of cube on wgpu. It got a bit bigger than intended, as there was a cascade of things to do so bear with me!

Outline

  • Wgpu 22.0.0 has a new method on ComputePass leak_lifetime which allows using the compute pass without a complicated lifetime. It's still safe, there's runtime checks as well, just means you have to be more careful. This allows cube CL to batch executions into one Compute pass, which is faster.

  • However, you can't use encoder.copy_buffer_to_buffer while a compute pass is still open. Since client.create() gets called before each execute, this prevents the compute pass optimization from really working.

  • The way client.create() works is sub-optimal anyway: queue.write_buffer_with is faster, and on most GPUs doesn't need a staging buffer & copy at all, saving CPU and GPU work. This should also be a nice saving when eg. uploading a batch of data for training.

  • Using queue.write_buffer_with is tricky however. The GPU memory gets written at the start of the command buffer submitted to the queue. This doesn't work with pooled memory though! Cube might want an order like kernel A -> write_buffer -> kernel B. Writing first means kernel A stamps over the memory.

  • So, we need a way to not use memory used earlier in the pass. I've considered some other options, like a separate memory pool for create() allocations, but ultimately have gone with the approach that also applies to this issue, where you can exclude a set of resources to use when allocating. For this case, we only need to not overlap memory regions not whole buffers, but that would get more complicated and isn't worth it probably.

  • An earlier attempt at the issue above was to exclude ChunkId's. This is ok but a bit strange. The requirement is really to not share buffers. In the implementation one chunk == one buffer so it could work, but that raises the question why there's ChunkIds in the first place. This seemed viable to simplify, so the API just maps StorageIds -> chunk metadata. This saves some overhead & lookups as well.

  • The current pool code was generic over the chunk and slice type, and mocked these for testing. That meant I couldn't get the storage ID however, and imho this abstraction & mocking isn't great. I've removed this and made the tests work with a "real" setup which exercises more parts of the code. This might catch some more bugs.

  • The sync() method in mm.alloc() and mm.reserve() made all of this tricker as well and with gnarlier semantics (how to deal with writes that happen soon as we sync?). This doesn't seem to be used atm and imho sounds like trouble in general! I'veremoved this for now, but lmk if I'm missing the crucial reason it should be there.

  • The ability of ComputeStorage to copy buffers is also a bit problematic, as it probably also needs to be ordered properly within the encoder. I've just left a note of this for now as it's not currently used, I'm not sure what the plans are for extending the max memory but would be great if it could be done without copying.

Benchmarks

Some benchmarks on my burn gaussian splatting code:

Inference CPU
image

Inference + GPU sync
image

Training step CPU
image

Possible future work

  • Merge all constants into a single info struct for wgsl. Allocating small 4 byte buffers for each float isn't very efficient.
  • Mark these constant buffers as uniform memory. This might require a separate pool of 'uniform buffers' after all, not sure.
  • More memory juggling to solve Deferred tensor allocations to avoid buffer sharing with conflicting visibility in WebGPU burn#1996
  • Remove the SmallMemoryPool - I suspect if all allocations for uniforms were batched it would be redundant.
  • Remove the simple memory allocator as the new dynamic one is probably better all around.

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have two comments! Otherwise LGTM

Comment on lines 129 to 131
// TODO: The semantics of this seems really strange, as we might be copying before other commands
// are flushed & done. Ideally, we should pass in the actual current encoder and queue commands there
// or remove the need to copy entirely. This is not used at the moment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove copy from the storage in this PR, since it's not safe to use without sync, which is removed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great! I've removed it and the code that was using it. Lmk if that's too aggressive, can also just replace wgpu copy specifically with a panic.

_ => None,
};

let pass = self.start_compute_pass();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rename start_compute_pass and end_compute_pass to add_compute_pass and clear_compute_pass, otherwise it's strange that we start way more compute pass than we end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. Actually have just inlined start_compute_pass now, it's probably not a good idea to just call that anywhere anyway. Renamed end_compute_pass to clear_compute_pass

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nathanielsimard nathanielsimard merged commit d94a07a into tracel-ai:main Aug 12, 2024
1 of 2 checks passed
@ArthurBrussee ArthurBrussee deleted the wgpu-speed branch August 13, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants