-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wgpu memory corruption for CubeCount::Dynamic #156
Conversation
// Keep track of any buffer that might be used in the wgpu queue, as we cannot copy into them | ||
// after they have any outstanding compute work. Calling get_resource repeatedly | ||
// will add duplicates to this, but that is ok. | ||
let handle = self.memory_management.get(binding.memory.clone()); | ||
self.storage_in_flight.push(handle.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove duplicates by using a set? If the storage_in_flight
is cleared often it's probably not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should actually probably create a data structure with methods. We use a slice in the memory management functions, but it could be a struct with the proper methods. Then we could change the data structure more easily (Vec under size of 10-20, a set if more than that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I checked this a set was slower, since this array isn't that big. But yeah probably worth just keeping the semantics simple here rather than optimizing. Added a "MemoryLock" type now. Also changed the function to take an Option<> as allocating an empty hashset for each call is a bit sad.
x.1 += 1; | ||
x.1 < 2 | ||
}); | ||
self.storage_in_flight.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why syncing means that we can remove the external readonly buffers from the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yes good point! The whole get_resource API is rather unsafe, as the "resource" is tied to the original allocation. As long as the allocation is upheld all the queue copy/write stuff here is fine as well, so it's the same issue as already exists - keeping a resource past the lifetime of the original allocation is unsafe.
I've implemented the solution we came up with in the Discord to keep the binding as part of the resource, or rather, wrapped stuff in a BindingResource so all servers can do this. Lmk how this approach looks!
A previous PR #56 has made it a bit trickier to deal with wgpu resources, and turns out there's another memory corruption possible now :) Resources used for
CubeCount::Dynamic
were not registered as "in use" which means it was possible for a copy operation & dispatch to be ordered incorrectly.Another problem is that
get_resource
is public which means users might add operations using a buffer themselves. These should also be tracked! The fix is to always register a buffer as 'used' if it's used inget_resource
.This fix also seems to alleviate the need for the mysterious "double use for used copy buffers" that was previously needed, mystery solved!
#56 has introduced two memory corruptions now, so it's clearly a bit too flaky... some simpler strategy to enable fast wgpu copying might be needed but don't have much inspiration atm!