-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable size buffer management #265
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ehpor
force-pushed
the
feature/messaging_system
branch
from
December 13, 2024 01:15
8f7a855
to
2d6905d
Compare
The results should be readable.
ehpor
force-pushed
the
feature/messaging_system
branch
from
December 20, 2024 19:18
16acf49
to
8cbc70c
Compare
The bug, introduced by yesterday's refactor, is now fixed. @raphaelpclt |
raphaelpclt
approved these changes
Dec 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my limited c++ knowledge
I'll merge this then, since there is nothing yet depending on this new code, so fixes can be done at a later stage if I messed something up. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A first step to having shared memory streams with variable-sized data.
This PR introduces memory allocators that manage an external piece of memory, whether shared or GPU or local. It defines two allocators:
Both implementations do not use locks (= mutexes). The intention behind this is to avoid deadlocking of one process due to the crash of another process that held the lock during the crash. Another reason is tolerance for priority inversion. Lock-free programming ensures that a high-priority process never blocks a low-priority process.
This PR also includes a lock-free hash map implementation that will be needed later on.
Algorithm description
CAS loops
Being lock-free data structures, both
PoolAllocator
andFreeListAllocator
make extensive use of the CAS (Compare And Swap) atomic instruction. This instruction performs the following code atomically:This can be used to perform the following operation:
The CAS-loop gets the data from an object, performs an operation on that data and tries to set the new data on the object. If the data was still what we got originally, it wasn't modified in between and we can safely set the new value. If it was modified, it was changed by someone else and we need to do the operation again. The CAS is the base operation for many lock-free algorithms.
FYI: in C++, the above operation is often done using a do-while loop, rather than a while loop with exit criterion.
We use CAS loops for many operations inside the
PoolAllocator
andFreeListAllocator
.PoolAllocator
The pool allocator can give out resources from an internal pool of resources. All resources are the same and therefore these blocks of memory that it is giving out are required to be the same size.
It maintains an internal stack structure to manage access. To allocate, we remove the top (= head) element from the stack and gives it to the user. To deallocate, it pushes the given element to the stack.
FreeListAllocator
The free-list allocator gives out blocks of memory of variable size. It maintains a set of
BlockDescriptor
objects that describe each block as an offset and size. We also include a bit for signifying whether a block is free or allocated. Since block descriptors need to be atomically-modified, i.e. both offset, size and free bit at the same time, it needs to fit in a 64bit value. We use 32bits for the offset, 31 bits for size and 1 bit for the free bit.The
FreeListAllocator
maintains a sorted singly-linked list ofBlockDescriptor
s that are all free. Occasionally, when a block is being worked on, a block may have the free block set toFalse
while the block is still on the linked list. This is rectified as soon as the operation finishes.During allocation, the algorithm searches for the first free block that is of larger or equal size. If the block is the right size, it removes it from the free list, and returns the block. If it's too large, the block will be split into two blocks, with one of the blocks being returned.
During deallocation, the algorithm inserts the block into the linked-list at the right location and attempts to coalesce adjacent block to form larger blocks. The coalescence is also done atomically, of course. An edge case needs to be handled here, where two adjacent blocks are inserted at the same time, meaning that neither are coalesced, resulting in two free adjacent blocks on the free list. Because of this, a block needs to iteratively check for adjacent blocks after coalescing, in case there were two free blocks there, resulting from that race condition. (Potentially we could try to coalesce every block pair in the free list, an operation that is also linear in number of elements, but only requires traversal of the linked list once rather than multiple times, one for each coalescence.)
Hash map
Since we're envisioning the need for a key-value pair storage (i.e. a C++ map or a Python dictionary) in shared memory, we include a HashMap into this PR. This hash map is specifically for fixed-sized strings and does not allow removal of elements from the hash map. It uses MurmurHash3 to hash the strings and looks up the hash in the hash map. On the off chance that the key doesn't match the key stored in that hash bucket, the hash map looks in the bucket to the right and repeats that process until the right key is found.
Benchmarks
Pool allocator:
FreeListAllocator:
Linux scalability performs 10,000,000 allocations and then deallocates them in the same order. ThreadTest performs the same, but it batches of 10,000 allocations and deallocations. Both of these are best-case scenarios. The Larson benchmark performs allocates and puts the result at a random index in a 1,000 element list. If that index was already used, it deallocates that block of memory. Blocks of memory are randomized in size too. This is typical allocation for our use case.
So, best case, about 12ns for allocation and deallocation, average case, about 140ns.
HashMap:
Inspiration
The free-list allocator was roughly inspired / based on: