Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable size buffer management #265

Merged
merged 34 commits into from
Dec 23, 2024
Merged

Variable size buffer management #265

merged 34 commits into from
Dec 23, 2024

Conversation

ehpor
Copy link
Collaborator

@ehpor ehpor commented Dec 2, 2024

A first step to having shared memory streams with variable-sized data.

This PR introduces memory allocators that manage an external piece of memory, whether shared or GPU or local. It defines two allocators:

  • A pool allocator. This uses fixed-sized blocks of memory and allows arbitrary order allocation and deallocation of these blocks. The memory objects are internally stored in a free list (stack?) that keeps track of unallocated blocks.
  • A free list allocator. This allows for variable-sized blocks of memory, which are being tracked by a singly-linked list called the free list. In this implementation, the free list is kept ordered to allow for easier coalescence of neighboring blocks when both are free.

Both implementations do not use locks (= mutexes). The intention behind this is to avoid deadlocking of one process due to the crash of another process that held the lock during the crash. Another reason is tolerance for priority inversion. Lock-free programming ensures that a high-priority process never blocks a low-priority process.

This PR also includes a lock-free hash map implementation that will be needed later on.

Algorithm description

CAS loops

Being lock-free data structures, both PoolAllocator and FreeListAllocator make extensive use of the CAS (Compare And Swap) atomic instruction. This instruction performs the following code atomically:

def atomic_compare_and_swap(atomic_variable, expected, new_value):
    if atomic_variable == expected:
        atomic_variable = new_value
        return True
    else:
        expected = atomic_variable
        return False

This can be used to perform the following operation:

def cas_loop(object):
    while True:
        data = object.data.copy()
        new_data = perform_operation(data)
    
        success = atomic_compare_and_swap(object.data, data, new_data)
        if success:
            return 

The CAS-loop gets the data from an object, performs an operation on that data and tries to set the new data on the object. If the data was still what we got originally, it wasn't modified in between and we can safely set the new value. If it was modified, it was changed by someone else and we need to do the operation again. The CAS is the base operation for many lock-free algorithms.

FYI: in C++, the above operation is often done using a do-while loop, rather than a while loop with exit criterion.

We use CAS loops for many operations inside the PoolAllocator and FreeListAllocator.

PoolAllocator

The pool allocator can give out resources from an internal pool of resources. All resources are the same and therefore these blocks of memory that it is giving out are required to be the same size.

It maintains an internal stack structure to manage access. To allocate, we remove the top (= head) element from the stack and gives it to the user. To deallocate, it pushes the given element to the stack.

FreeListAllocator

The free-list allocator gives out blocks of memory of variable size. It maintains a set of BlockDescriptor objects that describe each block as an offset and size. We also include a bit for signifying whether a block is free or allocated. Since block descriptors need to be atomically-modified, i.e. both offset, size and free bit at the same time, it needs to fit in a 64bit value. We use 32bits for the offset, 31 bits for size and 1 bit for the free bit.

The FreeListAllocator maintains a sorted singly-linked list of BlockDescriptors that are all free. Occasionally, when a block is being worked on, a block may have the free block set to False while the block is still on the linked list. This is rectified as soon as the operation finishes.

During allocation, the algorithm searches for the first free block that is of larger or equal size. If the block is the right size, it removes it from the free list, and returns the block. If it's too large, the block will be split into two blocks, with one of the blocks being returned.

During deallocation, the algorithm inserts the block into the linked-list at the right location and attempts to coalesce adjacent block to form larger blocks. The coalescence is also done atomically, of course. An edge case needs to be handled here, where two adjacent blocks are inserted at the same time, meaning that neither are coalesced, resulting in two free adjacent blocks on the free list. Because of this, a block needs to iteratively check for adjacent blocks after coalescing, in case there were two free blocks there, resulting from that race condition. (Potentially we could try to coalesce every block pair in the free list, an operation that is also linear in number of elements, but only requires traversal of the linked list once rather than multiple times, one for each coalescence.)

Hash map

Since we're envisioning the need for a key-value pair storage (i.e. a C++ map or a Python dictionary) in shared memory, we include a HashMap into this PR. This hash map is specifically for fixed-sized strings and does not allow removal of elements from the hash map. It uses MurmurHash3 to hash the strings and looks up the hash in the hash map. On the off chance that the key doesn't match the key stored in that hash bucket, the hash map looks in the bucket to the right and repeats that process until the right key is found.

Benchmarks

Pool allocator:

Linux Scalability:
Time: 0.643032 sec
Throughput: 3.11027e+07 ops/s

FreeListAllocator:

Linux Scalability:
Time: 0.238074 sec
Throughput: 8.40075e+07 ops/s
Threadtest:
Time: 0.20949 sec
Throughput: 9.547e+07 ops/s
Larson benchmark:
Time: 2.84551 sec
Throughput: 7.02826e+06 ops/s

Linux scalability performs 10,000,000 allocations and then deallocates them in the same order. ThreadTest performs the same, but it batches of 10,000 allocations and deallocations. Both of these are best-case scenarios. The Larson benchmark performs allocates and puts the result at a random index in a 1,000 element list. If that index was already used, it deallocates that block of memory. Blocks of memory are randomized in size too. This is typical allocation for our use case.

So, best case, about 12ns for allocation and deallocation, average case, about 140ns.

HashMap:

Insertion time: 46 ns
Lookup time: 51 ns

Inspiration

The free-list allocator was roughly inspired / based on:

@ehpor ehpor added the enhancement New feature or request label Dec 2, 2024
@ehpor ehpor self-assigned this Dec 2, 2024
@ehpor ehpor marked this pull request as ready for review December 11, 2024 23:36
@ehpor ehpor force-pushed the feature/messaging_system branch from 8f7a855 to 2d6905d Compare December 13, 2024 01:15
@ehpor ehpor force-pushed the feature/messaging_system branch from 16acf49 to 8cbc70c Compare December 20, 2024 19:18
@raphaelpclt
Copy link
Collaborator

raphaelpclt commented Dec 20, 2024

Testing on HiCAT. Compilation: ok
Benchmark results:

  • pool allocator:
image
  • free list allocator:
image
  • hash map:
image

@ehpor
Copy link
Collaborator Author

ehpor commented Dec 20, 2024

The bug, introduced by yesterday's refactor, is now fixed. @raphaelpclt

Copy link
Collaborator

@raphaelpclt raphaelpclt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my limited c++ knowledge

@ehpor
Copy link
Collaborator Author

ehpor commented Dec 23, 2024

I'll merge this then, since there is nothing yet depending on this new code, so fixes can be done at a later stage if I messed something up.

@ehpor ehpor merged commit ff2e9d2 into develop Dec 23, 2024
6 checks passed
@ehpor ehpor deleted the feature/messaging_system branch December 23, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants