Variable size buffer management #265

ehpor · 2024-12-02T06:34:13Z

A first step to having shared memory streams with variable-sized data.

This PR introduces memory allocators that manage an external piece of memory, whether shared or GPU or local. It defines two allocators:

A pool allocator. This uses fixed-sized blocks of memory and allows arbitrary order allocation and deallocation of these blocks. The memory objects are internally stored in a free list (stack?) that keeps track of unallocated blocks.
A free list allocator. This allows for variable-sized blocks of memory, which are being tracked by a singly-linked list called the free list. In this implementation, the free list is kept ordered to allow for easier coalescence of neighboring blocks when both are free.

Both implementations do not use locks (= mutexes). The intention behind this is to avoid deadlocking of one process due to the crash of another process that held the lock during the crash. Another reason is tolerance for priority inversion. Lock-free programming ensures that a high-priority process never blocks a low-priority process.

This PR also includes a lock-free hash map implementation that will be needed later on.

Algorithm description

CAS loops

Being lock-free data structures, both PoolAllocator and FreeListAllocator make extensive use of the CAS (Compare And Swap) atomic instruction. This instruction performs the following code atomically:

def atomic_compare_and_swap(atomic_variable, expected, new_value):
    if atomic_variable == expected:
        atomic_variable = new_value
        return True
    else:
        expected = atomic_variable
        return False

This can be used to perform the following operation:

def cas_loop(object):
    while True:
        data = object.data.copy()
        new_data = perform_operation(data)
    
        success = atomic_compare_and_swap(object.data, data, new_data)
        if success:
            return

The CAS-loop gets the data from an object, performs an operation on that data and tries to set the new data on the object. If the data was still what we got originally, it wasn't modified in between and we can safely set the new value. If it was modified, it was changed by someone else and we need to do the operation again. The CAS is the base operation for many lock-free algorithms.

FYI: in C++, the above operation is often done using a do-while loop, rather than a while loop with exit criterion.

We use CAS loops for many operations inside the PoolAllocator and FreeListAllocator.

PoolAllocator

The pool allocator can give out resources from an internal pool of resources. All resources are the same and therefore these blocks of memory that it is giving out are required to be the same size.

It maintains an internal stack structure to manage access. To allocate, we remove the top (= head) element from the stack and gives it to the user. To deallocate, it pushes the given element to the stack.

FreeListAllocator

The free-list allocator gives out blocks of memory of variable size. It maintains a set of BlockDescriptor objects that describe each block as an offset and size. We also include a bit for signifying whether a block is free or allocated. Since block descriptors need to be atomically-modified, i.e. both offset, size and free bit at the same time, it needs to fit in a 64bit value. We use 32bits for the offset, 31 bits for size and 1 bit for the free bit.

The FreeListAllocator maintains a sorted singly-linked list of BlockDescriptors that are all free. Occasionally, when a block is being worked on, a block may have the free block set to False while the block is still on the linked list. This is rectified as soon as the operation finishes.

During allocation, the algorithm searches for the first free block that is of larger or equal size. If the block is the right size, it removes it from the free list, and returns the block. If it's too large, the block will be split into two blocks, with one of the blocks being returned.

During deallocation, the algorithm inserts the block into the linked-list at the right location and attempts to coalesce adjacent block to form larger blocks. The coalescence is also done atomically, of course. An edge case needs to be handled here, where two adjacent blocks are inserted at the same time, meaning that neither are coalesced, resulting in two free adjacent blocks on the free list. Because of this, a block needs to iteratively check for adjacent blocks after coalescing, in case there were two free blocks there, resulting from that race condition. (Potentially we could try to coalesce every block pair in the free list, an operation that is also linear in number of elements, but only requires traversal of the linked list once rather than multiple times, one for each coalescence.)

Hash map

Since we're envisioning the need for a key-value pair storage (i.e. a C++ map or a Python dictionary) in shared memory, we include a HashMap into this PR. This hash map is specifically for fixed-sized strings and does not allow removal of elements from the hash map. It uses MurmurHash3 to hash the strings and looks up the hash in the hash map. On the off chance that the key doesn't match the key stored in that hash bucket, the hash map looks in the bucket to the right and repeats that process until the right key is found.

Benchmarks

Pool allocator:

Linux Scalability:
Time: 0.643032 sec
Throughput: 3.11027e+07 ops/s

FreeListAllocator:

Linux Scalability:
Time: 0.238074 sec
Throughput: 8.40075e+07 ops/s
Threadtest:
Time: 0.20949 sec
Throughput: 9.547e+07 ops/s
Larson benchmark:
Time: 2.84551 sec
Throughput: 7.02826e+06 ops/s

Linux scalability performs 10,000,000 allocations and then deallocates them in the same order. ThreadTest performs the same, but it batches of 10,000 allocations and deallocations. Both of these are best-case scenarios. The Larson benchmark performs allocates and puts the result at a random index in a 1,000 element list. If that index was already used, it deallocates that block of memory. Blocks of memory are randomized in size too. This is typical allocation for our use case.

So, best case, about 12ns for allocation and deallocation, average case, about 140ns.

HashMap:

Insertion time: 46 ns
Lookup time: 51 ns

Inspiration

The free-list allocator was roughly inspired / based on:

MSc thesis by Richardo Leite: Practical Lock-Free Dynamic Memory Allocation
A series of blog posts by GingerBill: https://www.gingerbill.org/series/memory-allocation-strategies/

The results should be readable.

raphaelpclt · 2024-12-20T19:35:12Z

Testing on HiCAT. Compilation: ok
Benchmark results:

pool allocator:

free list allocator:

hash map:

ehpor · 2024-12-20T21:10:27Z

The bug, introduced by yesterday's refactor, is now fixed. @raphaelpclt

raphaelpclt

Looks good from my limited c++ knowledge

ehpor · 2024-12-23T23:28:14Z

I'll merge this then, since there is nothing yet depending on this new code, so fixes can be done at a later stage if I messed something up.

ehpor added the enhancement New feature or request label Dec 2, 2024

ehpor self-assigned this Dec 2, 2024

ehpor marked this pull request as ready for review December 11, 2024 23:36

ehpor requested review from RemiSoummer, a-sevin, steigersg and raphaelpclt December 11, 2024 23:36

ehpor force-pushed the feature/messaging_system branch from 8f7a855 to 2d6905d Compare December 13, 2024 01:15

ehpor mentioned this pull request Dec 17, 2024

Add a shared-memory message broker. #272

Draft

ehpor added 21 commits December 20, 2024 11:18

Add a hash map for fixed-sized string keys.

39aec8a

Add a pool allocator that can work from shared memory.

09dd48a

Make pool allocator only manage an outside buffer.

a744542

Put implementation into inl file.

92ad8da

Fix typos.

7491502

Add a lock-free free list allocator.

ca90158

Check for coalecense requirements before owning block A.

7b7f191

Add benchmarks for FreeListAllocator.

abe8d2f

Add threadtest benchmark.

9684a46

Add benchmark for pool allocator.

b77b24c

Fix for expected cannot be atomic.

bbcb886

Fix typename compiler warning.

35542f6

Add missing include.

2fd43a4

Use 32bit offset and size.

c610187

Put free-flag on the size (since it's usually smaller).

b7543ce

Fix typo.

7cd9c22

Use an external buffer.

53c3376

Refactor pool allocator to use an external buffer.

267cee0

Refactor free list allocator to use external buffer.

f9f77b3

Fix import to use C++ version.

0ff6bcf

Copy using std::copy().

d5ed5af

ehpor added 12 commits December 20, 2024 11:18

Rename and fix hash map implementation.

1062563

Convert indentation to tabs.

3120d84

Add benchmark for hash map.

1fba625

Import cstring library for strcmp().

1346321

Use size_t for number of blocks.

9545c86

Compute time per operation in ns.

61d1e0e

Add maximum key length check, safe string compare, and byte packing.

613d263

Do not return a const pointer.

97cd3d0

The results should be readable.

Use string_view instead of c or c++ strings.

25c9c5d

Fix fence-post counting issue.

a0cfe0b

Use Open() and Create().

820b565

Use Open() and Create() functions instead of Initialize().

8cbc70c

ehpor force-pushed the feature/messaging_system branch from 16acf49 to 8cbc70c Compare December 20, 2024 19:18

Use actual number of blocks, not read from non-initialized memory.

7c5bd95

raphaelpclt approved these changes Dec 23, 2024

View reviewed changes

ehpor merged commit ff2e9d2 into develop Dec 23, 2024
6 checks passed

ehpor deleted the feature/messaging_system branch December 23, 2024 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable size buffer management #265

Variable size buffer management #265

ehpor commented Dec 2, 2024 •

edited

Loading

raphaelpclt commented Dec 20, 2024 •

edited

Loading

ehpor commented Dec 20, 2024

raphaelpclt left a comment

ehpor commented Dec 23, 2024

Variable size buffer management #265

Variable size buffer management #265

Conversation

ehpor commented Dec 2, 2024 • edited Loading

Algorithm description

CAS loops

PoolAllocator

FreeListAllocator

Hash map

Benchmarks

Inspiration

raphaelpclt commented Dec 20, 2024 • edited Loading

ehpor commented Dec 20, 2024

raphaelpclt left a comment

Choose a reason for hiding this comment

ehpor commented Dec 23, 2024

ehpor commented Dec 2, 2024 •

edited

Loading

raphaelpclt commented Dec 20, 2024 •

edited

Loading