New memory allocator : vsg::InstrusiveAllocator #1228
Replies: 7 comments 2 replies
-
The original VSG allocator utilizes the [vsg::MemorySlots[(https://github.com/vsg-dev/VulkanSceneGraph/blob/master/include/vsg/core/MemorySlots.h) class to manage the position of allocated and free slots in the memory blocks. The vsg::MemorySlots class I originally wrote for handling allocation of parts of vk/vsg::Buffer and vk/vsg::DeviceMemory which require the memory slot tracking external to the actual Vulkan memory/buffers. To do the required memory allocation/available tracking the MemorySlots class uses std::multimap and std::map: std::multimap<size_t, size_t> _availableMemory;
std::map<size_t, size_t> _offsetSizes;
std::map<size_t, size_t> _reservedMemory; This scheme has worked reliably but it's an O(logn) cost for each allocation/deallocation, which in the end means O(nlogn) for all the allocations/deallocations, so when n gets large it starts falling behind allocators that can achieve a O(1) cost for each allocation/deallocation and O(n) for all allocations/deallocations. This is what we see when comparing to the std::new/delete, when we get into hundreds of millions of allocations the vsg::MemorySlots becomes a bottleneck. While use size_t as a data type in the std::map's is great for scalability it also adds a per memory allocation overhead that while no directly allocated in the memory block/vsg::Buffer/vsg::DeviceMemory it still has to occupy main memory. For hundreds of millions of nodes this cast become significant, more than doubling the total allocation footprint for these small node allocations. Now all these memory and performance overheads are largely negligible for applications with 10's of millions of memory allocations so most applications won't notice them, and they'll benefit from the better traversal/scene graph access performance that block allocation provides so all is good. However, as we've found out for some really heavy duty applications that required hundreds of millions of allocations or even billions of allocations the memory and performance overheads become their own bottleneck, so the work I've undertaken with the new vsg::InstrusiveAllocator takes a completely different approach than vsg::MemorySlots, rather than tracking the allocated and available memory tracking external to the allocated memory slots it stores the required indexing directly in the memory blocks. The way the vsg::InstrusiveAllocator track all the possible slots of memory within a block is allocate each memory block as a contiguous array of Element, the Element is a 4 byte struct with a union and nested struct within it so that each Element can be accessed as a packed linked list in the form of 15bit relative offsets, or treated as a 4 byte indexed, or just treated as allocated memory. This 4 byte structure means that the minimum alignment is 4 bytes, and alignment larger than this has to be a multiple of 4. This fits with how hardware manages things, as well a std::new/delete so doesn't add an extra constraint that isn't already there. The data parts of the Element struct is defined thus: struct Element
{
union
{
uint32_t index;
struct
{
unsigned int previous : 15;
unsigned int next : 15;
unsigned int status : 2;
};
};
....
} Each memory block is managed by InstrusiveAllocator::MemoryBlock class which holds a pointer the Element* and allocates all the whole block with a single std::new/delete call and then populates the block with Element with the first Element of each available slot starting with a Element that packs relative position of the previous slot, the next slot and whether that slot is status is available (0) or allocated (1). The previous and next relative positions are 15 bit so this limits the size of each memory allocation to 2^15-2, for allocations larger than this the IntrusiveAllocator falls back to use std::new/std::delete instead. Allocating memory from blocks only has a performance advantage for data that is lots of small separate objects - like scene graph nodes or simple objects while large data types like large vertex arrays or textures tend to be accessed either via a pointer or in their own batched processing so they don't cause cache misses when allocated more randomly through memory as within themselves they are coherent. As well as the Element entries being used to mark the size next and previous memory slot as a form for relative linked list, for slots that are available to be allocated from the MemoryBlock class also packs previous and next indices for the purposes of linked free list. When memory allocations are requested this linked list is walked through to find a slot that is big enough for an allocation, then that slot is either consumed entirely and it's status is changed to 0 and removed from the free linked list, or for if there is enough memory remaining the first part of the slot is returned as allocated memory while the second part is set up to be a new free slots. When creating new free slots the beginning of the slot is aligned to the alignment set up for that MemoryBlock - you can set this programmatically or just leave it to the defaults of 4 byte alignment for Nodes/Objects/Data and 16byte for physics data (this fits with the PhysX constraints.) When deallocating memory the MemoryBlock class finds the Element before the pointer position being deleted to get the size of the slots and it's next/previous slots. If the either/both the next/previous slots are already available for allocation those slots are combined into a single available slot - so long as the new enlarged slot sill fits below the 2^15-2 constraint. The free list is then adjusted to contain this new or change slot so that slot is now available for reuse by subsequent allocations. The InstrusiveAllocation implementation we have now is my second attempt at rewrite of vsg::Allocator, I tried initially with a simpler non intrusive approach that had lower memory overhead than the existing vsg::MemorySlots class but it's performance just didn't scale as well as required. The new approach is more complicated - which is why it took my a further 3 weeks to get ready to present to the world. I am hoping this simplified explanation of how it works will help others get an idea of how it works and if they start reviewing the code should hopefully understand what it is meant to do. While I have hammered the new class using various examples and test datasets and fixed all the issues I have come across it's still a complicated new implementation that affects everything that the scene graph does, a screw up in allocating memory, release that memory for reuse, and incorrect reuse all can cause crashes or subtle issues that can be hard to track down. I have done what I can to try and avoid such issues but I'm only human, mistakes can happen in design and implementation, there could be corner cases which cause bugs or perform less than desirable that I haven't identified yet. To catch any remaining issues I really need the communities help to test out the InstrusiveAllocator branch against your own applications and data. All your should need to do is check out the InstrusiveAllocator branch, rebuild the VSG and sibling projects then rebuild your application and start testing. While there are some minor API changes for most applications I don't expect any modifications will be required. Let me know how you get on, both with a thumbs up if everything works fine or that details on any issues you've seen. |
Beta Was this translation helpful? Give feedback.
-
Here's same tests run on 3 systems I have here: AMD 5700G + Kubuntu 22.04:
Intel Core i7 13700 + Windows 11:
Intel Core i7 1165G7 (laptop) + Kubuntu 24.04:
|
Beta Was this translation helpful? Give feedback.
-
I have had any feedback on the InstrusiveAllocator branch yet so I can only go on my own testing, if I don't get any negative feedback by tomorrow I'll assume we're all good and will go ahead and merge the InstrusiveAllocator branch with VSG master. So please test and let me know of success or failures. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi Robert,
I just tried the IntrusiveAllocator branch on my own application. There
were no problems and it ran as expected.
The only metric I looked at was FPS for the same scene with the same view.
After all the database paging finished I was getting 308 fps using the
master branch and 312 fps using the IntrusiveAllocator branch. It's a big
scene, but mainly a lot of vertices and textures - more like 10,000 nodes
than 10,000,000. This might explain the moderate improvement.
I'm using Kubuntu 24.04 with a Nvidia 3070Ti video card.
Regards,
Roland
…On Thu, 27 Jun 2024 at 01:23, Robert Osfield ***@***.***> wrote:
I have had any feedback on the InstrusiveAllocator branch yet so I can
only go on my own testing, if I don't get any negative feedback by tomorrow
I'll assume we're all good and will go ahead and merge the
InstrusiveAllocator branch with VSG master.
So please test and let me know of success or failures. Thanks.
—
Reply to this email directly, view it on GitHub
<#1228 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPOEQZ4DEGGAEM4AIDI5UDZJLMIXAVCNFSM6AAAAABJ2MBFBOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQOBUGMYTE>
.
You are receiving this because you are subscribed to this thread.Message
ID: <vsg-dev/VulkanSceneGraph/repo-discussions/1228/comments/9884312@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
I am now doing final testing before I merge the InstrusiveAllocator branch with master. I've done a few more refinements over the past week that improve performance by around 4% better than the figures I've previous published: AMD 5700G + Kubuntu 22.04:
This means the new InstrusiveAllocator is 6x faster at allocation and 5.8 x faster at deallocation than the previous vsg::Allocator. For applications that make hundreds of millions or billions of allocations/deallocations it'll be a noticeable improvement. |
Beta Was this translation helpful? Give feedback.
-
When I wrote the osg2vsg library back in the early days of the VSG project I wrote an osggroups example that mirrored the vsggroups one so that I could test the scene graph creation, traversal and destruction performance of the OSG and VSG side by side. Running the comparison with the latest vsg::InstrusiveAllocator, when building a scene graph with 12 quad tree levels we get:
The VSG now is 8 x faster at creating nodes, 13.5 x faster at traversing, and 2 x faster at destruction, whilst consuming 1/6th the amount of memory. These tests are pure CPU benchmarks, there isn't any OpenGL vs Vulkan influence, it's all down to the design and implementation differences between the scene graphs. |
Beta Was this translation helpful? Give feedback.
-
I have now completed my testing and as I haven't seen any issues reported have gone ahead and merged the work with VSG master: #1231 |
Beta Was this translation helpful? Give feedback.
-
Over the past 4 weeks I have focused on development of a new vsg::Allocator that dramatically reduces the CPU overhead associated with allocating and deallocating scene graph and data objects in CPU memory.
The previous work on memory allocating implemented block allocation with object type affinity that ensure all nodes would be allocated in adjacent memory, all data allocated in adjacent memory, general objects allocated in their blocks and physics allocated in their own (with 16 byte alignment as required by PhyX.) This block allocation with affinity improves traversal performance by 2-3 times over objects just allocated using the std::new/delete so is really helpful scene graph performance, this is achieved because traversing nodes in adjacent memory improves cache coherency, reducing cache missing and improving the number of CPU instructions that can be run concurrently.
The problem with the original vsg::Allocator implementation is that CPU overhead of allocating and deallocating objects was substantially slower than new/delete. For applications that do most of their allocations at start up and deallocations at exit this extra cost would be trivial, so the tradeoff of better runtime performance was well worth it. For applications with modest number of objects being allocated/deallocated per frame the cost of allocation/deallocation still remained low enough to be negligible.
However, for some applications where hundreds of millions of objects are allocated and deallocated through the life of the application to cost of allocation/deallocation has turned out to be nontrivial and has risen to point of being a bottleneck is some specialist applications. My work over the past month tackles this allocation/deallocation bottleneck whilst retaining the benefits of block allocation with affinity that the original allocator provided.
The new allocator work can be found in two branches:
To make it possible to switch between the old and new allocators I made vsg::Allocator a pure virtual base class and moved the original implementation out in a vsg::BlockAllocator subclass from vsg::Allocator, and added vsg::InstrusiveAllocator subclass from vsg::Allocator. The InstrusiveAllocator branch just has vsg::InstrusiveAllocator while the AllocatorRefactor branch has both, and you can set which one you want to use at runtime using the VSG_ALLOCATOR env var set to NEW or OLD. The default is the new vsg::InstrusiveAllocator. You can also set it programmatically be setting via vsg::Allocator::instance() to the allocator of your choice as is done in the vsgallocator example.
For comparison purposes I have also implemented a StdAllocator in the vsgallocator and vsggroups examples so I can do performance comparisons with using std::new/delete against the old BlockAllocator and the new InstrusiveAllocator. I modified the vsggroups example to use as my primary tested by, adding command line options to select between the three allocators running different scene graph creation, traversal and deallocation tests. These modifications can be found in the AllocatorRefactor branch of the vsgExamples.
To test the original allocator (vsg::BlockAllocator) on creating a quad scene graph with 12 levels:
Comparing to using std::new/delete allocator:
Comparing to the new vsg::InstrusiveAllocator
Results are written to the console in the form:
Comparing the results on my AMD 5700G running on Kubuntu 22.04 I see:
This particular test just creates a hierarchy of vsg::Group or vsg::QuadGroup it doesn't interleave any other node or objects types into the scene graph so the benefits of the block allocation with affinity doesn't fine through, I illustrate that below. The big takeaway is the the old BlockAllocator is so much slower than both the StdAllocator at allocation and deallocation - 5.76x slower on allocation, and 7.56x slower at deallocation. Ouch, for a scene graph with 223 million nodes it's a significant extra CPU cost.
The good news is that the new InstrusiveAllocator has a similar allocation speeds as the StdAllocator, but only a bit slower on deallocation on my Linux/AMD 5700G system . The news gets even better under Windows 11 on my Intel iCore7 - when I tested on it I found that the InstrusiveAllocator is actually faster than the allocator StdAllocator. It looks like the Linux memory allocator is far more optimized than the Microsoft one.
But there's even more good news when look at the memory overhead with these three different allocators, if we compare the total memory used by vsggroups when it runs the above tests we find that the std::map<size_t, size_t> used by the old vsg::BlockAllocator for indexing the allocated memory/available memory adds a significant memory usage for scene graphs with lots of small node allocations (average node size in this test is 38bytes) compared to the std::new/delete, but the IntrusiveAllocator does even better as we just pay for an additional 4 bytes per allocation + any alignment that might be needed. Results for the above vsggroups -l 12 runs are:
This memory usage is well below the 64GB I have on this machine, but for a scene graph in embedded system or for a even larger scene graph with billions of objects the new allocator could avoid hitting virtual memory or perhaps running out of memory. I have to add though, this is very niche application types, so most users will probably never notice this memory footprint benefit as typically the biggest memory usage will be large vertex arrays and textures, rather than billions of individual objects so the extra per object memory overhead will be less significant.
Going back to original motivation of doing block allocation with object type affinity, if we load a large city model in vsggroups using -i name and then test the time to run 100,000 traversals of the whole scene graph we can see the old and the allocator pull well ahead of StdAllocator (that uses vsg::new/delete)
Here the new allocator is 2% faster, but it's such as small margin that re-running the tests might show a swap of placing. However, the InstrusiveAllocator is 2.2 x faster than the StdAllocator which is well beyond margin of error.
My plan is merge the InstrusiveAllocator branch with VSG master in the first half of this week, then make a 1.1.6 dev release containing this work shortly after. What I'd like from the community is to check out the InstrusiveAllocator or AllocatorRefactor branches and test out on your platforms and against your applications. Let me know how you get on, and if everything looks solid I'll go ahead and merge with VSG master.
Beta Was this translation helpful? Give feedback.
All reactions