Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default app.manifest should include <heapType>SegmentHeap</heapType> #43611

Open
benaadams opened this issue Sep 21, 2024 · 5 comments · May be fixed by dotnet/roslyn#75199
Open

Default app.manifest should include <heapType>SegmentHeap</heapType> #43611

benaadams opened this issue Sep 21, 2024 · 5 comments · May be fixed by dotnet/roslyn#75199
Labels
Area-NetSDK untriaged Request triage from a team member

Comments

@benaadams
Copy link
Member

Describe the bug

While it doesn't effect managed code if you are using any native allocator from a 3rd party lib (e.g. use RocksDB for example); then the allocator will degrade terribly over time and become a performance bottleneck.

Windows 10, version 2004 (build 19041) and later support a new allocator that doesn't suffer from this issues https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype

However it needs to be specified in the app.manifest which by default .NET isn't doing.

This can be manually overwritten eg. NethermindEth/nethermind#7418 but should be on this fast path by default.

To Reproduce

Build a .NET exe on Windows this field is not included in the manifest

@beyondflesh
Copy link

beyondflesh commented Sep 22, 2024

https://bugs.chromium.org/p/chromium/issues/detail?id=1102281

But which led to performance regression for WebXPRT3, Speedometer2, and JetStream2. Following are some regression data on Intel CFL i9-9900K

i9-9900K:
Speedometer2.0: ~ -5%
WebXPRT3: ~-5.8%
JetStream2: ~-6.2%

Vector35/binaryninja-api#2778
Opting into the Segment Heap on Windows improves analysis speed by ~20%

NTOS:
Before: Analysis update took 1398.097 seconds
After: Analysis update took 1121.882 seconds

JScript:
Before: Analysis update took 139.466 seconds
After: Analysis update took 113.748 seconds

These reflect a 20-25% improvement on analysis times.

more info on segment heap(internals) : https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf

@benaadams
Copy link
Member Author

https://bugs.chromium.org/p/chromium/issues/detail?id=1102281

But which led to performance regression for WebXPRT3, Speedometer2, and JetStream2. Following are some regression data on Intel CFL i9-9900K

Any more recent update in last 4 years?

Behaviour we see with using a RocksDB db as part of app is performance is good for about 4 hours; then starts to continuously nose dive with default heap.

With the segment heap the performance continues to remain steady beyond a week of runtime.

An alternative might be to enable it when ServerGC is selected to indicate its a long running process (assuming it hasn't improved in the intervening 4 years)

@beyondflesh
Copy link

I'm not sure, best to ask Bruce Dawson about it, from some testing i did personally performance improvement or regression depends a lot on allocation frequency and size which is why i posted my first comment, since some apps benefit while others either experience stutters or overall perf drop by up to -20%

Also i'm not sure why Microsoft doesnt have any documentation on segment heap or how to best optimize apps for it/tune the heap, best docs i can find about it are from reverse engineering and PoC for exploits..

The excessive memory waste is believed to be significantly due to extra affinity slots being created with per-processor slists containing free memory. That is, in order to allow maximum parallelism the heap is creating new lists of free memory for use by individual processors. This may be worsened by my machine having lots of CPUs, which gives the heap an opportunity to create a lot of heap affinity slots.
https://issues.chromium.org/issues/40103406

https://blogs.windows.com/msedgedev/2020/06/17/improving-memory-usage/

https://bugs.chromium.org/p/chromium/issues/detail?id=1102281

Using the segment heap when running the speedometer benchmark leads to an extra 12.8 GB of VirtualAlloc/VirtualFree traffic which leads to an extra 5.1 to 6.8 seconds of CPU overhead. This is roughly enough to account for the observed regression.

Later improvements to the Segment Heap resulted in significant memory footprint reductions, at the cost of some CPU overhead required to more effectively manage the heap.
microsoft/Windows-Dev-Performance#39

@Sergio0694
Copy link
Contributor

"it needs to be specified in the app.manifest which by default .NET isn't doing."

Just adding further context on this in case anyone reading this is now wondering if they also should go update their apps with this: if you have a packaged (MSIX) app, this is already the default. No action is needed in that case. The UI framework you're using doesn't matter (UWP, WinUI 3, WPF, WinForms, no UI framework at all, whatever), as long as you're deploying as a packaged app. This app manifest fix is only needed for unpackaged apps. For docs, see heap:HeapPolicy remarks 🙂

@kasperk81
Copy link
Contributor

Later improvements to the Segment Heap resulted in significant memory footprint reductions, at the cost of some CPU overhead required to more effectively manage the heap.
microsoft/Windows-Dev-Performance#39

microsoft/Windows-Dev-Performance#106

Expected Behavior

The NT heap scales at an acceptable level, or the Segment Heap performs at least as good as the "classic" NT heap in every case.
Actual Behavior

The NT heap scales horrendously in some cases. The segment heap scales well but has worse performance in many cases.

maybe add it as optional from sdk side to avoid surprises? like setting <UseWindowsSegmentHeap>true<UseWindowsSegmentHeap> in project file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-NetSDK untriaged Request triage from a team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants