-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable support for Linux THP on architectures other than amd64 #8702
Disable support for Linux THP on architectures other than amd64 #8702
Conversation
THP seems to be causing problems on 32-bit and 64-bit ARM hosts. This change disables the test for THP on everything except 64-bit x86 hosts which are known to work.
CT Test Results 3 files 143 suites 49m 21s ⏱️ Results for commit dbba3df. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
case $OPSYS in #( | ||
linux*) : | ||
case $ARCH-$OPSYS in #( | ||
amd64-linux*) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an option allowing people to opt-in. OTP-27.0.1 works fine out of the box on an M1 mini running Fedora 40, i.e. arm64-linux
. The generated config.h
says #define HAVE_LINUX_THP 1
.
Is there a way to determine if the THP optimization actually kicks in or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an option allowing people to opt-in. OTP-27.0.1 works fine out of the box on an M1 mini running Fedora 40, i.e. arm64-linux. The generated config.h says #define HAVE_LINUX_THP 1.
Thank you, that is good to know.
The linker flags that I am using align stuff to a 2MiB boundary makes sense for 64-bit x86 which uses a 2MiB page for THP. However, the size of a transparent huge page is not guaranteed to be same on all architectures so that alignment is not guaranteed to be correct. I believe it is reasonable for a 32-bit ARM, at least on the version of Debian for armhf that I installed on a QEMU system last night, but I am not sure if it is reasonable for different variants of 64-bit ARM.
Since I don't have real hardware to test on at the moment, I am hesitant to enable other architectures. Leaving it in could be a no-op, or worse. I thought there was an argument to ./configure
to opt-out of checking for, and enabling, THP but it seems to not show up when I run ./configure --help
. (My mistake.) Giving users control of this does seem like a reasonable thing to do.
Can you please tell me what the value of /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
is on your M1? Apple uses a 16KiB page by default, so this might have a unique value on that system.
Is there a way to determine if the THP optimization actually kicks in or not?
To tell if the .text
segment is being mapped with THP look for the .text
segment mapping in /proc/$pid/smaps
for an Erlang node's process. If the entry for the .text
segment has a non-zero value for FilePmdMapped
I believe everything should be working. Here is an example of what that looks like on my system
00600000-00c00000 r-xp 00200000 00:1b 21865 /path/to/beam.frmptr.smp
Size: 6144 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 6144 kB
Pss: 6144 kB
Pss_Dirty: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 6144 kB
Private_Dirty: 0 kB
Referenced: 6144 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 6144 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 1
ProtectionKey: 0
VmFlags: rd ex mr mw me hg
Note a few additional things
- The mapping starts at 0x00600000, a multiple of 2MiB
- The value of
Size
is 6144 kB, also a multiple of 2MiB - The value of
THPeligible
is1
- The value of
VmFlags
includeshg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please tell me what the value of
/sys/kernel/mm/transparent_hugepage/hpage_pmd_size
is on your M1?
mini_7_ls /sys/kernel/mm/transparent_hugepage/
/sys/kernel/mm/transparent_hugepage:
defrag hpage_pmd_size hugepages-128kB/ hugepages-2048kB/ hugepages-32768kB/ hugepages-512kB/ hugepages-8192kB/ shmem_enabled
enabled hugepages-1024kB/ hugepages-16384kB/ hugepages-256kB/ hugepages-4096kB/ hugepages-64kB/ khugepaged/ use_zero_page
mini_8_cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
mini_9_cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
33554432
To tell if the
.text
segment is being mapped with THP look for the.text
segment mapping in/proc/$pid/smaps
for an Erlang node's process. If the entry for the.text
segment has a non-zero value forFilePmdMapped
I believe everything should be working. Here is an example of what that looks like on my system00600000-00c00000 r-xp 00200000 00:1b 21865 /path/to/beam.frmptr.smp Size: 6144 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 6144 kB Pss: 6144 kB Pss_Dirty: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 6144 kB Private_Dirty: 0 kB Referenced: 6144 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 6144 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 ProtectionKey: 0 VmFlags: rd ex mr mw me hg
Note a few additional things
1. The mapping starts at 0x00600000, a multiple of 2MiB 2. The value of `Size` is 6144 kB, also a multiple of 2MiB 3. The value of `THPeligible` is `1` 4. The value of `VmFlags` includes `hg`
For the one executable mapping it shows:
00600000-0096c000 r-xp 00200000 00:22 309231 /path/to/lib/erlang/erts-15.0.1/bin/beam.smp
Size: 3504 kB
KernelPageSize: 16 kB
MMUPageSize: 16 kB
Rss: 3184 kB
Pss: 3184 kB
Pss_Dirty: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 3184 kB
Private_Dirty: 0 kB
Referenced: 3184 kB
Anonymous: 0 kB
KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me
which I take it means the optimization didn't kick in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini_9_cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
33554432
Thank you.
AFAIK, Aarch64 has three "translation granules". The 4KiB granule is like x86 and offers 4KiB, 2MiB, and 1GiB pages. Your machine seems to use a 16KiB granule that offers 16KiB and 32MiB pages.
The settings I chose had the 4KiB, 2MiB, and 1GiB page sizes in mind. Other granules are likely to benefit less from this optimization since pages sizes are larger giving the TLB more coverage. That should mean fewer iTLB misses without the need to mess around with Linux "hugepages", a good thing.
That said, for things like the heap, a 32MiB page should be beneficial. That is a separate optimization I added and it is controlled by the +MMlp on|off
flag. A 32MiB page should be beneficial for the JIT cache but my patches to enable that were never accepted by asmjit.
If you are seeing a lot of iTLB misses from the .text
segment, measurable with perf(1)
, there is a feature in newer kernels called multi-size THP which can simulate a large page size using multiple PTEs. That might still be a win for smaller regions of memory like the .text
segment of Erlang. See
https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html
Alas, I am not sure multi-size THP is relevant for the mapping of the .text
segment using the current strategy. Some additional research is needed. However, even if it isn't, there is a trick that can be done at startup where
- the text segment is saved away
- the address space of the .text segment is unmapped and remapped using whatever options you want
- the text segment is copied back into place
Not so pretty but it can be worth a lot of performance and open-source code is already available to do this.
which I take it means the optimization didn't kick in.
Doesn't look like it. Try using large pages with the heap?
Hello! Did you want to do something more in this PR, or should we merge it as is? |
I think the PR in its current form is good. However, I was unable to reproduce the crashes that were reported by the users of 32-bit and 64-bit ARM systems running Debian Linux using the its defaults. Everything worked as expected on the hardware I had access to. I think we can do a better job with 64-bit ARM by adding logic to determine what page size(s) the kernel is using (I believe this change change between reboots) and using large pages where it makes sense. That is not an especially big change but beyond the scope of the issues reported. |
I'll include it in 27.1 then and we can figure out the other details later on. |
Thanks! |
THP seems to be causing problems on 32-bit and 64-bit ARM hosts. This change disables the test for THP on everything except 64-bit x86 hosts which are known to work.