Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable support for Linux THP on architectures other than amd64 #8702

Merged

Conversation

lexprfuncall
Copy link
Contributor

THP seems to be causing problems on 32-bit and 64-bit ARM hosts. This change disables the test for THP on everything except 64-bit x86 hosts which are known to work.

THP seems to be causing problems on 32-bit and 64-bit ARM hosts.  This
change disables the test for THP on everything except 64-bit x86 hosts
which are known to work.
Copy link
Contributor

github-actions bot commented Aug 8, 2024

CT Test Results

    3 files    143 suites   49m 21s ⏱️
1 591 tests 1 542 ✅ 49 💤 0 ❌
2 330 runs  2 256 ✅ 74 💤 0 ❌

Results for commit dbba3df.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@lexprfuncall
Copy link
Contributor Author

I believe this should resolve #8652 and #8696.

case $OPSYS in #(
linux*) :
case $ARCH-$OPSYS in #(
amd64-linux*) :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an option allowing people to opt-in. OTP-27.0.1 works fine out of the box on an M1 mini running Fedora 40, i.e. arm64-linux. The generated config.h says #define HAVE_LINUX_THP 1.

Is there a way to determine if the THP optimization actually kicks in or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an option allowing people to opt-in. OTP-27.0.1 works fine out of the box on an M1 mini running Fedora 40, i.e. arm64-linux. The generated config.h says #define HAVE_LINUX_THP 1.

Thank you, that is good to know.

The linker flags that I am using align stuff to a 2MiB boundary makes sense for 64-bit x86 which uses a 2MiB page for THP. However, the size of a transparent huge page is not guaranteed to be same on all architectures so that alignment is not guaranteed to be correct. I believe it is reasonable for a 32-bit ARM, at least on the version of Debian for armhf that I installed on a QEMU system last night, but I am not sure if it is reasonable for different variants of 64-bit ARM.

Since I don't have real hardware to test on at the moment, I am hesitant to enable other architectures. Leaving it in could be a no-op, or worse. I thought there was an argument to ./configure to opt-out of checking for, and enabling, THP but it seems to not show up when I run ./configure --help. (My mistake.) Giving users control of this does seem like a reasonable thing to do.

Can you please tell me what the value of /sys/kernel/mm/transparent_hugepage/hpage_pmd_size is on your M1? Apple uses a 16KiB page by default, so this might have a unique value on that system.

Is there a way to determine if the THP optimization actually kicks in or not?

To tell if the .text segment is being mapped with THP look for the .text segment mapping in /proc/$pid/smaps for an Erlang node's process. If the entry for the .text segment has a non-zero value for FilePmdMapped I believe everything should be working. Here is an example of what that looks like on my system

00600000-00c00000 r-xp 00200000 00:1b 21865                              /path/to/beam.frmptr.smp
Size:               6144 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                6144 kB
Pss:                6144 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      6144 kB
Private_Dirty:         0 kB
Referenced:         6144 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:      6144 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
ProtectionKey:         0
VmFlags: rd ex mr mw me hg 

Note a few additional things

  1. The mapping starts at 0x00600000, a multiple of 2MiB
  2. The value of Size is 6144 kB, also a multiple of 2MiB
  3. The value of THPeligible is 1
  4. The value of VmFlags includes hg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please tell me what the value of /sys/kernel/mm/transparent_hugepage/hpage_pmd_size is on your M1?

mini_7_ls /sys/kernel/mm/transparent_hugepage/
/sys/kernel/mm/transparent_hugepage:
defrag             hpage_pmd_size     hugepages-128kB/   hugepages-2048kB/  hugepages-32768kB/ hugepages-512kB/   hugepages-8192kB/  shmem_enabled 
enabled            hugepages-1024kB/  hugepages-16384kB/ hugepages-256kB/   hugepages-4096kB/  hugepages-64kB/    khugepaged/        use_zero_page 
mini_8_cat /sys/kernel/mm/transparent_hugepage/enabled 
always [madvise] never
mini_9_cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size 
33554432

To tell if the .text segment is being mapped with THP look for the .text segment mapping in /proc/$pid/smaps for an Erlang node's process. If the entry for the .text segment has a non-zero value for FilePmdMapped I believe everything should be working. Here is an example of what that looks like on my system

00600000-00c00000 r-xp 00200000 00:1b 21865                              /path/to/beam.frmptr.smp
Size:               6144 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                6144 kB
Pss:                6144 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      6144 kB
Private_Dirty:         0 kB
Referenced:         6144 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:      6144 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
ProtectionKey:         0
VmFlags: rd ex mr mw me hg 

Note a few additional things

1. The mapping starts at 0x00600000, a multiple of 2MiB

2. The value of `Size` is 6144 kB, also a multiple of 2MiB

3. The value of `THPeligible` is `1`

4. The value of `VmFlags` includes `hg`

For the one executable mapping it shows:

00600000-0096c000 r-xp 00200000 00:22 309231                             /path/to/lib/erlang/erts-15.0.1/bin/beam.smp
Size:               3504 kB
KernelPageSize:       16 kB
MMUPageSize:          16 kB
Rss:                3184 kB
Pss:                3184 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      3184 kB
Private_Dirty:         0 kB
Referenced:         3184 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
VmFlags: rd ex mr mw me

which I take it means the optimization didn't kick in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mini_9_cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
33554432

Thank you.

AFAIK, Aarch64 has three "translation granules". The 4KiB granule is like x86 and offers 4KiB, 2MiB, and 1GiB pages. Your machine seems to use a 16KiB granule that offers 16KiB and 32MiB pages.

The settings I chose had the 4KiB, 2MiB, and 1GiB page sizes in mind. Other granules are likely to benefit less from this optimization since pages sizes are larger giving the TLB more coverage. That should mean fewer iTLB misses without the need to mess around with Linux "hugepages", a good thing.

That said, for things like the heap, a 32MiB page should be beneficial. That is a separate optimization I added and it is controlled by the +MMlp on|off flag. A 32MiB page should be beneficial for the JIT cache but my patches to enable that were never accepted by asmjit.

If you are seeing a lot of iTLB misses from the .text segment, measurable with perf(1), there is a feature in newer kernels called multi-size THP which can simulate a large page size using multiple PTEs. That might still be a win for smaller regions of memory like the .text segment of Erlang. See

https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html

Alas, I am not sure multi-size THP is relevant for the mapping of the .text segment using the current strategy. Some additional research is needed. However, even if it isn't, there is a trick that can be done at startup where

  1. the text segment is saved away
  2. the address space of the .text segment is unmapped and remapped using whatever options you want
  3. the text segment is copied back into place

Not so pretty but it can be worth a lot of performance and open-source code is already available to do this.

which I take it means the optimization didn't kick in.

Doesn't look like it. Try using large pages with the heap?

@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Aug 12, 2024
@jhogberg jhogberg self-assigned this Aug 19, 2024
@jhogberg jhogberg added the testing currently being tested, tag is used by OTP internal CI label Aug 19, 2024
@bjorng bjorng removed the testing currently being tested, tag is used by OTP internal CI label Aug 19, 2024
@garazdawi
Copy link
Contributor

Hello! Did you want to do something more in this PR, or should we merge it as is?

@lexprfuncall
Copy link
Contributor Author

Hello! Did you want to do something more in this PR, or should we merge it as is?

I think the PR in its current form is good. However, I was unable to reproduce the crashes that were reported by the users of 32-bit and 64-bit ARM systems running Debian Linux using the its defaults. Everything worked as expected on the hardware I had access to.

I think we can do a better job with 64-bit ARM by adding logic to determine what page size(s) the kernel is using (I believe this change change between reboots) and using large pages where it makes sense. That is not an especially big change but beyond the scope of the issues reported.

@garazdawi
Copy link
Contributor

I'll include it in 27.1 then and we can figure out the other details later on.

@garazdawi garazdawi added the testing currently being tested, tag is used by OTP internal CI label Sep 4, 2024
@garazdawi garazdawi added this to the OTP-27.1 milestone Sep 4, 2024
@garazdawi garazdawi self-assigned this Sep 4, 2024
@garazdawi garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Sep 4, 2024
@garazdawi garazdawi merged commit ee24604 into erlang:maint Sep 5, 2024
18 checks passed
@garazdawi
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants