Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

Closed
geerlingguy opened this issue Jul 6, 2024 · 6 comments
Closed

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

geerlingguy opened this issue Jul 6, 2024 · 6 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Jul 6, 2024

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/[email protected]/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Jul 6, 2024

Baseline

pi@pi5:~/linux $ uname -a
Linux pi5 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Geekbench 6

Type Run 1 Run 2 Run 3 Average
Single 802 799 801 801
Multi 1721 1717 1731 1723
Link result result result -

HPL / Top 500

Stat Run 1 Run 2 Run 3 Average
Power (avg) 11.5W 11.2W 11.1W 11.3W
Result 28.839 Gflops 28.395 Gflops 28.545 Gflops 28.593 Gflops
Efficiency 2.51 Gflops/W 2.53 Gflops/W 2.57 Gflops/W 2.54 Gflops/W
Click to show representative result
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             292.97             2.8839e+01
HPL_pdgesv() start time Fri Jul  5 22:52:13 2024

HPL_pdgesv() end time   Fri Jul  5 22:57:06 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Power consumption graph showing HPL run tail end, and two of the Geekbench 6 runs. System uptime is over 24 hours.

Screenshot 2024-07-06 at 10 25 45 PM

@geerlingguy
Copy link
Owner Author

geerlingguy commented Jul 7, 2024

After applying NUMA patch

# After a quick rebuild of the current Pi 64-bit kernel source:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #1 SMP PREEMPT Sat Jul  6 23:19:18 CDT 2024 aarch64 GNU/Linux
  1. Download mbox.gz from this patch thread.
  2. On the Pi, in the linux checkout from Raspberry Pi's build the Linux kernel guide, run: git am PATCH-2-2-arm64-numa-Add-NUMA-emulation-for-ARM64.mbox (skip empty messages)
  3. Configure NUMA Emulation with make menuconfig (requires libncurses-dev be installed via apt)
    1. Enable "Kernel Features" > "NUMA Memory Allocation and Scheduler Support" (and enable "NUMA emulation" when it appears)
    2. Save the config and exit.
  4. Rebuild the kernel and reboot.
# After rebuilding the kernel with the NUMA patch:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #2 SMP PREEMPT Sun Jul  7 00:43:44 CDT 2024 aarch64 GNU/Linux

IMPORTANT NOTE: The following results were taken with the NUMA Emulation patch applied, but without adding numa=fake=4 to cmdline.txt. See follow-up comment below with results after setting that parameter.


Geekbench 6

Type Run 1 Run 2 Run 3 Average
Single 795 801 802 799
Multi 1636 1626 1638 1633
Link result result result -

Single core: 0.25% slower
Multicore: 5.36% slower

HPL / Top 500

Stat Run 1 Run 2 Run 3 Average
Power (avg) 11.4W 11.0W 11.1W 11.2W
Result 31.348 Gflops 30.621 Gflops 30.958 Gflops 30.976 Gflops
Efficiency 2.75 Gflops/W 2.78 Gflops/W 2.78 Gflops/W 2.77 Gflops/W

Result: 8.00% faster
Efficiency: 8.66% more efficient

Click to show representative result
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             269.52             3.1348e+01
HPL_pdgesv() start time Sun Jul  7 15:05:55 2024

HPL_pdgesv() end time   Sun Jul  7 15:10:25 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

@geerlingguy
Copy link
Owner Author

I also ran Geekbench 6 just after boot (1 min uptime) with the NUMA patch in place. Here's the result: https://browser.geekbench.com/v6/cpu/6820837 (801 / 1636).

@geerlingguy
Copy link
Owner Author

And another Geekbench 6 run about 1 hour after boot, after cooldown period of 10 minutes after all the previous tests: https://browser.geekbench.com/v6/cpu/6821505 (799 / 1637). So no noticeable difference at least on this Pi 5 8GB running this Linux kernel between runs immediately following boot and runs much later.

Going to move some other performance testing over to geerlingguy/sbc-reviews#21

@will127534
Copy link

Hi @geerlingguy, I'm running through your steps and I think we also need to add numa=fake=4 to the cmdline.txt.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Jul 8, 2024

@will127534 - heh... as I was writing up a bit of a post on this... I realized that exact step was missing. I'm going to re-test now. Adding numa=fake=4 to /boot/firmware/config.txt and rebooting, I now see:

pi@pi5:~ $ dmesg
...
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Faking a node at [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000] Faking a node at [mem 0x0000000100000000-0x000000017fffffff]
[    0.000000] Faking a node at [mem 0x0000000180000000-0x00000001ffffffff]
...
[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe  smsc95xx.macaddr=D8:3A:DD:84:FB:3A vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  console=ttyAMA10,115200 console=tty1 root=PARTUUID=9f1af6e7-02 rootfstype=ext4 fsck.repair=yes numa=fake=4 rootwait

Geekbench 6

Run with: numactl --interleave=all ./geekbench6 — installed with sudo apt install -y numactl.

Type Run 1 Run 2 Run 3 Average
Single 854 853 851 853
Multi 1949 1947 1936 1944
Link result result result -

Single core: 6.29% faster
Multicore: 12.05% faster

HPL / Top 500

Modified main.yml playbook mpirun command to have prepended numactl --interleave=all.

Stat Run 1 Run 2 Run 3 Average
Power (avg) 12.0W 12.1W 12.0W 12.0W
Result 33.204 Gflops 33.194 Gflops 33.143 Gflops 33.180 Gflops
Efficiency 2.78 Gflops/W 2.74 Gflops/W 2.76 Gflops/W 2.76 Gflops/W

Result: 14.85% faster
Efficiency: 8% more efficient

Click to show representative result
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             254.46             3.3204e+01
HPL_pdgesv() start time Mon Jul  8 13:15:36 2024

HPL_pdgesv() end time   Mon Jul  8 13:19:51 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants