Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

geerlingguy · 2024-07-06T02:20:27Z

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/[email protected]/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

geerlingguy · 2024-07-06T03:41:29Z

Baseline

pi@pi5:~/linux $ uname -a
Linux pi5 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Geekbench 6

Type	Run 1	Run 2	Run 3	Average
Single	802	799	801	801
Multi	1721	1717	1731	1723
Link	result	result	result	-

HPL / Top 500

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	11.5W	11.2W	11.1W	11.3W
Result	28.839 Gflops	28.395 Gflops	28.545 Gflops	28.593 Gflops
Efficiency	2.51 Gflops/W	2.53 Gflops/W	2.57 Gflops/W	2.54 Gflops/W

Click to show representative result

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             292.97             2.8839e+01
HPL_pdgesv() start time Fri Jul  5 22:52:13 2024

HPL_pdgesv() end time   Fri Jul  5 22:57:06 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Power consumption graph showing HPL run tail end, and two of the Geekbench 6 runs. System uptime is over 24 hours.

geerlingguy · 2024-07-07T02:48:30Z

After applying NUMA patch

# After a quick rebuild of the current Pi 64-bit kernel source:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #1 SMP PREEMPT Sat Jul  6 23:19:18 CDT 2024 aarch64 GNU/Linux

Download mbox.gz from this patch thread.
On the Pi, in the linux checkout from Raspberry Pi's build the Linux kernel guide, run: git am PATCH-2-2-arm64-numa-Add-NUMA-emulation-for-ARM64.mbox (skip empty messages)
Configure NUMA Emulation with make menuconfig (requires libncurses-dev be installed via apt)
1. Enable "Kernel Features" > "NUMA Memory Allocation and Scheduler Support" (and enable "NUMA emulation" when it appears)
2. Save the config and exit.
Rebuild the kernel and reboot.

# After rebuilding the kernel with the NUMA patch:

pi@pi5:~ $ uname -a
Linux pi5 6.6.36-v8-16k+ #2 SMP PREEMPT Sun Jul  7 00:43:44 CDT 2024 aarch64 GNU/Linux

IMPORTANT NOTE: The following results were taken with the NUMA Emulation patch applied, but without adding numa=fake=4 to cmdline.txt. See follow-up comment below with results after setting that parameter.

Geekbench 6

Type	Run 1	Run 2	Run 3	Average
Single	795	801	802	799
Multi	1636	1626	1638	1633
Link	result	result	result	-

Single core: 0.25% slower
Multicore: 5.36% slower

HPL / Top 500

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	11.4W	11.0W	11.1W	11.2W
Result	31.348 Gflops	30.621 Gflops	30.958 Gflops	30.976 Gflops
Efficiency	2.75 Gflops/W	2.78 Gflops/W	2.78 Gflops/W	2.77 Gflops/W

Result: 8.00% faster
Efficiency: 8.66% more efficient

Click to show representative result

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             269.52             3.1348e+01
HPL_pdgesv() start time Sun Jul  7 15:05:55 2024

HPL_pdgesv() end time   Sun Jul  7 15:10:25 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

geerlingguy · 2024-07-07T19:32:20Z

I also ran Geekbench 6 just after boot (1 min uptime) with the NUMA patch in place. Here's the result: https://browser.geekbench.com/v6/cpu/6820837 (801 / 1636).

geerlingguy · 2024-07-07T22:14:11Z

And another Geekbench 6 run about 1 hour after boot, after cooldown period of 10 minutes after all the previous tests: https://browser.geekbench.com/v6/cpu/6821505 (799 / 1637). So no noticeable difference at least on this Pi 5 8GB running this Linux kernel between runs immediately following boot and runs much later.

Going to move some other performance testing over to geerlingguy/sbc-reviews#21

will127534 · 2024-07-08T06:45:38Z

Hi @geerlingguy, I'm running through your steps and I think we also need to add numa=fake=4 to the cmdline.txt.

geerlingguy · 2024-07-08T17:21:20Z

@will127534 - heh... as I was writing up a bit of a post on this... I realized that exact step was missing. I'm going to re-test now. Adding numa=fake=4 to /boot/firmware/config.txt and rebooting, I now see:

pi@pi5:~ $ dmesg
...
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007fffffff]
[    0.000000] Faking a node at [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000] Faking a node at [mem 0x0000000100000000-0x000000017fffffff]
[    0.000000] Faking a node at [mem 0x0000000180000000-0x00000001ffffffff]
...
[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe  smsc95xx.macaddr=D8:3A:DD:84:FB:3A vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  console=ttyAMA10,115200 console=tty1 root=PARTUUID=9f1af6e7-02 rootfstype=ext4 fsck.repair=yes numa=fake=4 rootwait

Geekbench 6

Run with: numactl --interleave=all ./geekbench6 — installed with sudo apt install -y numactl.

Type	Run 1	Run 2	Run 3	Average
Single	854	853	851	853
Multi	1949	1947	1936	1944
Link	result	result	result	-

Single core: 6.29% faster
Multicore: 12.05% faster

HPL / Top 500

Modified main.yml playbook mpirun command to have prepended numactl --interleave=all.

Stat	Run 1	Run 2	Run 3	Average
Power (avg)	12.0W	12.1W	12.0W	12.0W
Result	33.204 Gflops	33.194 Gflops	33.143 Gflops	33.180 Gflops
Efficiency	2.78 Gflops/W	2.74 Gflops/W	2.76 Gflops/W	2.76 Gflops/W

Result: 14.85% faster
Efficiency: 8% more efficient

Click to show representative result

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             254.46             3.3204e+01
HPL_pdgesv() start time Mon Jul  8 13:15:36 2024

HPL_pdgesv() end time   Mon Jul  8 13:19:51 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

geerlingguy mentioned this issue Jul 7, 2024

Raspberry Pi 5 model B geerlingguy/sbc-reviews#21

Open

geerlingguy closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

geerlingguy commented Jul 6, 2024 •

edited

Loading

geerlingguy commented Jul 6, 2024 •

edited

Loading

geerlingguy commented Jul 7, 2024 •

edited

Loading

geerlingguy commented Jul 7, 2024

geerlingguy commented Jul 7, 2024

will127534 commented Jul 8, 2024

geerlingguy commented Jul 8, 2024 •

edited

Loading

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

Benchmark Raspberry Pi 5 Linux kernel NUMA patch #36

Comments

geerlingguy commented Jul 6, 2024 • edited Loading

geerlingguy commented Jul 6, 2024 • edited Loading

Baseline

Geekbench 6

HPL / Top 500

geerlingguy commented Jul 7, 2024 • edited Loading

After applying NUMA patch

Geekbench 6

HPL / Top 500

geerlingguy commented Jul 7, 2024

geerlingguy commented Jul 7, 2024

will127534 commented Jul 8, 2024

geerlingguy commented Jul 8, 2024 • edited Loading

Geekbench 6

HPL / Top 500

geerlingguy commented Jul 6, 2024 •

edited

Loading

geerlingguy commented Jul 6, 2024 •

edited

Loading

geerlingguy commented Jul 7, 2024 •

edited

Loading

geerlingguy commented Jul 8, 2024 •

edited

Loading