Raspberry Pi Compute Module 5 #58

geerlingguy · 2024-11-27T08:04:03Z

Basic information

Board URL (official): https://www.raspberrypi.com/products/compute-module-5/
Board purchased from: Provided by Raspberry Pi
Board purchase date: November, 2024
Board specs (as tested): 4GB RAM, 32GB eMMC (also 8GB RAM, 8GB eMMC as noted)
Board price (as tested): $70

All tests were run on the 4GB board, except as noted. Some tests scale with RAM.

Linux/system information

# output of `screenfetch`
         _,met$$$$$gg.           pi@cm5
      ,g$$$$$$$$$$$$$$$P.        OS: Debian 12 bookworm
    ,g$$P""       """Y$$.".      Kernel: aarch64 Linux 6.6.51+rpt-rpi-2712
   ,$$P'              `$$$.      Uptime: 12m
  ',$$P       ,ggs.     `$$b:    Packages: 1630
  `d$$'     ,$P"'   .    $$$     Shell: bash 5.2.15
   $$P      d$'     ,    $$P     Disk: 9.4G / 237G (5%)
   $$:      $$.   -    ,d$$'     CPU: ARM Cortex-A76 @ 4x 2.4GHz
   $$\;      Y$b._   _,d$P'      GPU: 
   Y$$.    `.`"Y$$$$P"'          RAM: 677MiB / 4045MiB
   `$$b      "-.__              
    `Y$$                        
     `Y$$.                      
       `$$b.                    
         `Y$$b.                 
            `"Y$b._             
                `""""      

# output of `uname -a`
Linux cm5 6.6.51+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt3 (2024-10-08) aarch64 GNU/Linux

Benchmark results

CPU

Geekbench 6: (804 single / 1651 multi - https://browser.geekbench.com/v6/cpu/9095882)
Geekbench 6: (892 single / 2128 multi - https://browser.geekbench.com/v6/cpu/9331638) (after Dec 9 Pi OS update)
26.15 Gflops / 2.79 Gflops/W (geerlingguy/top500-benchmark HPL result) - 31.48 Gflops for 8GB (3.147 Gflops/W)
32.152 Gflops / 3.49 Gflops/W (4GB after Dec 9 Pi OS update)

Power

Idle power draw (at wall): 2.4 W (2.3 W for 'Lite' models)
Maximum simulated power draw (stress-ng --matrix 0): 8.5 W
During Geekbench multicore benchmark: 8.2 W
During top500 HPL benchmark: 9.4 W (10 W on 8GB)

Disk

Pinedrive 256GB 2242 NVMe SSD at PCIe Gen 3

Benchmark	Result
iozone 4K random read	63.01 MB/s
iozone 4K random write	298.06 MB/s
iozone 1M random read	820.16 MB/s
iozone 1M random write	759.24 MB/s
iozone 1M sequential read	823.04 MB/s
iozone 1M sequential write	758.51 MB/s

Built-in eMMC (32GB)

Benchmark	Result
iozone 4K random read	34.71 MB/s
iozone 4K random write	61.80 MB/s
iozone 1M random read	314.97 MB/s
iozone 1M random write	108.32 MB/s
iozone 1M sequential read	316.19 MB/s
iozone 1M sequential write	109.71 MB/s

wget https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh
chmod +x disk-benchmark.sh
sudo MOUNT_PATH=/ TEST_SIZE=1g ./disk-benchmark.sh

Network

iperf3 results:

Built-in 1 Gbps Ethernet (BCM54210PE)

iperf3 -c $SERVER_IP: 938 Mbps
iperf3 -c $SERVER_IP --reverse: 884 Mbps
iperf3 -c $SERVER_IP --bidir: 931 Mbps up, 663 Mbps down

WiFi (built-in PCB antenna)

iperf3 -c $SERVER_IP: 249 Mbps
iperf3 -c $SERVER_IP --reverse: 240 Mbps
iperf3 -c $SERVER_IP --bidir: 133 Mbps up, 96.6 Mbps down

pi@cm5:~ $ iwconfig wlan0
wlan0     IEEE 802.11  ESSID:"GE_5G"  
          Mode:Managed  Frequency:5.2 GHz  Access Point: 6C:CD:D6:61:8F:21   
          Bit Rate=390 Mb/s   Tx-Power=31 dBm   
          Retry short limit:7   RTS thr:off   Fragment thr:off
          Power Management:on
          Link Quality=57/70  Signal level=-53 dBm  
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:0   Missed beacon:0

WiFi (external antenna)

iperf3 -c $SERVER_IP: 250 Mbps
iperf3 -c $SERVER_IP --reverse: 245 Mbps
iperf3 -c $SERVER_IP --bidir: 113 Mbps up, 120 Mbps down

pi@cm5:~ $ iwconfig wlan0
wlan0     IEEE 802.11  ESSID:"GE_5G"  
          Mode:Managed  Frequency:5.2 GHz  Access Point: 6C:CD:D6:61:8F:21   
          Bit Rate=433.3 Mb/s   Tx-Power=31 dBm   
          Retry short limit:7   RTS thr:off   Fragment thr:off
          Power Management:on
          Link Quality=53/70  Signal level=-57 dBm  
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:18  Invalid misc:0   Missed beacon:0

(Measured for optimal antenna orientation with sudo apt install wavemon, and wavemon)

GPU

glmark2

glmark2-es2 / glmark2-es2-wayland results:

=======================================================
    glmark2 2023.01
=======================================================
    OpenGL Information
    GL_VENDOR:      Broadcom
    GL_RENDERER:    V3D 7.1
    GL_VERSION:     OpenGL ES 3.1 Mesa 23.2.1-1~bpo12+rpt3
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 2563 FrameTime: 0.390 ms
[build] use-vbo=true: FPS: 3419 FrameTime: 0.293 ms
[texture] texture-filter=nearest: FPS: 2839 FrameTime: 0.352 ms
[texture] texture-filter=linear: FPS: 2839 FrameTime: 0.352 ms
[texture] texture-filter=mipmap: FPS: 2883 FrameTime: 0.347 ms
[shading] shading=gouraud: FPS: 2867 FrameTime: 0.349 ms
[shading] shading=blinn-phong-inf: FPS: 2487 FrameTime: 0.402 ms
[shading] shading=phong: FPS: 2109 FrameTime: 0.474 ms
[shading] shading=cel: FPS: 2045 FrameTime: 0.489 ms
[bump] bump-render=high-poly: FPS: 1406 FrameTime: 0.711 ms
[bump] bump-render=normals: FPS: 3072 FrameTime: 0.326 ms
[bump] bump-render=height: FPS: 2873 FrameTime: 0.348 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1176 FrameTime: 0.851 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 482 FrameTime: 2.078 ms
[pulsar] light=false:quads=5:texture=false: FPS: 2943 FrameTime: 0.340 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 289 FrameTime: 3.467 ms
[desktop] effect=shadow:windows=4: FPS: 1080 FrameTime: 0.926 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 537 FrameTime: 1.863 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 529 FrameTime: 1.893 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 570 FrameTime: 1.755 ms
[ideas] speed=duration: FPS: 2264 FrameTime: 0.442 ms
[jellyfish] <default>: FPS: 1211 FrameTime: 0.826 ms
[terrain] <default>: FPS: 77 FrameTime: 13.035 ms
[shadow] <default>: FPS: 184 FrameTime: 5.463 ms
[refract] <default>: FPS: 84 FrameTime: 11.928 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 3268 FrameTime: 0.306 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2297 FrameTime: 0.435 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 3222 FrameTime: 0.310 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 2733 FrameTime: 0.366 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 1898 FrameTime: 0.527 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 2618 FrameTime: 0.382 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 2622 FrameTime: 0.381 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 1778 FrameTime: 0.563 ms
=======================================================
                                  glmark2 Score: 1916 
=======================================================

Note: This benchmark requires an active display on the device. Not all devices may be able to run glmark2-es2, so in that case, make a note and move on!

Ollama

ollama LLM model inference results:

Pi Model	CPU/GPU	LLM	Rate
Raspberry Pi CM5 - 4GB	CPU	llama3.2:3b	4.58 Tokens/s
Raspberry Pi CM5 - 8GB	CPU	llama3.2:3b	4.53 Tokens/s
Raspberry Pi CM5 - 8GB	CPU	llama3.1:8b	1.93 Tokens/s

Power consumption was a steady 9.3W during inference.

Memory

tinymembench results:

Click to expand memory benchmark result

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   5303.7 MB/s (0.2%)
 C copy backwards (32 byte blocks)                    :   5333.1 MB/s (0.2%)
 C copy backwards (64 byte blocks)                    :   5328.4 MB/s
 C copy                                               :   6061.3 MB/s (0.1%)
 C copy prefetched (32 bytes step)                    :   6031.9 MB/s
 C copy prefetched (64 bytes step)                    :   6036.9 MB/s
 C 2-pass copy                                        :   5433.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   6003.8 MB/s (0.1%)
 C 2-pass copy prefetched (64 bytes step)             :   5996.6 MB/s
 C fill                                               :  12660.7 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :  12630.7 MB/s
 C fill (shuffle within 32 byte blocks)               :  12628.8 MB/s
 C fill (shuffle within 64 byte blocks)               :  12642.2 MB/s
 NEON 64x2 COPY                                       :   5996.0 MB/s (1.0%)
 NEON 64x2x4 COPY                                     :   5996.6 MB/s
 NEON 64x1x4_x2 COPY                                  :   6006.0 MB/s
 NEON 64x2 COPY prefetch x2                           :   5517.6 MB/s
 NEON 64x2x4 COPY prefetch x1                         :   5587.1 MB/s
 NEON 64x2 COPY prefetch x1                           :   5494.3 MB/s
 NEON 64x2x4 COPY prefetch x1                         :   5596.0 MB/s (0.6%)
 ---
 standard memcpy                                      :   6012.7 MB/s
 standard memset                                      :  12646.0 MB/s (0.3%)
 ---
 NEON LDP/STP copy                                    :   6012.5 MB/s (0.1%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :   6014.6 MB/s (0.2%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :   6013.5 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   5997.7 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   5995.9 MB/s
 NEON LD1/ST1 copy                                    :   6002.0 MB/s
 NEON STP fill                                        :  12634.8 MB/s (0.8%)
 NEON STNP fill                                       :  12640.2 MB/s (0.7%)
 ARM LDP/STP copy                                     :   6011.4 MB/s (0.6%)
 ARM STP fill                                         :  12403.9 MB/s (0.4%)
 ARM STNP fill                                        :  12408.2 MB/s (0.2%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON LDP/STP copy (from framebuffer)                 :   1939.3 MB/s (0.7%)
 NEON LDP/STP 2-pass copy (from framebuffer)          :   1737.0 MB/s (0.2%)
 NEON LD1/ST1 copy (from framebuffer)                 :   1945.3 MB/s (0.2%)
 NEON LD1/ST1 2-pass copy (from framebuffer)          :   1736.1 MB/s
 ARM LDP/STP copy (from framebuffer)                  :   1894.0 MB/s (0.1%)
 ARM LDP/STP 2-pass copy (from framebuffer)           :   1732.5 MB/s (0.1%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.1 ns          /     1.5 ns 
    262144 :    1.6 ns          /     2.0 ns 
    524288 :    2.3 ns          /     2.9 ns 
   1048576 :    8.3 ns          /    11.3 ns 
   2097152 :   15.1 ns          /    19.0 ns 
   4194304 :   51.5 ns          /    77.4 ns 
   8388608 :   79.8 ns          /   108.0 ns 
  16777216 :   94.9 ns          /   119.5 ns 
  33554432 :  104.4 ns          /   126.1 ns 
  67108864 :  110.0 ns          /   130.3 ns

`sbc-bench` results

Run sbc-bench and paste a link to the results here: https://0x0.st/XRKg.txt

Phoronix Test Suite

Results from pi-general-benchmark.sh:

Launch version of Pi OS

pts/encode-mp3: 11.739 sec
pts/x264 4K: 4.32 fps
pts/x264 1080p: 18.06 fps
pts/phpbench: 435778
pts/build-linux-kernel (defconfig): 2222.151 sec

December update (Pi OS with NUMA faking and SDRAM tweaks)

pts/encode-mp3: 11.708 sec
pts/x264 4K: 4.08 fps
pts/x264 1080p: 17.63 fps
pts/phpbench: 431800
pts/build-linux-kernel (defconfig): 2110.457 sec

Other benchmarks

Boot time (Pi OS 64-bit Desktop): 22.92s to SSH login, 23.90s to GUI with menu bar

The text was updated successfully, but these errors were encountered:

schoolpost · 2024-11-27T08:57:47Z

Thanks for all the thorough performance numbers, can you confirm CM5 uses the D0 variant of the BCM2712?

geerlingguy · 2024-11-27T09:46:02Z

@schoolpost - All the ones I've seen are D0, yes.

geerlingguy · 2024-11-27T19:01:49Z

Home Assistant Yellow is also a drop-in upgrade, in addition to all the Compute Module carrier boards I mentioned in my video: https://www.youtube.com/watch?v=X4blR5Ua3S0

melroy89 · 2024-12-01T00:29:57Z

Hi! Me again 😃. Are you sure the Broadcom BCM2712 has H264 hardware decode/encode bock? I read a lot online that HEVC decoder in the BCM2712 is not provided by Broadcom, but designed by the Raspberry Foundation themselves. And with this new BCM2712 chip (used in CM5) they removed H264 decoder fully:

https://www.phoronix.com/forums/forum/hardware/processors-memory/1412037-raspberry-pi-5-benchmarks-significantly-better-performance-improved-i-o?p=1412202#post1412202

And the reason I'm saying this is. That you mentioning that the Raspberry Pi is used in IP KVM cards, which is true, but these KVM cards actually need h264 hardware codec support in order to function well.. In fact, you could argue that CM5 is worse now.

geerlingguy · 2024-12-01T01:23:24Z

@melroy89 - It has H.265 decode (up to 4K 60 fps), but not hardware H.264 encode or decode (nor any other encode/decode, at least none that is usable by Raspberry Pi... there could be dark silicon on there that Broadcom used for other customers).

CM5 is fully capable of 1080p encode and decode, and can even hit 4K depending on bitrate and what you're after, outside of H.265. See: Can the Raspberry Pi 5 handle 4K?.

I haven't yet tested the CM5 on IP KVM cards—in my video I was clear that the Compute Module fits, but that and many other CM4 boards have not been tested with CM5 yet. For tracking on that testing, please follow this issue: geerlingguy/raspberry-pi-pcie-devices#686

geerlingguy · 2024-12-09T16:46:55Z

Re-testing a few things now that the SDRAM/NUMA tweaks are in Pi OS proper (just needs an update to activate them).

geerlingguy · 2024-12-09T17:11:15Z

Added some results up above, but here's new SBC-bench results: https://0x0.st/Xh2H.txt

Some interesting diffs:

- libc memcpy copy                                 :   5705.0 MB/s (3, 0.2%)
- libc memchr scan                                 :  13722.6 MB/s (2)
- libc memset fill                                 :  12567.9 MB/s (3, 0.9%)
+ libc memcpy copy                                 :   5892.4 MB/s (2)
+ libc memchr scan                                 :  14188.5 MB/s (2)
+ libc memset fill                                 :   9675.1 MB/s (3, 1.4%)

-  * memcpy: 5705.0 MB/s, memchr: 13722.6 MB/s, memset: 12567.9 MB/s
-  * 16M latency: 119.2 118.6 118.8 117.8 120.8 134.6 130.2 139.9 
-  * 128M latency: 136.3 135.2 147.2 135.1 136.4 134.9 135.8 137.1 
-  * 7-zip MIPS (3 consecutive runs): 11250, 11265, 11217 (11240 avg), single-threaded: 3164
-  * `aes-256-cbc     540307.67k  1003613.67k  1256029.53k  1332866.39k  1365516.29k  1367834.62k`
-  * `aes-256-cbc     540608.23k  1003560.11k  1255918.42k  1332845.23k  1365215.91k  1368135.00k`
+  * memcpy: 5892.4 MB/s, memchr: 14188.5 MB/s, memset: 9675.1 MB/s
+  * 16M latency: 100.1 101.0 103.3 100.9 99.88 114.8 130.4 146.9 
+  * 128M latency: 116.7 115.2 116.6 115.2 119.5 115.8 116.6 118.4 
+  * 7-zip MIPS (3 consecutive runs): 11819, 11858, 11809 (11830 avg), single-threaded: 3306
+  * `aes-256-cbc     540521.66k  1003777.26k  1256054.36k  1332929.88k  1365549.06k  1368053.08k`
+  * `aes-256-cbc     540683.01k  1003568.53k  1256005.03k  1332878.68k  1365235.03k  1368211.46k`

This was referenced Nov 27, 2024

Benchmark Raspberry Pi Compute Module 5 geerlingguy/top500-benchmark#48

Closed

Results: Raspberry Pi Compute Module 5 ThomasKaiser/sbc-bench#106

Open

geerlingguy added a commit that referenced this issue Nov 27, 2024

Issue #58: Add Pi CM5.

08716cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raspberry Pi Compute Module 5 #58

Raspberry Pi Compute Module 5 #58

geerlingguy commented Nov 27, 2024 •

edited

Loading

schoolpost commented Nov 27, 2024

geerlingguy commented Nov 27, 2024

geerlingguy commented Nov 27, 2024

melroy89 commented Dec 1, 2024 •

edited

Loading

geerlingguy commented Dec 1, 2024

geerlingguy commented Dec 9, 2024

geerlingguy commented Dec 9, 2024

Raspberry Pi Compute Module 5 #58

Raspberry Pi Compute Module 5 #58

Comments

geerlingguy commented Nov 27, 2024 • edited Loading

Basic information

Linux/system information

Benchmark results

CPU

Power

Disk

Pinedrive 256GB 2242 NVMe SSD at PCIe Gen 3

Built-in eMMC (32GB)

Network

Built-in 1 Gbps Ethernet (BCM54210PE)

WiFi (built-in PCB antenna)

WiFi (external antenna)

GPU

glmark2

Ollama

Memory

sbc-bench results

Phoronix Test Suite

Launch version of Pi OS

December update (Pi OS with NUMA faking and SDRAM tweaks)

Other benchmarks

schoolpost commented Nov 27, 2024

geerlingguy commented Nov 27, 2024

geerlingguy commented Nov 27, 2024

melroy89 commented Dec 1, 2024 • edited Loading

geerlingguy commented Dec 1, 2024

geerlingguy commented Dec 9, 2024

geerlingguy commented Dec 9, 2024

geerlingguy commented Nov 27, 2024 •

edited

Loading

`sbc-bench` results

melroy89 commented Dec 1, 2024 •

edited

Loading