Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspberry Pi 400 #59

Open
geerlingguy opened this issue Nov 30, 2024 · 0 comments
Open

Raspberry Pi 400 #59

geerlingguy opened this issue Nov 30, 2024 · 0 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Nov 30, 2024

raspberry-pi-400-hero-back-ports

Basic information

NOTE: I never uploaded my initial test results to this repository, as it was created in 2023. As time goes on, I've tried filling in more of the Pi lineup, as I like to have comparisons between all product families. Thus, this machine was re-tested running the latest Pi OS and firmware as of late 2024.

Linux/system information

# output of `screenfetch`
         _,met$$$$$gg.           pi@pi400
      ,g$$$$$$$$$$$$$$$P.        OS: Debian 12 bookworm
    ,g$$P""       """Y$$.".      Kernel: aarch64 Linux 6.6.62+rpt-rpi-v8
   ,$$P'              `$$$.      Uptime: 0m
  ',$$P       ,ggs.     `$$b:    Packages: 1920
  `d$$'     ,$P"'   .    $$$     Shell: bash 5.2.15
   $$P      d$'     ,    $$P     Disk: 13G / 119G (12%)
   $$:      $$.   -    ,d$$'     CPU: ARM Cortex-A72 @ 4x 1.8GHz
   $$\;      Y$b._   _,d$P'      GPU: 
   Y$$.    `.`"Y$$$$P"'          RAM: 415MiB / 3791MiB
   `$$b      "-.__              
    `Y$$                        
     `Y$$.                      
       `$$b.                    
         `Y$$b.                 
            `"Y$b._             
                `""""   

# output of `uname -a`
Linux pi400 6.6.62+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux

Benchmark results

CPU

Power

  • Idle power draw (at wall): 2.7 W
  • Maximum simulated power draw (stress-ng --matrix 0): 5.2 W
  • During Geekbench multicore benchmark: 6.3 W
  • During top500 HPL benchmark: 6.4 W

Disk

SanDisk Extreme 128 GB microSD

Benchmark Result
iozone 4K random read 8.76 MB/s
iozone 4K random write 4.39 MB/s
iozone 1M random read 43.66 MB/s
iozone 1M random write 34.71 MB/s
iozone 1M sequential read 43.66 MB/s
iozone 1M sequential write 35.50 MB/s

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading.

Also consider running PiBenchmarks.com script.

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: TODO Mbps
  • iperf3 -c $SERVER_IP --reverse: TODO Mbps
  • iperf3 -c $SERVER_IP --bidir: TODO Mbps up, TODO Mbps down

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

glmark2

glmark2-es2 / glmark2-es2-wayland results:

=======================================================
    glmark2 2023.01
=======================================================
    OpenGL Information
    GL_VENDOR:      Broadcom
    GL_RENDERER:    V3D 4.2
    GL_VERSION:     OpenGL ES 3.1 Mesa 23.2.1-1~bpo12+rpt3
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 1027 FrameTime: 0.974 ms
[build] use-vbo=true: FPS: 1527 FrameTime: 0.655 ms
[texture] texture-filter=nearest: FPS: 1252 FrameTime: 0.799 ms
[texture] texture-filter=linear: FPS: 1218 FrameTime: 0.821 ms
[texture] texture-filter=mipmap: FPS: 1150 FrameTime: 0.870 ms
[shading] shading=gouraud: FPS: 1171 FrameTime: 0.854 ms
[shading] shading=blinn-phong-inf: FPS: 940 FrameTime: 1.065 ms
[shading] shading=phong: FPS: 720 FrameTime: 1.389 ms
[shading] shading=cel: FPS: 687 FrameTime: 1.457 ms
[bump] bump-render=high-poly: FPS: 589 FrameTime: 1.701 ms
[bump] bump-render=normals: FPS: 1222 FrameTime: 0.819 ms
[bump] bump-render=height: FPS: 1123 FrameTime: 0.891 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 447 FrameTime: 2.241 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 222 FrameTime: 4.524 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1339 FrameTime: 0.747 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 108 FrameTime: 9.300 ms
[desktop] effect=shadow:windows=4: FPS: 438 FrameTime: 2.284 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 188 FrameTime: 5.338 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 194 FrameTime: 5.176 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 233 FrameTime: 4.306 ms
[ideas] speed=duration: FPS: 914 FrameTime: 1.095 ms
[jellyfish] <default>: FPS: 421 FrameTime: 2.379 ms
[terrain] <default>: FPS: 26 FrameTime: 39.011 ms
[shadow] <default>: FPS: 110 FrameTime: 9.099 ms
[refract] <default>: FPS: 36 FrameTime: 28.205 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 1432 FrameTime: 0.699 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 700 FrameTime: 1.430 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1331 FrameTime: 0.752 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 1036 FrameTime: 0.966 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 607 FrameTime: 1.649 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 983 FrameTime: 1.018 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 983 FrameTime: 1.018 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 604 FrameTime: 1.656 ms
=======================================================
                                  glmark2 Score: 755 
=======================================================

GravityMark

GravityMark results:

1. Download the latest version of GravityMark: https://gravitymark.tellusim.com
2. Run `chmod [downloaded_file.run]`
3. Run `sudo ./[downloaded_file.run]` and press `y` to accept the terms.
4. Open the link it prints, and run the Benchmark defaults, changing to 720p resolution and 50,000 asteroids.

Note: These benchmarks require an active display on the device. Not all devices may be able to run glmark2-es2, so in that case, make a note and move on!

Ollama

ollama LLM model inference results:

Pi Model CPU/GPU LLM Rate Power
Raspberry Pi 400 - 4GB CPU llama3.2:3b 1.60 Tokens/s 6 W

Note that Ollama will run on the CPU if no valid GPU / drivers are present. Be sure to note whether Ollama runs on the CPU, GPU, or a dedicated NPU.

TODO: See this issue for discussion about a full suite of standardized GPU benchmarks.

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   2574.7 MB/s (2.3%)
 C copy backwards (32 byte blocks)                    :   2580.6 MB/s (0.3%)
 C copy backwards (64 byte blocks)                    :   2577.9 MB/s (0.3%)
 C copy                                               :   2442.4 MB/s (0.3%)
 C copy prefetched (32 bytes step)                    :   2557.6 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :   2554.6 MB/s (0.3%)
 C 2-pass copy                                        :   1855.1 MB/s (0.5%)
 C 2-pass copy prefetched (32 bytes step)             :   2202.1 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :   2205.1 MB/s (0.2%)
 C fill                                               :   3076.9 MB/s (1.1%)
 C fill (shuffle within 16 byte blocks)               :   3032.4 MB/s (0.6%)
 C fill (shuffle within 32 byte blocks)               :   3040.0 MB/s (0.7%)
 C fill (shuffle within 64 byte blocks)               :   3034.4 MB/s (0.5%)
 NEON 64x2 COPY                                       :   2555.8 MB/s (0.3%)
 NEON 64x2x4 COPY                                     :   2556.9 MB/s (0.3%)
 NEON 64x1x4_x2 COPY                                  :   2557.2 MB/s (0.3%)
 NEON 64x2 COPY prefetch x2                           :   2553.5 MB/s (0.2%)
 NEON 64x2x4 COPY prefetch x1                         :   2552.4 MB/s (0.2%)
 NEON 64x2 COPY prefetch x1                           :   2557.8 MB/s (0.3%)
 NEON 64x2x4 COPY prefetch x1                         :   2549.9 MB/s (0.3%)
 ---
 standard memcpy                                      :   2564.8 MB/s (0.3%)
 standard memset                                      :   3066.5 MB/s (0.9%)
 ---
 NEON LDP/STP copy                                    :   2552.3 MB/s (0.3%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :   2548.9 MB/s (0.2%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :   2547.3 MB/s (0.3%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   2558.1 MB/s (0.2%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   2561.2 MB/s (0.3%)
 NEON LD1/ST1 copy                                    :   2559.0 MB/s (0.3%)
 NEON STP fill                                        :   3035.2 MB/s (0.8%)
 NEON STNP fill                                       :   2862.7 MB/s (0.5%)
 ARM LDP/STP copy                                     :   2553.1 MB/s (0.3%)
 ARM STP fill                                         :   3054.3 MB/s (1.0%)
 ARM STNP fill                                        :   2870.2 MB/s (0.9%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON LDP/STP copy (from framebuffer)                 :    765.8 MB/s (0.3%)
 NEON LDP/STP 2-pass copy (from framebuffer)          :    654.9 MB/s (0.2%)
 NEON LD1/ST1 copy (from framebuffer)                 :    824.9 MB/s (5.5%)
 NEON LD1/ST1 2-pass copy (from framebuffer)          :    689.0 MB/s (0.2%)
 ARM LDP/STP copy (from framebuffer)                  :    551.3 MB/s
 ARM LDP/STP 2-pass copy (from framebuffer)           :    523.8 MB/s (0.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    1.2 ns          /     2.2 ns 
     65536 :    4.7 ns          /     7.4 ns 
    131072 :    7.2 ns          /     9.9 ns 
    262144 :   10.3 ns          /    13.1 ns 
    524288 :   11.9 ns          /    15.2 ns 
   1048576 :   29.9 ns          /    45.8 ns 
   2097152 :   82.8 ns          /   120.4 ns 
   4194304 :  111.4 ns          /   144.7 ns 
   8388608 :  132.2 ns          /   163.0 ns 
  16777216 :  142.5 ns          /   171.7 ns 
  33554432 :  148.0 ns          /   176.5 ns 
  67108864 :  155.7 ns          /   188.2 ns 

sbc-bench results

Run sbc-bench and paste a link to the results here: https://0x0.st/XRdJ.bin

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: 24.500 sec
  • pts/x264 4K: 1.70 fps
  • pts/x264 1080p: 7.65 fps
  • pts/phpbench: 202897
  • pts/build-linux-kernel (defconfig): 6849.540 sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant