Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M4 Mac mini #57

Open
geerlingguy opened this issue Nov 11, 2024 · 27 comments
Open

M4 Mac mini #57

geerlingguy opened this issue Nov 11, 2024 · 27 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Nov 11, 2024

Basic information

  • Board URL (official): https://www.apple.com/mac-mini/
  • Board purchased from: Apple (direct)
  • Board purchase date: October 29, 2024 (arrived Nov 11, 2024)
  • Board specs (as tested): M4 10/10/16-core, 32GB RAM, 1TB SSD, 10 GbE
  • Board price (as tested): 1499.00

Linux/system information

# output of `screenfetch`
                 -/+:.          jgeerling@jeff-mini
                :++++.          OS: 64bit macOS  
               /+++/.           Kernel: arm64 Darwin 24.1.0
       .:-::- .+/:-``.::-       Uptime: 5h 39m
    .:/++++++/::::/++++++/:`    Packages: 183
  .:///////////////////////:`   Shell: zsh 5.9
  ////////////////////////`     Resolution: 3840x2160 
 -+++++++++++++++++++++++`      DE: Aqua
 /++++++++++++++++++++++/       WM: Quartz Compositor
 /sssssssssssssssssssssss.      WM Theme: Blue (Dark)
 :ssssssssssssssssssssssss-     Font: FMonoMedium
  osssssssssssssssssssssssso/`  Disk: 190G / 995G (20%)
  `syyyyyyyyyyyyyyyyyyyyyyyy+`  CPU: Apple M4
   `ossssssssssssssssssssss/    GPU: Apple M4 
     :ooooooooooooooooooo+.     RAM: 3974MiB / 32768MiB
      `:+oo+/:-..-:/+o+/-      

# output of `uname -a`
Darwin jeff-mini.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64

Benchmark results

CPU

Power

  • Idle power draw (at wall): 4.1 W
  • Maximum simulated power draw (stress-ng --matrix 0): 31.2 W
  • During Geekbench multicore benchmark: 36 W
  • During top500 HPL benchmark: 39.6 W
  • During Cinebench 2024: 38 W

Disk

Internal Apple Storage

Benchmark Result
AmorphousDiskMark 4K random read QD64 1113.00 MB/s
AmorphousDiskMark 4K random write QD64 121.97 MB/s
AmorphousDiskMark 1M sequential read 3017.64 MB/s
AmorphousDiskMark 1M sequential write 3196.68 MB/s

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: 9.40 Gbps
  • iperf3 -c $SERVER_IP --reverse: 9.38 Gbps
  • iperf3 -c $SERVER_IP --bidir: 9.37 Gbps up, 7.73 Gbps down

The 10 GbE connection adds about 2W to total system power draw.

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

  • Cinebench 2024: 3787
  • Geekbench (Metal): 56652
  • Geekbench (OpenCL): 37773

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  30582.1 MB/s (2.4%)
 C copy backwards (32 byte blocks)                    :  30488.7 MB/s (2.7%)
 C copy backwards (64 byte blocks)                    :  30756.7 MB/s (0.8%)
 C copy                                               :  31050.0 MB/s (0.7%)
 C copy prefetched (32 bytes step)                    :  31217.9 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :  31255.8 MB/s (1.8%)
 C 2-pass copy                                        :  25266.3 MB/s (1.3%)
 C 2-pass copy prefetched (32 bytes step)             :  25340.9 MB/s (1.5%)
 C 2-pass copy prefetched (64 bytes step)             :  25332.9 MB/s (1.5%)
 C fill                                               :  45000.0 MB/s (10.3%)
 C fill (shuffle within 16 byte blocks)               :  35503.4 MB/s (2.5%)
 C fill (shuffle within 32 byte blocks)               :  37420.2 MB/s (4.0%)
 C fill (shuffle within 64 byte blocks)               :  41411.4 MB/s (6.9%)
 NEON 64x2 COPY                                       :  44108.2 MB/s (1.9%)
 NEON 64x2x4 COPY                                     :  44995.9 MB/s (3.8%)
 NEON 64x1x4_x2 COPY                                  :  43933.6 MB/s (3.5%)
 NEON 64x2 COPY prefetch x2                           :  38081.9 MB/s (4.2%)
 NEON 64x2x4 COPY prefetch x1                         :  37652.8 MB/s (0.8%)
 NEON 64x2 COPY prefetch x1                           :  38499.6 MB/s (1.7%)
 NEON 64x2x4 COPY prefetch x1                         :  36585.0 MB/s (1.6%)
 ---
 standard memcpy                                      :  44986.9 MB/s (2.0%)
 standard memset                                      :  69795.4 MB/s (1.2%)
 ---
 NEON LDP/STP copy                                    :  44254.6 MB/s (4.7%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  45326.7 MB/s (4.9%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  43931.5 MB/s (3.8%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  44670.6 MB/s (2.6%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  44082.8 MB/s (1.3%)
 NEON LD1/ST1 copy                                    :  42881.2 MB/s (1.6%)
 NEON STP fill                                        :  80754.5 MB/s (5.4%)
 NEON STNP fill                                       :  68623.4 MB/s (0.5%)
 ARM LDP/STP copy                                     :  43418.8 MB/s (0.3%)
 ARM STP fill                                         :  82462.2 MB/s (5.2%)
 ARM STNP fill                                        :  68986.8 MB/s (1.3%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.1 ns 
    131072 :    0.0 ns          /     0.0 ns 
    262144 :    2.0 ns          /     3.0 ns 
    524288 :    2.9 ns          /     3.8 ns 
   1048576 :    3.4 ns          /     4.1 ns 
   2097152 :    3.7 ns          /     4.1 ns 
   4194304 :    5.1 ns          /     5.6 ns 
   8388608 :    6.1 ns          /     6.3 ns 
  16777216 :   12.7 ns          /    17.6 ns 
  33554432 :   49.1 ns          /    71.5 ns 
  67108864 :   71.1 ns          /    91.0 ns 

sbc-bench results

The script doesn't run on macOS.

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: DNF (doesn't install on macOS)
  • pts/x264 4K: 12.82 fps
  • pts/x264 1080p: 55.53 fps
  • pts/phpbench: 1125967
  • pts/build-linux-kernel (defconfig): DNF (doesn't run on macOS)

Run inside a Docker container:

  • pts/encode-mp3: 4.250 s
  • pts/x264 4K: 25.27 fps
  • pts/x264 1080p: 108.50 fps
  • pts/phpbench: 932720
  • pts/build-linux-kernel (defconfig): 383.776 s

Additional Benchmarks

Ollama (LLMs)

See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings and geerlingguy/ollama-benchmark#2

System CPU/GPU Model Eval Rate Power (Peak)
M4 Mac mini (10 core CPU) / 32GB GPU llama3.2:3b 41.31 Tokens/s 30.1 W
M4 Mac mini (10 core CPU) / 32GB GPU llama3.1:8b 20.95 Tokens/s 29.4 W
M4 Mac mini (10 core CPU) / 32GB GPU llama2:13b 13.60 Tokens/s 29.8 W
@geerlingguy
Copy link
Owner Author

This system is much more of a 'Single Board Computer' than a couple of the Ampere systems I'm also testing in this repo; and in a nice advancement over some SBCs, it has a power supply integrated in its diminutive body.

For a nice video showing how to upgrade the storage (lol why Apple still doesn't just go to M.2 is crazy): https://www.youtube.com/watch?v=cJPXLE9uPr8

And Snazzy Labs has a good teardown: https://www.youtube.com/watch?v=OYlF0NVXS70

I don't feel obligated to gut mine, since plenty of other people already have done so to theirs. I do feel obligated to test it like crazy before putting it into service. My current plan is to replace my M2 MacBook Air at home with this machine, and then the Air might come to the office and replace my old 2013 Air for a 'bench' laptop.

@geerlingguy geerlingguy changed the title M4 Pro Mac Mini M4 Pro Mac mini Nov 12, 2024
@geerlingguy
Copy link
Owner Author

I think my setup at my home desk leaves something to be desired. I noticed the M4 Pro CPU spikes to 100°C during multi-core tests like Cinebench. Geekbench 6 doesn't hit the cores as hard for as long, but it may have brief throttling as well, affecting the overall score.

Screenshot 2024-11-12 at 12 08 01 AM

I'm going to re-run all the tests in a better test environment at the studio tomorrow.

@ThomasKaiser
Copy link

why Apple still doesn't just go to M.2 is crazy

M.2 as in 'standard NVMe SSD with own controller'? Well, two reasons against:

  • lower profits
  • power efficiency ruined. At least with any consumer SSD so far I experienced way higher consumption figures than without (SBC use case: eMMC vs. OS on NVMe) and I guess a lot of what Apple is doing with the I/O subsystem is optimizing for both performance and battery life (or low consumption in the Mini's case). Unfortunately I haven't seen any tests so far emphasizing on the 'race to idle' concept also including I/O but by doing any fine-grained measurements with any Apple Silicon HW it's obvious that the I/O subsystem is most of the times in 'deep sleep' state or something lower (not that familiar with terminology in this area)

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 12, 2024

At the office, where the Mac mini is in open air (ambient temp 23°C / 73°F), Cinebench is still pushing the SoC to 100°F pretty quickly. Maybe Apple's fan curve isn't aggressive enough?

I can hear the fan, but it's certainly very quiet. Only barely audible in the studio. Fan is at 1920 rpm

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 12, 2024

After installing Macs Fan Control and setting the fan to 4900 rpm (max), the fan is audible (sounds about like the Qualcomm Dev Kit, lol), and temps are now down to 80°C after a minute:

Screenshot 2024-11-12 at 10 13 35 AM

Going to re-run Cinebench multi at max fan speed to see how it fares, and compare the two runs:

Test Default fan settings (1900 rpm) High fan speed (4000 rpm)
Cinebench 2024 Single 169 175
Cinebench 2024 Multi 893 801

The full system power draw averages 37.1W while running Cinebench 2024 continuously with default fan curve. Maxed out, I was getting up to 40.8W.

Odd result, the multi score is actually lower with the CPU not hitting max temps throughout! Not sure why, but color me surprised. Maybe silence is bliss.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 12, 2024

Cinebench results with Apple's default fan curve:

Screenshot 2024-11-12 at 10 39 03 AM

Cinebench results with 3000 rpm fan speed:

Screenshot 2024-11-12 at 11 55 54 AM

@geerlingguy geerlingguy changed the title M4 Pro Mac mini M4 Mac mini Nov 12, 2024
@geerlingguy
Copy link
Owner Author

Heh... I thought I had ordered an M4 Pro. Little did I know, I ordered an M4. Oopsie! It's still way faster than my M2 Air I'm replacing, and it still has 32 GB of RAM (double my Air), 1 TB HDD (double my Air), and 10 GbE (which my Air had to use a TB3 dongle for), so I'm happy with it, but disappointed I didn't look as closely when I ordered it.

Though looking at the prices now, I'm okay with not paying the like $500 premium for the Pro + RAM upgrade.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 12, 2024

Here's the power graph through the Cinebench runs:

Screenshot 2024-11-12 at 11 52 28 AM

It goes multi (default fan curve), GPU (default fan curve), single (3000 rpm), multi (3000 rpm), GPU (3000 rpm).

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 12, 2024

@ThomasKaiser - Since the sbc-bench.sh script doesn't run on macOS/Darwin, would you like any tests in particular since I have my machine up and ready for some benchmarks? (Outside of what I've already run.)

@geerlingguy
Copy link
Owner Author

Disk results with AmorphousDiskMark 4.0.1:

Screenshot 2024-11-12 at 2 41 17 PM

@geerlingguy
Copy link
Owner Author

I ran my https://github.com/geerlingguy/ollama-benchmark a couple times, even tried loading up the llama3.1:70b model but that tried consuming almost 40 GB of RAM—which meant 12 GB of swap, and slaughtered the performance. Need more RAM to run larger models.

@andrewginns
Copy link

andrewginns commented Nov 13, 2024

Would be great to run this standardised benchmark of llama models!

ggerganov/llama.cpp#4167

The pinned commit is very old (but required for obvious reasons) so newer versions of the repo might also be interesting to explore to see the impact and efficiency of the GPU/Neural engine.

@film42
Copy link

film42 commented Nov 13, 2024

I can hear the fan, but it's certainly very quiet. Only barely audible in the studio. Fan is at 1920 rpm

That's good news, but ouch! That seems like a really slow fan. Curious to know how well it handles high-res video transcoding. I'm sure I'm not the only one who thinks this screams "my next media server" for only $550 (on Amazon).

@nreilly
Copy link

nreilly commented Nov 13, 2024

At the office, where the Mac mini is in open air (ambient temp 23°C / 73°F), Cinebench is still pushing the SoC to 100°F pretty quickly. Maybe Apple's fan curve isn't aggressive enough?

I can hear the fan, but it's certainly very quiet. Only barely audible in the studio. Fan is at 1920 rpm

Please consider testing in High Power Mode.

@geerlingguy
Copy link
Owner Author

@nreilly - Low Power mode has been switched off in all my testing:

Screenshot 2024-11-12 at 11 42 13 PM

@nreilly
Copy link

nreilly commented Nov 13, 2024

@nreilly - Low Power mode has been switched off in all my testing

The Mac mini (2024) should have 3 options for the energy level. Low, Automatic and High. The Apple article indicates it's available for the Mac mini (2024), and doesn't clarify it's only for the M4 Pro versions, so I would expect it to be an option for you, but your screenshot obviously says no.

macos-ventura-system-settings-battery-energy-mode-on-battery-high-power

@imadcat
Copy link

imadcat commented Nov 13, 2024

Great thx for sharing! Is there also a computer vision (non LLM) AI benchmark result?

@JayBrown
Copy link

If only APFS and Disk Utility offered RAID5 functionality… I'd use this as a home server in a heartbeat, with two external M.2 thunderbolt enclosures, each sporting an M.2 gen3 x2 to 6xSATA adapter… using three SATA SSDs on each adapter for a combined storage pool. Afaik the only way to implement RAID5 is by using OWC's SoftRAID, and that's a subscription model, and I don't know if we should trust their software for an application as fundamental as storage. OpenZFS seems to be buggy as hell on macOS, so RAIDz1 is probably out of the question. Come on, Apple, give us more RAID levels already. 🙏

@mibosshard
Copy link

the M4 (non-Pro) only has the Low Power Mode. only M4 Pro Mac mini has High Power Mode. in Low Power Mode, the M4 SoC consumes less than 7 W under full load (HandBrake H.265 encoding on the CPUs).
https://browser.geekbench.com/v6/cpu/compare/8799758?baseline=8802915

@geerlingguy
Copy link
Owner Author

I unplugged 10 GbE and just used WiFi 6, and idle power consumption goes from 6W to 4.1W.

The HPL efficiency score increased from 6.74 Gflop/W to 7.57 Gflops/W (much wow).

@film42
Copy link

film42 commented Nov 20, 2024

Wow! Does apple power down the radio completely when wifi and Bluetooth are disabled? Can’t help but wonder if radios are using power at idle when 10 GbE is connected.

@ThomasKaiser
Copy link

I unplugged 10 GbE and just used WiFi 6

Have you also tried to 'downgrade' the Ethernet link by connecting to a GbE or 2.5GbE switch port and measured idle consumption?

@geerlingguy
Copy link
Owner Author

geerlingguy commented Nov 20, 2024

@ThomasKaiser - I haven't tried that yet. May do so soon, left my power cord at home so I can't test right now :( (at least unless I have a compatible AC cord laying around in one of my boxes lol)

@geerlingguy
Copy link
Owner Author

For some comparisons:

hpl efficiency m4 mac mini

llm performance m4 mac mini

@akarabach
Copy link

Looks like a good candidate to run as a home server. With 2 thunderbolt 5 ports its not a problem anymore to connect even 990 pro ssds. Would love to see such kind of video on the channel!

@ThomasKaiser
Copy link

ThomasKaiser commented Nov 21, 2024

Since the sbc-bench.sh script doesn't run on macOS/Darwin, would you like any tests in particular

@geerlingguy in case the Mini is still in 'testing territory' with Xcode installed it should take just a few minutes of your time to get 7-zip 16.02 scores (16.02 to be comparable with sbc-bench scores – versions from 17.04 on up may perform different):

git clone https://github.com/ThomasKaiser/p7zip
cd p7zip
make -j$(sysctl -n hw.ncpu) INSTALL=install CC=gcc CXX=g++ OPTFLAGS='-O3'
for i in 1 2 3 4 5 ; do bin/7za b -mmt=1 | grep "Tot:" ; done | sort -n | tail -n1 ; for i in 1 2 3 4 5 ; do bin/7za b -mmt=6 | grep "Tot:" ; done | sort -n | tail -n1

This will download 16.02 sources (with all known vulnerabilities patched), build from source and run the 7-zip benchmark single-threaded and on all 6 P-cores in parallel (5 continous runs each, then displaying best score each).

Numbers to be compared with (counting/measuring only P-cores on the Macs):

  • 192-core AmpereOne A192-32X (3200 MHz): single: 4783, multi: 745720 (155.91 ratio)
  • 8-core Apple M1 Pro (3228 MHz): single: 5540, multi: 46540 (8.39 ratio)
  • 8-core Apple M1 Max (3228 MHz): single: 5845, multi: 52170 (8.93 ratio)
  • 8-core Apple M2 Pro (3504 MHz): single: 6830, multi: 57120 (8.36 ratio)
  • 4-core Apple M3 (4054 MHz): single: 7570, multi: 31380 (4.15 ratio)
  • 6-core Apple M4 (? MHz): single: ?, multi: ? (? ratio)

(the ratio for all the Apple results is obviously strange since should be lower than count of threads)

I checked parallel with sudo powermetrics | grep frequency in a following run whether the cpufreqs were consistent when running single-threaded vs. multi and found on the systems tested no difference (unlike M1 and M2 in the past that were clocked lower when all P-cores were active compared to only 1 or 2).

@DaveHarwoodNZ
Copy link

why Apple still doesn't just go to M.2 is crazy

M.2 as in 'standard NVMe SSD with own controller'? Well, two reasons against:

  • lower profits
  • power efficiency ruined.

The integrated controller appears to offer low-overhead encryption which shows almost no performance degradation. Not that many people will compare devices with encryption enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants