Improve matrix multiplication using the Java Vector API on Apple silicon. #7

mukel · 2024-07-21T17:21:01Z

llama.cpp runs incredibly fast on Apple silicon, I ran a build with pure CPU, and it is closer to the memory bandwidth e.g. 28 tokens/s on an M3 Pro.
llama3.java seems to be rather slow on Apple silicon e.g. Q8_0 runs as fast as Q4_0 at about 4 tokens/s, something is off. On PC it's within ~10% of llama.cpp

stephanj · 2024-07-30T11:34:49Z

⚠️ Claude Sonnet 3.5 feedback (hopefully this helps?) :

"Based on your observations, it seems there may be some optimization opportunities for llama3.java on Apple Silicon platforms. Let's analyze this situation and consider some potential reasons and solutions:

Vectorization:
llama.cpp likely uses highly optimized SIMD (Single Instruction, Multiple Data) instructions specific to Apple Silicon (NEON instructions). The Java implementation may not be taking full advantage of these vector instructions.

Possible solution: Ensure that the Java Vector API is being utilized effectively. You might want to check if the JVM is properly recognizing and using the NEON instructions on Apple Silicon.

Memory access patterns:
Efficient cache usage and memory access patterns are crucial for performance, especially when dealing with large models and matrices.

Possible solution: Analyze and optimize the memory access patterns in the Java code. This could involve restructuring data to improve cache locality or using more efficient data structures.

JIT compilation:
The Java Just-In-Time (JIT) compiler might not be optimizing the code as effectively as the native C++ compiler used for llama.cpp.

Possible solution: You could try running the Java application for longer periods to allow the JIT compiler more time to optimize the hot code paths. You might also experiment with different JVM flags to tune the JIT compiler's behavior.

Quantization implementation:
The implementation of the quantization methods (Q4_0 and Q8_0) in Java might not be as efficient as the C++ counterparts.

Possible solution: Review and optimize the quantization code, possibly by profiling to identify bottlenecks.

Native code integration:
llama.cpp might be using some platform-specific optimizations that are difficult to replicate in pure Java.

Possible solution: Consider using JNI (Java Native Interface) to integrate some critical C++ code directly, especially for the most performance-sensitive parts of the application.

Garbage collection:
Java's garbage collection could be introducing pauses or overhead that's not present in the C++ version.

Possible solution: Profile the application to see if garbage collection is a significant factor. If so, you might need to optimize object allocation and consider using different GC algorithms or tuning GC parameters.

Threading model:
The threading and parallelization strategy in the Java version might not be as efficient as the C++ version on Apple Silicon.

Possible solution: Review the parallelization approach, ensuring it's well-suited to the specific architecture of Apple Silicon chips.

To address these issues:

Profile the Java application thoroughly to identify the main bottlenecks.
Compare the assembly output of critical sections between llama.cpp and the JIT-compiled Java code to spot any significant differences in instruction usage.
Experiment with different JVM flags and versions to see if there are significant performance variations.
Consider reaching out to the Java community or Apple's developer resources for insights on optimizing Java performance on Apple Silicon.

By systematically addressing these potential issues, you may be able to significantly improve the performance of llama3.java on Apple Silicon, bringing it closer to the performance levels you're seeing with llama.cpp."

RainerXE · 2024-10-15T18:14:39Z

Hi,
I profiled the on graalvm 24ea16 with visual vm in JIT mode and I seems that most time (94+%) is used up by: FloatTensor.vectorDot (). Hope this helps.

RainerXE · 2024-10-15T19:52:15Z

There must be something wrong with the Graalvm 24 EA at least on the M1 Mac. I tested the benchmarks from this site:https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama#complete-benchmark-results and got MUCH worse results using JEP338 then
m1macbookpro-graalvm 24ea16.txt
using the baseline (s. attached).

RainerXE mentioned this issue Oct 17, 2024

[GR-59102] Vector API (JEP338) Implementation is very slow on M1 Mac oracle/graal#9889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve matrix multiplication using the Java Vector API on Apple silicon. #7

Improve matrix multiplication using the Java Vector API on Apple silicon. #7

mukel commented Jul 21, 2024

stephanj commented Jul 30, 2024 •

edited

Loading

RainerXE commented Oct 15, 2024

RainerXE commented Oct 15, 2024

Improve matrix multiplication using the Java Vector API on Apple silicon. #7

Improve matrix multiplication using the Java Vector API on Apple silicon. #7

Comments

mukel commented Jul 21, 2024

stephanj commented Jul 30, 2024 • edited Loading

RainerXE commented Oct 15, 2024

RainerXE commented Oct 15, 2024

stephanj commented Jul 30, 2024 •

edited

Loading