-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve matrix multiplication using the Java Vector API on Apple silicon. #7
Comments
"Based on your observations, it seems there may be some optimization opportunities for llama3.java on Apple Silicon platforms. Let's analyze this situation and consider some potential reasons and solutions:
Possible solution: Ensure that the Java Vector API is being utilized effectively. You might want to check if the JVM is properly recognizing and using the NEON instructions on Apple Silicon.
Possible solution: Analyze and optimize the memory access patterns in the Java code. This could involve restructuring data to improve cache locality or using more efficient data structures.
Possible solution: You could try running the Java application for longer periods to allow the JIT compiler more time to optimize the hot code paths. You might also experiment with different JVM flags to tune the JIT compiler's behavior.
Possible solution: Review and optimize the quantization code, possibly by profiling to identify bottlenecks.
Possible solution: Consider using JNI (Java Native Interface) to integrate some critical C++ code directly, especially for the most performance-sensitive parts of the application.
Possible solution: Profile the application to see if garbage collection is a significant factor. If so, you might need to optimize object allocation and consider using different GC algorithms or tuning GC parameters.
Possible solution: Review the parallelization approach, ensuring it's well-suited to the specific architecture of Apple Silicon chips. To address these issues:
By systematically addressing these potential issues, you may be able to significantly improve the performance of llama3.java on Apple Silicon, bringing it closer to the performance levels you're seeing with llama.cpp." |
Hi, |
There must be something wrong with the Graalvm 24 EA at least on the M1 Mac. I tested the benchmarks from this site:https://alexklibisz.com/2023/02/25/accelerating-vector-operations-jvm-jdk-incubator-vector-project-panama#complete-benchmark-results and got MUCH worse results using JEP338 then |
llama.cpp runs incredibly fast on Apple silicon, I ran a build with pure CPU, and it is closer to the memory bandwidth e.g. 28 tokens/s on an M3 Pro.
llama3.java seems to be rather slow on Apple silicon e.g. Q8_0 runs as fast as Q4_0 at about 4 tokens/s, something is off. On PC it's within ~10% of llama.cpp
The text was updated successfully, but these errors were encountered: