-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explanation for performance gap #91
Comments
I gained at most 13.62 GFLOP/s for large count of loop iterations in FlopsCL and with float16. One of the important aspects is to balance kernel length and iterations. I have many measurements done but will publish them at most in October. |
So one big factor are the ALUs. You only get the full 24GFLOPS if you utilize both ALUs for any clock cycle! Since the multiplication ALU does not have that many opcodes, it is definitively not utilized that much. And of course the other problem will be the memory bandwidth. Compared to the fairly powerful computation power, the memory interfaces are very slow. And as @pfoof hinted (I think), too big kernel code (or branch skipping too many instructions) might also lead to cache misses loading the instructions. But I don't have any numbers for that. |
Hey @doe300, I couldn't find any other contact to you and I would like to share my research for master thesis on VC4CL: |
@pfoof, very interesting read, thanks for sharing! I would have hoped the Raspberry Pi fares better with power/computation, but I guess I just have to try to improve the performance 😉 I definitively have to look at your thesis in more details, especially at the detailed benchmarks, result interpretations and comparisons between Raspberry Pi CPU and GPU performance! |
I am curious about performance measurements / theoretical performance numbers. The often stated theoretical performance of the VideoCore IV is 24GFLOPS.
The author of py-videocore manages to get to 8.32 GFLOPS with hand-optimized code:
https://qiita.com/9_ties/items/e0fdd165c1c7df6bb8ee
The fastest claimed measurement with clpeak using VC4CL is also just above 8 GFLOPS. On my raspberry pi, I measure about 6.3 GFLOPS.
So even a synthetic benchmark, and hand-optimized code can only reach about one third of the theoretical performance. For Desktop GPUs, clpeak mostly finds about the same performance as stated by the manufacturer. Where does this large performance gap come from?
The text was updated successfully, but these errors were encountered: