Benchmarks and issues on Starfive Visionfive-2 with PowerVR B-Series BXE-4-32 #29
Replies: 3 comments 8 replies
-
Looks like it would indeed some adjustments. For example local memory size of 4K is quite small so it wouldn't allow Winograd and probably restrict the block size for matrix multiplication/convolution.
Note dlprimitives includes extensive benchmarks and testing suite - start from Make sure you build dlprimitives outside Thanks |
Beta Was this translation helpful? Give feedback.
-
Additional note, I am using extra heat-sinks, so there should be no degradation while running longer, so far I never saw a temperature above 65°C within the package. |
Beta Was this translation helpful? Give feedback.
-
Ok looking into device manual I can see that can reach theoretical 35.4GFLops capacity on 594MHz clock
I already can see that in my measurement I get only around 50% of theoretical flops - but it may be test limit or GPU limit. Now lets try to do some tuning:
In order to check performance use
That would run only GEMM test 512x512x512 and see what is the result, how much flops can you reach. You can try to find optimal parameters for GEMM (and thus convolution!) |
Beta Was this translation helpful? Give feedback.
-
This thread is to present and discuss the results I get when running dlprimitives & pytorch_dlprim on my Visionfive2. Unfortunately, Imagination is very secretive about the actual specs of the B-Series BXE-4-32 GPU.
As at least Starfive is not secretive at all about their hardware board, so at least I know that the GPU is part of the same physical chip as the CPU, inside the JH7110 package. JH7110 diagram full board view
You can find the full clinfo here: https://gist.github.com/JackTemaki/f9e86cae57e00569692ab495ce7b8965
So it does share the same memory with the CPU. Maybe it is also important to mention that I have the 8Gb version of the board.
I will try to run all available tests/benchmarks tomorrow, but I can already say that during forwarding of the mnist (gradients & loss deactivated, copying data to device, forwarding, copying back, display results after detach) the speedup is marginal.
So the average step time for using all 4 CPU cores is 0.35 seconds (~300% CPU utilization), and when using the GPU I have 0.3 seconds with 50% CPU utilization.
So the code was basically (within the provided mnist.py script):
Beta Was this translation helpful? Give feedback.
All reactions