Benchmarks and issues on Starfive Visionfive-2 with PowerVR B-Series BXE-4-32 #29

JackTemaki · 2023-04-22T23:33:19Z

JackTemaki
Apr 22, 2023

This thread is to present and discuss the results I get when running dlprimitives & pytorch_dlprim on my Visionfive2. Unfortunately, Imagination is very secretive about the actual specs of the B-Series BXE-4-32 GPU.

As at least Starfive is not secretive at all about their hardware board, so at least I know that the GPU is part of the same physical chip as the CPU, inside the JH7110 package. JH7110 diagram full board view

You can find the full clinfo here: https://gist.github.com/JackTemaki/f9e86cae57e00569692ab495ce7b8965
So it does share the same memory with the CPU. Maybe it is also important to mention that I have the 8Gb version of the board.

I will try to run all available tests/benchmarks tomorrow, but I can already say that during forwarding of the mnist (gradients & loss deactivated, copying data to device, forwarding, copying back, display results after detach) the speedup is marginal.

So the average step time for using all 4 CPU cores is 0.35 seconds (~300% CPU utilization), and when using the GPU I have 0.3 seconds with 50% CPU utilization.

So the code was basically (within the provided mnist.py script):

        start = time.time()                                                                              
        data, target = data.to(device), target.to(device)                                                                                          
        with torch.no_grad():                                                                            
            output = model(data)                                                                         
        output = output.detach().to("cpu")                                                               
        print("forward time: %.3f" % (time.time() - start))

artyom-beilis · 2023-04-23T07:14:33Z

artyom-beilis
Apr 23, 2023
Maintainer

You can find the full clinfo here: https://gist.github.com/JackTemaki/f9e86cae57e00569692ab495ce7b8965
So it does share the same memory with the CPU. Maybe it is also important to mention that I have the 8Gb version of the board.

Looks like it would indeed some adjustments. For example local memory size of 4K is quite small so it wouldn't allow Winograd and probably restrict the block size for matrix multiplication/convolution.

I will try to run all available tests/benchmarks tomorrow

Note dlprimitives includes extensive benchmarks and testing suite - start from dlprim_flops and other benchmark utils documented in the code.

Make sure you build dlprimitives outside pytorch_dlprim backend tree because under pytorch_dlprim lots of utilities are disabled.

Thanks

0 replies

JackTemaki · 2023-04-23T13:08:02Z

JackTemaki
Apr 23, 2023
Author

dlprim_flops crashed after a few hours, but I got some (slow) results:
https://gist.github.com/JackTemaki/bc922bc19624f5d6c7cdaf987c654e90

Additional note, I am using extra heat-sinks, so there should be no degradation while running longer, so far I never saw a temperature above 65°C within the package.

0 replies

artyom-beilis · 2023-04-23T14:30:35Z

artyom-beilis
Apr 23, 2023
Maintainer

You can find the full clinfo here: https://gist.github.com/JackTemaki/f9e86cae57e00569692ab495ce7b8965

Ok looking into device manual I can see that can reach theoretical 35.4GFLops capacity on 594MHz clock

dlprim_flops crashed after a few hours, but I got some (slow) results:
https://gist.github.com/JackTemaki/bc922bc19624f5d6c7cdaf987c654e90

I already can see that in my measurement I get only around 50% of theoretical flops - but it may be test limit or GPU limit.

Now lets try to do some tuning:

Add new platform line src/context.cpp. Look as is_nvidia() add similar function for power VR vendor id 0x1010 as per clinfo output
Check various options shown in src/gemm.cpp - look into line 36, see as example if(ctx.is_apple()) - now check various parameters - if they work or give good result: different sizes for tile m/n/k, and block m/n - take from proposed setups.

In order to check performance use

./dlprim_flops -g0  0:0  0.25

That would run only GEMM test 512x512x512 and see what is the result, how much flops can you reach. You can try to find optimal parameters for GEMM (and thus convolution!)

8 replies

JackTemaki Apr 23, 2023
Author

Still running as I write this, unfortunately it does not seem really better than before:

   0     effnet  forward b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        1.1 GFlops ( 6.91%)      0.5 GB/s (12.35%) limited by memory 12.35% algo=depthwise_separable
   0     effnet bwd-data b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        0.0 GFlops ( 0.20%)      0.0 GB/s ( 0.35%) limited by memory  0.35% algo=depthwise_separable
   0     effnet bwd-filt b=64 k=3  p=1 s=1 in=480  out=480  g=480 D=14        0.2 GFlops ( 1.18%)      0.1 GB/s ( 2.11%) limited by memory  2.11% algo=depthwise_separable
   1    alexnet  forward b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       2.0 GFlops (11.96%)      0.0 GB/s ( 0.47%) limited by gflops 11.96% algo=gemm
   1    alexnet bwd-data b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       0.5 GFlops ( 3.01%)      0.0 GB/s ( 0.12%) limited by gflops  3.01% algo=gemm
   1    alexnet bwd-filt b=64 k=11 p=2 s=4 in=3    out=64   g=1   D=224       2.2 GFlops (13.31%)      0.0 GB/s ( 0.52%) limited by gflops 13.31% algo=gemm
   2    alexnet  forward b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        3.8 GFlops (23.15%)      0.0 GB/s ( 0.24%) limited by gflops 23.15% algo=gemm
   2    alexnet bwd-data b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        1.2 GFlops ( 7.20%)      0.0 GB/s ( 0.07%) limited by gflops  7.20% algo=gemm
   2    alexnet bwd-filt b=64 k=5  p=2 s=1 in=96   out=192  g=2   D=27        1.7 GFlops (10.54%)      0.0 GB/s ( 0.11%) limited by gflops 10.54% algo=gemm

JackTemaki Apr 23, 2023
Author

I think I will just pick the worst convolution cases and play around to see if I can improve them.

artyom-beilis Apr 23, 2023
Maintainer

I think I will just pick the worst convolution cases and play around to see if I can improve them.

I suggest try this setup

Run 7
tile_size_m_ = 64;
tile_size_n_ = 64;
block_size_m_ = 4;
block_size_n_ = 4;
tile_size_k_ = 16;
off_ = 0;
GEMM
  NN  0:  512,  512,  512        5.3 GFlops (31.66%)      0.1 GB/s ( 1.55%) limited by gflops 31.66%
  NT  0:  512,  512,  512        5.5 GFlops (33.25%)      0.1 GB/s ( 1.63%) limited by gflops 33.25%
  TN  0:  512,  512,  512        5.2 GFlops (31.28%)      0.1 GB/s ( 1.53%) limited by gflops 31.28%
  TT  0:  512,  512,  512        5.5 GFlops (33.37%)      0.1 GB/s ( 1.64%) limited by gflops 33.37%

And other setups with smaller block_size

I explain why. The single block need to fit to registers memory and tile should fit local memory. So for 4K local memory you need small tile and size. 8 by 8 block needs at least 64 registers per thread (BTW nvidia has only 256 per thread) and it may not be an option for more complex kernel (gemm+conv) 4 by 4 needs only 16 registers

I see this setup gives good result as well so try it with convolutions or in general.

artyom-beilis Apr 23, 2023
Maintainer

also

                    tile_size_m_ = 96;
                    tile_size_n_ = 96;
                    block_size_m_ = 6;
                    block_size_n_ = 6;
                    tile_size_k_ = 16;
                    off_ = 0;

This can be interesting to check (it is efficient for AMD gpus)

JackTemaki Apr 23, 2023
Author

I think unfortunately the Run 7 might be mislabeled, and this was actually also block size 8. Okay then, I will explore whenever there is time (it is no problem to run this why doing something else).

The convolution run still did not finish, but it did not crash as before, and actually has nice speeds for the VGG now:

  13        vgg  forward b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (79.34%)      0.1 GB/s ( 2.21%) limited by gflops 79.34% algo=gemm
  13        vgg bwd-data b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (79.34%)      0.1 GB/s ( 2.21%) limited by gflops 79.34% algo=gemm
  13        vgg bwd-filt b=64 k=3  p=1 s=1 in=64   out=64   g=1   D=224      13.2 GFlops (79.34%)      0.1 GB/s ( 2.21%) limited by gflops 79.34% algo=gemm
  14        vgg  forward b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (79.34%)      0.0 GB/s ( 0.29%) limited by gflops 79.34% algo=gemm
  14        vgg bwd-data b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (79.34%)      0.0 GB/s ( 0.29%) limited by gflops 79.34% algo=gemm
  14        vgg bwd-filt b=64 k=3  p=1 s=1 in=512  out=512  g=1   D=28       13.2 GFlops (79.33%)      0.0 GB/s ( 0.30%) limited by gflops 79.33% algo=gemm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks and issues on Starfive Visionfive-2 with PowerVR B-Series BXE-4-32 #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmarks and issues on Starfive Visionfive-2 with PowerVR B-Series BXE-4-32 #29

JackTemaki Apr 22, 2023

Replies: 3 comments · 8 replies

artyom-beilis Apr 23, 2023 Maintainer

JackTemaki Apr 23, 2023 Author

artyom-beilis Apr 23, 2023 Maintainer

JackTemaki Apr 23, 2023 Author

JackTemaki Apr 23, 2023 Author

artyom-beilis Apr 23, 2023 Maintainer

artyom-beilis Apr 23, 2023 Maintainer

JackTemaki Apr 23, 2023 Author

JackTemaki
Apr 22, 2023

Replies: 3 comments 8 replies

artyom-beilis
Apr 23, 2023
Maintainer

JackTemaki
Apr 23, 2023
Author

artyom-beilis
Apr 23, 2023
Maintainer

JackTemaki Apr 23, 2023
Author

JackTemaki Apr 23, 2023
Author

artyom-beilis Apr 23, 2023
Maintainer

artyom-beilis Apr 23, 2023
Maintainer

JackTemaki Apr 23, 2023
Author