rvv-bench: XiangShan performance problems #3200

camel-cdr · 2024-07-14T23:03:13Z

Before start

I have read the XiangShan Documents. 我已经阅读过香山文档。
I have searched the previous issues and did not find anything relevant. 我已经搜索过之前的 issue，并没有找到相关的。
I have searched the previous discussions and did not find anything relevant. 我已经搜索过之前的 discussions，并没有找到相关的。

Describe you problem

XiangShan performs unexpectedly bad for some cases described below.

What did you do before

There isn't much I could do, see Additional context for the full detail.

Environment

XiangShan branch: master
XiangShan commit id: 1461d8f
gcc version: 12.3.0

Additional context

Hi, I've finally got most of the code from my benchmark to run on the XiangShan rtl simulation.

While the performance is promising, XiangShan is quite slow compared to other implementations in some of the benchmarks.

You can view the results here and compare it to the C910 from XuanTie here.

The benchmarks that didn't run aren't included in the results, and I'll try to create separate issues for those once I looked at them in more detail. Build instruction are on the benchmark page, I build the DefaultConfig with DRAMsim3 from the master branch on the 2024-07-13.

Note: for future readers, once the website updates, you can still find the older results under this commit

Performance comparison to C910

Let's start with the good results, in the byteswap, LUT4 and *ascii to utf16/utf32 benchmarks XiangShan cleanly outperforms the C910 in scalar and vector as would be expected.

*On ascii to utf16/utf32 the segmented load/store implementation is a lot slower than the C910, but AFAIK the complex load/stores aren't optimized on XiangShan yet.

memcpy and memset are slow for LMUL<8

For memset, the fastest RVV implementation on XiangShan is about 2x faster than the fastest one for the C910.
On memcpy the fastest XiangShan RVV implementation is actually a bit slower than the fastest C910 implementation.

Note: You can toggle the selected benchmark in the legend of the graphs by clicking on them.

However, XiangShan performs very badly with smaller LMUL, on both memcpy and memset.
LMUL=1 memcpy (rvv_m1) is 5x slower on XiangShan than on the C910, and LMUL=1 memset is ~1.8x slower.

Compare the memset rvv_m1, and rvv_tail_m1 implementations, and notice that rvv_tail_m1 matches the optimal performance of rvv_m8. rvv_m1 is just a simple not unrolled, LMUL=1 vse8.v strip mining loop, rvv_tail_m1 is equivalent, but it moves the vsetvli outside the loop and only operates on vlmax inside the loop.
The performance difference indicates that XiangShan currently handles vsetvli instructions very inefficiently.

strlen and utf8 count: anything involving masks is slow

I'm not sure why the RVV strlen implementations, even the one that isn't using vle8ff.v, are slower than a SWAR (musl) implementation. Both RVV implementations are about 2.5x slower than on the C910.

Similarly in utf8 count, the RVV implementation is surprisingly slow compared to C910, the C910 is >3x faster.
This doesn't make much sense to me, since changing LMUL, unrolling the loop, or moving vsetvli outside the loop, don't impact performance at all, which is opposed to the observations in memset/memcpy.
The only difference that could explain the performance problem that I can see is that both operate on vector masks. Maybe that introduces a weird dependency in XiangShan?

no idea why these are slow, might be a mixture of the above

The C910 outperforms XiangShan in scalar code for the mergelines 2/3 benchmark, where 2/3 characters are detected and removed. For the cases where removal is less frequent, XiangShan performs better.
On the vectorized code the C910 always beats XiangShan, since the code makes heavy use of masks this is probably the explanation for that.

~~For mandelbrot I again have no idea what's going on in scalar. XiangShan is almost 2x slower than the C910, and only slightly faster than the X60.~~
The vectorized versions are also about 2x slower than the C910, and even the in-order X60 that has a VLEN=256, but XiangShan should be beating both of those given its performance target. The inner loop uses multiple vsetvlis, and a vector mask, that could again be the cause for the slow performance.

XiangShan outperforms the C910 on scalar poly1305 as expected, however the vectorized implementation is once again about 2x slower than on the C910.
Here, the hop loop doesn't use vector mask vector masks, nor vsetvli. It does use one vlseg4e32, but that should be overshadowed by the other vector operations.

Conclusion

Please take a look at the benchmark results your self, and maybe reproduce it for further investigation.

I think that currently, XiangShan has a big problem with handling vsetvli and operations on vector masks efficiently.
This should be investigated, and once fixed it's probably better to redo the measurements, since this will have an impact on almost all vectorized implementations.

The one case ~~two cases~~ where the scalar code is slower are quite weird, ~~especially the mandelbrot one should be investigated.~~
I've attached the scalar assembly code for both since I used a different compiler versions to compile for the C910.

The text was updated successfully, but these errors were encountered:

Anzooooo · 2024-07-15T13:12:33Z

@camel-cdr
We appreciate your testing and finding problems. We will investigate the cause and optimize it as soon as possible.
At present, we plan to modify the execution logic of the segment instruction and solve the blocking problem of vsetvli instruction decoding, which will bring some performance improvement.
Thank you again for your attention and support to Xiangshan. We will reply if there is any new progress in the future.

camel-cdr · 2024-08-25T17:08:40Z

FYI, the lacking scalar mandelbrot performance was a measurement error on my part.
I introduced guards to the vectorizor into the mandelbort implementation such that it can't be vectorized, but the way I did it slowed down the scalar codegen considerably. My C910 measurements where done without this change.

Now it performs as expected, and even beats my, admitedly quite old, desktop:

scalar fp32 mandelbrot 64x64 with 64 iterations:
Zen1 1600x:  1264882 cycles
XiangShanV2: 1361856 cycles
XiangShanV3: 1011363 cycles

I'll try to rerun the full benchmark and update the website soon.

camel-cdr · 2024-09-04T18:04:05Z

I've uploaded updated measurements from yesterday's master branch: https://camel-cdr.github.io/rvv-bench-results/xiangshanv3/
Now all benchmarks ran successfully, but there wasn't any noticeable change in vector performance.
The scalar poly1305 benchmark was somehow speed up by 2x.

Here is a quick summary:

benchmark	speedup over scalar
memcpy	good speedup at LMUL=8, struggles with LMUL<8
memset	matches 64-bit GPR memset with LMUL=8, slower for LMUL<8
strlen	all slower than 64-bit GPR strlen, vle8ff.v very slow
utf8_count	matches 64-bit GPR utf8_count at all LMUL
mergeliens	good speedup; higher LMUL a lot better, except for rvv_vslide
mandelbrot	good speedup at LMUL=2, moderate speedup at LMUL=1
byteswap	good speedup from unrolled LMUL=1 vrgathers
	non unrolled LMUL=1 vrgather should've also given a speedup, but didn't
LUT4	good speedup from vrgather, vluxei/vloxei match scalar at LMUL=2, and are faster at LMUL>2
ascii_to_ut16	good speedup from LMUL>1 rvv_ext, rvv_vss roughly matches scalar, rvv_vsseg is a lot slower
ascii_to_ut32	same as above
chacha20	a lot slower than scalar
poly1305	slower than scalar

camel-cdr added the problem Problem requiring help label Jul 14, 2024

Anzooooo self-assigned this Jul 15, 2024

camel-cdr mentioned this issue Aug 13, 2024

Result for Banana Pi BPI-F3 camel-cdr/rvv-bench-results#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rvv-bench: XiangShan performance problems #3200

rvv-bench: XiangShan performance problems #3200

camel-cdr commented Jul 14, 2024 •

edited

Loading

Anzooooo commented Jul 15, 2024

camel-cdr commented Aug 25, 2024

camel-cdr commented Sep 4, 2024 •

edited

Loading

rvv-bench: XiangShan performance problems #3200

rvv-bench: XiangShan performance problems #3200

Comments

camel-cdr commented Jul 14, 2024 • edited Loading

Before start

Describe you problem

What did you do before

Environment

Additional context

Performance comparison to C910

memcpy and memset are slow for LMUL<8

strlen and utf8 count: anything involving masks is slow

no idea why these are slow, might be a mixture of the above

Conclusion

Anzooooo commented Jul 15, 2024

camel-cdr commented Aug 25, 2024

camel-cdr commented Sep 4, 2024 • edited Loading

camel-cdr commented Jul 14, 2024 •

edited

Loading

camel-cdr commented Sep 4, 2024 •

edited

Loading