Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rvv-bench: XiangShan performance problems #3200

Open
3 tasks done
camel-cdr opened this issue Jul 14, 2024 · 3 comments
Open
3 tasks done

rvv-bench: XiangShan performance problems #3200

camel-cdr opened this issue Jul 14, 2024 · 3 comments
Assignees
Labels
problem Problem requiring help

Comments

@camel-cdr
Copy link
Contributor

camel-cdr commented Jul 14, 2024

Before start

  • I have read the XiangShan Documents. 我已经阅读过香山文档。
  • I have searched the previous issues and did not find anything relevant. 我已经搜索过之前的 issue,并没有找到相关的。
  • I have searched the previous discussions and did not find anything relevant. 我已经搜索过之前的 discussions,并没有找到相关的。

Describe you problem

XiangShan performs unexpectedly bad for some cases described below.

What did you do before

There isn't much I could do, see Additional context for the full detail.

Environment

  • XiangShan branch: master
  • XiangShan commit id: 1461d8f
  • gcc version: 12.3.0

Additional context

Hi, I've finally got most of the code from my benchmark to run on the XiangShan rtl simulation.

While the performance is promising, XiangShan is quite slow compared to other implementations in some of the benchmarks.

You can view the results here and compare it to the C910 from XuanTie here.

The benchmarks that didn't run aren't included in the results, and I'll try to create separate issues for those once I looked at them in more detail. Build instruction are on the benchmark page, I build the DefaultConfig with DRAMsim3 from the master branch on the 2024-07-13.

Note: for future readers, once the website updates, you can still find the older results under this commit

Performance comparison to C910

Let's start with the good results, in the byteswap, LUT4 and *ascii to utf16/utf32 benchmarks XiangShan cleanly outperforms the C910 in scalar and vector as would be expected.

*On ascii to utf16/utf32 the segmented load/store implementation is a lot slower than the C910, but AFAIK the complex load/stores aren't optimized on XiangShan yet.

memcpy and memset are slow for LMUL<8

For memset, the fastest RVV implementation on XiangShan is about 2x faster than the fastest one for the C910.
On memcpy the fastest XiangShan RVV implementation is actually a bit slower than the fastest C910 implementation.

Note: You can toggle the selected benchmark in the legend of the graphs by clicking on them.

However, XiangShan performs very badly with smaller LMUL, on both memcpy and memset.
LMUL=1 memcpy (rvv_m1) is 5x slower on XiangShan than on the C910, and LMUL=1 memset is ~1.8x slower.

Compare the memset rvv_m1, and rvv_tail_m1 implementations, and notice that rvv_tail_m1 matches the optimal performance of rvv_m8. rvv_m1 is just a simple not unrolled, LMUL=1 vse8.v strip mining loop, rvv_tail_m1 is equivalent, but it moves the vsetvli outside the loop and only operates on vlmax inside the loop.
The performance difference indicates that XiangShan currently handles vsetvli instructions very inefficiently.

strlen and utf8 count: anything involving masks is slow

I'm not sure why the RVV strlen implementations, even the one that isn't using vle8ff.v, are slower than a SWAR (musl) implementation. Both RVV implementations are about 2.5x slower than on the C910.

Similarly in utf8 count, the RVV implementation is surprisingly slow compared to C910, the C910 is >3x faster.
This doesn't make much sense to me, since changing LMUL, unrolling the loop, or moving vsetvli outside the loop, don't impact performance at all, which is opposed to the observations in memset/memcpy.
The only difference that could explain the performance problem that I can see is that both operate on vector masks. Maybe that introduces a weird dependency in XiangShan?

no idea why these are slow, might be a mixture of the above

The C910 outperforms XiangShan in scalar code for the mergelines 2/3 benchmark, where 2/3 characters are detected and removed. For the cases where removal is less frequent, XiangShan performs better.
On the vectorized code the C910 always beats XiangShan, since the code makes heavy use of masks this is probably the explanation for that.

For mandelbrot I again have no idea what's going on in scalar. XiangShan is almost 2x slower than the C910, and only slightly faster than the X60.
The vectorized versions are also about 2x slower than the C910, and even the in-order X60 that has a VLEN=256, but XiangShan should be beating both of those given its performance target. The inner loop uses multiple vsetvlis, and a vector mask, that could again be the cause for the slow performance.

XiangShan outperforms the C910 on scalar poly1305 as expected, however the vectorized implementation is once again about 2x slower than on the C910.
Here, the hop loop doesn't use vector mask vector masks, nor vsetvli. It does use one vlseg4e32, but that should be overshadowed by the other vector operations.

Conclusion

Please take a look at the benchmark results your self, and maybe reproduce it for further investigation.

I think that currently, XiangShan has a big problem with handling vsetvli and operations on vector masks efficiently.
This should be investigated, and once fixed it's probably better to redo the measurements, since this will have an impact on almost all vectorized implementations.

The one case two cases where the scalar code is slower are quite weird, especially the mandelbrot one should be investigated.
I've attached the scalar assembly code for both since I used a different compiler versions to compile for the C910.

@camel-cdr camel-cdr added the problem Problem requiring help label Jul 14, 2024
@Anzooooo Anzooooo self-assigned this Jul 15, 2024
@Anzooooo
Copy link
Member

@camel-cdr
We appreciate your testing and finding problems. We will investigate the cause and optimize it as soon as possible.
At present, we plan to modify the execution logic of the segment instruction and solve the blocking problem of vsetvli instruction decoding, which will bring some performance improvement.
Thank you again for your attention and support to Xiangshan. We will reply if there is any new progress in the future.

@camel-cdr
Copy link
Contributor Author

FYI, the lacking scalar mandelbrot performance was a measurement error on my part.
I introduced guards to the vectorizor into the mandelbort implementation such that it can't be vectorized, but the way I did it slowed down the scalar codegen considerably. My C910 measurements where done without this change.

Now it performs as expected, and even beats my, admitedly quite old, desktop:

scalar fp32 mandelbrot 64x64 with 64 iterations:
Zen1 1600x:  1264882 cycles
XiangShanV2: 1361856 cycles
XiangShanV3: 1011363 cycles

I'll try to rerun the full benchmark and update the website soon.

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Sep 4, 2024

I've uploaded updated measurements from yesterday's master branch: https://camel-cdr.github.io/rvv-bench-results/xiangshanv3/
Now all benchmarks ran successfully, but there wasn't any noticeable change in vector performance.
The scalar poly1305 benchmark was somehow speed up by 2x.

Here is a quick summary:

benchmark speedup over scalar
memcpy good speedup at LMUL=8, struggles with LMUL<8
memset matches 64-bit GPR memset with LMUL=8, slower for LMUL<8
strlen all slower than 64-bit GPR strlen, vle8ff.v very slow
utf8_count matches 64-bit GPR utf8_count at all LMUL
mergeliens good speedup; higher LMUL a lot better, except for rvv_vslide
mandelbrot good speedup at LMUL=2, moderate speedup at LMUL=1
byteswap good speedup from unrolled LMUL=1 vrgathers
non unrolled LMUL=1 vrgather should've also given a speedup, but didn't
LUT4 good speedup from vrgather, vluxei/vloxei match scalar at LMUL=2, and are faster at LMUL>2
ascii_to_ut16 good speedup from LMUL>1 rvv_ext, rvv_vss roughly matches scalar, rvv_vsseg is a lot slower
ascii_to_ut32 same as above
chacha20 a lot slower than scalar
poly1305 slower than scalar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
problem Problem requiring help
Projects
None yet
Development

No branches or pull requests

2 participants