-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement checksum computation using DynASM #1275
Conversation
35956d7
to
5c077f9
Compare
Neat. Note, i think one of the virtues of the current generic implementation is the simplicity; you get to check optimized results against something that's more or less comprehensible and more or less easy to check against the reference implementation. But perhaps that isn't so important. Where do you see this going? |
OK, I think you got a point. If eventually the new library replaces the current one it would be nice to have a simple version of the algorithm that is easy to grasp and that can be used to compared results to. Actually I had an implementation in plain Lua to compare the results of the new implementation, but at the last minute I decided to compare against the current implementation. I pushed a new commit with the Lua implementation. |
This sounds like a great idea! But as this is explored further please keep in mind the impact of AVX2 (and also AVX512, should it be used in future) on CPU frequency scaling as it is a potential source of jitter. There's some discussion here: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ |
I modified the main loop to sum two 64-bit on each iteration, in other words, summing at 16-byte strides. Then I added a new waterfall level to handle 8 byte offsets. Also in the handling of the remaining bytes it's not necessary to loop. With those changes the generic implementation is better than the SSE implementation for all cases:
|
@tobyriddell Thanks for the pointer, it was an interesting reading. According to the article it seems that per-core frequency decreases if using AVX/AVX2 multiplication instructions. OTOH, computing the checksum only involves additions and shift instructions which it seems not to degrade per-core frequency. From the article:
|
Added a new level of loop unrolling with 4 qwords. I learned there's some work already done by @lukego on a similar PR #899 Luke already rewrote an AVX2 version of the algorithm in DynASM. Perhaps both issues could get combined somehow. IMHO is not worth to write a SSE version of the algorithm as the generic version with a loop unrolling of 4 qwords is already better than the SSE version in all cases. However, probably is worth to have an AVX version of the algorithm which makes use of YMM registers. Benchmarks for 4 qwords unrolling: $ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen: 0.122729
SSE2: 0.125478
AVX2: 0.140538
New: 0.077302
2M; 550 bytes
Gen: 0.20596
SSE2: 0.098654
AVX2: 0.049787
New: 0.047055
1M; 1500 bytes
Gen: 0.273965
SSE2: 0.100557
AVX2: 0.058187
New: 0.068768 |
It would be nice if the benchmarks printed comparable numbers -- nanoseconds per byte and nanoseconds per checksum. |
Updated results with nanoseconds per byte and nanoseconds per checksum.
|
What system did you use to check the timings? Can you try with the E5-2620v3 servers we have and also with a skylake? I guess a skylake laptop or desktop, given that I don't think we have skylake xeon servers yet. In the context of #1194 (comment) I think we should probably remove the AVX2 and SSE variants. Probably need to add a "snabbmark" case for this hash versus the C hash. I think also you need to do a randomized test to make sure this version computes the same as the reference one written in C; i.e. for all lengths from the min to the max, generate a few random packets, compute checksum via C and dynasm, and assert dynasm result equals C. |
I think the results I posted were from E5-2620v3, but in any case I run the benchmark again: E5-2620v3 (Haswell) $ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen: elapse: 0.126402; ns_per_csum: 87.78; ns_per_byte: 1.99
SSE2: elapse: 0.135262; ns_per_csum: 93.93; ns_per_byte: 2.13
AVX2: elapse: 0.130680; ns_per_csum: 90.75; ns_per_byte: 2.06
New: elapse: 0.078119; ns_per_csum: 54.25; ns_per_byte: 1.23
2M; 550 bytes
Gen: elapse: 0.212166; ns_per_csum: 1060.83; ns_per_byte: 1.93
SSE2: elapse: 0.095822; ns_per_csum: 479.11; ns_per_byte: 0.87
AVX2: elapse: 0.054535; ns_per_csum: 272.68; ns_per_byte: 0.50
New: elapse: 0.079838; ns_per_csum: 399.19; ns_per_byte: 0.73
1M; 1500 bytes
Gen: elapse: 0.293406; ns_per_csum: 2934.06; ns_per_byte: 1.96
SSE2: elapse: 0.105560; ns_per_csum: 1055.60; ns_per_byte: 0.70
AVX2: elapse: 0.059494; ns_per_csum: 594.94; ns_per_byte: 0.40
New: elapse: 0.115417; ns_per_csum: 1154.17; ns_per_byte: 0.77 Laptop (i7-6700HQ CPU @ 2.60GHz; Skylake) $ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen: elapse: 0.122758; ns_per_csum: 85.25; ns_per_byte: 1.94
SSE2: elapse: 0.121412; ns_per_csum: 84.31; ns_per_byte: 1.92
AVX2: elapse: 0.122230; ns_per_csum: 84.88; ns_per_byte: 1.93
New: elapse: 0.074836; ns_per_csum: 51.97; ns_per_byte: 1.18
2M; 550 bytes
Gen: elapse: 0.215340; ns_per_csum: 1076.70; ns_per_byte: 1.96
SSE2: elapse: 0.089518; ns_per_csum: 447.59; ns_per_byte: 0.81
AVX2: elapse: 0.062259; ns_per_csum: 311.30; ns_per_byte: 0.57
New: elapse: 0.053170; ns_per_csum: 265.85; ns_per_byte: 0.48
1M; 1500 bytes
Gen: elapse: 0.301346; ns_per_csum: 3013.46; ns_per_byte: 2.01
SSE2: elapse: 0.095864; ns_per_csum: 958.64; ns_per_byte: 0.64
AVX2: elapse: 0.065900; ns_per_csum: 659.00; ns_per_byte: 0.44
New: elapse: 0.076165; ns_per_csum: 761.65; ns_per_byte: 0.51 As for the requests, I can tackle those changes, sure. |
The snippets of code that deal with the remaining bytes should be ifs and not whiles. For instance, if the remaining bytes to sum are 7, this number is decomposed as 4 + 2 + 1. For all other numbers lower to 8 their decomposition is a sum of different values, thus there won't be iterations.
The change required to support third argument 'initial', return value as host-byte order value and adapt some selftests.
I added a few more commits that address some of the requested changes:
|
|
@dpino In case you haven’t already, I recommend playing with |
Just an example run on my old Ivy Bridge i7-3770 desktop:
On our old E5-2620v3 (Haswell-EP) server:
|
On a Skylake mobile CPU (i7-7500U):
|
For me this is LGTM! |
On a AMD Ryzen 5 1600 (turbo off):
LGTM as well! |
Merged into max-next. |
This is something I've been working on and I thought it was worth sharing. It's a replacement of the checksum computation algorithms written in DynASM.
Right now the algorithm is written using intrinsics (
checksum.c
) and there are 3 versions of the algorithm each targeting a different architecture: generic, SSE2 and AVX2. So far I only have one generic implementation, but it's better than the current generic version. It's also better than all versions for small packets, but not for medium and large packets.As a side note, it struck me that AVX architectures (Sandybridge) are using the SSE2 version of the algorithm. I wonder if it would be possible to take advantage of the AVX instruction set to write a more specific version for this architecture.
Here are some benchmarks: