Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checksum computation using DynASM #1275

Merged
merged 15 commits into from
Jun 6, 2019

Conversation

dpino
Copy link
Contributor

@dpino dpino commented Feb 9, 2018

This is something I've been working on and I thought it was worth sharing. It's a replacement of the checksum computation algorithms written in DynASM.

Right now the algorithm is written using intrinsics (checksum.c) and there are 3 versions of the algorithm each targeting a different architecture: generic, SSE2 and AVX2. So far I only have one generic implementation, but it's better than the current generic version. It's also better than all versions for small packets, but not for medium and large packets.

As a side note, it struck me that AVX architectures (Sandybridge) are using the SSE2 version of the algorithm. I wonder if it would be possible to take advantage of the AVX instruction set to write a more specific version for this architecture.

Here are some benchmarks:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.122575
SSE2:   0.125627
AVX2:   0.137349
New:    0.102393
2M; 550 bytes
Gen:    0.19948
SSE2:   0.0907
AVX2:   0.04979
New:    0.108642
1M; 1500 bytes
Gen:    0.27743
SSE2:   0.101574
AVX2:   0.059514
New:    0.128075

@wingo
Copy link
Contributor

wingo commented Feb 9, 2018

Neat. Note, i think one of the virtues of the current generic implementation is the simplicity; you get to check optimized results against something that's more or less comprehensible and more or less easy to check against the reference implementation. But perhaps that isn't so important. Where do you see this going?

@dpino
Copy link
Contributor Author

dpino commented Feb 9, 2018

OK, I think you got a point. If eventually the new library replaces the current one it would be nice to have a simple version of the algorithm that is easy to grasp and that can be used to compared results to.

Actually I had an implementation in plain Lua to compare the results of the new implementation, but at the last minute I decided to compare against the current implementation.

I pushed a new commit with the Lua implementation.

@tobyriddell
Copy link

This sounds like a great idea! But as this is explored further please keep in mind the impact of AVX2 (and also AVX512, should it be used in future) on CPU frequency scaling as it is a potential source of jitter.

There's some discussion here: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

@dpino
Copy link
Contributor Author

dpino commented Feb 9, 2018

I modified the main loop to sum two 64-bit on each iteration, in other words, summing at 16-byte strides. Then I added a new waterfall level to handle 8 byte offsets. Also in the handling of the remaining bytes it's not necessary to loop. With those changes the generic implementation is better than the SSE implementation for all cases:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.125743
SSE2:   0.126113
AVX2:   0.143601
New:    0.071434
2M; 550 bytes
Gen:    0.192014
SSE2:   0.096879
AVX2:   0.049316
New:    0.067029
1M; 1500 bytes
Gen:    0.276876
SSE2:   0.099367
AVX2:   0.057041
New:    0.082595

@dpino
Copy link
Contributor Author

dpino commented Feb 9, 2018

@tobyriddell Thanks for the pointer, it was an interesting reading. According to the article it seems that per-core frequency decreases if using AVX/AVX2 multiplication instructions. OTOH, computing the checksum only involves additions and shift instructions which it seems not to degrade per-core frequency. From the article:

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

@dpino
Copy link
Contributor Author

dpino commented Feb 11, 2018

Added a new level of loop unrolling with 4 qwords.

I learned there's some work already done by @lukego on a similar PR #899 Luke already rewrote an AVX2 version of the algorithm in DynASM. Perhaps both issues could get combined somehow. IMHO is not worth to write a SSE version of the algorithm as the generic version with a loop unrolling of 4 qwords is already better than the SSE version in all cases. However, probably is worth to have an AVX version of the algorithm which makes use of YMM registers.

Benchmarks for 4 qwords unrolling:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.122729
SSE2:   0.125478
AVX2:   0.140538
New:    0.077302
2M; 550 bytes
Gen:    0.20596
SSE2:   0.098654
AVX2:   0.049787
New:    0.047055
1M; 1500 bytes
Gen:    0.273965
SSE2:   0.100557
AVX2:   0.058187
New:    0.068768

@wingo
Copy link
Contributor

wingo commented Mar 2, 2018

It would be nice if the benchmarks printed comparable numbers -- nanoseconds per byte and nanoseconds per checksum.

@dpino
Copy link
Contributor Author

dpino commented Apr 25, 2018

Updated results with nanoseconds per byte and nanoseconds per checksum.

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.126386; ns_per_csum: 87.77; ns_per_byte: 1.99
SSE2:   elapse: 0.123921; ns_per_csum: 86.06; ns_per_byte: 1.96
AVX2:   elapse: 0.119802; ns_per_csum: 83.20; ns_per_byte: 1.89
New:    elapse: 0.075025; ns_per_csum: 52.10; ns_per_byte: 1.18
2M; 550 bytes
Gen:    elapse: 0.211609; ns_per_csum: 1058.04; ns_per_byte: 1.92
SSE2:   elapse: 0.102081; ns_per_csum: 510.40; ns_per_byte: 0.93
AVX2:   elapse: 0.063685; ns_per_csum: 318.42; ns_per_byte: 0.58
New:    elapse: 0.054159; ns_per_csum: 270.79; ns_per_byte: 0.49
1M; 1500 bytes
Gen:    elapse: 0.291010; ns_per_csum: 2910.10; ns_per_byte: 1.94
SSE2:   elapse: 0.099104; ns_per_csum: 991.04; ns_per_byte: 0.66
AVX2:   elapse: 0.066498; ns_per_csum: 664.98; ns_per_byte: 0.44
New:    elapse: 0.074388; ns_per_csum: 743.88; ns_per_byte: 0.50

@dpino dpino changed the title [WIP] Implement checksum computation using DynASM Implement checksum computation using DynASM Apr 25, 2018
@dpino dpino force-pushed the checksum-dynasm branch from 311196c to 7b2c42f Compare April 25, 2018 16:22
@dpino dpino closed this Jun 15, 2018
@dpino dpino deleted the checksum-dynasm branch June 15, 2018 09:25
@dpino dpino restored the checksum-dynasm branch August 2, 2018 11:03
@dpino dpino reopened this Aug 14, 2018
@wingo
Copy link
Contributor

wingo commented Aug 14, 2018

What system did you use to check the timings? Can you try with the E5-2620v3 servers we have and also with a skylake? I guess a skylake laptop or desktop, given that I don't think we have skylake xeon servers yet. In the context of #1194 (comment) I think we should probably remove the AVX2 and SSE variants. Probably need to add a "snabbmark" case for this hash versus the C hash. I think also you need to do a randomized test to make sure this version computes the same as the reference one written in C; i.e. for all lengths from the min to the max, generate a few random packets, compute checksum via C and dynasm, and assert dynasm result equals C.

@dpino
Copy link
Contributor Author

dpino commented Aug 14, 2018

I think the results I posted were from E5-2620v3, but in any case I run the benchmark again:

E5-2620v3 (Haswell)

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.126402; ns_per_csum: 87.78; ns_per_byte: 1.99
SSE2:   elapse: 0.135262; ns_per_csum: 93.93; ns_per_byte: 2.13
AVX2:   elapse: 0.130680; ns_per_csum: 90.75; ns_per_byte: 2.06
New:    elapse: 0.078119; ns_per_csum: 54.25; ns_per_byte: 1.23
2M; 550 bytes
Gen:    elapse: 0.212166; ns_per_csum: 1060.83; ns_per_byte: 1.93
SSE2:   elapse: 0.095822; ns_per_csum: 479.11; ns_per_byte: 0.87
AVX2:   elapse: 0.054535; ns_per_csum: 272.68; ns_per_byte: 0.50
New:    elapse: 0.079838; ns_per_csum: 399.19; ns_per_byte: 0.73
1M; 1500 bytes
Gen:    elapse: 0.293406; ns_per_csum: 2934.06; ns_per_byte: 1.96
SSE2:   elapse: 0.105560; ns_per_csum: 1055.60; ns_per_byte: 0.70
AVX2:   elapse: 0.059494; ns_per_csum: 594.94; ns_per_byte: 0.40
New:    elapse: 0.115417; ns_per_csum: 1154.17; ns_per_byte: 0.77

Laptop (i7-6700HQ CPU @ 2.60GHz; Skylake)

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.122758; ns_per_csum: 85.25; ns_per_byte: 1.94
SSE2:   elapse: 0.121412; ns_per_csum: 84.31; ns_per_byte: 1.92
AVX2:   elapse: 0.122230; ns_per_csum: 84.88; ns_per_byte: 1.93
New:    elapse: 0.074836; ns_per_csum: 51.97; ns_per_byte: 1.18
2M; 550 bytes
Gen:    elapse: 0.215340; ns_per_csum: 1076.70; ns_per_byte: 1.96
SSE2:   elapse: 0.089518; ns_per_csum: 447.59; ns_per_byte: 0.81
AVX2:   elapse: 0.062259; ns_per_csum: 311.30; ns_per_byte: 0.57
New:    elapse: 0.053170; ns_per_csum: 265.85; ns_per_byte: 0.48
1M; 1500 bytes
Gen:    elapse: 0.301346; ns_per_csum: 3013.46; ns_per_byte: 2.01
SSE2:   elapse: 0.095864; ns_per_csum: 958.64; ns_per_byte: 0.64
AVX2:   elapse: 0.065900; ns_per_csum: 659.00; ns_per_byte: 0.44
New:    elapse: 0.076165; ns_per_csum: 761.65; ns_per_byte: 0.51

As for the requests, I can tackle those changes, sure.

dpino added 2 commits August 23, 2018 12:56
The change required to support third argument 'initial', return value as
host-byte order value and adapt some selftests.
@dpino
Copy link
Contributor Author

dpino commented Aug 23, 2018

I added a few more commits that address some of the requested changes:

  • Removed AVX2 and SSE2 implementations.
  • Randomized tests. I adapted the selftest in lib.checksum, which validates the computed checksum using the C algorithm (cksum_generic) is the same as the one computed by ipsum (DynASM implementation). The selftest in arch.checksum also validates the checksum computation is correct comparing the result to an algorithm written in Lua (I think we should remove the cksum_generic algorithm in the future as it's not used anymore, only for validation purposes).
  • Snabbmark is pending. I plan to move the benchmark in arch.checksum there.

@dpino
Copy link
Contributor Author

dpino commented Aug 23, 2018

  • Added checksum subprogram to snabbmark.

@eugeneia
Copy link
Member

@dpino In case you haven’t already, I recommend playing with lib.pmu in that snabbmark. Should be an interesting study case for peeking at the inner working of the CPU!

@wingo
Copy link
Contributor

wingo commented Sep 5, 2018

Just an example run on my old Ivy Bridge i7-3770 desktop:

$ sudo taskset -c 2 ./snabb snabbmark checksum
C: Size=44 bytes; MPPS=14 M: 30.16 cycles, 7.88 ns per iteration (result: 30438); 0.18 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 21.35 cycles, 5.60 ns per iteration (result: 30438); 0.13 ns per byte
C: Size=550 bytes; MPPS=2 M: 227.78 cycles, 59.74 ns per iteration (result: 15425); 0.11 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 139.13 cycles, 36.50 ns per iteration (result: 15425); 0.07 ns per byte
C: Size=1516 bytes; MPPS=1 M: 610.06 cycles, 159.51 ns per iteration (result: 8540); 0.11 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 386.77 cycles, 101.32 ns per iteration (result: 8540); 0.07 ns per byte

On our old E5-2620v3 (Haswell-EP) server:

$ sudo taskset -c 4 ./snabb snabbmark checksum
[pmu /sys/devices/cpu/rdpmc: 1 -> 2]
C: Size=44 bytes; MPPS=14 M: 27.07 cycles, 11.29 ns per iteration (result: 14151); 0.26 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 20.04 cycles, 8.35 ns per iteration (result: 14151); 0.19 ns per byte
C: Size=550 bytes; MPPS=2 M: 342.32 cycles, 142.64 ns per iteration (result: 36504); 0.26 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 133.46 cycles, 55.61 ns per iteration (result: 36504); 0.10 ns per byte
C: Size=1516 bytes; MPPS=1 M: 942.14 cycles, 392.58 ns per iteration (result: 41872); 0.26 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 380.32 cycles, 158.51 ns per iteration (result: 41872); 0.10 ns per byte

@wingo
Copy link
Contributor

wingo commented Sep 5, 2018

On a Skylake mobile CPU (i7-7500U):

$ sudo taskset -c 1 ./snabb snabbmark checksum
No PMU available: CPU not recognized: GenuineIntel-6-8E
C: Size=44 bytes; MPPS=14 M: 7.08 ns per iteration (result: 14899); 0.16 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 5.81 ns per iteration (result: 14899); 0.13 ns per byte
C: Size=550 bytes; MPPS=2 M: 93.85 ns per iteration (result: 13253); 0.17 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 24.20 ns per iteration (result: 13253); 0.04 ns per byte
C: Size=1516 bytes; MPPS=1 M: 268.12 ns per iteration (result: 6361); 0.18 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 69.42 ns per iteration (result: 6361); 0.05 ns per byte

@wingo
Copy link
Contributor

wingo commented Sep 5, 2018

For me this is LGTM!

@eugeneia
Copy link
Member

eugeneia commented Sep 5, 2018

On a AMD Ryzen 5 1600 (turbo off):

dyser$ sudo taskset -c 3 ./snabb snabbmark checksum
[pmu: /sbin/modprobe msr]
sh: /sbin/modprobe: No such file or directory
No PMU available: requires /dev/cpu/*/msr (Linux 'msr' module)
C: Size=44 bytes; MPPS=14 M: 6.60 ns per iteration (result: 24346); 0.15 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 4.39 ns per iteration (result: 24346); 0.10 ns per byte
C: Size=550 bytes; MPPS=2 M: 56.49 ns per iteration (result: 19817); 0.10 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 23.79 ns per iteration (result: 19817); 0.04 ns per byte
C: Size=1516 bytes; MPPS=1 M: 140.99 ns per iteration (result: 52607); 0.09 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 73.39 ns per iteration (result: 52607); 0.05 ns per byte

LGTM as well!

@eugeneia
Copy link
Member

eugeneia commented Jan 7, 2019

Merged into max-next.

@eugeneia eugeneia added the merged label Jan 7, 2019
@eugeneia eugeneia merged commit 542a179 into snabbco:master Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants