Implement checksum computation using DynASM #1275

dpino · 2018-02-09T08:03:29Z

This is something I've been working on and I thought it was worth sharing. It's a replacement of the checksum computation algorithms written in DynASM.

Right now the algorithm is written using intrinsics (checksum.c) and there are 3 versions of the algorithm each targeting a different architecture: generic, SSE2 and AVX2. So far I only have one generic implementation, but it's better than the current generic version. It's also better than all versions for small packets, but not for medium and large packets.

As a side note, it struck me that AVX architectures (Sandybridge) are using the SSE2 version of the algorithm. I wonder if it would be possible to take advantage of the AVX instruction set to write a more specific version for this architecture.

Here are some benchmarks:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.122575
SSE2:   0.125627
AVX2:   0.137349
New:    0.102393
2M; 550 bytes
Gen:    0.19948
SSE2:   0.0907
AVX2:   0.04979
New:    0.108642
1M; 1500 bytes
Gen:    0.27743
SSE2:   0.101574
AVX2:   0.059514
New:    0.128075

wingo · 2018-02-09T10:18:52Z

Neat. Note, i think one of the virtues of the current generic implementation is the simplicity; you get to check optimized results against something that's more or less comprehensible and more or less easy to check against the reference implementation. But perhaps that isn't so important. Where do you see this going?

dpino · 2018-02-09T10:46:28Z

OK, I think you got a point. If eventually the new library replaces the current one it would be nice to have a simple version of the algorithm that is easy to grasp and that can be used to compared results to.

Actually I had an implementation in plain Lua to compare the results of the new implementation, but at the last minute I decided to compare against the current implementation.

I pushed a new commit with the Lua implementation.

tobyriddell · 2018-02-09T11:30:55Z

This sounds like a great idea! But as this is explored further please keep in mind the impact of AVX2 (and also AVX512, should it be used in future) on CPU frequency scaling as it is a potential source of jitter.

There's some discussion here: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

dpino · 2018-02-09T18:59:31Z

I modified the main loop to sum two 64-bit on each iteration, in other words, summing at 16-byte strides. Then I added a new waterfall level to handle 8 byte offsets. Also in the handling of the remaining bytes it's not necessary to loop. With those changes the generic implementation is better than the SSE implementation for all cases:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.125743
SSE2:   0.126113
AVX2:   0.143601
New:    0.071434
2M; 550 bytes
Gen:    0.192014
SSE2:   0.096879
AVX2:   0.049316
New:    0.067029
1M; 1500 bytes
Gen:    0.276876
SSE2:   0.099367
AVX2:   0.057041
New:    0.082595

dpino · 2018-02-09T19:07:19Z

@tobyriddell Thanks for the pointer, it was an interesting reading. According to the article it seems that per-core frequency decreases if using AVX/AVX2 multiplication instructions. OTOH, computing the checksum only involves additions and shift instructions which it seems not to degrade per-core frequency. From the article:

Another interesting distinction is that ChaCha20-Poly1305 with AVX2 is slightly slower in OpenSSL but is the same in BoringSSL. Why might that be? The reason here is that the BoringSSL code does not use AVX2 multiplication instructions for Poly1305, and only uses simple xor, shift and add operations for ChaCha20, which allows it to run at the base frequency.

dpino · 2018-02-11T10:05:40Z

Added a new level of loop unrolling with 4 qwords.

I learned there's some work already done by @lukego on a similar PR #899 Luke already rewrote an AVX2 version of the algorithm in DynASM. Perhaps both issues could get combined somehow. IMHO is not worth to write a SSE version of the algorithm as the generic version with a loop unrolling of 4 qwords is already better than the SSE version in all cases. However, probably is worth to have an AVX version of the algorithm which makes use of YMM registers.

Benchmarks for 4 qwords unrolling:

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    0.122729
SSE2:   0.125478
AVX2:   0.140538
New:    0.077302
2M; 550 bytes
Gen:    0.20596
SSE2:   0.098654
AVX2:   0.049787
New:    0.047055
1M; 1500 bytes
Gen:    0.273965
SSE2:   0.100557
AVX2:   0.058187
New:    0.068768

wingo · 2018-03-02T09:22:30Z

It would be nice if the benchmarks printed comparable numbers -- nanoseconds per byte and nanoseconds per checksum.

dpino · 2018-04-25T15:14:13Z

Updated results with nanoseconds per byte and nanoseconds per checksum.

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.126386; ns_per_csum: 87.77; ns_per_byte: 1.99
SSE2:   elapse: 0.123921; ns_per_csum: 86.06; ns_per_byte: 1.96
AVX2:   elapse: 0.119802; ns_per_csum: 83.20; ns_per_byte: 1.89
New:    elapse: 0.075025; ns_per_csum: 52.10; ns_per_byte: 1.18
2M; 550 bytes
Gen:    elapse: 0.211609; ns_per_csum: 1058.04; ns_per_byte: 1.92
SSE2:   elapse: 0.102081; ns_per_csum: 510.40; ns_per_byte: 0.93
AVX2:   elapse: 0.063685; ns_per_csum: 318.42; ns_per_byte: 0.58
New:    elapse: 0.054159; ns_per_csum: 270.79; ns_per_byte: 0.49
1M; 1500 bytes
Gen:    elapse: 0.291010; ns_per_csum: 2910.10; ns_per_byte: 1.94
SSE2:   elapse: 0.099104; ns_per_csum: 991.04; ns_per_byte: 0.66
AVX2:   elapse: 0.066498; ns_per_csum: 664.98; ns_per_byte: 0.44
New:    elapse: 0.074388; ns_per_csum: 743.88; ns_per_byte: 0.50

wingo · 2018-08-14T15:02:18Z

What system did you use to check the timings? Can you try with the E5-2620v3 servers we have and also with a skylake? I guess a skylake laptop or desktop, given that I don't think we have skylake xeon servers yet. In the context of #1194 (comment) I think we should probably remove the AVX2 and SSE variants. Probably need to add a "snabbmark" case for this hash versus the C hash. I think also you need to do a randomized test to make sure this version computes the same as the reference one written in C; i.e. for all lengths from the min to the max, generate a few random packets, compute checksum via C and dynasm, and assert dynasm result equals C.

dpino · 2018-08-14T16:32:38Z

I think the results I posted were from E5-2620v3, but in any case I run the benchmark again:

E5-2620v3 (Haswell)

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.126402; ns_per_csum: 87.78; ns_per_byte: 1.99
SSE2:   elapse: 0.135262; ns_per_csum: 93.93; ns_per_byte: 2.13
AVX2:   elapse: 0.130680; ns_per_csum: 90.75; ns_per_byte: 2.06
New:    elapse: 0.078119; ns_per_csum: 54.25; ns_per_byte: 1.23
2M; 550 bytes
Gen:    elapse: 0.212166; ns_per_csum: 1060.83; ns_per_byte: 1.93
SSE2:   elapse: 0.095822; ns_per_csum: 479.11; ns_per_byte: 0.87
AVX2:   elapse: 0.054535; ns_per_csum: 272.68; ns_per_byte: 0.50
New:    elapse: 0.079838; ns_per_csum: 399.19; ns_per_byte: 0.73
1M; 1500 bytes
Gen:    elapse: 0.293406; ns_per_csum: 2934.06; ns_per_byte: 1.96
SSE2:   elapse: 0.105560; ns_per_csum: 1055.60; ns_per_byte: 0.70
AVX2:   elapse: 0.059494; ns_per_csum: 594.94; ns_per_byte: 0.40
New:    elapse: 0.115417; ns_per_csum: 1154.17; ns_per_byte: 0.77

Laptop (i7-6700HQ CPU @ 2.60GHz; Skylake)

$ sudo ./snabb snsh -t lib.newchecksum
selftest: newchecksum
14.4M; 44 bytes
Gen:    elapse: 0.122758; ns_per_csum: 85.25; ns_per_byte: 1.94
SSE2:   elapse: 0.121412; ns_per_csum: 84.31; ns_per_byte: 1.92
AVX2:   elapse: 0.122230; ns_per_csum: 84.88; ns_per_byte: 1.93
New:    elapse: 0.074836; ns_per_csum: 51.97; ns_per_byte: 1.18
2M; 550 bytes
Gen:    elapse: 0.215340; ns_per_csum: 1076.70; ns_per_byte: 1.96
SSE2:   elapse: 0.089518; ns_per_csum: 447.59; ns_per_byte: 0.81
AVX2:   elapse: 0.062259; ns_per_csum: 311.30; ns_per_byte: 0.57
New:    elapse: 0.053170; ns_per_csum: 265.85; ns_per_byte: 0.48
1M; 1500 bytes
Gen:    elapse: 0.301346; ns_per_csum: 3013.46; ns_per_byte: 2.01
SSE2:   elapse: 0.095864; ns_per_csum: 958.64; ns_per_byte: 0.64
AVX2:   elapse: 0.065900; ns_per_csum: 659.00; ns_per_byte: 0.44
New:    elapse: 0.076165; ns_per_csum: 761.65; ns_per_byte: 0.51

As for the requests, I can tackle those changes, sure.

The snippets of code that deal with the remaining bytes should be ifs and not whiles. For instance, if the remaining bytes to sum are 7, this number is decomposed as 4 + 2 + 1. For all other numbers lower to 8 their decomposition is a sum of different values, thus there won't be iterations.

The change required to support third argument 'initial', return value as host-byte order value and adapt some selftests.

dpino · 2018-08-23T11:07:16Z

I added a few more commits that address some of the requested changes:

Removed AVX2 and SSE2 implementations.
Randomized tests. I adapted the selftest in lib.checksum, which validates the computed checksum using the C algorithm (cksum_generic) is the same as the one computed by ipsum (DynASM implementation). The selftest in arch.checksum also validates the checksum computation is correct comparing the result to an algorithm written in Lua (I think we should remove the cksum_generic algorithm in the future as it's not used anymore, only for validation purposes).
Snabbmark is pending. I plan to move the benchmark in arch.checksum there.

dpino · 2018-08-23T15:46:37Z

Added checksum subprogram to snabbmark.

eugeneia · 2018-08-23T18:12:31Z

@dpino In case you haven’t already, I recommend playing with lib.pmu in that snabbmark. Should be an interesting study case for peeking at the inner working of the CPU!

wingo · 2018-09-05T10:18:17Z

Just an example run on my old Ivy Bridge i7-3770 desktop:

$ sudo taskset -c 2 ./snabb snabbmark checksum
C: Size=44 bytes; MPPS=14 M: 30.16 cycles, 7.88 ns per iteration (result: 30438); 0.18 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 21.35 cycles, 5.60 ns per iteration (result: 30438); 0.13 ns per byte
C: Size=550 bytes; MPPS=2 M: 227.78 cycles, 59.74 ns per iteration (result: 15425); 0.11 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 139.13 cycles, 36.50 ns per iteration (result: 15425); 0.07 ns per byte
C: Size=1516 bytes; MPPS=1 M: 610.06 cycles, 159.51 ns per iteration (result: 8540); 0.11 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 386.77 cycles, 101.32 ns per iteration (result: 8540); 0.07 ns per byte

On our old E5-2620v3 (Haswell-EP) server:

$ sudo taskset -c 4 ./snabb snabbmark checksum
[pmu /sys/devices/cpu/rdpmc: 1 -> 2]
C: Size=44 bytes; MPPS=14 M: 27.07 cycles, 11.29 ns per iteration (result: 14151); 0.26 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 20.04 cycles, 8.35 ns per iteration (result: 14151); 0.19 ns per byte
C: Size=550 bytes; MPPS=2 M: 342.32 cycles, 142.64 ns per iteration (result: 36504); 0.26 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 133.46 cycles, 55.61 ns per iteration (result: 36504); 0.10 ns per byte
C: Size=1516 bytes; MPPS=1 M: 942.14 cycles, 392.58 ns per iteration (result: 41872); 0.26 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 380.32 cycles, 158.51 ns per iteration (result: 41872); 0.10 ns per byte

wingo · 2018-09-05T10:22:01Z

On a Skylake mobile CPU (i7-7500U):

$ sudo taskset -c 1 ./snabb snabbmark checksum
No PMU available: CPU not recognized: GenuineIntel-6-8E
C: Size=44 bytes; MPPS=14 M: 7.08 ns per iteration (result: 14899); 0.16 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 5.81 ns per iteration (result: 14899); 0.13 ns per byte
C: Size=550 bytes; MPPS=2 M: 93.85 ns per iteration (result: 13253); 0.17 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 24.20 ns per iteration (result: 13253); 0.04 ns per byte
C: Size=1516 bytes; MPPS=1 M: 268.12 ns per iteration (result: 6361); 0.18 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 69.42 ns per iteration (result: 6361); 0.05 ns per byte

wingo · 2018-09-05T10:22:53Z

For me this is LGTM!

eugeneia · 2018-09-05T11:10:44Z

On a AMD Ryzen 5 1600 (turbo off):

dyser$ sudo taskset -c 3 ./snabb snabbmark checksum
[pmu: /sbin/modprobe msr]
sh: /sbin/modprobe: No such file or directory
No PMU available: requires /dev/cpu/*/msr (Linux 'msr' module)
C: Size=44 bytes; MPPS=14 M: 6.60 ns per iteration (result: 24346); 0.15 ns per byte
ASM: Size=44 bytes; MPPS=14 M: 4.39 ns per iteration (result: 24346); 0.10 ns per byte
C: Size=550 bytes; MPPS=2 M: 56.49 ns per iteration (result: 19817); 0.10 ns per byte
ASM: Size=550 bytes; MPPS=2 M: 23.79 ns per iteration (result: 19817); 0.04 ns per byte
C: Size=1516 bytes; MPPS=1 M: 140.99 ns per iteration (result: 52607); 0.09 ns per byte
ASM: Size=1516 bytes; MPPS=1 M: 73.39 ns per iteration (result: 52607); 0.05 ns per byte

LGTM as well!

eugeneia · 2019-01-07T14:33:57Z

Merged into max-next.

dpino force-pushed the checksum-dynasm branch from 35956d7 to 5c077f9 Compare February 9, 2018 08:05

dpino changed the title ~~[WIP] Implement checksum computation using DynASM~~ Implement checksum computation using DynASM Apr 25, 2018

dpino force-pushed the checksum-dynasm branch from 311196c to 7b2c42f Compare April 25, 2018 16:22

dpino closed this Jun 15, 2018

dpino deleted the checksum-dynasm branch June 15, 2018 09:25

dpino restored the checksum-dynasm branch August 2, 2018 11:03

dpino reopened this Aug 14, 2018

eugeneia mentioned this pull request Aug 15, 2018

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

Open

dpino force-pushed the checksum-dynasm branch from a8a2625 to a32899a Compare August 23, 2018 10:50

dpino added 11 commits August 23, 2018 12:56

Implement checksum computation using DynASM

f0c6011

Add reference implementation of checksum computation in Lua

9a96d57

Remove unnecessary label

9da791f

Add at 16 bytes strides

f3969f0

Fix comments

52e8457

Sum at 32 bytes strides

12fd28c

Remove unnecessary assignment to register

1ed8aa5

Print out nanseconds by byte and per csum

a7bac04

Check AVX2 and SSE2 are available

ca059dd

Add function for verifying correctness of new checksum computation

f38ec15

dpino added 2 commits August 23, 2018 12:56

Remove AVX2 and SSE2 checksum computations

7d5eeab

Use new checksum function by default

24ee9fb

The change required to support third argument 'initial', return value as host-byte order value and adapt some selftests.

dpino force-pushed the checksum-dynasm branch from a32899a to 24ee9fb Compare August 23, 2018 10:58

Add snabbmark checksum subprogram

ced7500

dpino force-pushed the checksum-dynasm branch from 4ab4551 to ced7500 Compare August 23, 2018 15:47

Benchmark checksum using PMU utilities

542a179

eugeneia added a commit that referenced this pull request Jan 7, 2019

Merge PR #1275 (IP checksum computation using DynASM) into max-next

41d544c

eugeneia added the merged label Jan 7, 2019

eugeneia mentioned this pull request Jan 7, 2019

Merge max-next into next for v2019.01 #1404

Merged

eugeneia merged commit 542a179 into snabbco:master Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement checksum computation using DynASM #1275

Implement checksum computation using DynASM #1275

dpino commented Feb 9, 2018

wingo commented Feb 9, 2018

dpino commented Feb 9, 2018 •

edited

Loading

tobyriddell commented Feb 9, 2018

dpino commented Feb 9, 2018 •

edited

Loading

dpino commented Feb 9, 2018 •

edited

Loading

dpino commented Feb 11, 2018

wingo commented Mar 2, 2018

dpino commented Apr 25, 2018 •

edited

Loading

wingo commented Aug 14, 2018

dpino commented Aug 14, 2018

dpino commented Aug 23, 2018

dpino commented Aug 23, 2018

eugeneia commented Aug 23, 2018

wingo commented Sep 5, 2018

wingo commented Sep 5, 2018

wingo commented Sep 5, 2018

eugeneia commented Sep 5, 2018 •

edited

Loading

eugeneia commented Jan 7, 2019

Implement checksum computation using DynASM #1275

Implement checksum computation using DynASM #1275

Conversation

dpino commented Feb 9, 2018

wingo commented Feb 9, 2018

dpino commented Feb 9, 2018 • edited Loading

tobyriddell commented Feb 9, 2018

dpino commented Feb 9, 2018 • edited Loading

dpino commented Feb 9, 2018 • edited Loading

dpino commented Feb 11, 2018

wingo commented Mar 2, 2018

dpino commented Apr 25, 2018 • edited Loading

wingo commented Aug 14, 2018

dpino commented Aug 14, 2018

dpino commented Aug 23, 2018

dpino commented Aug 23, 2018

eugeneia commented Aug 23, 2018

wingo commented Sep 5, 2018

wingo commented Sep 5, 2018

wingo commented Sep 5, 2018

eugeneia commented Sep 5, 2018 • edited Loading

eugeneia commented Jan 7, 2019

dpino commented Feb 9, 2018 •

edited

Loading

dpino commented Feb 9, 2018 •

edited

Loading

dpino commented Feb 9, 2018 •

edited

Loading

dpino commented Apr 25, 2018 •

edited

Loading

eugeneia commented Sep 5, 2018 •

edited

Loading