[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

lukego · 2016-04-27T12:49:38Z

This branch rewrites the AVX2 checksum routine with DynASM assembler instead of GCC intrinsics.

My motivations and perceived benefits are a little subjective:

Practice writing assembler code with DynASM - a powerful secret weapon.
Practice using x86 SIMD instructions - untapped potential growing exponentially.
Reduce reliance on GCC - a boring complex external dependency.
Get Snabb to compile on older Redhat/CentOS - GCC lacks AVX2 support.
Get to know this checksum routine better - originally ported from another project.
Have a chance to try and squeeze more performance.

I will be satisfied if the assembler version is at least as short as the C version and also at least as fast.

Looks promising so far. The version here is basically working but seems to be missing a carry bit somewhere (off-by-one on some tests). The code is a little over half the size of the C version. The performance seems at least as good.

Just have simple microbenchmarks for now. I would like to get a more thorough performance test like #755 upstream to verify this change. Meanwhile, here is how it looks in comparison to the C code based on the microbenchmark included on the PR (1 million iterations on the same input). On case with 150 byte input:

ASM
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       45,552,172      45.552       9.718       0.304
instructions                                 84,139,687      84.140      17.950       0.561
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

C
EVENT                                             TOTAL       /call      /block       /byte
cycles                                       64,792,114      64.792      13.822       0.432
instructions                                248,176,655     248.177      52.944       1.655
call                                          1,000,000       1.000       0.213       0.007
block                                         4,687,500       4.688       1.000       0.031
byte                                        150,000,000     150.000      32.000       1.000

Here is a little table with some other values:

Code vs Bytes	31	32	100	150	500	1,500	5,000	10,000	100,000
C (cycles)	84*	85*	263*	65	96	206	498	970	10,005
ASM (cycles)	44	19	47	46	51	116	402	772	9,085

EDIT: Marked with * the small sizes where our C code actually punts to non-SIMD.

So: fun and encouraging so far but more work to be done. Could still be some big mistakes that invalidate this code and/or results for now.

cc @tonyrog and @jsnell who may also be interested.

See source code comments for implementation status/notes.

Fixes an overflow bug where the 32-bit accumulators were summed using a 16-bit add instruction. Checksums now seem to be correct (same as existing routine) for up to 128KB inputs.

lukego · 2016-04-28T04:38:23Z

Fixed the bug with 4993c37. This code works fine up to 128KB inputs in casual testing. That limitation seems okay to me i.e. not worth writing more code to increase it because packets are not that big.

The next step is to integrate and test/benchmark more extensively both with synthetic benchmarks and end-to-end tests (offloading checksums from QEMU VMs).

From the comment: This routine executes a VZEROUPPER instruction before returning in order to flush 256-bit AVX register state and avoid potential expensive SSE-AVX transition penalties. This is a cheap form of insurance against taking ~ 75 cycle penalties when mixing SSE and AVX code in the same program. See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties and particularly section 3.3.

The assembler routine now has the same function interface as the C functions and should be able to serve as a drop-in replacement.

snabbmark can now measure the performance of built-in IP checksum routines and presents results for apples-to-apples comparison. The benchmark parameters are currently hard-coded. Length is randomly chosen from a "log uniform" distribution (favoring smaller values but drawn from a large range). Alignment is randomized. The intention is to favor robust routines that are not sensitive to alignment and predictable branches. Currently the alignment is forced to be even. Initially this was to be realistic for normal protocols but I discovered that odd addresses actually crash the SSE implementation. Have to address that bug separately.

lukego · 2016-04-30T17:14:49Z

Added snabbmark checksum benchmark:

$ taskset -c 1 sudo ./snabb snabbmark checksum
VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          0.326       1935.081
asm                   631.331          4.244        148.770
avx2                  631.331          2.743        230.180
sse2                  631.331          2.318        272.319

This is intended to be fairly harsh and realistic. The input sizes, contents, and alignments are randomized. I draw the input sizes from a log-uniform distribution which is my current favorite for packet sizes (mostly small but also including large and jumbo sizes). I also update the assembler routine to have the same interface as the others (added the initial argument).

The assembler routine shows the best results by far. The older AVX routine is likely suffering from the logic that selectively falls back on the generic routine on small inputs (both for the unpredictable branch and because the current implementation of generic checksum is beyond awful -- should fix that as a matter of principle even if we are using the SIMD one in practice.)

This is starting to look like effort well spent! IP checksum is the main hotspot for Virtio-net with client/server workloads e.g. running iperf in VM. Cycles saved here should translate directly into extra capacity for the NFV application.

Before there were three separate C checksum implementations (generic, SSE, AVX) that are each compiled with different compiler settings. These were fairly complex due to SIMD intrinsics. The SSE implementation was also incorrect and would segfault with odd-numbered addresses. Now there is one C checksum routine that is compiled with two different compiler settings (default/SSE and AVX). The checksum routine is written in a very simple style that GCC successfully vectorizes automatically (tested with GCC 4.8.5, 4.9.3, 5.3.0). I experimented with "waving a voodoo chicken" in a few different ways (# accumulators = 1, 2, 4; accumulator size = 32bit, 64bit). This formulation seems to work best for GCC. This does feel like hokus-pokus that exposes us to GCC behavior that is not nailed down, but that bothers me less than the high-brow intrinsics code. I have retained the AVX2 assembler implementation with DynASM because I have not been able to beat that with GCC yet. Current scoreboard: VARIANT BYTES/PACKET BYTES/CYCLE CYCLES/PACKET base 631.331 2.939 214.796 asm 631.331 4.161 151.719 avx2 631.331 3.416 184.825

lukego · 2016-05-01T07:46:32Z

This branch is taking a little bit of a different turn:

I found a fairly straightforward formulation of checksum in C that GCC is able to automatically vectorize when compiled with -O3. I hacked the Makefile to compile this function twice: once as cksum_avx2() using -mavx2 and once as cksum() using default settings (i.e. SSE). This is simpler and faster than the hand-coded SSE and AVX routines based on C intrinsics so I have removed those.

I have retained the AVX2 assembler variant as this is still the fastest by a significant margin.

Current scoreboard:

VARIANT          BYTES/PACKET    BYTES/CYCLE  CYCLES/PACKET
base                  631.331          2.921        216.117
asm                   631.331          4.172        151.310
avx2                  631.331          3.435        183.775

The next step is to eliminate either the C/AVX2 implementation of the assembler one. The open problem for the assembler one right now is the wart that it temporarily overwrites the memory trailing the input which is a different and more complex interface that may not suit all usages. The open problem for the C/AVX2 implementation is that it is slower than the assembler.

jsnell · 2016-05-01T18:17:41Z

My usual worry with auto-vectorization for critical hotspots is that at some point it'll fail to trigger for whatever reason, and you'll end up silently running the naive version of the code. It's ok in a closed environment where the compiler versions are guaranteed to be fixed and only get upgraded at very specific points in time, but more problematic when you as the author have little control over how this gets compiled.

Though the simplicity of that new C code is pretty sweet.

lukego · 2016-05-02T12:10:47Z

@jsnell Yes, I know what you mean. In Snabb we pin exact versions of LuaJIT and DynASM so that we can "geek out" on them but have tried to be conservative with gcc, glibc, etc. I am not especially comfortable with either the auto-vectorized nor the vectorized-with-C-intrinsics versions.

I imagine it is interesting in other projects where they are dedicated C hackers and pick one compiler (ICC, CLANG, or GCC) and geek out on its features for vectorization etc. at least until the distros come along and decide to use a different compiler, compile for a generic target architecture, skip the performance tests, etc :).

Questions I am struggling with now are:

Is the trick of zero-padding the input good (simpler code thanks to known properties of inputs) or bad (complex interface due to lazy coding)?
Is it worth writing an SSE assembler version by hand to reduce dependence on GCC?
How does performance compare with theoretical limits? i.e. based on throughput and latency of the instructions involved (how busy is it keeping the ALUs and what is the limiting factor on performance?)

…th-alarms Extend snabb-softwire-v2 schema with alarms

AkihiroSuda · 2018-08-03T07:32:59Z

What is current status?

eugeneia · 2018-08-15T09:59:49Z

@AkihiroSuda @dpino is hacking on a fast IP checksum implementation in the same spirit here: #1275

lib/checksum_simd.dasl: IP checksum in AVX2 assembler (prototype)

049d558

See source code comments for implementation status/notes.

lukego changed the title ~~[wip] lib/checksum_simd.dasl: IP checksum in AVX2 assembler (prototype)~~ [wip] IP checksum in AVX2 assembler (prototype rewrite) Apr 27, 2016

lukego added 2 commits April 28, 2016 04:25

lib.checksum_simd: Fix overflow bug - now works

4993c37

Fixes an overflow bug where the 32-bit accumulators were summed using a 16-bit add instruction. Checksums now seem to be correct (same as existing routine) for up to 128KB inputs.

lib/checksum_simd.dasl: Improved comments and selftest

14d78e4

lukego added 3 commits April 30, 2016 08:57

lib.checksum_simd: Add support for 'initial' argument

9ea8031

The assembler routine now has the same function interface as the C functions and should be able to serve as a drop-in replacement.

This was referenced May 5, 2016

Help wanted: LuaJIT pros to contribute to Snabb LuaJIT/LuaJIT#177

Closed

next: Changes queued for the v2016.05 release #888

Merged

dpino added a commit to dpino/snabb that referenced this pull request Aug 10, 2017

Merge pull request snabbco#899 from dpino/extend-snabb-softwire-v2-wi…

5c137e2

…th-alarms Extend snabb-softwire-v2 schema with alarms

dpino mentioned this pull request Feb 11, 2018

Implement checksum computation using DynASM #1275

Merged

This was referenced Nov 28, 2023

Segment: revise checksum to not allocate and blit robur-coop/utcp#30

Closed

TCP checksum computation: improve performance robur-coop/utcp#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

lukego commented Apr 27, 2016 •

edited

Loading

lukego commented Apr 28, 2016

lukego commented Apr 30, 2016

lukego commented May 1, 2016

jsnell commented May 1, 2016

lukego commented May 2, 2016

AkihiroSuda commented Aug 3, 2018

eugeneia commented Aug 15, 2018

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

Are you sure you want to change the base?

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

Conversation

lukego commented Apr 27, 2016 • edited Loading

lukego commented Apr 28, 2016

lukego commented Apr 30, 2016

lukego commented May 1, 2016

jsnell commented May 1, 2016

lukego commented May 2, 2016

AkihiroSuda commented Aug 3, 2018

eugeneia commented Aug 15, 2018

lukego commented Apr 27, 2016 •

edited

Loading