-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] IP checksum in AVX2 assembler (prototype rewrite) #899
base: master
Are you sure you want to change the base?
Conversation
See source code comments for implementation status/notes.
Fixes an overflow bug where the 32-bit accumulators were summed using a 16-bit add instruction. Checksums now seem to be correct (same as existing routine) for up to 128KB inputs.
Fixed the bug with 4993c37. This code works fine up to 128KB inputs in casual testing. That limitation seems okay to me i.e. not worth writing more code to increase it because packets are not that big. The next step is to integrate and test/benchmark more extensively both with synthetic benchmarks and end-to-end tests (offloading checksums from QEMU VMs). |
From the comment: This routine executes a VZEROUPPER instruction before returning in order to flush 256-bit AVX register state and avoid potential expensive SSE-AVX transition penalties. This is a cheap form of insurance against taking ~ 75 cycle penalties when mixing SSE and AVX code in the same program. See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties and particularly section 3.3.
The assembler routine now has the same function interface as the C functions and should be able to serve as a drop-in replacement.
snabbmark can now measure the performance of built-in IP checksum routines and presents results for apples-to-apples comparison. The benchmark parameters are currently hard-coded. Length is randomly chosen from a "log uniform" distribution (favoring smaller values but drawn from a large range). Alignment is randomized. The intention is to favor robust routines that are not sensitive to alignment and predictable branches. Currently the alignment is forced to be even. Initially this was to be realistic for normal protocols but I discovered that odd addresses actually crash the SSE implementation. Have to address that bug separately.
Added
This is intended to be fairly harsh and realistic. The input sizes, contents, and alignments are randomized. I draw the input sizes from a log-uniform distribution which is my current favorite for packet sizes (mostly small but also including large and jumbo sizes). I also update the assembler routine to have the same interface as the others (added the The assembler routine shows the best results by far. The older AVX routine is likely suffering from the logic that selectively falls back on the generic routine on small inputs (both for the unpredictable branch and because the current implementation of generic checksum is beyond awful -- should fix that as a matter of principle even if we are using the SIMD one in practice.) This is starting to look like effort well spent! IP checksum is the main hotspot for Virtio-net with client/server workloads e.g. running iperf in VM. Cycles saved here should translate directly into extra capacity for the NFV application. |
Before there were three separate C checksum implementations (generic, SSE, AVX) that are each compiled with different compiler settings. These were fairly complex due to SIMD intrinsics. The SSE implementation was also incorrect and would segfault with odd-numbered addresses. Now there is one C checksum routine that is compiled with two different compiler settings (default/SSE and AVX). The checksum routine is written in a very simple style that GCC successfully vectorizes automatically (tested with GCC 4.8.5, 4.9.3, 5.3.0). I experimented with "waving a voodoo chicken" in a few different ways (# accumulators = 1, 2, 4; accumulator size = 32bit, 64bit). This formulation seems to work best for GCC. This does feel like hokus-pokus that exposes us to GCC behavior that is not nailed down, but that bothers me less than the high-brow intrinsics code. I have retained the AVX2 assembler implementation with DynASM because I have not been able to beat that with GCC yet. Current scoreboard: VARIANT BYTES/PACKET BYTES/CYCLE CYCLES/PACKET base 631.331 2.939 214.796 asm 631.331 4.161 151.719 avx2 631.331 3.416 184.825
This branch is taking a little bit of a different turn: I found a fairly straightforward formulation of checksum in C that GCC is able to automatically vectorize when compiled with I have retained the AVX2 assembler variant as this is still the fastest by a significant margin. Current scoreboard:
The next step is to eliminate either the C/AVX2 implementation of the assembler one. The open problem for the assembler one right now is the wart that it temporarily overwrites the memory trailing the input which is a different and more complex interface that may not suit all usages. The open problem for the C/AVX2 implementation is that it is slower than the assembler. |
My usual worry with auto-vectorization for critical hotspots is that at some point it'll fail to trigger for whatever reason, and you'll end up silently running the naive version of the code. It's ok in a closed environment where the compiler versions are guaranteed to be fixed and only get upgraded at very specific points in time, but more problematic when you as the author have little control over how this gets compiled. Though the simplicity of that new C code is pretty sweet. |
@jsnell Yes, I know what you mean. In Snabb we pin exact versions of LuaJIT and DynASM so that we can "geek out" on them but have tried to be conservative with gcc, glibc, etc. I am not especially comfortable with either the auto-vectorized nor the vectorized-with-C-intrinsics versions. I imagine it is interesting in other projects where they are dedicated C hackers and pick one compiler (ICC, CLANG, or GCC) and geek out on its features for vectorization etc. at least until the distros come along and decide to use a different compiler, compile for a generic target architecture, skip the performance tests, etc :). Questions I am struggling with now are:
|
…th-alarms Extend snabb-softwire-v2 schema with alarms
What is current status? |
@AkihiroSuda @dpino is hacking on a fast IP checksum implementation in the same spirit here: #1275 |
This branch rewrites the AVX2 checksum routine with DynASM assembler instead of GCC intrinsics.
My motivations and perceived benefits are a little subjective:
I will be satisfied if the assembler version is at least as short as the C version and also at least as fast.
Looks promising so far. The version here is basically working but seems to be missing a carry bit somewhere (off-by-one on some tests). The code is a little over half the size of the C version. The performance seems at least as good.
Just have simple microbenchmarks for now. I would like to get a more thorough performance test like #755 upstream to verify this change. Meanwhile, here is how it looks in comparison to the C code based on the microbenchmark included on the PR (1 million iterations on the same input). On case with 150 byte input:
Here is a little table with some other values:
EDIT: Marked with
*
the small sizes where our C code actually punts to non-SIMD.So: fun and encouraging so far but more work to be done. Could still be some big mistakes that invalidate this code and/or results for now.
cc @tonyrog and @jsnell who may also be interested.