-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low DPDK throughput in NFV benchmark #665
Comments
@eugeneia Could you please update the issue text to say exactly what we know about the differences between the fast and slow test environments and how to run both tests? (In the discussion on the other issue I lose track a bit of exactly what code is being tested and what its results are.) Standard question whenever a mysterious 30% performance difference appears: Can the NUMA affinity have changed? (Could check e.g. in |
@lukego I started testing from a clean slate, to be honest I didn't know anything and that's why I did not want to pollute the issue with previous speculations. I am quite sure that its unrelated to NUMA affinity, as nothing changed there and the other benchmarks are not affected. So here we go, fresh start. There are two different versions of DPDK being tested:
I also tested on two different machines:
Finally I tested with two different versions of QEMU (2.4.0 and 2.1.0). Now to the results:
So while there is a performance hit when upgrading to DPDK 2.1, the reason for the decrease in performance on davos seems to be indeed the new QEMU (2.4.0) version. @lukego On a side node, legacy DPDK does indeed seem to work with 1000 byte packets. |
Can you tell me how to reproduce please? We should not apply any patches to DPDK. 9000 byte is what I believe won't work in the older DPDK. |
Good news, I have found the culprit: QEMU is the bottleneck, not DPDK as I thought.I have built a docker image identical to what the CI uses except that it contains QEMU https://github.com/SnabbCo/qemu/tree/v2.1.0-vhostuser plus the patch which increases hard-coded Virtio vring size (snabbco/qemu@7a94322). (I actually reproduced the wrong QEMU version before, that's why I couldn't reproduce on grindelwald.) This image should be an almost exact replica of the opaque blobs and QEMU we used in bench_env times (except a slight difference in the QEMU source for which I couldn't find the patch / context): https://hub.docker.com/r/eugeneia/snabb-nfv-test-legacy/ You can reproduce like so:
This yielded expected performance (5.3Mpps on grindelwald) and no DPDK packet loss when I ran it. |
Nice work. :-) |
Great detective work, Max! The Docker workflow really does make it easy to reproduce tests! I ran this on chur and the results seem consistent when taking into account the slower CPU:
Next I would really like to get in control of the patches. Specifically I would like to migrate over to testing with the latest releases of QEMU and DPDK without any patches applied. That is the intended target software environment. If the performance does not match expectations then we would dig in to look for the root cause and try to address this in the snabb code rather than with adding/reviving patches. Is this easy to setup? Patches can really take on a life of their own :-). I have a little bit of Nix envy here: Nix makes the exact versions/patches being tested completely transparent, even down to kernels and libc both on the host and inside the VMs, whereas Docker seems to make this very opaque. Dockerhub makes it extremely easy to reproduce the test environment but the summary page doesn't provide any insight into what code is actually in the container. (There is actually one QEMU patch that I do still recommend applying, "f393aea Add G_IO_HUP handler for socket chardev", but that should not be needed for the DPDK benchmark. It's only to allow the snabb process to restart without restarting QEMU.) |
I am in the process of building a patch-less “vanilla” image. |
Here is the “vanilla” image containing latest stable QEMU and DPDK (and kernel 3.19 instead of 3.13 because DPDK required 3.14+): https://hub.docker.com/r/eugeneia/snabb-nfv-test-vanilla/ |
Awesome, thanks! I am really impressed with the Docker workflow you have cooked up, I think that the effort has already paid off in terms of time saved on manually reproducing test environments. Here is a braindump on the general theme of compatibility and performance with different software versions from our upstream Snabb Switch perspective: The main thing is to make the latest upstream versions work and prevent them from breaking/regressing in the future. Then over time we build up a long trail of compatibility with consecutive versions. Supporting older and/or patched versions is less important unless specially required for some reason. We want to take responsibility for making all the components work well together: Snabb, QEMU, DPDK/Linux guests, etc. If there is a problem then we need to find a solution even if that involves creating workarounds in Snabb Switch, making temporary patches to QEMU/DPDK, and working with upstream communities to merge fixes so that we can drop patches. In this sense it is better when a problem is caused by Snabb Switch, where it should be easy to fix, rather than a change in another project that we need to somehow deal with. On performance issues like packet drops it can be that we need to take a "holistic" view of the relationships between all of the components rather than looking at one in isolation. For example if a new QEMU version leads to packet loss then this doesn't necessarily mean that there is a bug in QEMU but rather that the interactions between the components has changed. Particularly, QEMU is not involved in packet processing at all (this is done directly in shared memory between Snabb & DPDK) so it cannot be a direct processing bottleneck, but it is involved in negotiating the sizes and features of the shared memory rings and this can indirectly affect performance. Likewise even between the processing components it is hard to point a finger at one and say that it is the problem. If the DPDK guest is dropping packets then all we know is that it is receiving faster than it is transmitting (and the difference is the dropped packets). This could be due to local DPDK behavior, or subtle Virtio-net behavior that gives the receive ring larger capacity than the transmit ring, or subtle Snabb Switch behavior that bursts packets onto the receive ring faster than it takes them off the transmit ring, and so on. The most practical method I know of for diagnosing holistic performance problems is to bisect i.e. to identify two software versions that are as close as possible but with one being "good" and one being "bad". In this example it seems like we have "good" behavior from the fully patched QEMU 2.1 and "bad" behavior from the unpatched QEMU 2.4.1 and the next problem is to isolate this more narrowly e.g. does the problem appear when we drop one of our QEMU patches (like the vring size increase) or did it appear in QEMU 2.2 or 2.3 or 2.4 and so on. I have a fantasty that our build infrastructure could make this easy to generate a test matrix e.g. by running commands like:
I am not sure if this is practical with the Docker-based test bootstrapping? If so that would be interesting! I believe that this is a built-in feature of Hydra where build/test parameters can be specified as e.g. enums and all combinations can be tested (seen in a blog post on Hydra). I am keen to dig into this on the side in case it would be a nice solution for us in the future. |
@eugeneia Running the "vanilla" test I don't see traffic passing so it seems like we have a compatibility issue between Snabb master + QEMU 2.4.1 + DPDK 2.1. Do you agree? If so I can create a separate ticket for that. |
No I tested it and it works for me (albeit with decreased performance, since both the DPDK and QEMU “regressions” kick in.
Edit: I have only tested on chur. |
My guess is that its these two patches that are impacting the throughput:
And maybe this one (I am not sure what it does): virtualopensystems/dpdk@dae0a7f ( [virtio] Initialize the queues even if VIRTIO_NET_F_CTRL_VQ is not negotiated) Will try to verify this guess. |
Is there an easy way to confirm that based on the Docker workflow? This seems worth confirming before investing serious time in making a fix. |
I did some tests, results in relative numbers:
I could confirm that virtualopensystems/dpdk@dae0a7f is unrelated to performance. We already knew virtualopensystems/dpdk@7807fbb makes up for 20%, so I am thinking I misapplied snabbco/qemu@7a94322 (the code changed a bit since then, this is my adaption: eugeneia/qemu@101ec94). Makes sense? |
Regarding an easy way: I scratched my head a bit, but I ended up just branching snabbswitch-docker and building new images. The image building process takes a bit, but at least if you decide to share you can just |
I take that back: Removing the last DPDK patch from the equation does not affect performance. That leaves us with:
I might be approaching this from the wrong direction (e.g. eliminating patches that don't help instead of “bisecting” to the first bad commits) but since we have only ~3 relevant patches and probably thousands of commits from QEMU and DPDK upstream (which I probably don't understand)... |
OK, I am now reasonably certain that DPDK is the component we need to patch / focus on. I have ran another test using a vanilla/legacy hybrid image, using “legacy DPDK” and vanilla QEMU + eugeneia/qemu@101ec94:
There could maybe be one detail invalidating the result, which is kernel versions (legacy doesn't compile with 3.19 so it uses 3.13, vanilla doesn't compile with 3.13 so it uses 3.19). I guess I could apply eugeneia/dpdk@75f58c6 to vanilla to be really sure. Anyways, my takeaway from this is that we need to focus on DPDK and find out where the performance decrease comes from. If I understand correctly, l2fwd is just an example program, and is mostly untouched since 2013. So maybe it needs to be updated to adapt to DPDK development. |
Latest insights on this issue:
|
@eugeneia do you have a quick tip for how I could run the snabbnfv in benchmark mode ( |
See |
My goal is to run the full benchmark (packetblaster+qemu+snabb) but to control the |
No, not really. 😞 You could edit this line in between runs. Suboptimal, I know... |
Have been thinking about indirect descriptors a bit more. I am starting to think that they are an expensive feature that should be avoided. Direct descriptors only require the device/hypervisor/snabbnfv to make one L3 cache access to access packet payload. Indirect descriptors require two L3 cache accesses: first to resolve the address of the payload and second to actually access it. These L3 accesses are dependent on each other so the CPU won't be able to parallelize them (second can't start until the address is provided by the first). This makes me think that indirect descriptors will generally have higher per-packet overhead than direct descriptors. This would be visible in the DPDK l2fwd benchmark (high packet rate) but not with Linux kernel VMs (low packet rate, bottleneck is checksum offload). Could be that this can be resolved with clever assembler code in #719 to access multiple packets in parallel but I am not sure. (Could also be that I am mis-analysing the situation entirely.) Just flagging to @nnikolaev @dpino @wingo that you may be able to expect better efficiency with direct descriptors rather than indirect ones but I am not sure yet. Ideas/input/data welcome. (My understanding is that indirect descriptors are mostly useful for working around the impractically small vring size that is hard-coded in QEMU but the existing CI benchmarks show that it is possible to achieve good performance even with such small vrings.) |
See also the excellent test suite walkthrough that @eugeneia wrote. Down the bottom you see the much higher results when testing with older version of DPDK l2fwd that did not use indirect descriptors. (Hope we can improve the situation for both.) |
Closing because #1001 landed. |
program/snabbnfv/packetblaster_bench.sh
performs badly due to DPDK dropping packets. See #588.The text was updated successfully, but these errors were encountered: