Experimental performance test: Virtio-net options (take one) #796

lukego · 2016-03-02T10:56:50Z

This is another experimental performance test following in the recent tradition of #778 and so on.

Big picture

I start to see a picture in my mind of where this "experimental" testing work is leading now:

Shell scripts that define test campaigns e.g. comparing a set of branches_programs_workloads and producing CSV output.
Rmarkdown report templates that can convert the CSV files into high-level analysis i.e. relevant graphs and statistics.
Hydra infrastructure to automatically execute a test campaign on an available lab server and publish all of the results (graphs, code, logs, CSV files, ...) in a perfectly reproducible way (cc @domenkozar).

That is how nirvana looks to me right now :-). It is still on the horizon but on this PR I will at least show you where I am at so far.

This test

Now I want to perform a simple test that investigates the effect of various Virtio-net options on the NFV DPDK benchmark with different versions of DPDK in the guest. Specifically, to see how performance is impacted if we suppress "Mergeable RX buffers" or "Indirect descriptors" or both.

First I manually created branches no-mrg, no-indirect, and neither based on next. These branches modify the net_device.lua supported options list.

Next I wrote a shell script to run the tests in a loop.

#!/usr/bin/env bash                                                                                                                                                                

for commit in next no-indirect no-mrg neither; do
    echo "@@@@ branch $commit"
    echo "Running $n tests of $commit"
    git checkout $commit
    git log --oneline -1

    sudo make clean
    (cd src && scripts/dock.sh "(cd ..; make -j)")

    for i in $(seq $n); do
        echo "@nfv-selftest"
        (cd src && SNABB_PCI0=0000:01:00.0 \
                   SNABB_PCI1=0000:01:00.1 \
                   scripts/dock.sh program/snabbnfv/selftest.sh)
        echo "@dpdk-test"
        for dpdk in 2.1 1.7; do
            echo "@@@@ dpdk v${dpdk}"
            for pktsize in 64 128 256; do
                echo "@@@@ pktsize ${pktsize}B"
                (cd src && SNABB_TEST_IMAGE=snabbco/nfv-dpdk${dpdk} \
                           SNABB_PCI_INTEL0=0000:01:00.0 \
                           SNABB_PCI_INTEL1=0000:01:00.1 \
                           scripts/dock.sh CAPFILE=$pktsize program/snabbnfv/packetblaster_bench.sh)
            done
        done
    done
done

Next I wrote an awk script to pick up the @@@@ variable value lines and convert the output into CSV. The script also recognizes error messages and counts them as 0.

#!/usr/bin/env awk -f                                                                                                                                                              

BEGIN {
    running = 0;
    print("Mpps,branch,dpdk,pktsize")
}

/@@@@ branch /  { branch = $3 }
/@@@@ dpdk /    { dpdk = $3 }
/@@@@ pktsize / { pktsize = $3 }

/assertion failed!/ { printf("0,%s,%s,%s\n", branch, dpdk, pktsize) }
/Rate\(Mpps\):/       { printf("%f,%s,%s,%s\n", $2, branch, dpdk, pktsize) }

Finally I wrote an R script to do Tukey's test to see what we can predict about the performance of each branch, and to do a simple visualization:

library(lattice)

d <- read.csv('virtio-options-1.csv')
summary(d)
df <- data.frame(Mpps=d$Mpps, branch=d$branch, dpdk=d$dpdk, pktsize=d$pktsize)

TukeyHSD(aov(Mpps ~ branch, data=df), ordered=TRUE)

png('mpps-branch-dpdk.png')
bwplot(pktsize~Mpps|branch*dpdk, data=d)
dev.off()

Results (spoiler: flawed data)

Here is the result of Tukey's test:

  Tukey multiple comparisons of means
    95% family-wise confidence level
    factor levels have been ordered

Fit: aov(formula = Mpps ~ branch, data = df)

$branch
                            diff        lwr       upr     p adj
no-indirect-no-mrg  0.0048548387 -0.4762325 0.4859422 0.9999936
neither-no-mrg      0.0051451613 -0.4759422 0.4862325 0.9999924
next-no-mrg         0.0769516129 -0.4061034 0.5600066 0.9763492
neither-no-indirect 0.0002903226 -0.4807970 0.4813777 1.0000000
next-no-indirect    0.0720967742 -0.4109582 0.5551517 0.9803973
next-neither        0.0718064516 -0.4112485 0.5548614 0.9806245

This says that we cannot say with 95% confidence that there is any difference between the branches. The lower bound of the confidence interval (lwr) is negative and the upper bound (upr) is positive which means that "zero effect" is within the expected results.

Sounds suspicious to me! I actually expect these code changes to make a significant difference.

Let's look at the visualization:

Here we have a separate box plot for each combination of branch, DPDK version, and packet size. On close inspection we can see that DPDK version makes a significant difference, and so does packet size, but that branch makes an almost imperceptibly small difference.

My first thought was that the R script was merging all the samples together somehow. However, that does not hold because the results are not quite identical. My next thought was that the test script has somehow always tested the same branch, and that is what happened. Turns out that my Git working directory was dirty and so the git checkout in the script was failing. I had failed to include stderr in the log and so I did not notice this while scanning through the file as a sanity-check. (Thankfully I did include git log --oneline -1 so I can now see that it was indeed the same version being tested each time.)

So! Successfully applied R to a new data set, apparently came to the right conclusion by being careful to both visualize and statistically analyse the data, and have a clear next step i.e. rerun the test with correct data.

Cool! More to follow.

The text was updated successfully, but these errors were encountered:

darius · 2016-03-02T18:53:04Z

To help catch problems like this one early, I tend to stick a line like

set -euo pipefail # 'bash strict mode'

at the top of my shell scripts -- then they fail if any of their commands fails unexpectedly.

lukego · 2016-03-03T03:41:13Z

@darius Yeah. I am not sure what is right for these test runs. I start this script running and expect it to continue for hours/days and provide me with a symmetric / orthogonal data set. I think I would prefer to have errors clearly logged rather than actually stopping the test run (and leaving the machine idle).

lukego · 2016-03-03T12:43:03Z

Reflection: There is a problem with the test execution script in that background activity on the server could impact one branch's tests more than another's. The statistical tests assume that any such background activity is randomly distributed between the tests.

The trouble is this structure to the shell script:

for commit in next no-indirect no-mrg neither; do
    for i in $(seq $n); do
        for dpdk in 2.1 1.7; do
            for pktsize in 64 128 256; do
               ...
            done
        done
    done
done

which does the testing very sequentially: run all tests for one branch, then all tests for the next branch, etc. This is a problem because if there would be some background activity on the server some of the time, e.g. some independent activity by somebody else, then that might have a big impact on one of the branches but not the others. This would lead the analysis to report significant differences when in fact there are none.

One solution would be to perform the tests in a random order with something like:

# Create a list of test cases
true > tests.txt
for commit in next no-indirect no-mrg neither; do
    for i in $(seq $n); do
        for dpdk in 2.1 1.7; do
            for pktsize in 64 128 256; do
                echo "$commit $dpdk $pktsize" >> tests.txt
            done
        done
    done
done

# Shuffle the cases into a random order
shuffle tests.txt

# Execute the tests
while read commit dpdk pkgsize; do
    ...
end < tests.txt

This way the impact of background activities should be randomly distributed between the test cases. This would satisfy an assumption of the statistical analysis. Tests with background activity would still be valid and the analysis would take the variation into account when reporting its confidence level about the results (e.g. lots of background activity leads to larger +/- bounds around the results).

lukego · 2016-03-03T13:06:27Z

@domenkozar I am starting to dream about how to hook this into Nix. I have an idea now for how Hydra could automatically execute a test campaign across all available lab servers and publish the result. Please shoot it down and tell me what I get wrong :)

First is that we would define one Nix expression for each individual measurement. The names of the expressions would be something like:

packetforward-snabb2016.03-dpdk2.1-64byte-1
packetforward-snabb2016.03-dpdk2.1-64byte-2
packetforward-snabb2016.03-dpdk2.1-64byte-3
packetforward-snabb2016.03-dpdk2.1-128byte-1
packetforward-snabb2016.03-dpdk2.1-128byte-2
packetforward-snabb2016.03-dpdk2.1-128byte-3
packetforward-snabb2016.03-dpdk2.1-256byte-1
...

and so for a simple experiment the number of entries could be something like 2 (branches) * 2 (VM versions) * 3 (packet sizes) * 10 (measurements per scenario) = 120 Nix expressions. (More complex scenarios could have a thousand expressions or more.)

Then we would ask Hydra to build all of these expressions. So once we start this experiment there would be 120 new items in the Hydra build queue and it would distribute them across all available slaves that meet the requirements (e.g. is one of the lugano servers that all have equivalent hardware).

Once that completes we will want to aggregate the results. This could be done with a new Nix expression that depends on all of those, so that they are accessible on the file system, and skims through the logs to create a CSV file and generate a report.

Could even be that we submit just one job to Hydra - the report expression - and the test execution is done automatically when Hydra detects the expressions as requirements?

This sounds utopian to me:

Hydra would execute benchmarks across all available servers, so a 10h benchmark could be completed in 1h on 10 servers.
Each individual test scenario would be addressable with Nix: you could build it from scratch with nix-build (execute test), get it from a binary cache (fetch result from an earlier run), or submit it to Hydra to build somewhere / some time later.
The complete software environment for every test would be well-defined and permanently reproducible.

What's not to like?

domenkozar · 2016-03-03T15:09:06Z

@lukego Very close. I wouldn't collect the results from machines, but rather hydra. It's going to be easier since you need to use hydra to know what derivations were built, then you might aswell download the results (or go to the servers, but then you need to use http and ssh).

Other than that it sounds doable. We'd limit one job per slave to avoid artifacts.

lukego · 2016-03-03T15:13:48Z

Sounds neat.

I am thinking that the jobs would also need to run with the lock script (#773) so that they would be synchronized with other activities on the lab servers e.g. manual testing. This way we could run a Hydra slave on most/all lab servers and share them between interactive and CI use. (If needed we could extend the lock script so that interactive work would preempt/abort Hydra jobs and make those automatically loop/backoff until they manage to complete without preemption.)

domenkozar · 2016-03-03T15:15:38Z

All Nix builds are executed in chroot, so I'll have to see if it bind mounts something like /run or /tmp, otherwise I could add that so lock can be shared

Fix small nit in test on lwaftr-kristian

eugeneia added the idea label Apr 25, 2016

dpino pushed a commit to dpino/snabb that referenced this issue May 29, 2017

Merge pull request snabbco#796 from Igalia/lwaftr-kristian-update

dd88562

Fix small nit in test on lwaftr-kristian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental performance test: Virtio-net options (take one) #796

Experimental performance test: Virtio-net options (take one) #796

lukego commented Mar 2, 2016

darius commented Mar 2, 2016

lukego commented Mar 3, 2016

lukego commented Mar 3, 2016

lukego commented Mar 3, 2016

domenkozar commented Mar 3, 2016

lukego commented Mar 3, 2016

domenkozar commented Mar 3, 2016

Experimental performance test: Virtio-net options (take one) #796

Experimental performance test: Virtio-net options (take one) #796

Comments

lukego commented Mar 2, 2016

Big picture

This test

Results (spoiler: flawed data)

darius commented Mar 2, 2016

lukego commented Mar 3, 2016

lukego commented Mar 3, 2016

lukego commented Mar 3, 2016

domenkozar commented Mar 3, 2016

lukego commented Mar 3, 2016

domenkozar commented Mar 3, 2016