-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental performance test: Virtio-net options (take one) #796
Comments
To help catch problems like this one early, I tend to stick a line like
at the top of my shell scripts -- then they fail if any of their commands fails unexpectedly. |
@darius Yeah. I am not sure what is right for these test runs. I start this script running and expect it to continue for hours/days and provide me with a symmetric / orthogonal data set. I think I would prefer to have errors clearly logged rather than actually stopping the test run (and leaving the machine idle). |
Reflection: There is a problem with the test execution script in that background activity on the server could impact one branch's tests more than another's. The statistical tests assume that any such background activity is randomly distributed between the tests. The trouble is this structure to the shell script: for commit in next no-indirect no-mrg neither; do
for i in $(seq $n); do
for dpdk in 2.1 1.7; do
for pktsize in 64 128 256; do
...
done
done
done
done which does the testing very sequentially: run all tests for one branch, then all tests for the next branch, etc. This is a problem because if there would be some background activity on the server some of the time, e.g. some independent activity by somebody else, then that might have a big impact on one of the branches but not the others. This would lead the analysis to report significant differences when in fact there are none. One solution would be to perform the tests in a random order with something like: # Create a list of test cases
true > tests.txt
for commit in next no-indirect no-mrg neither; do
for i in $(seq $n); do
for dpdk in 2.1 1.7; do
for pktsize in 64 128 256; do
echo "$commit $dpdk $pktsize" >> tests.txt
done
done
done
done
# Shuffle the cases into a random order
shuffle tests.txt
# Execute the tests
while read commit dpdk pkgsize; do
...
end < tests.txt This way the impact of background activities should be randomly distributed between the test cases. This would satisfy an assumption of the statistical analysis. Tests with background activity would still be valid and the analysis would take the variation into account when reporting its confidence level about the results (e.g. lots of background activity leads to larger +/- bounds around the results). |
@domenkozar I am starting to dream about how to hook this into Nix. I have an idea now for how Hydra could automatically execute a test campaign across all available lab servers and publish the result. Please shoot it down and tell me what I get wrong :) First is that we would define one Nix expression for each individual measurement. The names of the expressions would be something like:
and so for a simple experiment the number of entries could be something like 2 (branches) * 2 (VM versions) * 3 (packet sizes) * 10 (measurements per scenario) = 120 Nix expressions. (More complex scenarios could have a thousand expressions or more.) Then we would ask Hydra to build all of these expressions. So once we start this experiment there would be 120 new items in the Hydra build queue and it would distribute them across all available slaves that meet the requirements (e.g. is one of the lugano servers that all have equivalent hardware). Once that completes we will want to aggregate the results. This could be done with a new Nix expression that depends on all of those, so that they are accessible on the file system, and skims through the logs to create a CSV file and generate a report. Could even be that we submit just one job to Hydra - the report expression - and the test execution is done automatically when Hydra detects the expressions as requirements? This sounds utopian to me:
What's not to like? |
@lukego Very close. I wouldn't collect the results from machines, but rather hydra. It's going to be easier since you need to use hydra to know what derivations were built, then you might aswell download the results (or go to the servers, but then you need to use http and ssh). Other than that it sounds doable. We'd limit one job per slave to avoid artifacts. |
Sounds neat. I am thinking that the jobs would also need to run with the |
All Nix builds are executed in chroot, so I'll have to see if it bind mounts something like |
Fix small nit in test on lwaftr-kristian
This is another experimental performance test following in the recent tradition of #778 and so on.
Big picture
I start to see a picture in my mind of where this "experimental" testing work is leading now:
That is how nirvana looks to me right now :-). It is still on the horizon but on this PR I will at least show you where I am at so far.
This test
Now I want to perform a simple test that investigates the effect of various Virtio-net options on the NFV DPDK benchmark with different versions of DPDK in the guest. Specifically, to see how performance is impacted if we suppress "Mergeable RX buffers" or "Indirect descriptors" or both.
First I manually created branches
no-mrg
,no-indirect
, andneither
based onnext
. These branches modify thenet_device.lua
supported options list.Next I wrote a shell script to run the tests in a loop.
Next I wrote an awk script to pick up the
@@@@ variable value
lines and convert the output into CSV. The script also recognizes error messages and counts them as 0.Finally I wrote an R script to do Tukey's test to see what we can predict about the performance of each branch, and to do a simple visualization:
Results (spoiler: flawed data)
Here is the result of Tukey's test:
This says that we cannot say with 95% confidence that there is any difference between the branches. The lower bound of the confidence interval (
lwr
) is negative and the upper bound (upr
) is positive which means that "zero effect" is within the expected results.Sounds suspicious to me! I actually expect these code changes to make a significant difference.
Let's look at the visualization:
Here we have a separate box plot for each combination of branch, DPDK version, and packet size. On close inspection we can see that DPDK version makes a significant difference, and so does packet size, but that
branch
makes an almost imperceptibly small difference.My first thought was that the R script was merging all the samples together somehow. However, that does not hold because the results are not quite identical. My next thought was that the test script has somehow always tested the same branch, and that is what happened. Turns out that my Git working directory was dirty and so the
git checkout
in the script was failing. I had failed to includestderr
in the log and so I did not notice this while scanning through the file as a sanity-check. (Thankfully I did includegit log --oneline -1
so I can now see that it was indeed the same version being tested each time.)So! Successfully applied R to a new data set, apparently came to the right conclusion by being careful to both visualize and statistically analyse the data, and have a clear next step i.e. rerun the test with correct data.
Cool! More to follow.
The text was updated successfully, but these errors were encountered: