-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] snabbmark: Add preliminary "byteops" benchmark #755
Conversation
This benchmark comprehensively measures the performance of various byte-oriented operations (copy, checksum) with many different variations (implementation; distribution of input sizes; alignment of src/dst/len; etc). This is working but messy first-cut code.
@petebristow what do you think about this in the context of #692? I am thinking that this PR basically steals that idea but keeps everything local inside I can imagine having additional functions for benchmarking apps and app networks alongside this one for benchmarking byte-oriented functions. (This one can also be improved a bit e.g. to recognize the checksum tests ignore the destination arguments and so prune those out of the permutation space.) These could initially share code via subroutines in |
Baby steps... I asked my mum about this kind of data analysis (she's good with statistics) and the keyword she gave me was SST (Total Sum of Squares). This lead me to the Kahn Academy videos on Analysis of Variance (Inferential Statistics). Looks promising. "A little knowledge is a dangerous thing"... early days yet. Please pipe up if you are interested in these things :). |
Closing this PR for now. I will reopen when I have something new to show. |
My mum tells me that she did an analysis of variance with a few variables and it says that the benchmarks are not sensitive to the alignment of the source operand but they are sensitive to the combination of operation (cksum, cksumavx2, memcpy) and displacement (cache level). Here is a picture of the latter: This seems to be saying that the base checksum is always slow, the AVX2 checksum varies by a factor of 2 depending on L1/L2/L3/DRAM, and the memcpy operation is dramatically faster in L1 cache (up to 12KB working set size). I find it encouraging that this kind of thing can be detected purely numerically without any explanation of what the data actually represents. Everything with a grain of salt at this stage, but still... feels like progress. |
So, seriously, is R the coolest thing in the universe? Probably. I very easily loaded the CSV file into R:
and then, completely by magic, it is able to just tell me the answers to all of these big questions I have been wondering about. Like:
which says to me that:
Holy smokes! Imagine taking hours of benchmarking numbers and getting a detailed analysis of them in one second. Mind: blown. This seems like exactly the tool that I need to move forward with micro-optimizations like the asm blitter in #719 where I want to understand how robust the optimization is to different workloads. EDIT: Actually it is saying that the alignments do matter a little bit... I said they don't matter based on the apparently much smaller effect than for |
ahem :) I think the conclusion above is okay but the details are wrong: I didn't properly declare that some more of the numeric columns in the CSV file are "factors". Correcting that we get a similar table:
where column Grain, salt, etc. |
If anybody wants to try then here is the full script: #!/usr/bin/env Rscript
print('loading data4.csv file')
d <- read.csv(file='data4.csv', sep=';')
# Columns that are "factors" in the experiment.
# (R would auto-detect if the values were non-numeric.)
d$srcalign <- as.factor(d$srcalign)
d$lenalign <- as.factor(d$lenalign)
d$disp <- as.factor(d$disp)
# Create a "data frame" with some columns to look at
df <- data.frame(d$byte.cyc, d$disp, d$srcalign, d$lenalign, d$name)
# run the analysis of variance
print('running analysis of variance')
summary(aov(d.byte.cyc ~ d.srcalign * d.disp * d.lenalign, data=df)) which expects to find data4.csv in the same directory. The expected output:
|
Handy link: R Tutorial that was the best one I found. |
Here is a Google Doc summarizing some more thoughts that my mum shared about this data set. I haven't fully digested them yet. |
Fun weekend hack...
This branch adds a new command
snabbmark byteops
that measures byte-oriented operations with diverse parameters and produces a comprehensive CSV file. The intention is to systematically measure and compare the performance of operations like memcpy and checksum at different levels of the cache heirarchy, with different alignments, and with different distributions of input sizes. This is in the same spirit as #688 and #744.
TLDR: Full CSV output for 10 runs on lugano-1. (45K rows.)
The parameters tested are:
memcpy
,cksum
,cksumavx2
. (This program seems to "fuzz" a problem incksumsse2
that needs to be looked into!)This benchmark comprehensively measures the performance of various
byte-oriented operations (copy, checksum) with many different
variations (implementation; distribution of input sizes; alignment of
src/dst/len; etc).
The resulting CSV file includes:
nbatch
: Aggregate number of iterations measured using the same parameters.nbytes
: Aggregate bytes for all iterations in the batch.nanos
: Nanoseconds elapsed to process the whole batch.cycles
,ref_cycles
,instructions
,l1-hits
,l2-hits
,l3-hits
,l3-misses
,branch-misses
: Performance counter report for the batchHere is some example output:
Is there anybody brave enough to try and analyze this data? :-)