-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve accuracy of CPU-bound benchmarks #1428
Conversation
Thanks. I'll take a look at it. Which results do you find the most striking? Alpine create rows or something else? |
Interesting how Karyon seems to be significantly more affected than the others by the change 🤔 |
@krausest thanks. I would say that apart from scenarios with creation/deletion of rows, all other have better results, albeit as you noted @fabiospampinato I suppose it is because the |
I took a look at it and I'm not so sure... I chose the following frameworks (pretty similar to your choice above): Sum of squares for the new implementation (new page per iteration): Sum of squares for the old implementation (new page per benchmark): As you can see the sum of squares is lower for the old implementation. If we look at each benchmark we seen update 10th standing out. My current conclusion: Except for update 10th row your proposal is worse for accuracy. It could be interesting to check if we could indeed improve accuracy by using new page per iteration for update 10th row. I can provide some python scripts if someone wants to investigate. |
To be honest, I didn't dug so deep into this, I raised it just by staring at the results table and visually comparing the data in the I'm agree with your suggestion to use sum of squares as the metric, but I will go further and suggest the sample variance as more complete formula for measurement of variance dispersion. Below are the results (using the Resultsdeviation: {
"actual": 70,
"probe": 38,
"equal": 0
}
global_variance: {
"actual": 90317.94378434446,
"probe": 77131.07725527778
}
total_variance_per_scenario: {
"01_run1k": {
"actual": 420.8260004444444,
"probe": 195.72536325555555
},
"02_replace1k": {
"actual": 425.1723292111113,
"probe": 415.9624710555551
},
"03_update10th1k_x16": {
"actual": 53264.40420355556,
"probe": 39700.66003022223
},
"04_select1k": {
"actual": 21912.573555022227,
"probe": 20733.772931200005
},
"05_swap1k": {
"actual": 1982.985489422223,
"probe": 1085.9812504000015
},
"06_remove-one-1k": {
"actual": 1323.9257832888882,
"probe": 649.5034381333339
},
"07_create10k": {
"actual": 5061.0647073555665,
"probe": 4479.431361922221
},
"08_create1k-after1k_x2": {
"actual": 2082.4557004,
"probe": 1952.440614511111
},
"09_clear1k_x8": {
"actual": 3844.5360156444463,
"probe": 7917.599794577779
}
} The code used to obtain these results is next. Codeconst compare = (actual, probe) => {
const weight = {
'01_run1k': 1,
'02_replace1k': 1,
'03_update10th1k_x16': 16,
'04_select1k': 16,
'05_swap1k': 4,
'06_remove-one-1k': 4,
'07_create10k': 1,
'08_create1k-after1k_x2': 2,
'09_clear1k_x8': 8
};
// https://en.wikipedia.org/wiki/Sample_variance
const variance = sample => {
const l = sample.length;
const m = sample.reduce((s, i) => i + s, 0) / l;
return sample.reduce((s, i) => Math.pow(i - m, 2) + s, 0) / (l - 1);
};
// framework variance per scenario (weighted, lower is better)
const items = {};
Object.entries({actual, probe}).forEach(([kind, values]) => {
values.forEach(i => {
if (i?.v?.total) {
const key = `${i.f}-${i.b}`;
items[key] ??= {type: i.b};
items[key][kind] = variance(i.v.total) * weight[i.b];
}
});
});
// scenarios with high variance between implementations (lower is better)
const count = Object.entries(items).reduce((a, [, v]) => {
a.actual += v.actual > v.probe ? 1 : 0;
a.probe += v.actual < v.probe ? 1 : 0;
a.equal += v.actual === v.probe ? 1 : 0;
return a;
}, {actual: 0, probe: 0, equal: 0});
// total variance per scenario (lower is better)
const total = {};
Object.entries(items).forEach(([, {type, actual, probe}]) => {
total[type] ??= {actual: 0, probe: 0};
total[type].actual += actual;
total[type].probe += probe;
});
// global variance (lower is better)
const global = {actual: 0, probe: 0};
Object.entries(total).forEach(([, {actual, probe}]) => {
global.actual += actual;
global.probe += probe;
});
console.log('total_variance_per_scenario:', total);
console.log('global_variance:', global);
console.log('deviation:', count);
}; From the results above can be observed that the worst variance in the new implementation is only in |
fwiw, this is true of every benchmark ever written. |
@leeoniya I got your point, but that statement was about contradiction of variance results, not performance of benchmark itself, i.e. when we are witnessing worse variance values in an more performant environment (if it was really the same as in the official benchmark, i.e. MacBook Pro 14 (32 GB RAM, 8/14 Cores, OSX 14.0)). |
I tried for the chrome 119 numbers (actual=same tab, probe=new tab):
So in this run it looked indeed better for your suggestion in all cases! There's one caveat: I'm currently seeing a few errors where the trace is mostly empty (the error says that no click event is included in the trace). I had 5 such errors for the same tab approach and 35 for the new tab approach. I currently have no idea how to mitigate this error. If I find something we might switch to the new tab approach. |
Good results indeed, now everything is its place and the experiment aligns with the theory, i.e. we got what was expected, a sizeable improvement for the CPU-throttled scenarios, the ones most "environment-sensitive". As for the empty traces and errors, I see these are not specifically related to this change, only magnified by it. To me they look more like some page tracing issues, so maybe it would be better to have a separate thread for that. |
As per #1493 I'm closing it here, similar functionality was integrated in master. |
The rationale behind this is that running multiple iterations of a benchmark within the same tab can distort the results due to optimization strategies in the browser. Consequently, this introduces an unreliable variance when comparing different types of frameworks, whether by nature (compiled vs runtime) or by behavior (e.g., innerHTML vs createElement).
This aligns with my observations, where certain frameworks lag behind others in some scenarios, even when using manual DOM manipulations.
To observe the impact of this change a small experiment was run in the following environment:
HW: HP EliteBook 8470p, i5-3320M × 4, 16.0 GiB
SW: Ubuntu 23.04, Chromium 118.0.5993.70 (Official Build) snap (64-bit)
Given the constraints of low-spec hardware, I opted for a minimal set of well-known (keyed) frameworks. The vanillajs-1 serves as the control benchmark. Additionally, the karyon framework, which I maintain and am familiar with regarding its behavior under various test scenarios, was included in the selection.
The benchmark figures are below:
and the results.ts.
and the results.ts.