-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflows: make cockroach-microbench-ci as required #135899
base: master
Are you sure you want to change the base?
workflows: make cockroach-microbench-ci as required #135899
Conversation
45661a3
to
0b75be3
Compare
performance-regression-expected
is set0b75be3
to
1b3e652
Compare
714137e
to
ad92977
Compare
ad92977
to
d463e84
Compare
Tested this with both with label and without label |
Is the runtime of this new job longer than the current longest required jobs (unit tests and lint)? |
Yes, the current runtime is ~31 min, which is longer than lint ~26 min |
Any chance we could narrow the set of tests in the required job a bit and/or split it across jobs such that we aren't regressing on end-to-end time for ci? |
If a patch changes the backupccl package, i don't see the need to run the microbenchmarks in the storage package. Before merging any sort of microbenchmark test as required in ci, can we limit the benchmarks run to the to packages that were changed in the commit? I'm quite worried this change would lead to very flaky CI. |
I'm also skeptical this would add more test coverage. Like, we already run these benchmarks once a night, right? If the owning team gets a failure the morning after they merged a problematic patch, they should be able to quickly bisect the regression. Also, this would be quite expensive? Is this really worth it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with concerns about this change. We can't be making CI any longer to complete. I am also concerned about introducing spurious CI failures.
The script uses a single
There's no way to split it since it is a single microbenchmark.
Same as above, since As for the flakiness part, this has been addressed by analysing the variance of the subsequent runs which were plotted here. There were also changes by @nvanbenschoten to reduce the variance #131711. Apart from the above changes, I have analysed past 1 month of microbenchmark and @srosenberg and me arrived at a threshold value of
This can be addressed by reducing the no of count of the benchmark run but that will result in more variance in the benchmark runs. @srosenberg wdyt about this? If that's the trade-off we want to take for faster ci.
#106661 mentions how roachperf is not been able to detect small regressions that have been incrementally added by PRs, the ci job aims to address this by stopping the PR from getting merge itself. |
Obviously, since |
How many spurious failures have occurred since this change was made? Have any legitimate failures been observed? |
I'm not quite following this? If small regressions are hard to notice due to being small, how does looking for them before vs after merge change the scale of the regression we can reliably detect? It seems like if we're able to mechanically determine if one SHA is a regression compared to another SHA with high confidence -- such that we can run it on the PR's SHA vs its mergebase -- then it seems like we could also just do so once or twice a day, comparing the current HEAD SHA to last one and file an issue (with the delta commit log) if it has regressed? Does the comparison really need to be done for every PR? |
I also don't know if I would consider 20% to be a "small" regression. It also seems like a regression of that magnitude would be clearly visible in the nightlies. |
The idea of adding doing comparison on every PR is that it forces the engineer to fix regressions in the PR itself rather than again waiting for the nightlies to run and then again go ahead analyse the code to detect the issue and fix. This make the turn around time higher, and it reduces by making the engineer fix the issues in the current PR itself. |
Are there db engineers asking for the benchmarks to be run on all prs before merging?
Sounds like we need to migrate off roachperf and onto a time series database that can conduct regression monitoring automatically :D |
I think I didn't explain fully before how we arrived at this number. Therefore, the comparison was skipped in this PR.#129390. Later I ran the microbenchmark around 100 times on a commit SHA and realised that there's a lot a variance in the benchmark output whose plot I posted in my comment above (https://lookerstudio.google.com/reporting/d707c4cc-5b86-490f-891f-b023766bb89f/page/n1tCE). The variance ranged from 5% to even 10% on the same commit SHA. This led us to believe that we need to increase the threshold to catch the regressions, so 20% might seem a bigger number, but due to variances in microbench runs, this is something that we had to do.
I will get the data tomorrow IST and post it here. |
I think so, one ask is the issue that is linked to this ticket, another one a slack thread here https://cockroachlabs.slack.com/archives/C0366GQV668/p1723134389600209 |
Do we need to force engineers to do this? Is there evidence that if we present engineers with reliable information that they just merged a significant regression that they won't act quickly to fix/revert/etc?
While that's true, it isn't clear that that is too much turn around time, or that cutting that down is worth the cost (either in $ or more importantly in productivity due to a slower CI pipeline). That said, if we really wanted something quicker I think we still have options that aren't to be in the blocking path of every PR, e.g. I believe we already have a CI that runs every time master's SHA has changed ("merge to master" or "publish bleeding edge" or something? dev-inf/rel-eng would know the details) in which the list of candidate changes compared to last run would be relatively small (just what was in the bors batch). |
I guess I would rephrase the above differently. The operating assumption is that the specifically chosen microbenchmarks are fairly stable, and the asserted performance difference denotes a sizable regression, with high probability. If that's the case, then how is this different from forcing engineers that the unit tests pass? Catching a regression after an offending PR is merged, translates to doing more (infra.) work. When would you not want to block a PR, after learning that it contains a sizable performance regression? |
One difference might be that unit tests fail regularly and are frequently broken by routine PR's to How often do single-PR performance regressions to the tune of 20% in a single microbenchmark occur? Once per month, week, year? Setting aside developer experience concerns (e.g. are there false positives, etc.), if this error case does not frequently occur, it is very unlikely that this is even worth the added CI cost. |
The same argument could be made for microbenchmarks, which can be seen as unit tests that assert on performance. We know from experience, that some performance bugs can quickly turn into unavailability bugs. Fundamentally, I don't see a difference.
That's a reasonable question. I suspect it's the primary reason why few places succeed in enforcing this check on PRs; that and a lack of discipline to address both correctness and performance in the same PR. Looking at similar microbenchmarks in Go (e.g., |
I ran the comparison b/w subsequent commits from 9th oct to 9th nov. These are some of the potential PRs that were could have been caught by the job: #134325 |
This change aims to make `cockroach-microbench-ci` a required step. There is an addition of a label `performance-regression-expected` that can be used to skip the comparison in case a regression is expected Epic: https://cockroachlabs.atlassian.net/browse/CRDB-40772 Fixes: cockroachdb#106661 Release note: None
d463e84
to
0533d74
Compare
We're about to go through the codebase and fix many minor regressions as part of our perf efforts. Not even forcing 20% regressions to be reckoned with in the moment is incompatible with that effort. I'm a strong +1 to adding something in the critical path of the development workflow that at least requires a manual override before a merge. Reading between the lines in this thread I get a sense that folks are worried that this is just going to be noisy and unhelpful. As is, that might be true. But I believe that done right, this check does not have to be "yet another annoying thing" in your day-to-day work. Instead, it could be a valuable tool that we love to have around, one that's helpful and high-signal and one that automatically does for us twelve steps that we'd otherwise have to trudge through the next day. I've been running microbenchmarks frequently recently and I don't think this needs to take 31 minutes, unless there is some insurmountable problem where getting the build done already takes 20. I'm happy to work with test-eng to look into the benchmark selection and ways to squeeze out more signal. My vote would be making the check faster and the ergonomics better first, and then making it mandatory. For example, I spot a few PRs in this list that I'm pretty sure had exactly no impact on performance. But I definitely am not going to check because that would be quite annoying to actually carry out - an ergonomics issue. It needs to be so pleasant that you'll want to take a look.
It's quite expensive to go back after the fact and revisit code that was merged before and run benchmarks. Saying that we just "don't care" about perf regressions during CI also communicates, well, that it's just not that important. I'd posit that regressions are often small fixes in before the merge - or at least can be accepted consciously - but after the merge they amount to an annoying and expensive context switch that in practice we don't very proactively do. I agree with previous comments that the regression check must not be permitted to increase overall CI length. I would simply call this a requirement. If I understand correctly, currently (after digging through a stack trace) this is how the job fails if it catches a regression:
I'd really want the By the way, I see that the current check makes a new run only for the commit to be tested, and pulls the old numbers off S3. I think that is a mistake - this introduces correlated bias into the results that (I think) significantly reduces the chances of keeping noise low. Instead, we should stash the binaries on S3 and run both the old and new code on the same machine for a direct comparison, at least for CPU time measurements. (Even reusing binaries can be problematic since one binary may be more efficient than another one simply due to some random compiler choices, but I think that we will have to live with). |
I added a change to script that removes the bound of running the microbenchmark on one cpu. This has decreased the ci job from 31 min to ~22 min.
We considered this while implementing the first time but this can further increase the CI job time, since I can't really run both benchmarks at the same time on a same machine. Therefore, this solution was designed with the thought that the CI jobs run on same machine types. I do agree that this still can have variance (as I have seen while I ran a single benchmark multiple times of a same machine without any changes), this was done to save CI time.
I think this can be done, We can sync up to see how this can be implemented, I'm happy to work on this if the engineers feel this will be helpful to debug. |
That's not true, the benchmark is really a family of 40 (!) microbenchmarks. We really shouldn't be running all of these. Details
That makes sense if you consider the benchmark selection sacred, but it very much isn't. The above set of benchmarks doesn't make much sense, it's quite repetitive. If we don't get the variance down, there's not much point to the whole exercise. I'd strongly suggest we consider narrowing down the set of benchmarks and we run both the before and after interleaved in the same environment (aka exactly what We could curate a few of the I also think down the road we can get even better at allocations, since we could entertain creating alloc profiles and stripping them of "one-off" allocations (i.e. anything not hit as often as we'd expect if it were on the critical path). As for the UX, I'm quite partial to this output, which is created by
re: the CPU and alloc profiles, I would take them in a separate one-off run of the benchmark at the end (since profiling has overhead). The Footnotes |
The thing that strikes me as a leap is to go straight to deeming that it must be in the critical path when we haven't tried first tried to see if our problem is solved by it being on the development workflow path at all, such as in the form a reliable post-merge notification of a detected regression.
Because a broken unit test harms the productivity of other engineers: if the code in some library doesn't compile or does not produce the correct results when called, I cannot build and test the behavior of features in my library and parts of the codebase if those parts can't compile their dependencies or are getting incorrect inputs from them. Breaking a core library could bring all development across all teams to a halt. Whereas if a library becomes x% slower for one night, that is far less likely to have a broad impact on productivity across the organization. My impression is that we believe that we've had quiet unnoticed perf regressions sneaking in but I believe we also all agree that this is because they simply were not identified at the time / brought to the attention of the author of the change that introduced them. I think getting the time of this job per run down is essential no matter when we run it. If we want to run it per-merge, it also must be comparing to the merge-base, not current HEAD -- otherwise every time someone merges an optimization, all other wholly unrelated PRs will start to flake as regressions. (Alternatively it could benchmark the new SHA as rebased on head but none of our CIs do that and it would likely be confusing to engineers to have one odd job doing that). But even if you could get the job down to, say, 20 minutes, it will still make the CI pipeline flakier just because every job, but especially such a long running one, can be preempted these days. So the more jobs we elect to run on the critical path increases the higher the chance of a flake on that path, even if the jobs themselves are 100% reliable, so we have to be sure that the benefit of doing so justifies the cost. In this case, it really seems like we can get the lion's share of that benefit from a post-merge check can't we? |
I've been convinced that a pre-merge microbenchmark job would be good to add, so long as it doesn't effect general productivity. I'll stay away from the cost discussions. One reason I came out swinging against this change is because my team doesn't really rely on microbenchmarks, so as Tobi has said, I really fear this change could bog down time to merge. Furthermore, I think we already run some sort of bench job pre merge in extended ci (which is optional). This job seems incredibly flaky and is often the bottleneck to extended CI runs. Since many engineers wait for extended CI to complete before merging, we already have a bench job slowing down time to merge! @tbg Are kv engineers using this extended ci bench job in their workflow? Since @sambhav-jain-16 's work seems much more targeted, I'm hoping one outcome of it will be to sunset extended ci's bench job. |
Let's table the discussion on whether this should be required. Personally, I don't want to require it, at least not in the "you can't merge until you fixed it" kind of sense. But I do think we should run these in the critical path and make the result visible, i.e. you'd have to ack your way through a regression to merge it. (Plus, you get the nice feeling of having done a good job when you actually made things faster). The important part is to do it well so that it's a pleasure to engage with. I don't want to make this the selling point here, but I think putting the perf tooling into people's PRs will also be a real boost to the perf culture at CRL. I'd argue that the async path is what we have tried so far. For example, here, every week. But it is fundamentally a slow path. It's like fixing a bug after the merge, except - as you point out - the bug doesn't "really hurt that much" and surely as an author you have other things to do, which in my estimate puts the chances of smaller issues being fixed at approximately zero percent. We can't run every benchmark in CI and expect that to be useful (looking at you, bench job), so we will need an "async path" for most benchmarks anyway. I am all for that, but that is not this.
I've never seen anyone use this job. I'm also not sure what it's there for: certainly it's not providing any value to the CI pipeline; you can't even see the results and benchmarks are meaningless without a reference point. It maybe convinces us that each of the benchmark actually compiles and can run without failing? We should just do that in nightlies, since it's rare to break one and we compile them already when running the regular tests. If they fail in nightly, they should just file issues. As a CI job, I'd remove this ASAP. |
I think this is conflating the effect of two variables and has some element of a correlation-vs-causation blurring: our path so far has indeed been async but has also been extremely non-specific. If I merge a regression, it is up to me see a message from a slackbot that doesn't mention me or my change some days later, and then think to go check if that might have been me. That fact that that is after I merge versus before I merge I don't think is the primary reason it isn't acted on, but rather the fact it is so non-specific and thus unactionable. I'd argue that we haven't yet tried a path where we give the authors of regressions specific, actionable feedback about their changes, either before or after they merge those changes, so it is isn't clear to me that we can conclude we have tried "the" async path and can conclude it is a dead end based on our lack of success so far? But all that said, if this new CI job can be made <10mins, is strictly comparing to the merge base, and doesn't flake -- and we're ready to disable it as soon as any of those are observed not to hold -- then isn't a hill I'm prepared to die on. |
Doesn't this jeopardize the benchmark results in some way? Generally people choose to run the benchmarks on a constant number of CPU's with limited parallelism to make the benchmark results more accurate. Did you do some sort of validation that the benchmark you're testing has not been affected by this change? |
If you're going to change some fundamental aspect of how the benchmarks are run to make the runtime shorter, I would prefer to make that change in a separate PR, then you can watch it for ~a week to determine that we haven't made the job flakier before making it required in CI. |
How did you produce the list? I see the first PR is mine, #134325. But this PR did not introduce any major performance regression (that I have been made aware of). |
Yes, this was done in the outputs I posted above, they were ran without fixating the cpu and showed the similar characteristics. I added that change in this PR itself to test it out since looks like anyways I need to make some fundamental changes in the PR as some of the folks have concerns.
There are suggestions by @tbg on using different benchmark altogether that toby has asked me to explore. These all will be in seperate PRs. I'm also going to wait for @srosenberg to comment on what he thinks about the suggestions pointed out above. |
Were microbenchmarks run against the change to see if there's a regression? Again, this change doesn't intend to capture a "major" regression, but smaller regressions that might not get caught in the nightlies but over the time get accumulated as mentioned here List was accumulated from the bucket that stores the output of ci job itself and the same |
I can't really disagree with you. I'd like to see something useful, if it's right after the PR in a slack message - fine by me too. Useful comes first. I don't think what we have right now is useful yet.
Perhaps surprisingly, giving Go just one CPU is actually a bad idea as far as I understand. These benchmarks run entire servers. Servers have async work, which, at one CPU, interrupts foreground work. So you want to have at least a few processors so that these benchmarks - which still only do sequential client requests - end up with predictable variance. Basically they measure the speed at which a single client can go against a system that has enough slack. This is useful. It's different from measuring how fast the system can go when maximally loaded, which is also useful but not what we're after here, as it's typically less stable and much more sensitive to the environment. Either way, running with one CPU is neither. It's "what if we ran all of CRDB on one CPU". We know that isn't great, and we don't care about it either. |
Re this, I also have my doubts about this list. For example #133298 certainly did not introduce any regression; there aren't even any production code changes in there unless requests are being traced. So I think this shows us that we need to do better than what we have as of this PR. We're really nowhere close to deciding where these checks should live - let's make them useful first and then work up a stronger conviction of what their fate should be. Here's Benchmarking output
|
You're the one claiming this PR introduced a performance regression, so where is the data in support of this? |
This change aims to make
cockroach-microbench-ci
a required step.There is an addition of a label
performance-regression-expected
thatcan be used to skip the comparison in case a regression is expected
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-40772
Fixes: #106661