workflows: make cockroach-microbench-ci as required #135899

sambhav-jain-16 · 2024-11-21T13:04:46Z

This change aims to make cockroach-microbench-ci a required step.
There is an addition of a label performance-regression-expected that
can be used to skip the comparison in case a regression is expected

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-40772
Fixes: #106661

cockroach-teamcity · 2024-11-21T13:04:56Z

This change is

sambhav-jain-16 · 2024-11-25T09:49:59Z

Tested this with both with label and without label

dt · 2024-11-25T12:25:11Z

Is the runtime of this new job longer than the current longest required jobs (unit tests and lint)?

sambhav-jain-16 · 2024-11-25T13:03:58Z

Is the runtime of this new job longer than the current longest required jobs (unit tests and lint)?

Yes, the current runtime is ~31 min, which is longer than lint ~26 min

dt · 2024-11-25T14:00:56Z

the current runtime is ~31 min

Any chance we could narrow the set of tests in the required job a bit and/or split it across jobs such that we aren't regressing on end-to-end time for ci?

msbutler · 2024-11-25T15:01:42Z

could narrow the set of tests in the required job a bit

If a patch changes the backupccl package, i don't see the need to run the microbenchmarks in the storage package. Before merging any sort of microbenchmark test as required in ci, can we limit the benchmarks run to the to packages that were changed in the commit?

I'm quite worried this change would lead to very flaky CI.

msbutler · 2024-11-25T15:11:49Z

I'm also skeptical this would add more test coverage. Like, we already run these benchmarks once a night, right? If the owning team gets a failure the morning after they merged a problematic patch, they should be able to quickly bisect the regression.

Also, this would be quite expensive? Is this really worth it?

rickystewart

I agree with concerns about this change. We can't be making CI any longer to complete. I am also concerned about introducing spurious CI failures.

sambhav-jain-16 · 2024-11-25T17:43:01Z

Any chance we could narrow the set of tests in the required job a bit and/or split it across jobs such that we aren't regressing on end-to-end time for ci?

The script uses a single BenchmarkKV from the sql package

cockroach/build/github/cockroach-microbench-ci.sh

Line 49 in eac4efe

--test_arg='-test.bench=^BenchmarkKV$' \

There's no way to split it since it is a single microbenchmark.

If a patch changes the backupccl package, i don't see the need to run the microbenchmarks in the storage package. Before merging any sort of microbenchmark test as required in ci, can we limit the benchmarks run to the to packages that were changed in the commit?
I'm quite worried this change would lead to very flaky CI.

Same as above, since BenchmarkKV covers end-to-end benchmark for the in-memory cockroach, so we can't limit per package.

As for the flakiness part, this has been addressed by analysing the variance of the subsequent runs which were plotted here. There were also changes by @nvanbenschoten to reduce the variance #131711.

Apart from the above changes, I have analysed past 1 month of microbenchmark and @srosenberg and me arrived at a threshold value of 20% for a regression, this change was merged here

We can't be making CI any longer to complete.

This can be addressed by reducing the no of count of the benchmark run but that will result in more variance in the benchmark runs. @srosenberg wdyt about this? If that's the trade-off we want to take for faster ci.

Like, we already run these benchmarks once a night, right? If the owning team gets a failure the morning after they merged a problematic patch, they should be able to quickly bisect the regression.

#106661 mentions how roachperf is not been able to detect small regressions that have been incrementally added by PRs, the ci job aims to address this by stopping the PR from getting merge itself.

rickystewart · 2024-11-25T17:50:44Z

there's no way to split it since it is a single microbenchmark.

Obviously, since BenchmarkKV is nothing more than a loop with 4 iterations, you could trivially extract it into 4 different benchmarks.

rickystewart · 2024-11-25T17:51:43Z

Apart from the above changes, I have analysed past 1 month of microbenchmark and @srosenberg and me arrived at a threshold value of 20% for a regression, this change was merged #134810

How many spurious failures have occurred since this change was made? Have any legitimate failures been observed?

dt · 2024-11-25T17:59:16Z

#106661 mentions how roachperf is not been able to detect small regressions that have been incrementally added by PRs, the ci job aims to address this by stopping the PR from getting merge itself.

I'm not quite following this? If small regressions are hard to notice due to being small, how does looking for them before vs after merge change the scale of the regression we can reliably detect? It seems like if we're able to mechanically determine if one SHA is a regression compared to another SHA with high confidence -- such that we can run it on the PR's SHA vs its mergebase -- then it seems like we could also just do so once or twice a day, comparing the current HEAD SHA to last one and file an issue (with the delta commit log) if it has regressed?

Does the comparison really need to be done for every PR?

rickystewart · 2024-11-25T18:03:39Z

I also don't know if I would consider 20% to be a "small" regression. It also seems like a regression of that magnitude would be clearly visible in the nightlies.

sambhav-jain-16 · 2024-11-25T18:16:20Z

#106661 mentions how roachperf is not been able to detect small regressions that have been incrementally added by PRs, the ci job aims to address this by stopping the PR from getting merge itself.

I'm not quite following this? If small regressions are hard to notice due to being small, how does looking for them before vs after merge change the scale of the regression we can reliably detect? It seems like if we're able to mechanically determine if one SHA is a regression compared to another SHA with high confidence -- such that we can run it on the PR's SHA vs its mergebase -- then it seems like we could also just do so once or twice a day, comparing the current HEAD SHA to last one and file an issue (with the delta commit log) if it has regressed?

Does the comparison really need to be done for every PR?

The idea of adding doing comparison on every PR is that it forces the engineer to fix regressions in the PR itself rather than again waiting for the nightlies to run and then again go ahead analyse the code to detect the issue and fix. This make the turn around time higher, and it reduces by making the engineer fix the issues in the current PR itself.

msbutler · 2024-11-25T18:22:52Z

The idea of adding doing comparison on every PR is that it forces the engineer to fix regressions in the PR itself rather than again waiting for the nightlies to run and then again go ahead analyse the code to detect the issue and fix. This make the turn around time higher that making the engineer fix the issues in the current PR itself.

Are there db engineers asking for the benchmarks to be run on all prs before merging?

#106661 mentions how roachperf is not been able to detect small regressions that have been incrementally added by PRs, the ci job aims to address this by stopping the PR from getting merge itself.

Sounds like we need to migrate off roachperf and onto a time series database that can conduct regression monitoring automatically :D

sambhav-jain-16 · 2024-11-25T18:25:10Z

I also don't know if I would consider 20% to be a "small" regression. It also seems like a regression of that magnitude would be clearly visible in the nightlies.

I think I didn't explain fully before how we arrived at this number.
There are 2 types of benchmarks, macro (roachtest benchmarks, that run in the nightlies and are plotted in roachperf) and other micro which are go benchmarks that run the benchmark on in-memory cockroach.
Now, when I initially added the job(#127228) to collect data, I made threshold to be 5%. However, soon it was seen that the CI job was flaky.

Therefore, the comparison was skipped in this PR.#129390.

Later I ran the microbenchmark around 100 times on a commit SHA and realised that there's a lot a variance in the benchmark output whose plot I posted in my comment above (https://lookerstudio.google.com/reporting/d707c4cc-5b86-490f-891f-b023766bb89f/page/n1tCE). The variance ranged from 5% to even 10% on the same commit SHA.

This led us to believe that we need to increase the threshold to catch the regressions, so 20% might seem a bigger number, but due to variances in microbench runs, this is something that we had to do.

How many spurious failures have occurred since this change was made? Have any legitimate failures been observed?

I will get the data tomorrow IST and post it here.

sambhav-jain-16 · 2024-11-25T18:28:38Z

Are there db engineers asking for the benchmarks to be run on all prs before merging?

I think so, one ask is the issue that is linked to this ticket, another one a slack thread here https://cockroachlabs.slack.com/archives/C0366GQV668/p1723134389600209

dt · 2024-11-25T18:35:48Z

forces the engineer to fix regressions in the PR itself

Do we need to force engineers to do this? Is there evidence that if we present engineers with reliable information that they just merged a significant regression that they won't act quickly to fix/revert/etc?

waiting for the nightlies to run ... turn around time

While that's true, it isn't clear that that is too much turn around time, or that cutting that down is worth the cost (either in $ or more importantly in productivity due to a slower CI pipeline). That said, if we really wanted something quicker I think we still have options that aren't to be in the blocking path of every PR, e.g. I believe we already have a CI that runs every time master's SHA has changed ("merge to master" or "publish bleeding edge" or something? dev-inf/rel-eng would know the details) in which the list of candidate changes compared to last run would be relatively small (just what was in the bors batch).

srosenberg · 2024-11-25T20:58:52Z

Do we need to force engineers to do this? Is there evidence that if we present engineers with reliable information that they just merged a significant regression that they won't act quickly to fix/revert/etc?

I guess I would rephrase the above differently. The operating assumption is that the specifically chosen microbenchmarks are fairly stable, and the asserted performance difference denotes a sizable regression, with high probability. If that's the case, then how is this different from forcing engineers that the unit tests pass? Catching a regression after an offending PR is merged, translates to doing more (infra.) work. When would you not want to block a PR, after learning that it contains a sizable performance regression?

rickystewart · 2024-11-25T22:19:19Z

The operating assumption is that the specifically chosen microbenchmarks are fairly stable, and the asserted performance difference denotes a sizable regression, with high probability. If that's the case, then how is this different from forcing engineers that the unit tests pass?

One difference might be that unit tests fail regularly and are frequently broken by routine PR's to cockroach. If we didn't check unit tests in CI, then a large proportion of non-trivial PR's would result in some post-merge failure that would have to be resolved after the fact by the engineer. So running unit tests in CI is a huge boon for developer productivity in that way.

How often do single-PR performance regressions to the tune of 20% in a single microbenchmark occur? Once per month, week, year? Setting aside developer experience concerns (e.g. are there false positives, etc.), if this error case does not frequently occur, it is very unlikely that this is even worth the added CI cost.

srosenberg · 2024-11-26T03:48:39Z

One difference might be that unit tests fail regularly and are frequently broken by routine PR's to cockroach. If we didn't check unit tests in CI, then a large proportion of non-trivial PR's would result in some post-merge failure that would have to be resolved after the fact by the engineer. So running unit tests in CI is a huge boon for developer productivity in that way.

The same argument could be made for microbenchmarks, which can be seen as unit tests that assert on performance. We know from experience, that some performance bugs can quickly turn into unavailability bugs. Fundamentally, I don't see a difference.

How often do single-PR performance regressions to the tune of 20% in a single microbenchmark occur? Once per month, week, year? Setting aside developer experience concerns (e.g. are there false positives, etc.), if this error case does not frequently occur, it is very unlikely that this is even worth the added CI cost.

That's a reasonable question. I suspect it's the primary reason why few places succeed in enforcing this check on PRs; that and a lack of discipline to address both correctness and performance in the same PR.

Looking at similar microbenchmarks in Go (e.g., CockroachDBkv95/nodes=1-8), over last 7 days, I spot several commits with regressions larger than 20%. We can see what Sambhav digs up from our own runs.

[1] https://perf.golang.org/dashboard/?benchmark=regressions&unit=&repository=go&branch=master&platform=all&days=7&end=2024-11-26T03%3A36

sambhav-jain-16 · 2024-11-26T05:52:54Z

I will get the data tomorrow IST and post it here.

I ran the comparison b/w subsequent commits from 9th oct to 9th nov.
This is the output of the comparison
output_20.txt

These are some of the potential PRs that were could have been caught by the job:

#134325
#134183
#134019
#133686
#133298
#132603
#132766
#131828

This change aims to make `cockroach-microbench-ci` a required step. There is an addition of a label `performance-regression-expected` that can be used to skip the comparison in case a regression is expected Epic: https://cockroachlabs.atlassian.net/browse/CRDB-40772 Fixes: cockroachdb#106661 Release note: None

tbg · 2024-11-26T08:39:54Z

We're about to go through the codebase and fix many minor regressions as part of our perf efforts. Not even forcing 20% regressions to be reckoned with in the moment is incompatible with that effort. I'm a strong +1 to adding something in the critical path of the development workflow that at least requires a manual override before a merge.

Reading between the lines in this thread I get a sense that folks are worried that this is just going to be noisy and unhelpful. As is, that might be true. But I believe that done right, this check does not have to be "yet another annoying thing" in your day-to-day work. Instead, it could be a valuable tool that we love to have around, one that's helpful and high-signal and one that automatically does for us twelve steps that we'd otherwise have to trudge through the next day.

I've been running microbenchmarks frequently recently and I don't think this needs to take 31 minutes, unless there is some insurmountable problem where getting the build done already takes 20. I'm happy to work with test-eng to look into the benchmark selection and ways to squeeze out more signal.

My vote would be making the check faster and the ergonomics better first, and then making it mandatory. For example, I spot a few PRs in this list that I'm pretty sure had exactly no impact on performance. But I definitely am not going to check because that would be quite annoying to actually carry out - an ergonomics issue. It needs to be so pleasant that you'll want to take a look.

I'm also skeptical this would add more test coverage. Like, we already run these benchmarks once a night, right? If the owning team gets a failure the morning after they merged a problematic patch, they should be able to quickly bisect the regression.

It's quite expensive to go back after the fact and revisit code that was merged before and run benchmarks. Saying that we just "don't care" about perf regressions during CI also communicates, well, that it's just not that important. I'd posit that regressions are often small fixes in before the merge - or at least can be accepted consciously - but after the merge they amount to an annoying and expensive context switch that in practice we don't very proactively do.

I agree with previous comments that the regression check must not be permitted to increase overall CI length. I would simply call this a requirement.

If I understand correctly, currently (after digging through a stack trace) this is how the job fails if it catches a regression:

I241126 05:29:15.389634 1 (gostd) main.go:252  [-] 1 +  | Metric: sec/op, Benchmark: pkg/sql→KV/Delete/SQL/rows=100, Change: +21.82%

I'd really want the benchdiff printout, CPU and mem profiles ready to be investigated, and commands I can just copy-paste. This should all be doable with some elbow grease and I'm happy to guide.

By the way, I see that the current check makes a new run only for the commit to be tested, and pulls the old numbers off S3. I think that is a mistake - this introduces correlated bias into the results that (I think) significantly reduces the chances of keeping noise low. Instead, we should stash the binaries on S3 and run both the old and new code on the same machine for a direct comparison, at least for CPU time measurements. (Even reusing binaries can be problematic since one binary may be more efficient than another one simply due to some random compiler choices, but I think that we will have to live with).
I think we also want to look into more stable proxies for performance, for example allocation counts.

sambhav-jain-16 · 2024-11-26T10:51:05Z

I added a change to script that removes the bound of running the microbenchmark on one cpu. This has decreased the ci job from 31 min to ~22 min.

Instead, we should stash the binaries on S3 and run both the old and new code on the same machine for a direct comparison, at least for CPU time measurements.

We considered this while implementing the first time but this can further increase the CI job time, since I can't really run both benchmarks at the same time on a same machine. Therefore, this solution was designed with the thought that the CI jobs run on same machine types. I do agree that this still can have variance (as I have seen while I ran a single benchmark multiple times of a same machine without any changes), this was done to save CI time.

I'd really want the benchdiff printout, CPU and mem profiles ready to be investigated, and commands I can just copy-paste. This should all be doable with some elbow grease and I'm happy to guide.

I think this can be done, We can sync up to see how this can be implemented, I'm happy to work on this if the engineers feel this will be helpful to debug.

tbg · 2024-11-26T11:58:30Z

@sambhav-jain-16

The script uses a single BenchmarkKV from the sql package
There's no way to split it since it is a single microbenchmark.

That's not true, the benchmark is really a family of 40 (!) microbenchmarks. We really shouldn't be running all of these.

Details

BenchmarkKV/Insert
BenchmarkKV/Insert/Native
BenchmarkKV/Insert/Native/rows=1
BenchmarkKV/Insert/Native/rows=1-8         	   25984	     45583 ns/op	   17865 B/op	     109 allocs/op
BenchmarkKV/Insert/Native/rows=10
BenchmarkKV/Insert/Native/rows=10-8        	   17368	     72429 ns/op	   43492 B/op	     260 allocs/op
BenchmarkKV/Insert/Native/rows=100
BenchmarkKV/Insert/Native/rows=100-8       	    3992	    274320 ns/op	  297208 B/op	    1555 allocs/op
BenchmarkKV/Insert/Native/rows=1000
BenchmarkKV/Insert/Native/rows=1000-8      	     550	   2212428 ns/op	 2756464 B/op	   14302 allocs/op
BenchmarkKV/Insert/Native/rows=10000
BenchmarkKV/Insert/Native/rows=10000-8     	      55	  21728092 ns/op	36333194 B/op	  141181 allocs/op
BenchmarkKV/Insert/SQL
BenchmarkKV/Insert/SQL/rows=1
BenchmarkKV/Insert/SQL/rows=1-8            	    8286	    143787 ns/op	   50522 B/op	     345 allocs/op
BenchmarkKV/Insert/SQL/rows=10
BenchmarkKV/Insert/SQL/rows=10-8           	    6276	    191096 ns/op	   92634 B/op	     562 allocs/op
BenchmarkKV/Insert/SQL/rows=100
BenchmarkKV/Insert/SQL/rows=100-8          	    1782	    625859 ns/op	  511924 B/op	    2429 allocs/op
BenchmarkKV/Insert/SQL/rows=1000
BenchmarkKV/Insert/SQL/rows=1000-8         	     282	   4244354 ns/op	 5232202 B/op	   22290 allocs/op
BenchmarkKV/Insert/SQL/rows=10000
BenchmarkKV/Insert/SQL/rows=10000-8        	      26	  40943014 ns/op	63747911 B/op	  226450 allocs/op
BenchmarkKV/Update
BenchmarkKV/Update/Native
BenchmarkKV/Update/Native/rows=1
BenchmarkKV/Update/Native/rows=1-8         	   17668	     68578 ns/op	   24860 B/op	     157 allocs/op
BenchmarkKV/Update/Native/rows=10
BenchmarkKV/Update/Native/rows=10-8        	    8370	    180738 ns/op	   79162 B/op	     475 allocs/op
BenchmarkKV/Update/Native/rows=100
BenchmarkKV/Update/Native/rows=100-8       	    1324	   1258769 ns/op	  612383 B/op	    3240 allocs/op
BenchmarkKV/Update/Native/rows=1000
BenchmarkKV/Update/Native/rows=1000-8      	     100	  10198341 ns/op	 5796471 B/op	   32020 allocs/op
BenchmarkKV/Update/Native/rows=10000
BenchmarkKV/Update/Native/rows=10000-8     	      12	  97419938 ns/op	74752036 B/op	  321941 allocs/op
BenchmarkKV/Update/SQL
BenchmarkKV/Update/SQL/rows=1
BenchmarkKV/Update/SQL/rows=1-8            	    4857	    212755 ns/op	   51222 B/op	     454 allocs/op
BenchmarkKV/Update/SQL/rows=10
BenchmarkKV/Update/SQL/rows=10-8           	    3037	    373754 ns/op	  120027 B/op	    1083 allocs/op
BenchmarkKV/Update/SQL/rows=100
BenchmarkKV/Update/SQL/rows=100-8          	    1122	   1274596 ns/op	  542731 B/op	    5585 allocs/op
BenchmarkKV/Update/SQL/rows=1000
BenchmarkKV/Update/SQL/rows=1000-8         	     142	   8886639 ns/op	 4642445 B/op	   46557 allocs/op
BenchmarkKV/Update/SQL/rows=10000
BenchmarkKV/Update/SQL/rows=10000-8        	      14	  99820250 ns/op	152647302 B/op	  636290 allocs/op
BenchmarkKV/Delete
BenchmarkKV/Delete/Native
BenchmarkKV/Delete/Native/rows=1
BenchmarkKV/Delete/Native/rows=1-8         	   25843	     46584 ns/op	   18224 B/op	     110 allocs/op
BenchmarkKV/Delete/Native/rows=10
BenchmarkKV/Delete/Native/rows=10-8        	   14882	     80086 ns/op	   42871 B/op	     238 allocs/op
BenchmarkKV/Delete/Native/rows=100
BenchmarkKV/Delete/Native/rows=100-8       	    3109	    387702 ns/op	  375712 B/op	    1349 allocs/op
BenchmarkKV/Delete/Native/rows=1000
BenchmarkKV/Delete/Native/rows=1000-8      	     404	   3228782 ns/op	 2951236 B/op	   12234 allocs/op
BenchmarkKV/Delete/Native/rows=10000
BenchmarkKV/Delete/Native/rows=10000-8     	      39	  29882730 ns/op	36880245 B/op	  121116 allocs/op
BenchmarkKV/Delete/SQL
BenchmarkKV/Delete/SQL/rows=1
BenchmarkKV/Delete/SQL/rows=1-8            	    6510	    177782 ns/op	   76364 B/op	     422 allocs/op
BenchmarkKV/Delete/SQL/rows=10
BenchmarkKV/Delete/SQL/rows=10-8           	    4240	    286130 ns/op	  112574 B/op	     870 allocs/op
BenchmarkKV/Delete/SQL/rows=100
BenchmarkKV/Delete/SQL/rows=100-8          	    1404	    798489 ns/op	  481803 B/op	    4931 allocs/op
BenchmarkKV/Delete/SQL/rows=1000
BenchmarkKV/Delete/SQL/rows=1000-8         	     134	   8837215 ns/op	11335875 B/op	   59369 allocs/op
BenchmarkKV/Delete/SQL/rows=10000
BenchmarkKV/Delete/SQL/rows=10000-8        	      15	  77364294 ns/op	153260886 B/op	  634228 allocs/op
BenchmarkKV/Scan
BenchmarkKV/Scan/Native
BenchmarkKV/Scan/Native/rows=1
BenchmarkKV/Scan/Native/rows=1-8           	   71312	     16277 ns/op	    8225 B/op	      48 allocs/op
BenchmarkKV/Scan/Native/rows=10
BenchmarkKV/Scan/Native/rows=10-8          	   65787	     17709 ns/op	    9700 B/op	      52 allocs/op
BenchmarkKV/Scan/Native/rows=100
BenchmarkKV/Scan/Native/rows=100-8         	   39488	     29750 ns/op	   22479 B/op	      56 allocs/op
BenchmarkKV/Scan/Native/rows=1000
BenchmarkKV/Scan/Native/rows=1000-8        	    8326	    147428 ns/op	  173259 B/op	      66 allocs/op
BenchmarkKV/Scan/Native/rows=10000
BenchmarkKV/Scan/Native/rows=10000-8       	     942	   1273170 ns/op	 1517181 B/op	     100 allocs/op
BenchmarkKV/Scan/SQL
BenchmarkKV/Scan/SQL/rows=1
BenchmarkKV/Scan/SQL/rows=1-8              	   10674	    108192 ns/op	   27252 B/op	     261 allocs/op
BenchmarkKV/Scan/SQL/rows=10
BenchmarkKV/Scan/SQL/rows=10-8             	   10041	    116282 ns/op	   28484 B/op	     287 allocs/op
BenchmarkKV/Scan/SQL/rows=100
BenchmarkKV/Scan/SQL/rows=100-8            	    7886	    150092 ns/op	   39313 B/op	     478 allocs/op
BenchmarkKV/Scan/SQL/rows=1000
BenchmarkKV/Scan/SQL/rows=1000-8           	    1939	    572280 ns/op	  175920 B/op	    3052 allocs/op
BenchmarkKV/Scan/SQL/rows=10000
BenchmarkKV/Scan/SQL/rows=10000-8          	     294	   4100760 ns/op	  946247 B/op	   30168 allocs/op

We considered this while implementing the first time but this can further increase the CI job time, since I can't really run both benchmarks at the same time on a same machine. Therefore, this solution was designed with the thought that the CI jobs run on same machine types. I do agree that this still can have variance (as I have seen while I ran a single benchmark multiple times of a same machine without any changes), this was done to save CI time.

That makes sense if you consider the benchmark selection sacred, but it very much isn't. The above set of benchmarks doesn't make much sense, it's quite repetitive. If we don't get the variance down, there's not much point to the whole exercise. I'd strongly suggest we consider narrowing down the set of benchmarks and we run both the before and after interleaved in the same environment (aka exactly what benchdiff does).

We could curate a few of the BenchmarkKV benchmarks, that would be ok. But we've recently spent so much time stabilizing BenchmarkSysbench (which follow a similar approach to BenchmarkKV) and care about this benchmark due to the perf-efficienc effort, so I would recommend starting out with BenchmarkSysbench/SQL/3node/(oltp_read_only|oltp_write_only). An "op" in these is multiple statements, so "one run" is really worth a lot more than one. The setup for each invocation takes <2s (on my macbook anyway), and for oltp_read_write a -benchtime of around 1300x should result in 10s runs. Doing ten runs for both the old and the new would add up to approximately 3.5 minutes. Then similar for oltp_read_write, which likely needs a lower -benchtime (again I would aim for 10s individual runs). In total, we'd come in at around ten minutes. The variance I see on these tests is typically in the 1% range for CPU and alloc_space, and often near zero for alloc_objects, which means it's highly specific.

I also think down the road we can get even better at allocations, since we could entertain creating alloc profiles and stripping them of "one-off" allocations (i.e. anything not hit as often as we'd expect if it were on the critical path).

As for the UX, I'm quite partial to this output, which is created by benchdiff (internally using benchstat¹ which also does p-testing and some form of outlier detection), so ideally some output in this format would be surfaced, along with CPU as well as alloc profiles for before and after.

name                                  old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_only-24    8.83ms ± 1%    8.79ms ± 1%   ~     (p=0.421 n=5+5)

name                                  old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_only-24    1.41MB ± 1%    1.42MB ± 0%   ~     (p=0.548 n=5+5)

name                                  old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_only-24     8.01k ± 0%     8.02k ± 0%   ~     (p=0.690 n=5+5)

re: the CPU and alloc profiles, I would take them in a separate one-off run of the benchmark at the end (since profiling has overhead). The Sysbench benchmark has some nice tooling (or so I, the author, tend to think 😅) where if you pass -test.memprofile=mem.pb.gz it creates delta alloc profiles mem<name_of_subtest>.pb.gz as well that don't profile the test setup. We'd want to surface these for the 'before' and 'after' along with nice suggested pprof invocations. (No such thing exists for CPU profiles, but it should be fine, if we run workload for like ~10s the initial setup shouldn't bias the profile too heavily). We can iterate on all of this stuff, initially it's important to get a stable benchmark.

https://github.com/nvanbenschoten/benchdiff/blob/bcbcae095a2fd9205135c506ee1da91612a77098/main.go#L407-L422 ↩

dt · 2024-11-26T13:07:19Z

strong +1 to adding something in the critical path of the development workflow

The thing that strikes me as a leap is to go straight to deeming that it must be in the critical path when we haven't tried first tried to see if our problem is solved by it being on the development workflow path at all, such as in the form a reliable post-merge notification of a detected regression.

If that's the case, then how is this different from forcing engineers that the unit tests pass?

Because a broken unit test harms the productivity of other engineers: if the code in some library doesn't compile or does not produce the correct results when called, I cannot build and test the behavior of features in my library and parts of the codebase if those parts can't compile their dependencies or are getting incorrect inputs from them. Breaking a core library could bring all development across all teams to a halt. Whereas if a library becomes x% slower for one night, that is far less likely to have a broad impact on productivity across the organization.

My impression is that we believe that we've had quiet unnoticed perf regressions sneaking in but I believe we also all agree that this is because they simply were not identified at the time / brought to the attention of the author of the change that introduced them.

I think getting the time of this job per run down is essential no matter when we run it. If we want to run it per-merge, it also must be comparing to the merge-base, not current HEAD -- otherwise every time someone merges an optimization, all other wholly unrelated PRs will start to flake as regressions. (Alternatively it could benchmark the new SHA as rebased on head but none of our CIs do that and it would likely be confusing to engineers to have one odd job doing that).

But even if you could get the job down to, say, 20 minutes, it will still make the CI pipeline flakier just because every job, but especially such a long running one, can be preempted these days. So the more jobs we elect to run on the critical path increases the higher the chance of a flake on that path, even if the jobs themselves are 100% reliable, so we have to be sure that the benefit of doing so justifies the cost. In this case, it really seems like we can get the lion's share of that benefit from a post-merge check can't we?

msbutler · 2024-11-26T14:09:59Z

I've been convinced that a pre-merge microbenchmark job would be good to add, so long as it doesn't effect general productivity. I'll stay away from the cost discussions.

One reason I came out swinging against this change is because my team doesn't really rely on microbenchmarks, so as Tobi has said, I really fear this change could bog down time to merge. Furthermore, I think we already run some sort of bench job pre merge in extended ci (which is optional). This job seems incredibly flaky and is often the bottleneck to extended CI runs. Since many engineers wait for extended CI to complete before merging, we already have a bench job slowing down time to merge! @tbg Are kv engineers using this extended ci bench job in their workflow?

Since @sambhav-jain-16 's work seems much more targeted, I'm hoping one outcome of it will be to sunset extended ci's bench job.

tbg · 2024-11-26T15:31:07Z

The thing that strikes me as a leap is to go straight to deeming that it must be in the critical path when we haven't tried first tried to see if our problem is solved by it being on the development workflow path at all, such as in the form a reliable post-merge notification of a detected regression.

Let's table the discussion on whether this should be required. Personally, I don't want to require it, at least not in the "you can't merge until you fixed it" kind of sense. But I do think we should run these in the critical path and make the result visible, i.e. you'd have to ack your way through a regression to merge it. (Plus, you get the nice feeling of having done a good job when you actually made things faster). The important part is to do it well so that it's a pleasure to engage with. I don't want to make this the selling point here, but I think putting the perf tooling into people's PRs will also be a real boost to the perf culture at CRL.

I'd argue that the async path is what we have tried so far. For example, here, every week. But it is fundamentally a slow path. It's like fixing a bug after the merge, except - as you point out - the bug doesn't "really hurt that much" and surely as an author you have other things to do, which in my estimate puts the chances of smaller issues being fixed at approximately zero percent.

We can't run every benchmark in CI and expect that to be useful (looking at you, bench job), so we will need an "async path" for most benchmarks anyway. I am all for that, but that is not this.

Furthermore, I think we already run some sort of bench job pre merge in extended ci (which is optional). This job seems incredibly flaky and is often the bottleneck to extended CI runs. Since many engineers wait for extended CI to complete before merging, we already have a bench job slowing down time to merge! @tbg Are kv engineers using this extended ci bench job in their workflow?

I've never seen anyone use this job. I'm also not sure what it's there for: certainly it's not providing any value to the CI pipeline; you can't even see the results and benchmarks are meaningless without a reference point. It maybe convinces us that each of the benchmark actually compiles and can run without failing? We should just do that in nightlies, since it's rare to break one and we compile them already when running the regular tests. If they fail in nightly, they should just file issues.

As a CI job, I'd remove this ASAP.

dt · 2024-11-26T15:48:16Z

I'd argue that the async path is what we have tried so far. For example, [slack bot that does not identify specific changes or notify their authors], every week. But it is fundamentally a slow path.

I think this is conflating the effect of two variables and has some element of a correlation-vs-causation blurring: our path so far has indeed been async but has also been extremely non-specific. If I merge a regression, it is up to me see a message from a slackbot that doesn't mention me or my change some days later, and then think to go check if that might have been me. That fact that that is after I merge versus before I merge I don't think is the primary reason it isn't acted on, but rather the fact it is so non-specific and thus unactionable.

I'd argue that we haven't yet tried a path where we give the authors of regressions specific, actionable feedback about their changes, either before or after they merge those changes, so it is isn't clear to me that we can conclude we have tried "the" async path and can conclude it is a dead end based on our lack of success so far?

But all that said, if this new CI job can be made <10mins, is strictly comparing to the merge base, and doesn't flake -- and we're ready to disable it as soon as any of those are observed not to hold -- then isn't a hill I'm prepared to die on.

rickystewart · 2024-11-26T15:53:24Z

I added a change to script that removes the bound of running the microbenchmark on one cpu.

Doesn't this jeopardize the benchmark results in some way? Generally people choose to run the benchmarks on a constant number of CPU's with limited parallelism to make the benchmark results more accurate. Did you do some sort of validation that the benchmark you're testing has not been affected by this change?

rickystewart · 2024-11-26T15:56:01Z

If you're going to change some fundamental aspect of how the benchmarks are run to make the runtime shorter, I would prefer to make that change in a separate PR, then you can watch it for ~a week to determine that we haven't made the job flakier before making it required in CI.

rickystewart · 2024-11-26T16:11:42Z

These are some of the potential PRs that were could have been caught by the job:

How did you produce the list? I see the first PR is mine, #134325. But this PR did not introduce any major performance regression (that I have been made aware of).

sambhav-jain-16 · 2024-11-26T16:12:56Z

Did you do some sort of validation that the benchmark you're testing has not been affected by this change?

Yes, this was done in the outputs I posted above, they were ran without fixating the cpu and showed the similar characteristics. I added that change in this PR itself to test it out since looks like anyways I need to make some fundamental changes in the PR as some of the folks have concerns.
Also, re-iterating that anything that has been done in this PR before also has been tested and this job has ran for some time now to collect data, and the numbers that has been arrived are after analysing that data which I posted here. These numbers have resulted in false positives and people seem to be concerned about the flakiness, which I intend to address going forward, which will require a new PR altogether.

If you're going to change some fundamental aspect of how the benchmarks are run to make the runtime shorter, I would prefer to make that change in a separate PR, then you can watch it for ~a week to determine that we haven't made the job flakier before making it required in CI.

There are suggestions by @tbg on using different benchmark altogether that toby has asked me to explore. These all will be in seperate PRs. I'm also going to wait for @srosenberg to comment on what he thinks about the suggestions pointed out above.

sambhav-jain-16 · 2024-11-26T16:16:01Z

How did you produce the list? I see the first PR is mine, #134325. But this PR did not introduce any major performance regression (that I have been made aware of).

Were microbenchmarks run against the change to see if there's a regression? Again, this change doesn't intend to capture a "major" regression, but smaller regressions that might not get caught in the nightlies but over the time get accumulated as mentioned here

List was accumulated from the bucket that stores the output of ci job itself and the same roachprod-microbench compare command was ran on subsequent runs.
The compare command list the commits in which there was a regression from the previous master commit

tbg · 2024-11-26T16:38:28Z

I'd argue that we haven't yet tried a path where we give the authors of regressions specific, actionable feedback about their changes, either before or after they merge those changes, so it is isn't clear to me that we can conclude we have tried "the" async path and can conclude it is a dead end based on our lack of success so far?

I can't really disagree with you. I'd like to see something useful, if it's right after the PR in a slack message - fine by me too. Useful comes first. I don't think what we have right now is useful yet.

Doesn't this jeopardize the benchmark results in some way? Generally people choose to run the benchmarks on a constant number of CPU's with limited parallelism to make the benchmark results more accurate. Did you do some sort of validation that the benchmark you're testing has not been affected by this change?

Perhaps surprisingly, giving Go just one CPU is actually a bad idea as far as I understand. These benchmarks run entire servers. Servers have async work, which, at one CPU, interrupts foreground work. So you want to have at least a few processors so that these benchmarks - which still only do sequential client requests - end up with predictable variance. Basically they measure the speed at which a single client can go against a system that has enough slack. This is useful. It's different from measuring how fast the system can go when maximally loaded, which is also useful but not what we're after here, as it's typically less stable and much more sensitive to the environment. Either way, running with one CPU is neither. It's "what if we ran all of CRDB on one CPU". We know that isn't great, and we don't care about it either.

tbg · 2024-11-26T16:57:33Z

Re this, I also have my doubts about this list. For example #133298 certainly did not introduce any regression; there aren't even any production code changes in there unless requests are being traced. So I think this shows us that we need to do better than what we have as of this PR. We're really nowhere close to deciding where these checks should live - let's make them useful first and then work up a stronger conviction of what their fate should be.

Here's benchdiff ./pkg/sql/tests -b -r 'Sysbench/SQL/3node/oltp_read_write' -d 1000x -c 5 for the above PR which shows very similar performance.

Benchmarking output

benchdiff ./pkg/sql/tests -b -r 'Sysbench/SQL/3node/oltp_read_write' -d 1000x -c 5
checking out '7342c63'
building benchmark binaries for '7342c63' [bazel=true] 1/1 /
checking out 'e77ae9b'
building benchmark binaries for 'e77ae9b' [bazel=true] 1/1 |

  pkg=1/1 iter=5/5 cockroachdb/cockroach/pkg/sql/tests /

name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    14.5ms ± 3%    14.5ms ± 2%   ~     (p=0.690 n=5+5)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.56MB ± 1%    2.57MB ± 2%   ~     (p=1.000 n=5+5)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     17.4k ± 1%     17.4k ± 1%   ~     (p=1.000 n=5+5)

rickystewart · 2024-11-26T17:20:01Z

How did you produce the list? I see the first PR is mine, #134325. But this PR did not introduce any major performance regression (that I have been made aware of).

Were microbenchmarks run against the change to see if there's a regression?

You're the one claiming this PR introduced a performance regression, so where is the data in support of this?

sambhav-jain-16 force-pushed the perf-regression branch from 45661a3 to 0b75be3 Compare November 21, 2024 13:33

sambhav-jain-16 changed the title ~~workflows: avoid bench comparion when performance-regression-expected is set~~ workflows: make cockroach-microbench-ci as required Nov 21, 2024

sambhav-jain-16 added the performance-regression-expected Performance regression is expected in this change label Nov 21, 2024

sambhav-jain-16 force-pushed the perf-regression branch from 0b75be3 to 1b3e652 Compare November 21, 2024 14:28

sambhav-jain-16 removed the performance-regression-expected Performance regression is expected in this change label Nov 22, 2024

sambhav-jain-16 force-pushed the perf-regression branch 3 times, most recently from 714137e to ad92977 Compare November 22, 2024 04:39

sambhav-jain-16 added the performance-regression-expected Performance regression is expected in this change label Nov 22, 2024

sambhav-jain-16 force-pushed the perf-regression branch from ad92977 to d463e84 Compare November 22, 2024 04:45

sambhav-jain-16 removed the performance-regression-expected Performance regression is expected in this change label Nov 22, 2024

sambhav-jain-16 marked this pull request as ready for review November 25, 2024 05:06

sambhav-jain-16 requested a review from a team as a code owner November 25, 2024 05:06

sambhav-jain-16 requested a review from srosenberg November 25, 2024 09:49

rickystewart requested changes Nov 25, 2024

View reviewed changes

sambhav-jain-16 force-pushed the perf-regression branch from d463e84 to 0533d74 Compare November 26, 2024 06:53

workflows: make cockroach-microbench-ci as required #135899

Are you sure you want to change the base?

workflows: make cockroach-microbench-ci as required #135899

Conversation

sambhav-jain-16 commented Nov 21, 2024 • edited Loading

cockroach-teamcity commented Nov 21, 2024

sambhav-jain-16 commented Nov 25, 2024

dt commented Nov 25, 2024

sambhav-jain-16 commented Nov 25, 2024

dt commented Nov 25, 2024

msbutler commented Nov 25, 2024

msbutler commented Nov 25, 2024 • edited Loading

rickystewart left a comment

Choose a reason for hiding this comment

sambhav-jain-16 commented Nov 25, 2024 • edited Loading

rickystewart commented Nov 25, 2024 • edited Loading

rickystewart commented Nov 25, 2024

dt commented Nov 25, 2024 • edited Loading

rickystewart commented Nov 25, 2024

sambhav-jain-16 commented Nov 25, 2024 • edited Loading

msbutler commented Nov 25, 2024

sambhav-jain-16 commented Nov 25, 2024 • edited Loading

sambhav-jain-16 commented Nov 25, 2024

dt commented Nov 25, 2024

srosenberg commented Nov 25, 2024 • edited Loading

rickystewart commented Nov 25, 2024 • edited Loading

srosenberg commented Nov 26, 2024

sambhav-jain-16 commented Nov 26, 2024 • edited Loading

tbg commented Nov 26, 2024

sambhav-jain-16 commented Nov 26, 2024 • edited Loading

tbg commented Nov 26, 2024 • edited Loading

Footnotes

dt commented Nov 26, 2024

msbutler commented Nov 26, 2024 • edited Loading

tbg commented Nov 26, 2024

dt commented Nov 26, 2024

rickystewart commented Nov 26, 2024

rickystewart commented Nov 26, 2024

rickystewart commented Nov 26, 2024

sambhav-jain-16 commented Nov 26, 2024 • edited Loading

sambhav-jain-16 commented Nov 26, 2024 • edited Loading

tbg commented Nov 26, 2024

tbg commented Nov 26, 2024

rickystewart commented Nov 26, 2024

sambhav-jain-16 commented Nov 21, 2024 •

edited

Loading

msbutler commented Nov 25, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 25, 2024 •

edited

Loading

rickystewart commented Nov 25, 2024 •

edited

Loading

dt commented Nov 25, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 25, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 25, 2024 •

edited

Loading

srosenberg commented Nov 25, 2024 •

edited

Loading

rickystewart commented Nov 25, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 26, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 26, 2024 •

edited

Loading

tbg commented Nov 26, 2024 •

edited

Loading

msbutler commented Nov 26, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 26, 2024 •

edited

Loading

sambhav-jain-16 commented Nov 26, 2024 •

edited

Loading