Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proof benchmarking in CI pipeline #651

Open
Nashtare opened this issue Sep 23, 2024 · 14 comments · May be fixed by #701
Open

Add proof benchmarking in CI pipeline #651

Nashtare opened this issue Sep 23, 2024 · 14 comments · May be fixed by #701
Assignees
Labels

Comments

@Nashtare
Copy link
Collaborator

We've noticed lately some performance hit due to a newly added feature, that highlights a more general issue of missing performance-related regression checks. There's some delta in proof generation timings, but noticeable regressions could still be easily caught.

We should set up some CI job for this.

@0xaatif
Copy link
Contributor

0xaatif commented Oct 3, 2024

I'd love to see durable timings so that I can get a historical view.

Bonus points for repeated runs every night so I can get some error bars.

GH CI is the worst case for getting these metrics - noisy shared machines are going to get us wide error margins :D

@atanmarko
Copy link
Member

atanmarko commented Oct 3, 2024

@0xaatif Yeah, that is a good idea, I was already thinking to add dedicated benchmark workflow scheduled every night. Current implementation is a quick check to make regressions obvious with the current PR development (we will see about error margin and tweak accordingly) - reason I have implemented it here is because it was easy and it does not add any overhead.

Now, how to keep historical workflow data in the CI?

@Nashtare
Copy link
Collaborator Author

Nashtare commented Oct 3, 2024

I think we need a stronger approach than the current. Few points:

  • @0xaatif mentioned the variability around shared instances, we'd need our own runner
  • current parameters (in feat: add ci proving time benchmark #694) are noisy (we need reproducible runs with as little as possible room for noise, this includes for instance prover state fully stored in memory to not have any disk I/O delays, ...)
  • relevant test case: the 2 blocks used in the script are very special edge cases (one with barely empty blocks, one with a single block). Initially the request for such feature was because of a perf regression observed over batches larger than the block pool size for instance.

I wouldn't know about historical workflow data, but IIRC in the past winterfell did have some benchmarking job in the CI. Maybe @bobbinth would know if there were some issues with it

@atanmarko
Copy link
Member

atanmarko commented Oct 3, 2024

@Nashtare Ok, then I'll close this PR and make a dedicated benchmark workflow.

  • I suppose it should only run on scheduled time (night EU/US) not to interfere with the others. Maybe we could run it on the PR merge.
  • Dedicated machine (how different should it be form the current zero-ci machine)?
  • What data should be collected in the benchmark?
  • What are the representative blocks that should be used in the benchmark?
  • How the data should be kept? Woud you just go through the github action history for that workflow or should we upload data to some database?

@4l0n50
Copy link
Contributor

4l0n50 commented Oct 3, 2024

What if we use a smaller security parameter? (@sai-deng mentioned this in another context). It would be cheaper while, at least at first sight, it would be weird to get a performance regression for sec param s but not for s' < s.

@Nashtare
Copy link
Collaborator Author

Nashtare commented Oct 3, 2024

Fair point, though we'd have trouble setting a meaningful threshold.
Realistically speaking, if we have some regression on the pure proving side (stark or recursion), that incurs an overhead of 7% in total, bringing the security level down from 100 bits to 30 could bring this overhead down to 2% (which is close to the noise threshold)

@sai-deng
Copy link
Member

sai-deng commented Oct 3, 2024

I think we should use production security parameters for benchmarking. Additionally, using cycle counts instead of CPU time may provide more stable and consistent results.

@Nashtare
Copy link
Collaborator Author

Nashtare commented Oct 3, 2024

Additionally, using cycle counts instead of CPU time may provide more stable and consistent results.

Did you mean our zkCPU cycle count? This is a good metric but isn't sufficient. For instance continuations brought a huge slowdown in CPU frequency because of memory flashing, but this would appear with just cycle count. Also all changes on the circuitry wouldn't be reflected.

@sai-deng
Copy link
Member

sai-deng commented Oct 3, 2024

I mean the CPU cycle count on the benchmark machine.

@atanmarko
Copy link
Member

atanmarko commented Oct 3, 2024

@sai-deng Cycle count for an hour or two proving job would be too big to practically handle and would not be easy to compare, no? I mean if job is 75 minutes and after change it is 80 minutes, you get some idea, but if second job is 12345778910111213 cycles slower you have no idea if it is significant or not.

@sai-deng
Copy link
Member

sai-deng commented Oct 3, 2024

I agree that handling large cycle counts can be tricky, we can record both CPU time and cycle counts, with the focus on percentage changes rather than absolute values. I’ve also used perf stat in previous projects for jobs lasting an hour without issues.

@atanmarko atanmarko linked a pull request Oct 4, 2024 that will close this issue
@atanmarko
Copy link
Member

@sai-deng Seems perf stat -e cycles is not supported on the CI machine:

Performance counter stats for '/home/runner/work/zk_evm/zk_evm/target/release/leader --runtime in-memory --load-strategy monolithic --block-batch-size 8 --proof-output-dir /home/runner/work/zk_evm/zk_evm/proofs stdio':

   <not supported>      cycles                                                                

     280.332879270 seconds time elapsed

    6285.283435000 seconds user
     276.485728000 seconds sys

@sai-deng
Copy link
Member

sai-deng commented Oct 4, 2024

Yes, we may need a self-hosted runner for cycle count.

@bobbinth
Copy link

bobbinth commented Oct 4, 2024

IIRC in the past winterfell did have some benchmarking job in the CI. Maybe @bobbinth would know if there were some issues with it

Yes, Winterfell had it in the past - but I think our approach was relatively simplistic and it added a lot of overhead for relatively little benefit. So, IIRC, rather than investing time to make it more sophisticated, we turned it off in favor of manual benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
6 participants