-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression on 20240802 for luindex #1200
Comments
It is interesting that it affects luindex so much, but not other benchmarks. |
Here are the eBPF trace of a nursery GC in the version right before #1182 was merged (I'll call it "no-assert" from now on) and the version right after it was merged ("yes-assert") Version "no-assert": Version "yes-assert": The PR #1182 changes the way we generate "PrepareMutator" work packets. - for mutator in <C::VM as VMBinding>::VMActivePlan::mutators() {
- mmtk.scheduler.work_buckets[WorkBucketStage::Prepare]
- .add(PrepareMutator::<C::VM>::new(mutator));
- }
+ let prepare_mutator_packets = <C::VM as VMBinding>::VMActivePlan::mutators()
+ .map(|mutator| Box::new(PrepareMutator::<C::VM>::new(mutator)) as _)
+ .collect::<Vec<_>>();
+ // Just in case the VM binding is inconsistent about the number of mutators and the actual mutator list.
+ debug_assert_eq!(
+ prepare_mutator_packets.len(),
+ <C::VM as VMBinding>::VMActivePlan::number_of_mutators()
+ );
+ mmtk.scheduler.work_buckets[WorkBucketStage::Prepare].bulk_add(prepare_mutator_packets); This change has two effects:
But part (2) has one side effect: It wakes fewer GC workers, and those workers grab many work packets from the buckets. See the Perfetto plot above. The reason is that in our current GC worker implementation, the worker steals one batch at a time (in And one work packet in the "Prepare" stage is very unbalanced. That was "ScanCodeCacheRoots". It scanned 26845 code cache root slots, and dominated the "Prepare" stage. (FYI, A derived "GenNurseryProcessEdges" work packet contains 26845 slots from the code cache roots and dominated the "Closure" stage.) Before the PR #1182, that work packet is usually picked up by a separate GC worker which had just done the In fact, it affects other plans (including SemiSpace) too. But it may not be that obvious because a full-heap GC will spend more time in the closure stage, so the overhead is not that obvious. See the plot below The recently merged PR mmtk/mmtk-openjdk#282 makes code cache roots generational, and allows code cache roots to be split into multiple "ProcessEdgesWork" work packet, each having at most 4096 elements. Nursery GCs usually have zero newly added code cache roots, and the cost is completely eliminated. See the plot below. For full-heap GCs, it greatly improved the load balance of the "Closure" stage, but does not help reducing the latency of the "Prepare" stage. See the plot below. And the reason is that In conclusion, the root causes are
Both of them contribute to the load-unbalancing of the "Prepare" stage. eBPF tracesHere are some eBPF trace logs for those who are interested. You can load them into Perfetto UI In the following logs, the plan is GenImmix unless explicitly labelled. They record one of every 10 GCs.
And the master branch, but recording every single GC. About the added assertionThe assertion has no impact on the nursery GCs we observed. Those nursery GCs usually don't generate any |
If I understand you right, our GC worker always steal a batch of packets, and when there aren't many packets, one worker may steall all the packets. The expected behavior would be that other workers will starve and steal packets from the worker, and eventually it makes the load balanced. Does that stealing actually happen? Is it costly to steal? |
We may be able to use priorized queue for |
In theory, they should be able to steal from other workers. But in practice, it seems to take longer for a thread to wake up due to OS scheduling. After the "Prepare" work packet which added the "PrepareCollector" work packets which are "designated work". All workers should wake up and find a "designated work" to do. However, some workers wake up earlier, and some workers wake up later. When a late worker wakes up (the ones that only got a "PrepareCollector" to execute), they wake up so late that other workers have already finished executing other work packets, which are usually small root-scanning packets that can be executed very quickly. That leaves them nothing to steal. From the eBPF plot, the latency of some |
FYI, we discussed the OS scheduling problem in a previous issue: #1160 |
There seems to be several different issues:
|
I took another look at the regression on 2024-08-02. Looking at the original log againThe raw execution logs from 2024-08-02 and the one immediately before that can be found in
The luindex benchmark is from DaCapo 2006, with 6x min heap size (22MB * 6 = 132MB), executed with 20 invocation, with 2 iterations each. It is not using probes, so it only shows total execution time of the last iteration, in the form of "===== DaCapo luindex PASSED in xxxxxx msec =====". From the log, we can find the mean execution time
The change of mean execution time is +8.19%. Re-running on groundhog.momaI re-ran luindex on groundhog.moma (the same CPU as hammer.moma and wrench.moma). It used GenImmix, 132MB heap size, 20 invocations, 2 iterations each, just like on hammer and wrench. The "build3" is the commit for #1182. "build2" and "build1" are the two commits before that, and "build4" is the commit after that. If "build3" caused the regression, we should see the total time of build3 and build4 being similar, but obviously higher than build1 and build2. I tried it three times, but the results were inconsistent. The first attempt (20 invocations) The second attempt (another 20 invocations, same configuration) The third attempt (another 20 invocations, same configuration) Plotty links:
The result does not show if "build3" regressed.
|
This time, I ran luindex for 40 invocations (instead of 20), with 5 iterations each (instead of 2). The result is: A result like this shows that #1182 didn't cause the regression. And it is more likely that there are other factors (environment changes) that impacted the performance. |
@wks if you take a look at the log file, is there anything different in terms of kernel versions, CPU frequencies, etc before and after the regression? |
Then it was probably from "Microcode update for Zenbleed" (https://github.com/anupli/momas/issues/14) which was applied to our machines on 2 Aug. |
The kernel versions are the same, and the CPU frequencies only differ by 0.8% (the frequency is actually faster after the regression). There is no microcode version logged, though. Maybe we should print |
I assume we can close this. It is not a performance regression, but an CI environment change. |
There is 10% performance slow down observed in our CI tests on 20240802 for luindex with OpenJDK GenImmix and StickyImmix:
The PR for the run should be #1182.
The text was updated successfully, but these errors were encountered: