Performance regression on 20240802 for luindex #1200

qinsoon · 2024-09-12T04:51:37Z

There is 10% performance slow down observed in our CI tests on 20240802 for luindex with OpenJDK GenImmix and StickyImmix:

The PR for the run should be #1182.

wks · 2024-09-12T05:49:53Z

It is interesting that it affects luindex so much, but not other benchmarks.

wks · 2024-09-18T09:49:48Z

Here are the eBPF trace of a nursery GC in the version right before #1182 was merged (I'll call it "no-assert" from now on) and the version right after it was merged ("yes-assert")

Version "no-assert":

Version "yes-assert":

The PR #1182 changes the way we generate "PrepareMutator" work packets.

-            for mutator in <C::VM as VMBinding>::VMActivePlan::mutators() {
-                mmtk.scheduler.work_buckets[WorkBucketStage::Prepare]
-                    .add(PrepareMutator::<C::VM>::new(mutator));
-            }
+            let prepare_mutator_packets = <C::VM as VMBinding>::VMActivePlan::mutators()
+                .map(|mutator| Box::new(PrepareMutator::<C::VM>::new(mutator)) as _)
+                .collect::<Vec<_>>();
+            // Just in case the VM binding is inconsistent about the number of mutators and the actual mutator list.
+            debug_assert_eq!(
+                prepare_mutator_packets.len(),
+                <C::VM as VMBinding>::VMActivePlan::number_of_mutators()
+            );
+            mmtk.scheduler.work_buckets[WorkBucketStage::Prepare].bulk_add(prepare_mutator_packets);

This change has two effects:

We asserts that the number of work packets is equal to the number of mutators reported by the binding.
We add all "PrepareMutator" work packets with one bulk_add instead of many add calls, reducing the number of synchronization operations.

But part (2) has one side effect: It wakes fewer GC workers, and those workers grab many work packets from the buckets. See the Perfetto plot above. The reason is that in our current GC worker implementation, the worker steals one batch at a time (in BucketQueue::steal_batch_and_pop). That was supposed to reduce the amount of synchronization when the bucket has many packets, and the load-balancing is good. However, because the number of work packets is small in the "Prepare" stage, a few GC workers grabbed most of the work packets, leaving none for others. You can compare the two plots above.

And one work packet in the "Prepare" stage is very unbalanced. That was "ScanCodeCacheRoots". It scanned 26845 code cache root slots, and dominated the "Prepare" stage. (FYI, A derived "GenNurseryProcessEdges" work packet contains 26845 slots from the code cache roots and dominated the "Closure" stage.) Before the PR #1182, that work packet is usually picked up by a separate GC worker which had just done the PrepareMutator packet which was the only packet added at that time. After that PR, however, there was usually one GC worker that picks up many other work packets together with "ScanCodeCacheRoots". See the circle in the second plot above. That increased the critical path of work packets in the "Prepare" stage.

In fact, it affects other plans (including SemiSpace) too. But it may not be that obvious because a full-heap GC will spend more time in the closure stage, so the overhead is not that obvious. See the plot below

The recently merged PR mmtk/mmtk-openjdk#282 makes code cache roots generational, and allows code cache roots to be split into multiple "ProcessEdgesWork" work packet, each having at most 4096 elements. Nursery GCs usually have zero newly added code cache roots, and the cost is completely eliminated. See the plot below.

For full-heap GCs, it greatly improved the load balance of the "Closure" stage, but does not help reducing the latency of the "Prepare" stage. See the plot below.

And the reason is that ScanCodeCacheRoots is still one work packet that scans all code cache roots. The fact that code cache roots are stored in two HashSet data structures makes it impossible to parallelize the scanning.

In conclusion, the root causes are

the extra-large ScanCodeCacheRoots work packet, and
the "batch-stealing" behavior of the work-stealing library

Both of them contribute to the load-unbalancing of the "Prepare" stage.

eBPF traces

Here are some eBPF trace logs for those who are interested. You can load them into Perfetto UI

In the following logs, the plan is GenImmix unless explicitly labelled. They record one of every 10 GCs.

And the master branch, but recording every single GC.

luindex-58-master1.log.json.gz

About the added assertion

The assertion has no impact on the nursery GCs we observed. Those nursery GCs usually don't generate any SweepChunk work packets at all, leaving nothing to assert.

qinsoon · 2024-09-19T00:27:15Z

If I understand you right, our GC worker always steal a batch of packets, and when there aren't many packets, one worker may steall all the packets. The expected behavior would be that other workers will starve and steal packets from the worker, and eventually it makes the load balanced. Does that stealing actually happen? Is it costly to steal?

qinsoon · 2024-09-19T00:30:19Z

steal_batch would steal around half of the tasks, as specified in https://docs.rs/crossbeam-deque/latest/crossbeam_deque/struct.Stealer.html#method.steal_batch.

We may be able to use priorized queue for ScanCodeCacheRoots so some workers will start working on it from the beginning while other workers can work on other packets.

wks · 2024-09-19T01:00:56Z

If I understand you right, our GC worker always steal a batch of packets, and when there aren't many packets, one worker may steall all the packets. The expected behavior would be that other workers will starve and steal packets from the worker, and eventually it makes the load balanced. Does that stealing actually happen? Is it costly to steal?

In theory, they should be able to steal from other workers. But in practice, it seems to take longer for a thread to wake up due to OS scheduling.

After the "Prepare" work packet which added the "PrepareCollector" work packets which are "designated work". All workers should wake up and find a "designated work" to do. However, some workers wake up earlier, and some workers wake up later. When a late worker wakes up (the ones that only got a "PrepareCollector" to execute), they wake up so late that other workers have already finished executing other work packets, which are usually small root-scanning packets that can be executed very quickly. That leaves them nothing to steal.

From the eBPF plot, the latency of some PrepareCollector packets is 56 µs. Not sure if this is the latency Linux is capable of scheduling at, but if a system is running other processes, some processors may be running other tasks and don't have a chance to run as a GCWorker.

wks · 2024-09-19T02:01:28Z

FYI, we discussed the OS scheduling problem in a previous issue: #1160

qinsoon · 2024-09-19T02:21:55Z

There seems to be several different issues:

notify_all in bulk_add vs notify_one in add. When we do notify_all, and all the workers that are waken up will have contention on the same work queue, we may see worse latency.
Workers starve. When workers are idle and we add a small number of work packets with bulk_add, certain workers just won't get any packet. steal_batch just makes the situation worse. We probably should be careful when using bulk_add instead of add if the number of packets is small.
The foreseeable long running packet may need to be put to the priority queue. An available worker will work on a priority packet, while leaving other packets to other workers.

wks · 2024-09-23T11:40:53Z

I took another look at the regression on 2024-08-02.

Looking at the original log again

The raw execution logs from 2024-08-02 and the one immediately before that can be found in

The luindex benchmark is from DaCapo 2006, with 6x min heap size (22MB * 6 = 132MB), executed with 20 invocation, with 2 iterations each. It is not using probes, so it only shows total execution time of the last iteration, in the form of "===== DaCapo luindex PASSED in xxxxxx msec =====".

From the log, we can find the mean execution time

date	mean (ms)	std (ms)
2024-07-26	2211.8	8.87
2024-08-02	2393.0	16.91

The change of mean execution time is +8.19%.

Re-running on groundhog.moma

I re-ran luindex on groundhog.moma (the same CPU as hammer.moma and wrench.moma). It used GenImmix, 132MB heap size, 20 invocations, 2 iterations each, just like on hammer and wrench.

The "build3" is the commit for #1182. "build2" and "build1" are the two commits before that, and "build4" is the commit after that. If "build3" caused the regression, we should see the total time of build3 and build4 being similar, but obviously higher than build1 and build2.

I tried it three times, but the results were inconsistent.

The first attempt (20 invocations)

The second attempt (another 20 invocations, same configuration)

The third attempt (another 20 invocations, same configuration)

Plotty links:

The result does not show if "build3" regressed.

The error bars overlap.
"build3" was the fastest one in the first attempt, but the slowest in the subsequent attempts.
"build4" was after "build3". If "build3" regressed, "build4" should have bad performance, too. But it was not the case.
The logs from hammer and wrench showed a slowdown of 8.91% in total execution time, but the difference between the best and the worst between build1-build4 was only between 0.1%-0.2%.

wks · 2024-09-23T13:21:29Z

This time, I ran luindex for 40 invocations (instead of 20), with 5 iterations each (instead of 2). The result is:

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|groundhog-2024-09-23-Mon-120221&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^benchmark&

A result like this shows that #1182 didn't cause the regression. And it is more likely that there are other factors (environment changes) that impacted the performance.

caizixian · 2024-09-23T14:32:50Z

@wks if you take a look at the log file, is there anything different in terms of kernel versions, CPU frequencies, etc before and after the regression?

qinsoon · 2024-09-23T21:17:40Z

Then it was probably from "Microcode update for Zenbleed" (https://github.com/anupli/momas/issues/14) which was applied to our machines on 2 Aug.

wks · 2024-09-24T03:07:41Z

@wks if you take a look at the log file, is there anything different in terms of kernel versions, CPU frequencies, etc before and after the regression?

The kernel versions are the same, and the CPU frequencies only differ by 0.8% (the frequency is actually faster after the regression). There is no microcode version logged, though. Maybe we should print /proc/cpuinfo (which contains the microcode version and the CPU frequencies) into the log, too.

qinsoon · 2024-09-27T03:36:53Z

I assume we can close this. It is not a performance regression, but an CI environment change.

qinsoon added the C-regression Category: Correctness or performance regression label Sep 12, 2024

qinsoon closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression on 20240802 for luindex #1200

Performance regression on 20240802 for luindex #1200

qinsoon commented Sep 12, 2024

wks commented Sep 12, 2024

wks commented Sep 18, 2024

qinsoon commented Sep 19, 2024

qinsoon commented Sep 19, 2024

wks commented Sep 19, 2024

wks commented Sep 19, 2024

qinsoon commented Sep 19, 2024

wks commented Sep 23, 2024

wks commented Sep 23, 2024

caizixian commented Sep 23, 2024 •

edited

Loading

qinsoon commented Sep 23, 2024

wks commented Sep 24, 2024

qinsoon commented Sep 27, 2024

Performance regression on 20240802 for luindex #1200

Performance regression on 20240802 for luindex #1200

Comments

qinsoon commented Sep 12, 2024

wks commented Sep 12, 2024

wks commented Sep 18, 2024

eBPF traces

About the added assertion

qinsoon commented Sep 19, 2024

qinsoon commented Sep 19, 2024

wks commented Sep 19, 2024

wks commented Sep 19, 2024

qinsoon commented Sep 19, 2024

wks commented Sep 23, 2024

Looking at the original log again

Re-running on groundhog.moma

wks commented Sep 23, 2024

caizixian commented Sep 23, 2024 • edited Loading

qinsoon commented Sep 23, 2024

wks commented Sep 24, 2024

qinsoon commented Sep 27, 2024

caizixian commented Sep 23, 2024 •

edited

Loading