-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library and Tools Subset build times regression (50+% increase) #82583
Comments
We updated the Roslyn compiler version in this commit range. Maybe there's a regression in the compiler? |
This is related to #76454 |
Looking at the time taken to build SPC it becomes pretty obvious window:
After 8 UTC - which matches the merge time of #81164 |
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries Issue DetailsDescriptionRuntime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline). Regression?Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8. DataSpecific runs that should properly capture the closest normal and regressed build:
|
@EgorBo To validate whether this build time increase was caused by the new length-based switch dispatch analysis, there is a command-line flag that will disable the analysis: |
Can confirm that |
Moving to Roslyn since this is definitely a compiler perf regression. |
Did Roslyn update revert (#82466) fix the 50% regression? Runtime pipeline is still timing out. |
I've been doing some investigation locally. I tried comparing build times after revert PR and before. My steps are: 1. An earlier comment pointed at PR #81164 as the source of the build-time regression, but I observed the revert PR is not a complete revert of that original PR (81164). The original PR (81164) also added some source generators ( But running similar tests comparing 81164 to the commit preceding it also yielded no significant build-time difference... Could someone look at this with me to confirm my repro steps or to provide a before/after repro? |
@hoyosjs had data showing the slowdown in the System.Private.CoreLib build, which didn't have any source-generator changes. |
Let me check the data again |
It's probably more productive to just look at a trace before and after the original merge, instead of before and after the revert. That should identify the problem, regardless of where it lies. |
The revert helped - even with the libraries build. The partial revert helped fix the issues. The official commands are
|
@jcouv, @hoyosjs and I spent a few hours debugging the issue this afternoon. It appears that the regression comes because the VBCSCompiler process is crashing during build. This crash is causing us to fall back to csc which drops the benefit of the compiler server and causes the build time to drop. In the cases where VBCSCompiler does not crash the build time is the expected time. The stack trace of the crash is below. Confirmed this is happening on the 8.0 preview runtime.
@hoyosjs has dumps of the crash available for investigation. Moving back to runtime |
@jaredpar can it be a sort of build failure if VBCS crashes? |
/cc @jeffschwMSFT for visibility |
This is a tricky question. Consider this case where essentially there is a native AV that rudely tears downs the VBCSCompiler process. From the perspective of the compiler MSBuild task this manifests as a broken named pipe. Raising an error for a single named pipe failure is a bit of a dicey proposition because that will happen in the wild (I/O can and will fail). Hard to see us shipping that to customers due to the noise false positives would cause. Looking at the totality of the MSBuild log in this case though made it very clear a bug was occurring. That's because at the solution build level I could see multiple compiler server broken pipe failures in the log (we log server events in the binlog). Once I saw that it was a virtual lock that there was a crash happening. But MSBuild tasks don't have solution level views of events, they only see the subset of events that come their way so there isn't a mechanism to make this determination today. Essentially this is another case where MSBuild / Binary Log analyzers would be a benefit to the ecosystem. At the solution build level it's very easy to spot these failures, and several other types of issues that can come up. |
This does not reproduce reliably. When we were debugging last night I'd say something like 50% of the time. We could not get it to reproduce under a debugger (VS or WinDbg). Had to instead flip the HLKM registry key to grab dumps on crashes and run the build a few times till we got the crash. |
It's a poor substitute for what you actually want, but one option for this specific thing would be to keep state in MSBuild's persistence mechanism--on pipe failure increment a persisted count and if |
That generally runs counter to the problem though. Generally, in the wild at least, it's one, maybe two crashes, for a large build. The reason we end up with multiple broken pipe entries in the log is because we have builds running in parallel on different nodes. So each node gets a broken pipe for one crash and their individual heuristic counter stays at one. I agree this can have positive impact though. Been using this trick in
Is there a real persistence mechanism? We've been using mutable |
Also remember that killing the VBCSCompiler process is considered safe shutdown for csc. I don't think adding any extra layers for csc here would help much. Instead, I think CI should be generally interested in crashing processes, especially if we're running things on preview runtimes. Having some sort of monitoring for all crashing processes during a run seems useful. |
Agree. It's also pretty easy to setup and could be done fairly generally at the arcade level. Effectively all that needs to be done is pick a directory to put dumps in: say Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps]
"DumpFolder"="c:\\generated\path\artifacts\dumps"
"DumpCount"=dword:00000001
"DumpType"=dword:00000002 Have a step with publishes the content of that directory and another that fails the build if it's not empty. |
while working around places we purposefully launch and crash processes, e.g. while testing Environment.FailFast. |
@hoyosjs could you please share a link to the dumps? I assume this is happening with preview1 sdk? |
Sent directly. |
I notice that @jkoritzinsky had updated threadstatics allocation logic as part of #80189. This change is included in the sdk currently in main (8.0.0-alpha.1.23057.5) |
There was a bug in that change that messed up zeroing for the dynamic case. The fix didn't make that alpha release. |
ah yeah, upgrading to Preview 1 would be the right fix then. |
This is an issue when collecting thread statics in the case of an unloadable assembly (Microsoft.CodeAnalysis.CodeStyle) in this case. Then as we loop to free the TLM, we hit a case where a CollectibleDynamicEntry is not properly 0'd |
yeah, that's the right fix. |
Description
Runtime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline).
Regression?
Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8.
Data
Specific runs that should properly capture the closest normal and regressed build:
The text was updated successfully, but these errors were encountered: