Library and Tools Subset build times regression (50+% increase) #82583

LoopedBard3 · 2023-02-22T01:07:52Z

Description

Runtime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline).

Regression?

Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8.

Data

Specific runs that should properly capture the closest normal and regressed build:

Performance pipeline (Look at "Libraries Build..." jobs):
- Normal: https://dev.azure.com/dnceng/internal/_build/results?buildId=2119219&view=results
- Regressed: https://dev.azure.com/dnceng/internal/_build/results?buildId=2119313&view=results
- Compare: https://github.com/dotnet/runtime/compare/75eac964..f6bed512 (Smaller than below compare)
official builds pipeline (Look at "Libraries Build... " jobs in the Build stage:

jkoritzinsky · 2023-02-22T03:53:04Z

We updated the Roslyn compiler version in this commit range. Maybe there's a regression in the compiler?

hoyosjs · 2023-02-22T05:59:56Z

This is related to #76454

hoyosjs · 2023-02-22T06:08:37Z

Looking at the time taken to build SPC it becomes pretty obvious window:

2023-02-21T21:45:00Z	12
2023-02-21T10:30:00Z	11.666666666666666
2023-02-21T16:30:00Z	11.666666666666666
2023-02-22T01:30:00Z	11.666666666666666
2023-02-21T15:45:00Z	11.6
2023-02-21T19:30:00Z	11.6
2023-02-21T22:30:00Z	11.333333333333334
2023-02-21T12:45:00Z	11.111111111111111
2023-02-21T09:00:00Z	11
2023-02-21T12:00:00Z	11
2023-02-21T18:00:00Z	11
2023-02-20T15:45:00Z	9.6666666666666661
2023-02-21T23:15:00Z	9.4
2023-02-22T02:15:00Z	9.2
2023-02-21T11:15:00Z	9.1428571428571423
2023-02-21T17:15:00Z	8.6666666666666661
2023-02-22T00:00:00Z	8.25
2023-02-21T04:30:00Z	7
2023-02-20T13:30:00Z	6.75
2023-02-20T12:45:00Z	6.666666666666667
2023-02-20T06:45:00Z	6.5
2023-02-20T11:15:00Z	6.5
2023-02-21T08:15:00Z	6.333333333333333
2023-02-20T22:30:00Z	6.2
2023-02-20T10:30:00Z	6.166666666666667
2023-02-20T07:30:00Z	6
2023-02-20T08:15:00Z	6
2023-02-20T15:00:00Z	6
2023-02-20T17:15:00Z	6
2023-02-21T00:00:00Z	6
2023-02-21T02:15:00Z	6
2023-02-21T03:00:00Z	5.8571428571428568
2023-02-20T20:15:00Z	5.8
2023-02-20T23:15:00Z	5.8

After 8 UTC - which matches the merge time of #81164

ghost · 2023-02-22T15:08:54Z

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Runtime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline).

Regression?

Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8.

Data

Specific runs that should properly capture the closest normal and regressed build:

Performance pipeline (Look at "Libraries Build..." jobs):
- Normal: https://dev.azure.com/dnceng/internal/_build/results?buildId=2119219&view=results
- Regressed: https://dev.azure.com/dnceng/internal/_build/results?buildId=2119313&view=results
- Compare: https://github.com/dotnet/runtime/compare/75eac964..f6bed512 (Smaller than below compare)
official builds pipeline (Look at "Libraries Build... " jobs in the Build stage:

Author:	LoopedBard3
Assignees:	-
Labels:	`area-Infrastructure-libraries`, `tenet-performance`, `untriaged`
Milestone:	-

jcouv · 2023-02-22T18:31:19Z

@EgorBo To validate whether this build time increase was caused by the new length-based switch dispatch analysis, there is a command-line flag that will disable the analysis: /features:disable-length-based-switch. I believe <Features>disable-length-based-switch</Features> from MSBuild will also work, but I have not confirmed the exact syntax.

Reverts version bump component of #81164 This PR caused heavy managed build regressions hitting all PR builds. See #82458 and #76454

EgorBo · 2023-02-22T18:53:05Z

Can confirm that build.cmd Clr.CoreLib -c Release is regressed from 30s to 40s on my machine. <Features>disable-length-based-switch</Features> didn't help so must be something else then

agocke · 2023-02-22T23:34:21Z

Moving to Roslyn since this is definitely a compiler perf regression.

am11 · 2023-02-23T01:10:18Z

Did Roslyn update revert (#82466) fix the 50% regression? Runtime pipeline is still timing out.

jcouv · 2023-02-23T19:11:11Z

I've been doing some investigation locally. I tried comparing build times after revert PR and before. My steps are: 1. build.cmd -clean, 2. build.cmd -restore, 3. build.cmd Clr.CoreLib -c Release.
I'm not seeing any meaningful difference between before vs. after...

An earlier comment pointed at PR #81164 as the source of the build-time regression, but I observed the revert PR is not a complete revert of that original PR (81164). The original PR (81164) also added some source generators (Microsoft.Interop.SourceGeneration) to several projects. So I wondered if that could be the problem.

But running similar tests comparing 81164 to the commit preceding it also yielded no significant build-time difference...

Could someone look at this with me to confirm my repro steps or to provide a before/after repro?

jkoritzinsky · 2023-02-23T19:12:55Z

@hoyosjs had data showing the slowdown in the System.Private.CoreLib build, which didn't have any source-generator changes.

hoyosjs · 2023-02-23T19:25:46Z

Let me check the data again

agocke · 2023-02-23T20:41:02Z

It's probably more productive to just look at a trace before and after the original merge, instead of before and after the revert. That should identify the problem, regardless of where it lies.

hoyosjs · 2023-02-23T22:22:14Z

The revert helped - even with the libraries build. The partial revert helped fix the issues.

System.Private.CoreLib

Libraries:

The official commands are

build.cmd -subset clr.nativeprereqs -arch x86 -ci first
build.cmd -subset clr.corelib+clr.nativecorelib+clr.nativeaotlibs+clr.tools+clr.packages+clr.paltestlist -arch x86 -c release -ci and the one that regresses is

jaredpar · 2023-02-24T02:42:29Z

@jcouv, @hoyosjs and I spent a few hours debugging the issue this afternoon. It appears that the regression comes because the VBCSCompiler process is crashing during build. This crash is causing us to fall back to csc which drops the benefit of the compiler server and causes the build time to drop. In the cases where VBCSCompiler does not crash the build time is the expected time.

The stack trace of the crash is below. Confirmed this is happening on the 8.0 preview runtime.

0:045> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     coreclr!VolatileStore [D:\a\_work\1\s\src\coreclr\inc\volatile.h @ 273] 
01 (Inline Function) --------`--------     coreclr!SetObjectReferenceUnchecked [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 319] 
02 (Inline Function) --------`--------     coreclr!PtrArray::SetAt+0x8 [D:\a\_work\1\s\src\coreclr\vm\object.h @ 705] 
03 000000f0`75aff990 00007ff8`d3e5ace6     coreclr!LoaderAllocator::SetHandleValue+0xa52fd [D:\a\_work\1\s\src\coreclr\vm\loaderallocator.cpp @ 989] 
04 000000f0`75affa00 00007ff8`d3ee8eda     coreclr!LoaderAllocator::FreeHandle+0x1a [D:\a\_work\1\s\src\coreclr\vm\loaderallocator.cpp @ 880] 
05 000000f0`75affa30 00007ff8`d3dfd9c9     coreclr!ThreadLocalBlock::FreeTLM+0xeb506 [D:\a\_work\1\s\src\coreclr\vm\threadstatics.cpp @ 61] 
06 000000f0`75affa70 00007ff8`d3dfeefb     coreclr!ThreadLocalBlock::FreeTable+0xc9 [D:\a\_work\1\s\src\coreclr\vm\threadstatics.cpp @ 93] 
07 (Inline Function) --------`--------     coreclr!Thread::DeleteThreadStaticData+0xc [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7727] 
08 000000f0`75affaa0 00007ff8`d3dfd298     coreclr!Thread::OnThreadTerminate+0x9f [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 2956] 
09 000000f0`75affb70 00007ff8`d3d6a77c     coreclr!DestroyThread+0x64 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 924] 
0a 000000f0`75affba0 00007ff9`62a726bd     coreclr!ThreadNative::KickOffThread+0xcc [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 239] 
0b 000000f0`75affc00 00007ff9`6348dfb8     kernel32!BaseThreadInitThunk+0x1d
0c 000000f0`75affc30 00000000`00000000     ntdll!RtlUserThreadStart+0x28

@hoyosjs has dumps of the crash available for investigation.

Moving back to runtime

EgorBo · 2023-02-24T11:22:52Z

@jaredpar can it be a sort of build failure if VBCS crashes?

marek-safar · 2023-02-24T13:29:47Z

/cc @jeffschwMSFT for visibility

jeffschwMSFT · 2023-02-24T14:39:35Z

@jaredpar do we either have a reproducible build or a dump?

cc @mangod9

stephentoub · 2023-02-24T14:43:12Z

@jaredpar do we either have a reproducible build or a dump?

"@hoyosjs has dumps of the crash available for investigation."

jaredpar · 2023-02-24T16:02:05Z

can it be a sort of build failure if VBCS crashes?

This is a tricky question. Consider this case where essentially there is a native AV that rudely tears downs the VBCSCompiler process. From the perspective of the compiler MSBuild task this manifests as a broken named pipe. Raising an error for a single named pipe failure is a bit of a dicey proposition because that will happen in the wild (I/O can and will fail). Hard to see us shipping that to customers due to the noise false positives would cause.

Looking at the totality of the MSBuild log in this case though made it very clear a bug was occurring. That's because at the solution build level I could see multiple compiler server broken pipe failures in the log (we log server events in the binlog). Once I saw that it was a virtual lock that there was a crash happening. But MSBuild tasks don't have solution level views of events, they only see the subset of events that come their way so there isn't a mechanism to make this determination today.

Essentially this is another case where MSBuild / Binary Log analyzers would be a benefit to the ecosystem. At the solution build level it's very easy to spot these failures, and several other types of issues that can come up.

@rainersigwald, @marcpopMSFT, @baronfel

jaredpar · 2023-02-24T16:03:41Z

@jaredpar do we either have a reproducible build or a dump?

"@hoyosjs has dumps of the crash available for investigation."

This does not reproduce reliably. When we were debugging last night I'd say something like 50% of the time. We could not get it to reproduce under a debugger (VS or WinDbg). Had to instead flip the HLKM registry key to grab dumps on crashes and run the build a few times till we got the crash.

rainersigwald · 2023-02-24T16:19:25Z

But MSBuild tasks don't have solution level views of events, they only see the subset of events that come their way so there isn't a mechanism to make this determination today.

It's a poor substitute for what you actually want, but one option for this specific thing would be to keep state in MSBuild's persistence mechanism--on pipe failure increment a persisted count and if count > heuristic_threshold throw a warning or error. The state would be per-node.

jaredpar · 2023-02-24T16:36:03Z

The state would be per-node.

That generally runs counter to the problem though. Generally, in the wild at least, it's one, maybe two crashes, for a large build. The reason we end up with multiple broken pipe entries in the log is because we have builds running in parallel on different nodes. So each node gets a broken pipe for one crash and their individual heuristic counter stays at one.

I agree this can have positive impact though. Been using this trick in dotnet/roslyn for years to catch our own mistakes. But we're the compiler team and are pretty keen on catching server crashes in our CI so we fail when the heuristic on a single node is > 0. But a false positive once or twice a year is a good trade off for the number of real crashes this has stopped us from shipping.

.. to keep state in MSBuild's persistence mechanism ...

Is there a real persistence mechanism? We've been using mutable static data which is not ideal.

agocke · 2023-02-24T18:20:40Z

Also remember that killing the VBCSCompiler process is considered safe shutdown for csc. I don't think adding any extra layers for csc here would help much.

Instead, I think CI should be generally interested in crashing processes, especially if we're running things on preview runtimes. Having some sort of monitoring for all crashing processes during a run seems useful.

jaredpar · 2023-02-24T18:34:11Z

Having some sort of monitoring for all crashing processes during a run seems useful.

Agree. It's also pretty easy to setup and could be done fairly generally at the arcade level. Effectively all that needs to be done is pick a directory to put dumps in: say artifacts\dumps. Then set the following registry entries:

Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps]
"DumpFolder"="c:\\generated\path\artifacts\dumps"
"DumpCount"=dword:00000001
"DumpType"=dword:00000002

Have a step with publishes the content of that directory and another that fails the build if it's not empty.

stephentoub · 2023-02-24T18:40:34Z

Have a step with publishes the content of that directory and another that fails the build if it's not empty.

while working around places we purposefully launch and crash processes, e.g. while testing Environment.FailFast.

mangod9 · 2023-02-24T19:21:24Z

@hoyosjs could you please share a link to the dumps? I assume this is happening with preview1 sdk?

hoyosjs · 2023-02-24T21:33:05Z

Sent directly.

mangod9 · 2023-02-24T22:00:11Z

I notice that @jkoritzinsky had updated threadstatics allocation logic as part of #80189. This change is included in the sdk currently in main (8.0.0-alpha.1.23057.5)

jkoritzinsky · 2023-02-24T22:09:01Z

There was a bug in that change that messed up zeroing for the dynamic case. The fix didn't make that alpha release.
We should try updating the global.json to Preview 1, which has the fix.

mangod9 · 2023-02-24T22:13:12Z

ah yeah, upgrading to Preview 1 would be the right fix then.

hoyosjs · 2023-02-24T22:30:25Z

This is an issue when collecting thread statics in the case of an unloadable assembly (Microsoft.CodeAnalysis.CodeStyle) in this case. Then as we loop to free the TLM, we hit a case where a CollectibleDynamicEntry is not properly 0'd

hoyosjs · 2023-02-24T22:46:33Z

yeah, that's the right fix.

LoopedBard3 added the tenet-performance Performance related issue label Feb 22, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Feb 22, 2023

LoopedBard3 mentioned this issue Feb 22, 2023

dotnet-runtime-perf pipeline runs are timing out during library builds dotnet/performance#2908

Closed

hoyosjs mentioned this issue Feb 22, 2023

Revert "[main] Update dependencies from dotnet/roslyn" #82466

Merged

EgorBo mentioned this issue Feb 22, 2023

Length-based switch dispatch dotnet/roslyn#66081

Merged

hoyosjs referenced this issue Feb 22, 2023

Revert update dependencies from dotnet/roslyn (#82466)

98bf267

Reverts version bump component of #81164 This PR caused heavy managed build regressions hitting all PR builds. See #82458 and #76454

agocke transferred this issue from dotnet/runtime Feb 22, 2023

jcouv self-assigned this Feb 22, 2023

jcouv removed the untriaged New issue has not been triaged by the area owner label Feb 22, 2023

lewing mentioned this issue Feb 24, 2023

builds breaking because analyzers fail with Value cannot be null. (Parameter 'type') #82467

Closed

ghost added the untriaged New issue has not been triaged by the area owner label Feb 24, 2023

jaredpar transferred this issue from dotnet/roslyn Feb 24, 2023

jaredpar unassigned jcouv Feb 24, 2023

EgorBo added the area-VM-coreclr label Feb 24, 2023

marek-safar added this to the 8.0.0 milestone Feb 24, 2023

marek-safar removed the untriaged New issue has not been triaged by the area owner label Feb 24, 2023

hoyosjs mentioned this issue Feb 25, 2023

Update SDK to preview 1 #82652

Closed

ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Feb 25, 2023

hoyosjs linked a pull request Feb 25, 2023 that will close this issue

[main] Update dependencies from dotnet/arcade #82599

Merged

marek-safar closed this as completed in #82599 Feb 28, 2023

ghost locked as resolved and limited conversation to collaborators Mar 30, 2023

Library and Tools Subset build times regression (50+% increase) #82583

Library and Tools Subset build times regression (50+% increase) #82583

Comments

LoopedBard3 commented Feb 22, 2023

Description

Regression?

Data

jkoritzinsky commented Feb 22, 2023

hoyosjs commented Feb 22, 2023 • edited Loading

hoyosjs commented Feb 22, 2023

ghost commented Feb 22, 2023

Description

Regression?

Data

jcouv commented Feb 22, 2023

EgorBo commented Feb 22, 2023

agocke commented Feb 22, 2023

am11 commented Feb 23, 2023

jcouv commented Feb 23, 2023

jkoritzinsky commented Feb 23, 2023

hoyosjs commented Feb 23, 2023

agocke commented Feb 23, 2023

hoyosjs commented Feb 23, 2023

jaredpar commented Feb 24, 2023

EgorBo commented Feb 24, 2023

marek-safar commented Feb 24, 2023

jeffschwMSFT commented Feb 24, 2023

stephentoub commented Feb 24, 2023

jaredpar commented Feb 24, 2023

jaredpar commented Feb 24, 2023

rainersigwald commented Feb 24, 2023

jaredpar commented Feb 24, 2023 • edited Loading

agocke commented Feb 24, 2023

jaredpar commented Feb 24, 2023 • edited Loading

stephentoub commented Feb 24, 2023

mangod9 commented Feb 24, 2023

hoyosjs commented Feb 24, 2023

mangod9 commented Feb 24, 2023

jkoritzinsky commented Feb 24, 2023

mangod9 commented Feb 24, 2023

hoyosjs commented Feb 24, 2023

hoyosjs commented Feb 24, 2023

hoyosjs commented Feb 22, 2023 •

edited

Loading

jaredpar commented Feb 24, 2023 •

edited

Loading

jaredpar commented Feb 24, 2023 •

edited

Loading