Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: PowerShell task idles before generating script #4758

Open
1 of 4 tasks
jwikman opened this issue Apr 11, 2024 · 20 comments
Open
1 of 4 tasks

[BUG]: PowerShell task idles before generating script #4758

jwikman opened this issue Apr 11, 2024 · 20 comments

Comments

@jwikman
Copy link

jwikman commented Apr 11, 2024

What happened?

This is reported at https://developercommunity.visualstudio.com/t/PowerShell-task-idles-before-generating-/10636846, but got the suggestion to report here as well.


We’ve got several self hosted Azure DevOps Agent that runs on one of our servers.

We’ve seen an issue that escalated in the last month or so, I believe that it correlates with updating all Agents to latest release (v3.236.1) and at the same time we started to run on PowerShell 7 (with pwsh: true).

After server is started, our pipelines runs fast for several days, or maybe a week or so.
Then we start to see performance problems, where a lot of tasks in the pipeline takes a lot longer to complete.

It seems as it is the PowerShell tasks (PowerShell@2) that takes longer and longer to start.
When checking the logs we can see that the PowerShell task header is written to the log, then it idles before the text “Generating script” is being logged.
Looking at old logs I can see that it used to take about 3 or 4s before “Generating script” is being printed.

2024-01-17T14:34:18.3486490Z ##[section]Starting: Create Build Container
2024-01-17T14:34:18.3652184Z ==============================================================================
2024-01-17T14:34:18.3652483Z Task : PowerShell
2024-01-17T14:34:18.3652629Z Description : Run a PowerShell script on Linux, macOS, or Windows
2024-01-17T14:34:18.3652823Z Version : 2.232.0
2024-01-17T14:34:18.3652970Z Author : Microsoft Corporation
2024-01-17T14:34:18.3653124Z Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/powershell
2024-01-17T14:34:18.3653358Z ==============================================================================
2024-01-17T14:34:22.2085506Z Generating script.
2024-01-17T14:34:22.2418319Z ========================== Starting Command Output ===========================
2024-01-17T14:34:22.2628841Z ##[command]“C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe” -NoLogo -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command “. ‘C:\Agent_AS015_scheduled_03_work_temp\ddcc7e46-c1ed-477b-badb-77d665f4bf1c.ps1’”

But when the server slows down lately this time increases and increases and increases until some colleague yells about it. Yesterday there was about 10 minutes between those two steps.

2024-04-10T13:22:50.2800169Z ##[section]Starting: Create Build Container
2024-04-10T13:22:50.2922041Z ==============================================================================
2024-04-10T13:22:50.2922184Z Task : PowerShell
2024-04-10T13:22:50.2922253Z Description : Run a PowerShell script on Linux, macOS, or Windows
2024-04-10T13:22:50.2922369Z Version : 2.237.5
2024-04-10T13:22:50.2922600Z Author : Microsoft Corporation
2024-04-10T13:22:50.2922694Z Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/powershell
2024-04-10T13:22:50.2922805Z ==============================================================================
2024-04-10T13:32:35.6715441Z Generating script.
2024-04-10T13:32:35.7043642Z ========================== Starting Command Output ===========================
2024-04-10T13:32:35.7294220Z ##[command]“C:\Program Files\PowerShell\7\pwsh.exe” -NoLogo -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command “. ‘C:\Agent_AS015_01_work_temp\3243b6c1-4d9e-4855-b9eb-389a1dae31d0.ps1’”

If I restart the server, all is fine for some days, then this starts again.

When monitoring the server, I can see that the pipeline agents create a LOT (50 to 100 or so) of WMIC processes, which seems to cause high CPU load oon WmiPrvSE.exe… Is this expected behavior? They all just seems to check the available memory on the host.
image

Yesterday, just before restarting the server, I saw above list of WMIC processes and they just seemed to live on.
Today, I still see a lot of WMIC processes being started and we get a CPU spike in WmiPrvSE, but after a few seconds (3 to 4) they are terminated, and CPU goes down.

We’ve got 12 agents running on the same machine, with 32 cores and 128GB of RAM, but since they are divided into two different AgentPools there are most often no more than 6 agents running at the same time.

I believe I found the code that is starting the WMIC processes:

processStartInfo.Arguments = "OS GET FreePhysicalMemory,TotalVisibleMemorySize /Value";

But why it is creating that many processes puzzles me… There is a loop in RunMemoryUtilizationMonitor() that should wait 5s between each check, but as it seems - this cannot work as intended, or am I mistaken?

Versions

Agent version: 3.236.1
Windows Version: Windows Server 2022, v21H2 (OS Build 20348.2402)

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

No response

Version controll system

No response

Relevant log output

See under "What happened" above.
@Materia-xx
Copy link

Same here, except in our case the agents will just time out jobs before they generate the script. Reboots do help, but only for a short amount of time before things start hanging up again. Ironic that the WMIC call seems to be meant to measure CPU but running 50 of them in parallel is capping CPU.

@ismayilov-ismayil
Copy link
Contributor

Hi @jwikman, thanks for reporting. Looks like it's related to agent tries to get memory usage on each context and since there are a lot of context on your case, it might be produce more WMIC processes than expected. We will continue to investigate

@jwikman
Copy link
Author

jwikman commented Apr 12, 2024

Hi @ismayilov-ismayil
I've uploaded some logs to this problem reported on DevCommunity, but they are not public. Can you access them as a MS representative somehow, or can I send you the logs in a non-public way?

I was running a very simple "Hello World" pipeline at the same time that we were experiencing this issue, with System Diagnostics enabled.

@jwikman
Copy link
Author

jwikman commented Apr 13, 2024

Maybe #4559 is what causing this?
That was first released in v3.232.4
So I'll go ahead and reinstall all agents with v3.232.3 to see if that helps.

@jwikman
Copy link
Author

jwikman commented Apr 13, 2024

I ran v3.232.4 for a few hours. 50+ pipelines with 1000+ tasks being run - All looked very good.

Then I upgraded back to v3.326.1 and now all WMIC processes are back, hundreds of processes are being executed almost all the time. I also see a lot of CPU spikes, that wasn't there on v3.232.4.

After looking into the code, I thought I could set AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS = false as a User-defined capabilities for the agents to disable the ResourceUtilizationWarnings that I thought was the reason behind my problems. (I got that from #4317)
But it seems as if that won't work, I still see this in the Initialize job task:
Knob: EnableResourceUtilizationWarnings = true Source: $(AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS)

How do I disable such a feature?

Meanwhile, I'll revert back to v3.232.4.

@jwikman
Copy link
Author

jwikman commented May 3, 2024

Last night I updated to v3.238.0 to see if anything has changed. But after three agents was updated, the CPU started to reach 100% CPU 50% of the time. And it seems as if it was due to all the WMIC processes...

@ismayilov-ismayil any progress on this issue?
Actually, it seems to be two issues, should I split this one into two?

  1. PowerShell task idles before generating scripts (this one, originally)
  2. WMIC calls ends up with 100% CPU

Right now, I cannot test the first issue, due to the second.

@ismayilov-ismayil
Copy link
Contributor

Hi @jwikman
We are working on this, you can follow the progress on developer community portal - https://developercommunity.visualstudio.com/t/PowerShell-task-idles-before-generating-/10636846

@KonstantinTyukalov KonstantinTyukalov self-assigned this May 6, 2024
@jwikman
Copy link
Author

jwikman commented May 6, 2024

Ok, Thanks @ismayilov-ismayil.
Then I know this got your attention, and we'll continue on the developer community portal.

@KonstantinTyukalov
Copy link
Contributor

@jwikman, we rolled back recent changes that may be the RC of the issue.
Please, let us know it the problem still bothers you

@jwikman
Copy link
Author

jwikman commented May 6, 2024

@KonstantinTyukalov, in which version of the agent can I expect this to be included?
And are there a PR where I can see the change? :)

@KonstantinTyukalov
Copy link
Contributor

KonstantinTyukalov commented May 6, 2024

@jwikman it should work with your current agent as we disabled the server-side feature flag
We'll prepare a long-term fix soon

@jwikman
Copy link
Author

jwikman commented May 6, 2024

@KonstantinTyukalov aah, ok.
Was that the AZP_ENABLE_RESOURCE_UTILIZATION_WARNINGS knob that I was trying to set?
If so, isn't it possible to set those knobs on the agent level? I thought it was possible, after reading #4317...

I'll test to upgrade our agents later tonight.

@jwikman
Copy link
Author

jwikman commented May 7, 2024

This is current status:

All agents are now running v3.238.0.
The WMIC processes are gone, so I guess that your feature flag change did this. Thanks for that! (Still curious about how I can set those feature flag myself...? 🤔)

So now we can focus on the original problem: The delays when the PowerShell tasks are creating scripts.

After I updated all agents, I still could see a lot of these delays. From 40s to several minutes.

Hence, I had to restart the build server, and after reboot the delays are neglectable again.

Let's see how this progress in a few days.

@KonstantinTyukalov
Copy link
Contributor

@jwikman

Still curious about how I can set those feature flag myself...?

Unfortunately, you can't disable this feature by yourself now.
We are working on the improvement of this perf monitor, and if it will be possible to toggle manually, we'll let you know.

Regarding the original issue: this looks like a delay of work from powershell handler to prepare a task for execution.
In my repro, I have only ~8 seconds delay between these 2 log lines.
Could you add the VSTSAGENT_TRACE environment variable to the agent process for verbose tracing and re-send your _diag logs in private messages of the Developer Community ticket?

@jwikman
Copy link
Author

jwikman commented May 8, 2024

Unfortunately, you can't disable this feature by yourself now. We are working on the improvement of this perf monitor, and if it will be possible to toggle manually, we'll let you know.

I see, thanks for letting me know.

Regarding the original issue: this looks like a delay of work from powershell handler to prepare a task for execution. In my repro, I have only ~8 seconds delay between these 2 log lines. Could you add the VSTSAGENT_TRACE environment variable to the agent process for verbose tracing and re-send your _diag logs in private messages of the Developer Community ticket?

Ok, I've uploaded new logs on the DC ticket.

I created 6 agents in a new pool for this test, on the same server as the other agent pools. Except these 6 agents, there where 8 other agents running other pipelines on the server. All is running v3.238.0.
The pipeline I executed ran the same job in parallell on all 6 agents.
The Job consists of 3 very simple PowerShell tasks:

  • Lists all environment variables (in PowerShell 7)
  • Write "Hello World" in PowerShell 7
  • Write "Hello World" in PowerShell 5

I added the variable VSTSAGENT_TRACE set to true when I started the pipeline. I also ticked the Enable system diagnostics option.

I few observations:

  • Several tasks were delayed with up to a couple of minutes or so before the task did anything else than printing the "Task Header" (version, author etc).
    • During this time, the debug messages for Agent environment resources was printed every 6th second or so.
  • The actual task takes about 1 second
  • The Finishing tasks (Finalize Job) took several minutes - but maybe that is caused by the extensive logging in this case?

Could the Agent environment resources checks be causing these delays?

Just let me know if you need anything else.

@jwikman
Copy link
Author

jwikman commented May 8, 2024

Hmmm. I run my pipeline once again, but without setting the Enable system diagnostics option (VSTSAGENT_TRACE still set to true).

Now the delays are down to about 40s, and the finishing task is running fine. The total execution time went from over 11 minutes down to 2:40.

Uploaded those logs as well, if the system diagnostics option added too much overhead or something.

@jwikman
Copy link
Author

jwikman commented May 13, 2024

Just as an update, almost a week since last reboot - the delays before starting a pipeline task is between 2 to 5 minutes.
Running my repro pipeline again at current state, uploading to DevCommunity case later.
After this I'll restart the server to get it up to speed again.

@jwikman
Copy link
Author

jwikman commented May 13, 2024

Oh, and this maybe can be another clue on this issue:
Today, when everything was really slow, we got an "Azure file copy" (AzureFileCopy@6) task that started to fail with authentication failures.
It turns out that the Azure file copy task retrieves an authentication token that has a 10-minute lifetime. But the task runs for more than 10 minutes between token retrieval and authentication, causing the authentication to fail.

The logs from that step:

2024-05-13T08:39:22.6335864Z ==============================================================================
2024-05-13T08:39:22.6336111Z Task         : Azure file copy
2024-05-13T08:39:22.6336218Z Description  : Copy files to Azure Blob Storage or virtual machines
2024-05-13T08:39:22.6336394Z Version      : 6.239.5
2024-05-13T08:39:22.6336492Z Author       : Microsoft Corporation
2024-05-13T08:39:22.6336629Z Help         : https://docs.microsoft.com/azure/devops/pipelines/tasks/deploy/azure-file-copy
2024-05-13T08:39:22.6336842Z ==============================================================================
2024-05-13T08:43:33.4475525Z Added TLS 1.2 in session.
2024-05-13T08:45:30.9273237Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Accounts\2.17.0\Az.Accounts.psd1 -Global
2024-05-13T08:46:49.7738587Z ##[warning]You're using AzureRM which will be retired soon, please schedule an update.
2024-05-13T08:46:50.9308516Z ##[command]Clear-AzContext -Scope CurrentUser -Force -ErrorAction SilentlyContinue
2024-05-13T08:46:51.3628724Z ##[command]Clear-AzContext -Scope Process
2024-05-13T08:46:51.4119670Z ##[command]Connect-AzAccount 
2024-05-13T08:46:51.4120435Z Name                           Value                                                                                   
2024-05-13T08:46:51.4121412Z ----                           -----                                                                                   
2024-05-13T08:46:51.4121894Z Tenant                         {redacted}                                                    
2024-05-13T08:46:51.4122198Z Scope                          Process                                                                                 
2024-05-13T08:46:51.4122582Z FederatedToken                 {redacted}
2024-05-13T08:46:51.4122980Z Environment                    AzureCloud                                                                              
2024-05-13T08:46:51.4123553Z ApplicationId                  ***                                                    
2024-05-13T08:46:51.4126751Z ServicePrincipal               True                                                                                    
2024-05-13T08:46:51.4126899Z 
2024-05-13T08:46:51.4127011Z 
2024-05-13T08:46:51.4127077Z 
2024-05-13T08:46:53.3001201Z 
2024-05-13T08:46:53.3672456Z ##[command]Set-AzContext 
2024-05-13T08:46:53.3672742Z Name                           Value                                                                                   
2024-05-13T08:46:53.3673156Z ----                           -----                                                                                   
2024-05-13T08:46:53.3673635Z SubscriptionId                 {redacted}                                                    
2024-05-13T08:46:53.3673808Z 
2024-05-13T08:46:53.3673890Z 
2024-05-13T08:46:53.3674847Z 
2024-05-13T08:46:54.0284574Z Account                              SubscriptionName TenantId                             Environment
2024-05-13T08:46:54.0294558Z -------                              ---------------- --------                             -----------
2024-05-13T08:46:54.0300991Z *** {redacted}      {redacted} AzureCloud 
2024-05-13T08:46:54.0301260Z 
2024-05-13T08:46:54.0377676Z Name               : {redacted} ({redacted}) - {redacted} - {redacted}
2024-05-13T08:46:54.0378386Z                      {redacted}
2024-05-13T08:46:54.0380214Z Subscription       : {redacted}
2024-05-13T08:46:54.0381424Z Account            : ***
2024-05-13T08:46:54.0383260Z Environment        : AzureCloud
2024-05-13T08:46:54.0385204Z Tenant             : {redacted}
2024-05-13T08:46:54.0385602Z TokenCache         : 
2024-05-13T08:46:54.0387443Z VersionProfile     : 
2024-05-13T08:46:54.0388721Z ExtendedProperties : {}
2024-05-13T08:46:54.0389129Z 
2024-05-13T08:48:08.5776971Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Resources\6.16.1\Az.Resources.psd1 -Global
2024-05-13T09:08:17.7928883Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Storage\6.1.3\Az.Storage.psd1 -Global
2024-05-13T09:10:05.2987485Z ##[error]ClientAssertionCredential authentication failed: A configuration issue is preventing authentication - check the error message from the server for details. You can modify the configuration in the application registration portal. See https://aka.ms/msal-net-invalid-client for details.  Original exception: AADSTS700024: Client assertion is not within its valid time range. Current time: 2024-05-13T09:10:05.1641790Z, assertion valid from 2024-05-13T08:46:51.0000000Z, expiry time of assertion 2024-05-13T08:56:51.0000000Z. Review the documentation at https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials . Trace ID: 156bae26-5b86-4a0f-9ddf-6e8d4c480500 Correlation ID: 4e1e6d17-5a52-43c9-835e-c9fcc7880e4a Timestamp: 2024-05-13 09:10:05Z
2024-05-13T09:10:05.3425662Z ##[section]Finishing: Update azure storage with lowercase

As seen above, there are first a delay for 4 minutes before step actually starts.
Then we have over 1 minute to import Az.Resources.
And then 20(!) minutes to import Az.Storage.
(maybe I'm misinterpreting the timestamps, maybe it is the import of Az.Resources that takes 20 minutes...?)

After rebooting server I get these times for the imports (20 + 7 seconds):

2024-05-13T09:31:36.8458808Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Resources\6.16.1\Az.Resources.psd1 -Global
2024-05-13T09:31:55.5007557Z ##[command]Import-Module -Name C:\Program Files\WindowsPowerShell\Modules\Az.Storage\6.1.3\Az.Storage.psd1 -Global
2024-05-13T09:32:02.8702791Z  mime: C:\Agent_AS015_12\_work\_tasks\AzureFileCopy_eb72cb01-a7e5-427b-a8a1-1b31ccac8a43\6.239.5\MimeMapping.json

Copy link

github-actions bot commented Nov 9, 2024

This issue has had no activity in 180 days. Please comment if it is not actually stale

@github-actions github-actions bot added the stale label Nov 9, 2024
@jwikman
Copy link
Author

jwikman commented Nov 10, 2024

@KonstantinTyukalov any news on this?

We are still rebooting the build server daily as a workaround...

@github-actions github-actions bot removed the stale label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants