Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency againt startup issues #29

Open
NiceGuyIT opened this issue Jan 16, 2023 · 6 comments
Open

Resiliency againt startup issues #29

NiceGuyIT opened this issue Jan 16, 2023 · 6 comments

Comments

@NiceGuyIT
Copy link
Contributor

One server was offline and after researching the cause, I discovered there was an event log stating "A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.". It would be nice if the service (all OS's) was configured to stay running as best it can. For connectivity issues, retry logic is preferable over exiting after an initial failure to connect. If there's a domain configured, doing a fresh DNS lookup (can the agent clear the DNS cache?) and ping'ing the API until it's able to connect would be nice. If there's no domain configured, or if the agent configuration is corrupt, of course generate a friendly error message and exit.

Note: It's possible this could happen if the agent was restarted (computer rebooted) while the server was being updated and the API unavailable.

Log Name:      System
Source:        Service Control Manager
Date:          1/10/2023 5:22:50 PM
Event ID:      7009
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Win-Server04.****.****
Description:
A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" EventSourceName="Service Control Manager" />
    <EventID Qualifiers="49152">7009</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8080000000000000</Keywords>
    <TimeCreated SystemTime="2023-01-11T01:22:50.657621300Z" />
    <EventRecordID>136367</EventRecordID>
    <Correlation />
    <Execution ProcessID="552" ThreadID="2404" />
    <Channel>System</Channel>
    <Computer>Win-Server04.****.****</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="param1">30000</Data>
    <Data Name="param2">tacticalrmm</Data>
    <Binary>74006100630074006900630061006C0072006D006D000000</Binary>
  </EventData>
</Event>
@silversword411
Copy link

silversword411 commented Jan 16, 2023

That looks like a Windows service error. Windows service said startup RMM...and when it checked to see if it was running, after 30000 milliseconds it wasn't. You sure AV didn't kill it?

Check agent.log to see if there's an error in there.

@NiceGuyIT
Copy link
Contributor Author

I started TacticalAgent from Mesh and this server does not have Bitdefender. I did not find any other event logs relating to the service and there was nothing in the agent.log.

It seems this message is coming from Windows when trying to start the service. This kb article explains how the registry key ServicesPipeTimeout can be increased from the default 30 seconds to 60 seconds. I view that as a workaround, not solution.

The answers to this StackOverflow question have many other scenarios where this error may occur, including scenarios that do not relate to a timeout.

While some of this may be due to my environment, there are things the TacticalAgent can do. One is to add dependencies to the service so the service control manager starts it a little later in the boot process. I.e. after networking is available. Another option is to add a scheduled task to start the service if it's not running.

@NiceGuyIT
Copy link
Contributor Author

Relevant Grafana issue 2060:

Similar to windows_exporter (And a bunch of other Go applications on windows) Grafana Agent can fail to start as a service on windows following a windows update or other high CPU Event during Service startup.

I wrote a fairly detailed analysis on the issue here that explains the cause but in an attempt to be brief this comes down to the way go initialises packages vs the time in which windows expects a response from an application starting as a service within 30s or 60s depending on your version of windows as per this diagram:

There's a PR linked in that issue that may be of use.

@NiceGuyIT NiceGuyIT changed the title Don't exit if there are connectivity issues Resiliency againt startup issues Jan 17, 2023
@NiceGuyIT
Copy link
Contributor Author

NiceGuyIT commented Jan 17, 2023

Here's the relevant Go issue to fix the runtime: Windows service timeout during system startup.

@silversword411
Copy link

silversword411 commented Jan 18, 2023

...when I start on a computer, it's usually 1-2 seconds for the service to start and show as running by windows checks.

How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?

I'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds. Are you sure there's not other extenuating circumstances in your test there making it take longer than 30 seconds? Is it nebula network delays that might be causing TRMM to stutter?

@NiceGuyIT
Copy link
Contributor Author

This happens only with high CPU usage. This thread specifically talks about this error happening only after rebooting to apply patches. This does not happen all the time. This is the first time I encountered this scenario while running TRMM for more than a year.

How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?

You can't measure this externally. The above thread mentioned they added an event log as the first action in main() but that was not getting called. This is because the timeout is in the actual startup of the program as explained in the Grafana issue.

'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds.

Under normal circumstances, that's true.

other extenuating circumstances in your test there making it take longer than 30 seconds?

Installing patches after a reboot could trigger this, but not all the time.

The link in the Grafana issue (here) explains how they are able to cause this to happen by limiting the CPU to 1/4 of a CPU in Hyper-V.

This issue is to address the Go runtime initialization slowness under high CPU load, as well as identify options that can be applied to alleviate the scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants