-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency againt startup issues #29
Comments
That looks like a Windows service error. Windows service said startup RMM...and when it checked to see if it was running, after 30000 milliseconds it wasn't. You sure AV didn't kill it? Check agent.log to see if there's an error in there. |
I started TacticalAgent from Mesh and this server does not have Bitdefender. I did not find any other event logs relating to the service and there was nothing in the It seems this message is coming from Windows when trying to start the service. This kb article explains how the registry key The answers to this StackOverflow question have many other scenarios where this error may occur, including scenarios that do not relate to a timeout. While some of this may be due to my environment, there are things the TacticalAgent can do. One is to add dependencies to the service so the service control manager starts it a little later in the boot process. I.e. after networking is available. Another option is to add a scheduled task to start the service if it's not running. |
Relevant Grafana issue 2060:
There's a PR linked in that issue that may be of use. |
Here's the relevant Go issue to fix the runtime: Windows service timeout during system startup. |
...when I start on a computer, it's usually 1-2 seconds for the service to start and show as running by windows checks. How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)? I'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds. Are you sure there's not other extenuating circumstances in your test there making it take longer than 30 seconds? Is it nebula network delays that might be causing TRMM to stutter? |
This happens only with high CPU usage. This thread specifically talks about this error happening only after rebooting to apply patches. This does not happen all the time. This is the first time I encountered this scenario while running TRMM for more than a year.
You can't measure this externally. The above thread mentioned they added an event log as the first action in
Under normal circumstances, that's true.
Installing patches after a reboot could trigger this, but not all the time. The link in the Grafana issue (here) explains how they are able to cause this to happen by limiting the CPU to 1/4 of a CPU in Hyper-V. This issue is to address the Go runtime initialization slowness under high CPU load, as well as identify options that can be applied to alleviate the scenario. |
One server was offline and after researching the cause, I discovered there was an event log stating "A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.". It would be nice if the service (all OS's) was configured to stay running as best it can. For connectivity issues, retry logic is preferable over exiting after an initial failure to connect. If there's a domain configured, doing a fresh DNS lookup (can the agent clear the DNS cache?) and ping'ing the API until it's able to connect would be nice. If there's no domain configured, or if the agent configuration is corrupt, of course generate a friendly error message and exit.
Note: It's possible this could happen if the agent was restarted (computer rebooted) while the server was being updated and the API unavailable.
The text was updated successfully, but these errors were encountered: