Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process not managed by Nomad killed on restart #23969

Closed
yigit-erkoca opened this issue Sep 16, 2024 · 29 comments · Fixed by #24265, #24538 or #24798
Closed

process not managed by Nomad killed on restart #23969

yigit-erkoca opened this issue Sep 16, 2024 · 29 comments · Fixed by #24265, #24538 or #24798

Comments

@yigit-erkoca
Copy link

yigit-erkoca commented Sep 16, 2024

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85

Operating system and Environment details

Windows 10 Enterprise 2016 LTSB

Issue

Nomad instance is killing another process on the system that is not deployed by nomad.

I see in the application event viewer log the following log from source "Process Exit Monitor"
2024-09-16 23:20:06 The process 'C:\simulator\BoardSimulator.exe' was terminated by the process 'C:\Program Files\Infra\Nomad\nomad.exe' with termination code 1. The creation time for the exiting process was 0x01db07ac4767577e.

Reproduction steps

On system reboot, some applications like simulator executables start by other windows services, as well as nomad instance.
This is a sporadic issue, does not happen at every reboot, but maybe estimated to be once in every ~100.

Expected Result

Nomad not killing/interfering any other process on the system that is not deployed by using nomad jobs.

Actual Result

Nomad is terminating other processes that are running in the system.

Job file (if appropriate)

There is only one job using raw_exec. We are just setting it up.
NomadJob.txt

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

nomad.log
The timestamp of the process termination according to the eventlog: 2024-09-16 23:20:06

@yigit-erkoca yigit-erkoca changed the title Nomad on Windows killing another process on the system Nomad on Windows (raw_exec) killing another process on the system Sep 17, 2024
@jrasell
Copy link
Member

jrasell commented Sep 20, 2024

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

@yigit-erkoca
Copy link
Author

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

Hello @jrasell, thank you for your attention! Indeed I'm extending the logs to find out the root cause. I have a setup that I can reproduce this issue quite easily. I'll add more details to the issue and let you know. I appreciate the help.

@tgross tgross added the hcc/jira label Oct 3, 2024
@tgross tgross self-assigned this Oct 10, 2024
@tgross tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Oct 10, 2024
@tgross tgross added this to the 1.9.x milestone Oct 15, 2024
@tgross
Copy link
Member

tgross commented Oct 15, 2024

It looks like the relevant portion of the logs is the following, which suggests that the spuriously killed process is getting killed while we're trying to restore the process for the previously-running task.

2024-09-16T23:20:06.267+0200 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task type=Received msg="Task received by client" failed=false
2024-09-16T23:20:06.268+0200 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28
2024-09-16T23:20:06.268+0200 [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28
2024-09-16T23:20:06.268+0200 [TRACE] client.alloc_runner.task_runner: failed to reattach to task; will not run until server is contacted: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task

What we see here is that the client starts up, looks in its local state, and tries to restore handles to the tasks that were running before the host shutdown (ref task_runner.go#L1290-L1321). Of course, none of them are going to be running, so that recovery will fail as expected because we fail to reattach to the executor process (ref driver.go#L288-L297).

We then call DestroyTask which should be returning ErrTaskNotFound (ref driver.go#L472) because the task handle cannot exist in the driver's state at this point. We won't see that error in the logs because we expect it so we explicitly ignore it (ref task_runner.go#L1310-L1312).

While we have saved the PID of the previously-running task in the local state, I don't see any code path where this PID is actually being made available for the driver to kill.

@yigit-erkoca you say you have a reliable reproduction, right? Can you confirm that the PID of the process getting incorrectly killed the same as the PID of the previously-running task or its executor process?

@tgross
Copy link
Member

tgross commented Oct 17, 2024

I did a little more testing and what's particularly weird about all of this is that the executor process is the one that actually sends the kill signal to the child process (or the Windows equivalent which is TerminateProcess). So if the executor process is gone -- which it must be because we rebooted -- then there's nothing to send a kill signal to another child process. It would definitely be helpful to know whether the PID we're trying to kill is the executor or previously-running task.

@yigit-erkoca
Copy link
Author

@tgross, thank you for looking into this!

We have indeed a quite reliable reproduction environment. We have reproduced it again today to provide you more information with an example.

Nomad is installed as windows service, and starts normally, starts all the executors, tasks, etc.
image

All the executors listed below; please note the highlighted instance (which will be the one we will be interested in)

image

After restarting/rebooting nomad will try to restore the tasks, and one of these occasions if there is any other process with the same PID above "5652" in this case notepad.exe gets the "5652".

image

Nomad during the restore action, it will kill the notepad.exe. Which can be seen in the windows event viewer below.

image

I hope this clarifies your question. From this, it looks like the PID is the executor's PID, not the task's PID.

If you need any further information, or things that we can try, please let us we can use our reproduction environment.

@tgross
Copy link
Member

tgross commented Oct 18, 2024

Thanks @yigit-erkoca, that's super helpful! That cuts the size of the problem in half and points me towards the go-plugin interface. I'll investigate further and report back.

@tgross
Copy link
Member

tgross commented Oct 18, 2024

I did a little bit of parallel investigation on Linux, and using killsnoop I verified that we never send a kill signal to the executor process PID after host restart (or any other PID, for that matter). So that further narrows this down to some quirk of Windows-specific behavior. That's a little weird, because this code should be mostly shared between them... the differences are in the executors themselves, which are not running at the time this all happens.

@tgross
Copy link
Member

tgross commented Oct 18, 2024

Looks like I found the culprit, and it totally can happen on other platforms. It's this line in go-plugin: cmd_reattach.go#L31.

  • We have a PID for the executor and try to reattach
  • os.FindProcess finds a process for that PID, just not the one we want
  • We can't connect to it because it's not an executor
  • go-plugin says, "oops, executor must be broken, let's kill it"

I'm having a chat with colleagues about the best approach to fix this. The obvious fix is to simply delete that line, but I assume there's a reason it exists. 😁 (Edit: this is an extremely old behavior hashicorp/go-plugin#4 which makes it even trickier)

@tgross tgross changed the title Nomad on Windows (raw_exec) killing another process on the system process not managed by Nomad killed on restart Oct 18, 2024
tgross added a commit to hashicorp/go-plugin that referenced this issue Oct 18, 2024
During reattachment, we look to see if the process corresponding to the stored
PID is running. If so, we try to connect to that process. If that fails, we kill
the process under the presumption it's not working, and return
ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which
means that the process we're trying to attach to may have exited and a different
process has spawned with the same PID. This results in some unrelated process
getting silently killed.

This impacts Nomad when running the `rawexec` or `exec` task drivers, because
the Nomad agent spawns an "executor" process via go-plugin to control the
workloads, and these executors are left running when Nomad exits. If the
executors die in the meantime (or the host is rebooted), then we can potentially
kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin
server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233
@peter-harmann-tfs
Copy link

@tgross Hello. We are running the fixed version, but it seems our processes are still being killed, this time likely from here.

We see "plugin failed to exit gracefully" in nomad logs just as our process ends unexpectedly. We are not able to reliably reproduce the issue at the moment (seems to occur under different circumstances than the original).

@tgross
Copy link
Member

tgross commented Nov 11, 2024

@peter-harmann-tfs just to confirm, you mean an unrelated process is getting killed? That is, one not managed by Nomad?

@peter-harmann-tfs
Copy link

@tgross Yes, it is an unrelated process.

Maybe to clarify, I am colleague of yigit-erkoca, so I am talking about the same deployment as the original issue.

@tgross
Copy link
Member

tgross commented Nov 11, 2024

Ok, thanks. I'll reopen and get this on the queue for another in-depth investigation.

@tgross tgross reopened this Nov 11, 2024
@tgross tgross removed their assignment Nov 12, 2024
@cenk-saldirim
Copy link

Hi, I am colleague of yigit-erkoca and peter-harmann-tfs.

We conducted a more detailed investigation regarding this issue, in order to provide steps to reproduce it.
When we correlated the time we identified a non-nomad process being killed with the timestamp in the Nomad logs, we observed the following log messages in sequence: 'plugin failed to exit gracefully' and 'reattached plugin process exited'.

We extended the logs in the reattach() and Kill() functions in the client.go file. This allowed us to log the reattach PID and reattach address information for the process that is being terminated.

The issue is finally reproduced, and observed that it occurs when the following sequence of events takes place:

  • Nomad has reattach PID and reattach Address information for the executor and try to reattach
  • The executor process has exited and a different process has spawned with the same PID. Additionally the reattach address is occupied by a different process (It's a rare scenario, but it can happen)
  • The reattach address is reachable, but no communication can be established (because the process using that reattach address is not an executor)
  • go-plugin decides to kill the process(which is not managed by nomad) using reattach PID information

I hope the information above is sufficient to identify the root cause of the problem. If more details are needed, we would be happy to provide them.

@tgross
Copy link
Member

tgross commented Nov 21, 2024

Thanks @cenk-saldirim. That scenario is exactly what hashicorp/go-plugin#320 was supposed to have fixed, which is particularly strange. Also, just a heads up that my colleague @mismithhisler is taking over the investigation from our end.

@cenk-saldirim
Copy link

Thanks for the information @tgross. I believe, the only difference between previous issue and the current one is the reattach address of the executor process which has exited is actively being used by another process. If that reattach address:port is not actively listening, go-plugin does not attempt to kill any process.

@tgross
Copy link
Member

tgross commented Nov 21, 2024

Ah! Sorry, I completely missed that... that complicates things quite a bit!

@mismithhisler
Copy link
Member

Hi @cenk-saldirim, thanks for all the information. Do you have any more logs you could attach here showing this most recent issue? The "plugin failed to exit gracefully" is helpful, but we might be able to narrow this search with more.

Also, can you confirm the clients that are having this issue are only running raw_exec tasks? And not running any external task drivers (i.e. Podman etc)?

@mismithhisler
Copy link
Member

@cenk-saldirim Just following up to my previous comment to let you know we can reproduce this and are working on a fix. Thanks again for the info you provided.

@cenk-saldirim
Copy link

Hi @mismithhisler, thank you very much for your finding and glad to hear that you can reproduce this. I can share the logs we collected while observing this behavior, if still needed.

@mismithhisler
Copy link
Member

I think we should be good on logs for now, thank you though!

@cenk-saldirim
Copy link

Hi @mismithhisler , @tgross,

We are running the fixed version of Nomad(v1.9.4) and from what we observed, only half of the problem seems to be fixed. In older versions, we observed that all processes spawned with the same PID as the previously created executor processes could be killed. With the latest fix(#24577), kill attempts are no longer made for some of these executor processes, but others are still being killed by Nomad. To summarize the issue, I would like to explain the reproduction scenario in our test environment below.

In our tests, we deploy 3 services using Nomad. While these services are being deployed, 6 different instances of nomad.exe are created.

Process Name PID Port listened
Process A nomad.exe 11424 10000
Process B nomad.exe 10236 10001
Process C nomad.exe 3272 10002
Process D nomad.exe 12052 14000
Process E nomad.exe 5452 14001
Process F nomad.exe 10576 14002

image

Let's assume that after a system reboot, 6 different processes with the same PIDs as the previously created ones are spawned(for test purpose, we are creating notepad.exe instances till we find the same PID). At the same time, the ports previously listened by nomad processes are actively being listened.

Process Name PID
Process A notepad.exe 11424
Process B notepad.exe 10236
Process C notepad.exe 3272
Process D notepad.exe 12052
Process E notepad.exe 5452
Process F notepad.exe 10576

After the Nomad service starts, it attempts to reattach to the executors it previously created, and we observe the following. Process D, E and F which are associated with port numbers 14000, 14001 and 14002 are no longer being killed by nomad(after the fix #24577). Process A, B and C which are associated with port numbers 10000, 10001 and 10002 are still being killed by nomad.

I hope the above information would help to identify the root cause. If more detailed information is needed, I would be happy to share.

@cenk-saldirim
Copy link

We also would like to know what is the difference between the processes created by nomad that are listening ports starting from 10000 and those listening ports starting from 14000? Another thing that I would like to mention, we have a temporary fix for this issue. It is not a proper fix and only works for windows. I hope this information also helps, you can find the changes here.

@mismithhisler
Copy link
Member

mismithhisler commented Jan 6, 2025

Hi @cenk-saldirim, thanks for updating us with your detailed findings, it's very helpful. Usually when running a task with logging enabled (which it is by default), two nomad processes will be started, one for the executor, and the other for the logger. On Linux, it's a bit easier to tell what is what.

My process tree looks similar to this when running a simple job:

\_ /usr/local/bin/nomad agent -config myconfig.hcl
   \_ /usr/local/bin/nomad logmon
   \_ /usr/local/bin/nomad executor
       \_ /usr/bin/sleep 5000s

The fact that we are still calling Kill() leads to think that we may be trying to kill the logmon process... I'm going to investigate this and see what I can find.

@cenk-saldirim
Copy link

Hi @mismithhisler, thank you for your detailed explanation and quick response. It's more clear to me. I believe you have already prepared a fix(#24798). I'll try to get a local build and test it as soon as possible.

@mismithhisler
Copy link
Member

@cenk-saldirim I accidentally merged this morning before reading your reply about testing. Please let me know if you are still seeing this issue, and we will open this back up and do some more investigation.

I was able to reproduce killing a random process when attempting to reattach to the logmon plugin, so I am feeling good that this will clear up the other half of this problem.

@cenk-saldirim
Copy link

@mismithhisler, I built Nomad locally with the latest fix(#24808), applied the reproduction scenario and observed through testing that the issue has been completely resolved. Thank you for resolving that issue in such a short time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment