Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Scan hangs if host is lost in the middle of a long running task #477

Open
kdelee opened this issue Nov 16, 2017 · 13 comments
Open

Scan hangs if host is lost in the middle of a long running task #477

kdelee opened this issue Nov 16, 2017 · 13 comments

Comments

@kdelee
Copy link

kdelee commented Nov 16, 2017

There appears to be a bug in ansible that will cause a scan to hang if a host is lost.

A bug needs to be filed in Anisble once we have reproducer using just ansible.

@kdelee
Copy link
Author

kdelee commented Nov 27, 2017

Similar problems:
ansible/ansible#13327
ansible/ansible#18305

The async/poll feature was suggested for them to use:
http://docs.ansible.com/ansible/latest/playbooks_async.html

I'm trying it out but no luck as of yet.

This is my playbook:

---

- name: reproduce lost host task hang
  hosts: all
  strategy: free
  tasks:
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
      async: 30
      poll: 10
# after the first task completes, I kill one of the machines
    - name: Do something that takes time 2
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
      async: 30
      poll: 10
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
      async: 30
      poll: 10
    - name: Do something that takes time 3
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
      async: 30
      poll: 10

@kdelee
Copy link
Author

kdelee commented Nov 29, 2017

Another possible work around is setting a number of seconds in a ansible.cfg that ssh should check for a connnection!

[ssh_connection]

ssh_args = -o ServerAliveInterval=n

Where n is between 1 and 255 and is the number of seconds to check that the server is alive

@kdelee
Copy link
Author

kdelee commented Nov 29, 2017

It looks like either of these are possible work arounds, but the config file is least invasive and directly addresses the issue at hand, which is the case that we want to be able to continue even if we lose one host.

The async/poll more addresses when one host of several is stuck indefinitely actually doing something on a host (i.e. it is still alive).

With the follwoing ansible.cfg in the directory of execution:

#ansible.cfg
[ssh_connection]

ssh_args = -o ServerAliveInterval=10

And executing the following playbook:

---

- name: reproduce lost host task hang
  hosts: all
  strategy: free
  tasks:
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
# after the first task completes, I kill one of the machines
    - name: Do something that takes time 2
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
    - name: Do something that takes time 3
      shell: "ping -c 30 localhost"
    - name: Do something that takes time 4
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"

I get the following output when I kill a server when the 2nd long running task is happening (after the 1st one completes).

ansible-playbook -i hosts kill-host.yml 

PLAY [reproduce lost host task hang] *******************************************************************

TASK [Gathering Facts] *********************************************************************************
ok: [10.10.XXX.XXX]
ok: [localhost]

TASK [Do something that takes time 1] ******************************************************************
changed: [10.10.XXX.XXX]
changed: [localhost]

TASK [Do something that takes time 2] ******************************************************************
changed: [localhost]
fatal: [10.10.XXX.XXX]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Timeout, server 10.10.XXX.XXX not responding.\r\n", "unreachable": true}

TASK [Do something that takes time 3] ******************************************************************
changed: [localhost]

TASK [Do something that takes time 4] ******************************************************************
changed: [localhost]
	to retry, use: --limit @/home/elijah/test/kill-host.retry

PLAY RECAP *********************************************************************************************
10.10.XXX.XXX              : ok=2    changed=1    unreachable=1    failed=0   
localhost                  : ok=5    changed=4    unreachable=0    failed=0   

Now the question is -- how do we ship this? Do we just tell users to put this in their global config file? Do we edit the file for them in the install process?

@kdelee
Copy link
Author

kdelee commented Nov 29, 2017

Ok, even better news, we can use the variables ansible_ssh_args or ansible_ssh_common_args in playbooks or in inventories, the distiction being that ansibles_ssh_common_args appends to any user configured args, and ansible_ssh_args overwrites them.

i.e. instead of on the config file level, we can write in the playbook itself:

- name: with these args a host dying mid-task won't make my playbook hang
  hosts: all
  strategy: free
  vars:
      # this appends args to whatever other ones
      # the user may have set in their ansible cfg
      # alternatively we can overwrite them with
      # ansible_ssh_args
      ansible_ssh_common_args: '-o ServerAliveInterval=10'

@kdelee kdelee changed the title Ansible bug causes scan to hang if host is lost Ansible "facts of life" causes scan to hang if host is lost Nov 29, 2017
@chambridge
Copy link

Awesome!

@kdelee
Copy link
Author

kdelee commented Nov 29, 2017

I'll make a PR

@kdelee
Copy link
Author

kdelee commented Dec 1, 2017

I've not been able to test this reliably because of other problems in both dev and master that breaks the scan. Most recently, see #501

@kdelee kdelee changed the title Ansible "facts of life" causes scan to hang if host is lost Scan hangs if host is lost in the middle of a long running task Dec 1, 2017
@kdelee
Copy link
Author

kdelee commented Dec 1, 2017

As I've been testing this, the ssh config option appears only to be a fix if the host is lost in the midst of a task. It does not seem to help if the host is lost in between tasks.

@noahl
Copy link

noahl commented Dec 4, 2017

I've read through the linked Ansible issues, and the Ansible docs for async and strategy. I don't have a solution yet, but I at least want to see what I can do.

@kdelee how did you create this error?

@kdelee
Copy link
Author

kdelee commented Dec 4, 2017

So I see this as one of three possible causes for mysterious "hangs forever" scenarios, which are:

  1. During a task that takes a long time to complete (yum update for example), the we loose ssh connection to the host. That is what this issue addresses.

I've recreated this with only ansible-playbook by scanning my localhost and one other machine on vcenter with the above tasks. I wait until the first task competes and a few more seconds so I can presume that the second task has begun (it doesn't output until after the the task) and then I kill the second scan host. You will see that the other tasks proceed because of the free strategy, but when we get to the end of the tasks for the localhost, it hangs forever.

I was trying to re-create in rho on Friday but kept running into other unrelated rho bugs, so did not get the whole way. I think you could insert a task in one of the playbooks to give yourself time to do it like I have above (have a task that alerts you that the next task will take a while and then have a task that sleeps for 60 seconds or whatever you need and kill the host in that time period).

  1. It also seems that if a host is lost in between the end of one task and before the next one begins, we run into trouble. I think this is distinct from this issue. I don't have a reliable reproducer but I think it is the case because it seems we hang if I turn off a machine while running lots of rapid-fire very quick tasks, and then the ssh-timeout setting does not help. I have not filed an issue for this yet.

  2. The third possibility, which is probably even more likely than total network connection loss of a host is is certain machines in a scan hanging forever on a shell command. So the machine is live, and the connection good, but it will not ever exit because it is busy. The problem is that the scan never finishes because of this one badly behaved host. See issue Prevent scan hanging if command exceeds time limit #504

@chambridge
Copy link

@kdelee Can we get the ansible_ssh_common_args: '-o ServerAliveInterval=10' change in as an incremental improvement?

I'd like it in for the next release so at least we know we are a bit better than before.

kdelee added a commit that referenced this issue Dec 4, 2017
Adds ansible ssh argument for rho playbook to check if the ssh
connection is still good every ten seconds. This helps with the case
of losing a host mid-task detailed in issue #477.
kdelee added a commit that referenced this issue Dec 4, 2017
Adds ansible ssh argument for rho playbook to check if the ssh
connection is still good every ten seconds. This helps with the case
of losing a host mid-task detailed in issue #477.
chambridge pushed a commit that referenced this issue Dec 5, 2017
Adds ansible ssh argument for rho playbook to check if the ssh
connection is still good every ten seconds. This helps with the case
of losing a host mid-task detailed in issue #477.
@chambridge
Copy link

chambridge commented Dec 6, 2017

After a recent conversation with Mark I looked into the following:
http://go2linux.garron.me/linux/2011/02/limit-idle-ssh-sessions-time-avoid-unattended-ones-clientaliveinterval-clientalivecoun/

Wonder if the following would help were we could essentially limit the time of any connection to something like 2 min.
ClientAliveInterval 120 ClientAliveCountMax 0

@kdelee
Copy link
Author

kdelee commented Dec 6, 2017

I understand that this would kill an "idle" session after 2 min. It would be good to see what it does if there is some long running task (thinking one of the yum facts -- those can take a while) that is longer than the time out. Will it kill if there is an active process in the shell?

We also do want to kill even active sessions eventually, but maybe we should give them a bit more time that this.

Anyways, sounds promising.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants