Scan hangs if host is lost in the middle of a long running task #477

kdelee · 2017-11-16T20:21:33Z

There appears to be a bug in ansible that will cause a scan to hang if a host is lost.

A bug needs to be filed in Anisble once we have reproducer using just ansible.

kdelee · 2017-11-27T17:57:39Z

Similar problems:
ansible/ansible#13327
ansible/ansible#18305

The async/poll feature was suggested for them to use:
http://docs.ansible.com/ansible/latest/playbooks_async.html

I'm trying it out but no luck as of yet.

This is my playbook:

---

- name: reproduce lost host task hang
  hosts: all
  strategy: free
  tasks:
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
      async: 30
      poll: 10
# after the first task completes, I kill one of the machines
    - name: Do something that takes time 2
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
      async: 30
      poll: 10
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
      async: 30
      poll: 10
    - name: Do something that takes time 3
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
      async: 30
      poll: 10

kdelee · 2017-11-29T16:39:17Z

Another possible work around is setting a number of seconds in a ansible.cfg that ssh should check for a connnection!

[ssh_connection]

ssh_args = -o ServerAliveInterval=n

Where n is between 1 and 255 and is the number of seconds to check that the server is alive

kdelee · 2017-11-29T17:01:05Z

It looks like either of these are possible work arounds, but the config file is least invasive and directly addresses the issue at hand, which is the case that we want to be able to continue even if we lose one host.

The async/poll more addresses when one host of several is stuck indefinitely actually doing something on a host (i.e. it is still alive).

With the follwoing ansible.cfg in the directory of execution:

#ansible.cfg
[ssh_connection]

ssh_args = -o ServerAliveInterval=10

And executing the following playbook:

---

- name: reproduce lost host task hang
  hosts: all
  strategy: free
  tasks:
    - name: Do something that takes time 1
      shell: "ping -c 30 localhost"
# after the first task completes, I kill one of the machines
    - name: Do something that takes time 2
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"
    - name: Do something that takes time 3
      shell: "ping -c 30 localhost"
    - name: Do something that takes time 4
      shell: "for i in {1..10}; do echo $i && sleep 2; done;"

I get the following output when I kill a server when the 2nd long running task is happening (after the 1st one completes).

ansible-playbook -i hosts kill-host.yml 

PLAY [reproduce lost host task hang] *******************************************************************

TASK [Gathering Facts] *********************************************************************************
ok: [10.10.XXX.XXX]
ok: [localhost]

TASK [Do something that takes time 1] ******************************************************************
changed: [10.10.XXX.XXX]
changed: [localhost]

TASK [Do something that takes time 2] ******************************************************************
changed: [localhost]
fatal: [10.10.XXX.XXX]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Timeout, server 10.10.XXX.XXX not responding.\r\n", "unreachable": true}

TASK [Do something that takes time 3] ******************************************************************
changed: [localhost]

TASK [Do something that takes time 4] ******************************************************************
changed: [localhost]
	to retry, use: --limit @/home/elijah/test/kill-host.retry

PLAY RECAP *********************************************************************************************
10.10.XXX.XXX              : ok=2    changed=1    unreachable=1    failed=0   
localhost                  : ok=5    changed=4    unreachable=0    failed=0

Now the question is -- how do we ship this? Do we just tell users to put this in their global config file? Do we edit the file for them in the install process?

kdelee · 2017-11-29T17:25:08Z

Ok, even better news, we can use the variables ansible_ssh_args or ansible_ssh_common_args in playbooks or in inventories, the distiction being that ansibles_ssh_common_args appends to any user configured args, and ansible_ssh_args overwrites them.

i.e. instead of on the config file level, we can write in the playbook itself:

- name: with these args a host dying mid-task won't make my playbook hang
  hosts: all
  strategy: free
  vars:
      # this appends args to whatever other ones
      # the user may have set in their ansible cfg
      # alternatively we can overwrite them with
      # ansible_ssh_args
      ansible_ssh_common_args: '-o ServerAliveInterval=10'

chambridge · 2017-11-29T17:27:49Z

Awesome!

kdelee · 2017-11-29T17:28:24Z

~~I'll make a PR~~

kdelee · 2017-12-01T17:25:49Z

I've not been able to test this reliably because of other problems in both dev and master that breaks the scan. Most recently, see #501

kdelee · 2017-12-01T22:30:15Z

As I've been testing this, the ssh config option appears only to be a fix if the host is lost in the midst of a task. It does not seem to help if the host is lost in between tasks.

noahl · 2017-12-04T16:53:31Z

I've read through the linked Ansible issues, and the Ansible docs for async and strategy. I don't have a solution yet, but I at least want to see what I can do.

@kdelee how did you create this error?

kdelee · 2017-12-04T17:14:34Z

So I see this as one of three possible causes for mysterious "hangs forever" scenarios, which are:

During a task that takes a long time to complete (yum update for example), the we loose ssh connection to the host. That is what this issue addresses.

I've recreated this with only ansible-playbook by scanning my localhost and one other machine on vcenter with the above tasks. I wait until the first task competes and a few more seconds so I can presume that the second task has begun (it doesn't output until after the the task) and then I kill the second scan host. You will see that the other tasks proceed because of the free strategy, but when we get to the end of the tasks for the localhost, it hangs forever.

I was trying to re-create in rho on Friday but kept running into other unrelated rho bugs, so did not get the whole way. I think you could insert a task in one of the playbooks to give yourself time to do it like I have above (have a task that alerts you that the next task will take a while and then have a task that sleeps for 60 seconds or whatever you need and kill the host in that time period).

It also seems that if a host is lost in between the end of one task and before the next one begins, we run into trouble. I think this is distinct from this issue. I don't have a reliable reproducer but I think it is the case because it seems we hang if I turn off a machine while running lots of rapid-fire very quick tasks, and then the ssh-timeout setting does not help. I have not filed an issue for this yet.
The third possibility, which is probably even more likely than total network connection loss of a host is is certain machines in a scan hanging forever on a shell command. So the machine is live, and the connection good, but it will not ever exit because it is busy. The problem is that the scan never finishes because of this one badly behaved host. See issue Prevent scan hanging if command exceeds time limit #504

chambridge · 2017-12-04T20:37:22Z

@kdelee Can we get the ansible_ssh_common_args: '-o ServerAliveInterval=10' change in as an incremental improvement?

I'd like it in for the next release so at least we know we are a bit better than before.

Adds ansible ssh argument for rho playbook to check if the ssh connection is still good every ten seconds. This helps with the case of losing a host mid-task detailed in issue #477.

chambridge · 2017-12-06T18:29:31Z

After a recent conversation with Mark I looked into the following:
http://go2linux.garron.me/linux/2011/02/limit-idle-ssh-sessions-time-avoid-unattended-ones-clientaliveinterval-clientalivecoun/

Wonder if the following would help were we could essentially limit the time of any connection to something like 2 min.
ClientAliveInterval 120 ClientAliveCountMax 0

kdelee · 2017-12-06T20:09:26Z

I understand that this would kill an "idle" session after 2 min. It would be good to see what it does if there is some long running task (thinking one of the yum facts -- those can take a while) that is longer than the time out. Will it kill if there is an active process in the shell?

We also do want to kill even active sessions eventually, but maybe we should give them a bit more time that this.

Anyways, sounds promising.

kdelee changed the title ~~Ansible bug causes scan to hang if host is lost~~ Ansible "facts of life" causes scan to hang if host is lost Nov 29, 2017

kdelee changed the title ~~Ansible "facts of life" causes scan to hang if host is lost~~ Scan hangs if host is lost in the middle of a long running task Dec 1, 2017

kdelee mentioned this issue Dec 1, 2017

Prevent scan hanging if command exceeds time limit #504

Open

kdelee added bug high severity labels Dec 1, 2017

kdelee mentioned this issue Dec 4, 2017

Add server alive interval to check for dead server #514

Merged

noahl mentioned this issue Dec 12, 2017

Issues from Mark for final rho release #537

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan hangs if host is lost in the middle of a long running task #477

Scan hangs if host is lost in the middle of a long running task #477

kdelee commented Nov 16, 2017

kdelee commented Nov 27, 2017 •

edited

Loading

kdelee commented Nov 29, 2017

kdelee commented Nov 29, 2017 •

edited

Loading

kdelee commented Nov 29, 2017 •

edited

Loading

chambridge commented Nov 29, 2017

kdelee commented Nov 29, 2017 •

edited

Loading

kdelee commented Dec 1, 2017

kdelee commented Dec 1, 2017

noahl commented Dec 4, 2017

kdelee commented Dec 4, 2017 •

edited

Loading

chambridge commented Dec 4, 2017

chambridge commented Dec 6, 2017 •

edited

Loading

kdelee commented Dec 6, 2017

Scan hangs if host is lost in the middle of a long running task #477

Scan hangs if host is lost in the middle of a long running task #477

Comments

kdelee commented Nov 16, 2017

kdelee commented Nov 27, 2017 • edited Loading

kdelee commented Nov 29, 2017

kdelee commented Nov 29, 2017 • edited Loading

kdelee commented Nov 29, 2017 • edited Loading

chambridge commented Nov 29, 2017

kdelee commented Nov 29, 2017 • edited Loading

kdelee commented Dec 1, 2017

kdelee commented Dec 1, 2017

noahl commented Dec 4, 2017

kdelee commented Dec 4, 2017 • edited Loading

chambridge commented Dec 4, 2017

chambridge commented Dec 6, 2017 • edited Loading

kdelee commented Dec 6, 2017

kdelee commented Nov 27, 2017 •

edited

Loading

kdelee commented Nov 29, 2017 •

edited

Loading

kdelee commented Nov 29, 2017 •

edited

Loading

kdelee commented Nov 29, 2017 •

edited

Loading

kdelee commented Dec 4, 2017 •

edited

Loading

chambridge commented Dec 6, 2017 •

edited

Loading