-
Notifications
You must be signed in to change notification settings - Fork 7
Scan hangs if host is lost in the middle of a long running task #477
Comments
Similar problems: The async/poll feature was suggested for them to use: I'm trying it out but no luck as of yet. This is my playbook:
|
Another possible work around is setting a number of seconds in a [ssh_connection]
ssh_args = -o ServerAliveInterval=n Where |
It looks like either of these are possible work arounds, but the config file is least invasive and directly addresses the issue at hand, which is the case that we want to be able to continue even if we lose one host. The async/poll more addresses when one host of several is stuck indefinitely actually doing something on a host (i.e. it is still alive). With the follwoing #ansible.cfg
[ssh_connection]
ssh_args = -o ServerAliveInterval=10 And executing the following playbook: ---
- name: reproduce lost host task hang
hosts: all
strategy: free
tasks:
- name: Do something that takes time 1
shell: "ping -c 30 localhost"
# after the first task completes, I kill one of the machines
- name: Do something that takes time 2
shell: "for i in {1..10}; do echo $i && sleep 2; done;"
- name: Do something that takes time 3
shell: "ping -c 30 localhost"
- name: Do something that takes time 4
shell: "for i in {1..10}; do echo $i && sleep 2; done;"
I get the following output when I kill a server when the 2nd long running task is happening (after the 1st one completes). ansible-playbook -i hosts kill-host.yml
PLAY [reproduce lost host task hang] *******************************************************************
TASK [Gathering Facts] *********************************************************************************
ok: [10.10.XXX.XXX]
ok: [localhost]
TASK [Do something that takes time 1] ******************************************************************
changed: [10.10.XXX.XXX]
changed: [localhost]
TASK [Do something that takes time 2] ******************************************************************
changed: [localhost]
fatal: [10.10.XXX.XXX]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Timeout, server 10.10.XXX.XXX not responding.\r\n", "unreachable": true}
TASK [Do something that takes time 3] ******************************************************************
changed: [localhost]
TASK [Do something that takes time 4] ******************************************************************
changed: [localhost]
to retry, use: --limit @/home/elijah/test/kill-host.retry
PLAY RECAP *********************************************************************************************
10.10.XXX.XXX : ok=2 changed=1 unreachable=1 failed=0
localhost : ok=5 changed=4 unreachable=0 failed=0
Now the question is -- how do we ship this? Do we just tell users to put this in their global config file? Do we edit the file for them in the install process? |
Ok, even better news, we can use the variables i.e. instead of on the config file level, we can write in the playbook itself:
|
Awesome! |
|
I've not been able to test this reliably because of other problems in both |
As I've been testing this, the ssh config option appears only to be a fix if the host is lost in the midst of a task. It does not seem to help if the host is lost in between tasks. |
I've read through the linked Ansible issues, and the Ansible docs for async and strategy. I don't have a solution yet, but I at least want to see what I can do. @kdelee how did you create this error? |
So I see this as one of three possible causes for mysterious "hangs forever" scenarios, which are:
I've recreated this with only I was trying to re-create in rho on Friday but kept running into other unrelated rho bugs, so did not get the whole way. I think you could insert a task in one of the playbooks to give yourself time to do it like I have above (have a task that alerts you that the next task will take a while and then have a task that sleeps for 60 seconds or whatever you need and kill the host in that time period).
|
@kdelee Can we get the I'd like it in for the next release so at least we know we are a bit better than before. |
Adds ansible ssh argument for rho playbook to check if the ssh connection is still good every ten seconds. This helps with the case of losing a host mid-task detailed in issue #477.
Adds ansible ssh argument for rho playbook to check if the ssh connection is still good every ten seconds. This helps with the case of losing a host mid-task detailed in issue #477.
Adds ansible ssh argument for rho playbook to check if the ssh connection is still good every ten seconds. This helps with the case of losing a host mid-task detailed in issue #477.
After a recent conversation with Mark I looked into the following: Wonder if the following would help were we could essentially limit the time of any connection to something like 2 min. |
I understand that this would kill an "idle" session after 2 min. It would be good to see what it does if there is some long running task (thinking one of the yum facts -- those can take a while) that is longer than the time out. Will it kill if there is an active process in the shell? We also do want to kill even active sessions eventually, but maybe we should give them a bit more time that this. Anyways, sounds promising. |
There appears to be a bug in ansible that will cause a scan to hang if a host is lost.
A bug needs to be filed in Anisble once we have reproducer using just ansible.
The text was updated successfully, but these errors were encountered: