Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytest-xdist server side timeout #57

Open
limaoscarjuliet opened this issue May 5, 2016 · 2 comments
Open

pytest-xdist server side timeout #57

limaoscarjuliet opened this issue May 5, 2016 · 2 comments

Comments

@limaoscarjuliet
Copy link

NOTE: This is not about timeout for test code itself (pytest-timeout works well here), this is about need for timeout in pytest-xdist.

First, let me say big thank you for pytest and pytest-xdist. We use it to run ~400 Docker containers on ~10 servers on AWS. It works wonders!

There are scenarios where pytest-xdist does not detect remote session crash or disconnect and as such will wait for results forever.

Today's xdist code detects session crash via EOF on the SSH session. When network connection is torn down, server marks the worker as dead, and re-adds it. All good.

But... consider a scenario where the SSH is not torn down:

  1. Run N tests on multiple remote machines with pytest-xdist,
  2. Tests spawn a python process on remote machine via SSH
  3. We run in boxed mode, so this process forks to run actual test code
  4. Process Add appveyor support #2 gets killed or crashes
  5. SSH session stays up because process Final adjustments for finishing GitHub import #3 inherited at least one stdin/out/err from the process Add appveyor support #2 (standard SSH behavior).

In this case, the server side xdist thinks the session is up and is waiting for the results for really, really long time ;-)

And yes, #2 does not crash normally. In our case it was oom killed quite persistently. All it takes is 1 oom kill for tens of thousands of tests and entire batch is ruined.

Please let me know if I can provide more info on this issue.

[root@nsth-c10 nsth] #.python --version
Python 2.7.10
[root@nsth-c10 nsth] #.py.test --version
This is pytest version 2.8.0, imported from /usr/local/lib/python2.7/site-packages/pytest-2.8.0-py2.7.egg/pytest.pyc
setuptools registered plugins:
pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/boxed.pyc
pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/looponfail.pyc
pytest-xdist-1.13.1 at /usr/local/lib/python2.7/site-packages/pytest_xdist-1.13.1-py2.7.egg/xdist/plugin.pyc
[root@nsth-c10 nsth] #.

P.S.
Moved from pytest-dev/pytest#1550

@RonnyPfannschmidt
Copy link
Member

i think this one is dependent on #20 - with the current codebase its really tricky to introduce heartbeats on top of the support for node-restarts

since we cant detect a dead ssh due to the default behaviour we need some kind of heartbeat mechanism, so we can be aware of sessions in a unresponsive state

i think this is a item for execnet itself

@limaoscarjuliet
Copy link
Author

limaoscarjuliet commented May 6, 2016

We addressed the underlying root cause by increasing amount of memory each container can use (docker -mem option). But, of course, there are other ways it may lock up or crash, so addressing this will help.

Thank you for taking this into account in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants