Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with pytest-xdist Handling Out of Memory Errors(IndexError) #1155

Open
loveleenamar9 opened this issue Nov 18, 2024 · 1 comment
Open
Labels

Comments

@loveleenamar9
Copy link

loveleenamar9 commented Nov 18, 2024

Hi,
I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.

Below is an example of the error log:

2024-10-27T21:28:18Z  tensorflow	[gw13] [ 70%] FAILED layerwise/Mistral7b/test_model_layers_0.py::test_model_layers_0 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	replacing crashed worker gw13
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> def worker_internal_error(
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self, node: WorkerController, formatted_error: str
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     ) -> None:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() was called on the worker.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         here ourselves using the formatted message.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self._active_nodes.remove(node)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         try:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> >           assert False, formatted_error
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           AssertionError: Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 271, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 session.exitstatus = doit(config, session) or 0
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 325, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 config.hook.pytest_runtestloop(session=session)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 182, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return outcome.get_result()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 100, in get_result
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 raise exc.with_traceback(exc.__traceback__)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 res = hook_impl.function(*args)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 174, in pytest_runtestloop
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 self.run_one_test()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 185, in run_one_test
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 item = items[self.item_index]
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E             IndexError: list index out of range
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           assert False
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> /root/.local/lib/python3.10/site-packages/xdist/dsession.py:232: AssertionError
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 273, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 327, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 139, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 122, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/logging.py", line 796, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     res = hook_impl.function(*args)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     self.loop_once()
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 152, in loop_once
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise RuntimeError("Unexpectedly no active workers available")
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> RuntimeError: Unexpectedly no active workers available

The issue can be reproduced by creating a dummy test that allocates a large amount of memory:

PYTHON

def test_oom():
    large_memory_allocation = []
    for _ in range(175):
        large_memory_allocation.append([0] * (1024**3 // 4))

I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.

Note: This issue is observed only with a large test suite.

Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?

Thanks!
Loveleen.

@RonnyPfannschmidt
Copy link
Member

this looks indeed like a missed case in worker restart

its possibly related to oom preventing messages due to the hard kill

most normal worker restarts get some kind of message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants