Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests using invoke_and_assert can sometimes attempt to forward signals from a thread that's not main #10895

Closed
1 task done
chrisguidry opened this issue Oct 5, 2023 · 6 comments · Fixed by #11552
Closed
1 task done
Assignees
Labels
development Tech debt, refactors, CI, tests, and other related work.

Comments

@chrisguidry
Copy link
Collaborator

First check

  • I am a contributor to the Prefect codebase

Description

This was the source of several test flakes that @abrookins identified here: https://github.com/PrefectHQ/prefect/pull/10889/files#diff-00c0a3f9f0ef58527f83db91f74ea7356bb8d9134b4db3e490e4e35d39c49445R12-R16

Impact

No response

Additional context

No response

@chrisguidry chrisguidry added the development Tech debt, refactors, CI, tests, and other related work. label Oct 5, 2023
@abrookins
Copy link
Collaborator

Here's a recent example of this failure -- it can occur in many places, not just our CLI commands.

My problem with this is that the issue is not only that tests are flaky -- though they are. The problem is that this suggests any code wrapped in sync_compatible is not safe to use with signal handlers.

=================================== FAILURES ===================================
________________________ test_start_no_options[server] _________________________
[gw7] linux -- Python 3.11.5 /opt/hostedtoolcache/Python/3.11.5/x64/bin/python

mock_run_process = <AsyncMock id='140453732195280'>, command_group = 'server'

    @pytest.mark.parametrize("command_group", ["server"])
    def test_start_no_options(mock_run_process: AsyncMock, command_group: str):
>       invoke_and_assert(
            [command_group, "start"],
            expected_output_contains=[
                "prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api",
                "View the API reference documentation at http://127.0.0.1:4200/docs",
            ],
        )

tests/cli/test_server.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/prefect/testing/cli.py:99: in invoke_and_assert
    result = runner.invoke(app, command, catch_exceptions=False, input=user_input)
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/typer/testing.py:21: in invoke
    return super().invoke(
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/click/testing.py:408: in invoke
    return_value = cli.main(args=args or (), prog_name=prog_name, **extra)
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/typer/core.py:778: in main
    return _main(
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/typer/core.py:216: in _main
    rv = self.invoke(ctx)
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/click/core.py:1688: in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/click/core.py:1688: in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/click/core.py:1434: in invoke
    return ctx.invoke(self.callback, **ctx.params)
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/click/core.py:783: in invoke
    return __callback(*args, **kwargs)
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/typer/main.py:683: in wrapper
    return callback(**use_params)  # type: ignore
src/prefect/cli/_utilities.py:41: in wrapper
    return fn(*args, **kwargs)
src/prefect/utilities/asyncutils.py:255: in coroutine_wrapper
    return call()
src/prefect/_internal/concurrency/calls.py:382: in __call__
    return self.result()
src/prefect/_internal/concurrency/calls.py:282: in result
    return self.future.result(timeout=timeout)
src/prefect/_internal/concurrency/calls.py:168: in result
    return self.__get_result()
/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
src/prefect/_internal/concurrency/calls.py:345: in _run_async
    result = await coro
src/prefect/cli/server.py:151: in start
    setup_signal_handlers_server(
src/prefect/utilities/processutils.py:364: in setup_signal_handlers_server
    setup_handler(signal.SIGINT, signal.SIGTERM, signal.SIGKILL)
src/prefect/utilities/processutils.py:349: in forward_signal_handler
    signal.signal(signum, handler)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

signalnum = <Signals.SIGINT: 2>
handler = <function forward_signal_handler.<locals>.handler at 0x7fbdd0eda480>

    @_wraps(_signal.signal)
    def signal(signalnum, handler):
>       handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
E       ValueError: signal only works in main thread of the main interpreter

/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/signal.py:56: ValueError
----------------------------- Captured stderr call -----------------------------
19:25:51.375 | DEBUG   | Dummy-1      | prefect._internal.concurrency - <function start at 0x7fbe18a21620> --> run async in new loop
19:25:51.375 | DEBUG   | Dummy-1      | prefect._internal.concurrency - Running call start(host='127.0.0.1', port=4200, keep_alive_timeout=5, log_level='WARNING', scheduler=False, analytics=F...) in thread 'Dummy-1'
19:25:51.375 | DEBUG   | Dummy-1      | prefect._internal.concurrency - <WatcherThreadCancelScope, name='start' RUNNING, runtime=0.00> entered
19:25:51.375 | DEBUG   | Dummy-1      | prefect._internal.concurrency - <WatcherThreadCancelScope, name='start' COMPLETED, runtime=0.00> exited
19:25:51.375 | DEBUG   | Dummy-1      | prefect._internal.concurrency - Executing coroutine for call start(host='127.0.0.1', port=4200, keep_alive_timeout=5, log_level='WARNING', scheduler=False, analytics=F...) in new loop
19:25:51.376 | DEBUG   | Dummy-1      | prefect._internal.concurrency - <AsyncCancelScope, name='start' RUNNING, runtime=0.00> entered
19:25:51.379 | DEBUG   | Dummy-1      | prefect._internal.concurrency - <AsyncCancelScope, name='start' COMPLETED, runtime=0.00> exited
19:25:51.379 | DEBUG   | Dummy-1      | prefect._internal.concurrency - Encountered exception in async call start(host='127.0.0.1', port=4200, keep_alive_timeout=5, log_level='WARNING', scheduler=False, analytics=F...)
Traceback (most recent call last):
  File "/home/runner/work/prefect/prefect/src/prefect/_internal/concurrency/calls.py", line 345, in _run_async
    result = await coro
             ^^^^^^^^^^
  File "/home/runner/work/prefect/prefect/src/prefect/cli/server.py", line 151, in start
    setup_signal_handlers_server(
  File "/home/runner/work/prefect/prefect/src/prefect/utilities/processutils.py", line 364, in setup_signal_handlers_server
    setup_handler(signal.SIGINT, signal.SIGTERM, signal.SIGKILL)
  File "/home/runner/work/prefect/prefect/src/prefect/utilities/processutils.py", line 349, in forward_signal_handler
    signal.signal(signum, handler)
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/signal.py", line 56, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: signal only works in main thread of the main interpreter

@abrookins
Copy link
Collaborator

This appears to be the root cause: pytest-dev/pytest-xdist#469

@abrookins
Copy link
Collaborator

So it's not just invoke_and_assert. The issue is any time we run signal.signal and a test runs off the main thread, we'll hit this issue. I’ve researched today if there is a fast fix to how we run tests with xdist, but I don’t see one.

So I’m considering other options: moving signal handler code into a utility that logs when it’s not on the main thread and thus can’t execute signal.signal and revisiting the mocking strategy we use to mock and test signal handler calls. Currently, our mock calls signal.signal, which of course fails in this case also.

@chrisguidry
Copy link
Collaborator Author

I'm working on a minimal repro test case now, having a hard time figuring out when xdist puts things on another thread...

@chrisguidry
Copy link
Collaborator Author

This appears to be the root cause: pytest-dev/pytest-xdist#469

And they point to the underlying execnet: pytest-dev/execnet#96

@chrisguidry
Copy link
Collaborator Author

I'm still not able to reproduce this, even just a basic test that asserts that the test thread is not the main thread.

import threading

import pytest


@pytest.mark.parametrize("anything", list(range(100)))
def test_sync_which_thread_am_i(anything: int):
    assert threading.get_ident() == threading.main_thread().ident


@pytest.mark.parametrize("anything", list(range(100)))
async def test_async_which_thread_am_i(anything: int):
    assert threading.get_ident() == threading.main_thread().ident

My understanding of the execnet issue is that on slower machines or under heavier load, execnet may need to send some work to another thread, so I have been testing this with a container that has the equivalent of 0.25 CPU:

docker run -ti --rm -v ./:/prefect/ --cpus="0.2" --entrypoint bash python:3.11

Then in there, I run a couple of things to get bootstrapped:

git config --global --add safe.directory /prefect
pip install -e .[dev]

Then run pytest (with timeouts disabled):

pytest -n 2 tests/cli/test_signal_issue.py --timeout=0

After quite a few tries, I'm not able to get this to happen. I haven't been able to determine what combo of software, Github Actions environment, and system load makes this happen, but I feel like this is starting to narrow in on the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Tech debt, refactors, CI, tests, and other related work.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants