Fix exception swallow in fixture due to parent cancel scope #77

touilleMan · 2019-01-12T12:20:12Z

Following #75, I've finally isolated the original issue:

async def die_soon(*, task_status=trio.TASK_STATUS_IGNORED):
    task_status.started()
    raise RuntimeError('Ooops !')

@asynccontextmanager
async def async_finalizer():
    try:
        yield
    finally:
        await trio.sleep(0)

@pytest.fixture
async def fixture(nursery):
    async with trio.open_nursery() as nursery1:
        async with async_finalizer():
            async with trio.open_nursery() as nursery2:
                await nursery2.start(die_soon)
                yield
                # Comment next line to make the test pass :'(
                nursery1.cancel_scope.cancel()

@pytest.mark.trio
async def test_try(fixture):
    await trio.sleep_forever()

die_soon causes the fixture to crash
Because how pytest-trio works, the information that die_soon has crashed doesn't make yield to return with an exception
Hence nursery1 is cancelled
any async teardown operation in async_finalizer end up with a Cancelled exception
this Cancelled exception reach the nursery1's which conclude no other exceptions has occurred

The sad part is if we convert the fixture into a regular async context manager, everything works fine.

I'm not really sure how we could fix that (well I'm not even sure about the steps I've listed 😄 ), @njsmith any idea ?

This test showcase a bug where an exception occuring in a yield fixture is silently swallowed when a parent cancel_scope get cancelled during teardown.

…g fixture teardown

codecov · 2019-05-30T12:44:34Z

Codecov Report

Merging #77 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
+ Coverage   99.78%   99.79%   +<.01%     
==========================================
  Files          19       19              
  Lines         476      483       +7     
  Branches       41       42       +1     
==========================================
+ Hits          475      482       +7     
  Misses          1        1

Impacted Files	Coverage Δ
pytest_trio/_tests/test_async_yield_fixture.py	`100% <100%> (ø)`	⬆️
pytest_trio/plugin.py	`99.48% <100%> (ø)`	⬆️

touilleMan · 2019-05-30T12:55:07Z

@njsmith Would you have an eye on this fix ?
The bug is really nasty and I fear this only cover part of the trouble (or may lead to incorrect exception reporting in some cases...)

oremanj · 2019-05-30T16:30:35Z

I think this is the same issue as python-trio/trio#455. An exception is propagating in a cancelled context; some aexit or finally executes a checkpoint; you wind up with a Cancelled propagating with the original exception as __context__; and when Trio swallows the Cancelled it can't tell whether the exception in __context__ was in the middle of being handled or what.

I'm not super comfortable with the proposed fix -- it feels special-cased against the particular example you've used to demonstrate the problem, rather than fixing the underlying problem.

The sad part is if we convert the fixture into a regular async context manager, everything works fine.

I have some doubts about that. Note that the @pytest.fixture protocol is different from the @asynccontextmanager protocol, because the former doesn't throw in exceptions. That means that for an apples-to-apples comparison, the @asynccontextmanager version would have to say

try:
    yield
finally:
    nursery1.cancel_scope.cancel()

at which point I believe the same error will recur, demonstrating that this isn't really a pytest-trio problem.

touilleMan · 2019-05-31T11:20:46Z

Thanks for the feedback @oremanj

I think this is the same issue as python-trio/trio#455.

I didn't know about this issue, but it seems pretty close indeed. My guess is the weird yield in pytest-trio make the issue more likely to occurs there and that's why I only encountered this kind of issue in pytest-trio so far.

I'm not super comfortable with the proposed fix -- it feels special-cased against the particular example you've used to demonstrate the problem, rather than fixing the underlying problem.

I totally agree, the point is currently we have test that swallow errors with is the worst kind of behavior (it's a pain to detect the bug, then an even bigger paint to find out the issue)

So I would say a hacky fix is still better than the current situation...

njsmith · 2019-06-07T17:53:35Z

This is a tough issue. @touilleMan can you share the real example where you ran into this?

The problem with the minimal example is that it's artificial, so it isn't obvious what it should do. Arguably it's already working correctly – you explicitly requested that the code inside the nursery1 block be cancelled, that code immediately stopped what it was doing (in this case, handling an exception), that's what you asked for, what's the problem? And if you change the example to something like this, then the proposed fix seems very weird (it means we exit with RuntimeError even though we caught it):

async def myfixture():
    try:
        async with open_nursery() as nursery:
            await nursery.start(die_soon)
            yield
    except RuntimeError:
        pass

So we probably need to do something, but the minimal example doesn't make it obvious what change to make.

touilleMan · 2019-06-07T19:31:10Z

@njsmith We encountered this issue multiple time with the unittests of our project Parsec
Put it shortly Parsec is a client/server application for sharing file with a client-side encryption. This means a typical usecase in our tests is:

init a server with the running_backend fixture
init a client with the alice_core fixture
in the actual test make the client and the server communicate

One important thing about the client (called core in our codebase) is it spawn multiple coroutines to run background tasks called monitors (typically listening to the backend in case another client modify a file we are working on, or determine when we should upload to the server the files that have been modified ni local).
To do that the core is created by a context manager which end with a scope cancellation to stop the monitors coroutines.

Now going back to our test, if a monitor is buggy and blows up it coroutine, the alice_core fixture doesn't get the exception an do it regular teardown, cancelling the core's monitors nursery.
This make a trivial bug pretty complex to spot because we have to comment out the cancellation in the fixture (and there can be multiple suspect cancellations in differents fixtures !) in order to reveal the exception.

I've created a branch in the project that demonstrate this nicely

$ pip install -e .[all]
[...]
$ py.test tests/test_showcase_pytest_trio_issue_75.py -s
[...]
tests/test_showcase_pytest_trio_issue_75.py Test started...
Crash in messages_monitor :(
.
[...]
======= 1 passed in 0.25 seconds =======
$ py.test tests/test_showcase_pytest_trio_issue_75.py -s --detect-pytest-trio-issue-75
tests/test_showcase_pytest_trio_issue_75.py Test started...
Crash in messages_monitor :(
F
[...]
======= 1 failed in 0.33 seconds =======

The more I think about this issue, the more I wonder about this "fixture never raise exception" policy. This is useful to make the trio fixture working like regular pytest fixture, but I don't see a good reason for pytest to do this in the first place (I guess the reason is teardown should happen no matter the outcome of the test, however this what try/finally are for and Python is all about consenting adults...).
So wouldn't it be better to just have trio fixture raise the exception ?

njsmith · 2019-06-07T19:34:29Z

Hmm, OK, I think I get part of it though. What's anomalous about myfixture is that it does crash the test – the yield never raises Cancelled, but it is is a cancel point in the critical sense that if it gets cancelled, then the whole test immediately gets cancelled. So even if we catch the exception in the fixture, it's already had side-effects outside the fixture. I think I finally get what you and @vxgmichel were saying in #75.

Ick.

So first, yeah, if crash has been called, then the test definitely needs to fail one way or another. That seems clear.

The simplest solution would be that if a fixture calls crash(None) but never follows up with an actual exception, then we raise a new exception like RuntimeError(f"fixture {fixture_name} crashed"). (This is sort of what @vxgmichel is doing here.) That would get us to a point where at least your tests reliably fail. And of course we'd have some docs to explain that this error happens when a fixture yield gets cancelled, usually because of a background task crashing, but then the fixture doesn't raise an exception, usually because it somehow caught the background task's exception.

If we want to do better, it would be by giving more information to help debug the problem. For something like the myfixture example above, I guess it will be pretty rare to get into this situation and won't be hard for users to figure out what they did wrong, so maybe that's good enough...?

There are also cases where you probably can't do any better. For example:

async def silly_fixture():
    async def cancel_me(cancel_scope):
        cancel_scope.cancel()

    async with trio.open_nursery() as nursery:
        nursery.start_soon(cancel_me, nursery.cancel_scope)
        yield

In this case, the fixture crashes the test, but there is no exception to report.

I guess this is an example of a case where it would be useful if Cancelled exceptions carried some hint about what triggered them – @smurfix keeps asking for them to capture the stack where cancel was called, which is expensive but might even be OK in a test setting. Or even just someone called cancel() could be a useful hint here, versus there was an unhandled exception propagating.

But myfixture and silly_fixture are both super super artificial situations. In the actual situations you're hitting it's a lot more confusing, so probably some kind of additional hints would be helpful, but to figure out what those are, we need a better understanding of how exactly people get into this situation.

For fixtures using the nursery fixture, I guess this isn't an issue? I think it reliably passes any background exceptions to crash. And AFAICT the only case where it could be cancelled from outside is if one of our TrioFixture.run raised an exception, in which case you'd see that exception. So we're looking at fixtures that open their own nurseries (perhaps hidden inside some async context manager that came from a library).

In any case where this problem happens, the first step is for the fixture yield to be cancelled. The only ways this can happen are if some code inside the fixture explicitly calls .cancel() (like in silly_fixture), or else there's an unhandled exception that hits a nursery. If there's an unhandled exception, either it gets caught explicitly (like in myfixture), or else it gets accidentally swallowed like in @touilleMan's die_soon fixture, which requires an exception handler be cancelled. So, this reduces to the previous case: for an exception to be accidentally swallowed, you either need an explicit call to .cancel(), or else you need a second unhandled exception. The case where one unhandled exception wipes out another one is important, but it does still result in the test reporting an exception, and in the multierror v2 design we have an idea for how to at least report the lost exception as a __context__ on the surviving exception. So... conclusion: in order to get into the really confusing case, I think you need to either explicitly call .cancel() on a scope that's still running, or else explicitly catch an exception like in the myfixture example.

And, it's not just any cancel() that can cause problems: if the cancel() is the very last thing inside the cancel scope/nursery, then it's safe. (So for example, the nursery fixture is fine, and I think all the nurseries used implicitly inside trio-websocket are fine.)

So I guess the general advice is:

Either you're explicitly catching and throwing away some exception; don't do that
Or you're cancelling a nursery while it's still running; maybe you should move the cancel to the end of the nursery instead.

Does that sound right?

njsmith · 2019-06-07T19:39:52Z

@njsmith

Or you're cancelling a nursery while it's still running; maybe you should move the cancel to the end of the nursery instead.

@touilleMan

To do that the core is created by a context manager which end with a scope cancellation to stop the monitors coroutines.

Well, score one for logic I guess :-). This context manager indeed does an explicit cancel, but not at the end of the nursery:

https://github.com/Scille/parsec-cloud/blob/07ebd60ec4de94f6073d5640ea35974a50524394/parsec/core/logged_core.py#L102

So I predict that if you remove 4 levels of indentation from that last line, then it will fix your problem. (I guess you might also need to add a monitor.nursery.cancel_scope.cancel() at the appropriate level too?)

touilleMan · 2019-06-07T19:45:42Z

@njsmith didn't read your answer yet, but seeing such a big answer 3mn after my post feels like you are batman 🦇 😄

touilleMan · 2019-06-07T20:02:43Z

So I predict that if you remove 4 levels of indentation from that last line, then it will fix your problem. (I guess you might also need to add a monitor.nursery.cancel_scope.cancel() at the appropriate level too?)

I've just tested, indeed your prediction was true ;-)

On top of that your explanation really helped me understanding the logic of this swalled-by-cancelled exception.
Now it seems totally logical to me to put the cancel() at the very end of the nursery context manager. However this is still incredibly misleading (one indentation level seems so harmless...), but I can't see how to make this less likely to shoot yourself in the foot.

njsmith · 2019-06-07T20:02:57Z

@touilleMan I wish! But it took me like an hour to write, I just started sooner :-)

njsmith · 2019-06-08T01:14:48Z

OK, so we definitely understand the problem better than we did before, that's good. And it seems like we can at least say it would be better if pytest-trio detected when a fixture called crash(None) and then never followed up with a crash(actual_exception). Maybe make it crash(unique_fixture_id, <exc or None>), so it can track this? That would at least make the problem clearer when it happens, and for the blatantly silly cases (like silly_fixture) I don't think there's anything better we can hope to do.

But you're also right that this is a super subtle footgun.

I've been trying to figure out if this can happen when your code is run outside pytest-trio. I think the answer is that there is one rare situation where if you get unlucky enough, you can lose background exceptions. Consider what happens if a background task crashes just as the body of the async context manager finishes and your yield returns. Something like:

async with logged_core_factory(...):
    await checkpoint()

The checkpoint suspends the task, checks for cancellation, then schedules it to run again later.
Then the scheduler switches to your monitor_backend_connection task, which exits with an exception. (I assume that something like that is happening in your actual test cases, just not at that exact moment?) This marks the monitor_nursery as cancelled, and we make a note to inject a Cancelled exception at the next chance we get.
But before that, we run the main task again, and it returns from the checkpoint (which already checked for cancellation, it's done with that), then exits the async with logged_core_factory block
So the main task goes back into the body of logged_core_factory and executes root_nursery.cancel_scope.cancel().

And now your monitor_backend_connection exception is going to be lost.

Now... in your application, I don't know if this is a problem or not. Once the body of the async with logged_core_factory exits, I guess you want to cancel those background tasks like monitor_backend_connection, and if that cancellation kills them while they were in the process of raising an exception, or just about to raise an exception, then maybe that's fine, you don't care about that anymore because you're exiting.

What makes the pytest-trio case so weird is that the background task exception causes the body of the async with logged_core_factory to exit early by cancelling the inner nursery, but pytest-trio suppresses the exception, so the outer nursery still gets cancelled, which hides the original exception.

.....I really don't know what to do about this! You're totally right about this part:

The more I think about this issue, the more I wonder about this "fixture never raise exception" policy. This is useful to make the trio fixture working like regular pytest fixture, but I don't see a good reason for pytest to do this in the first place (I guess the reason is teardown should happen no matter the outcome of the test, however this what try/finally are for and Python is all about consenting adults...).
So wouldn't it be better to just have trio fixture raise the exception ?

If pytest fixtures had followed the usual @contextmanager convention, then this wouldn't be an issue. But... I don't think there's anything we can do about it now. We can't just unilaterally declare that whenever pytest-trio is loaded, all pytest fixtures have different semantics, because people have existing fixtures that put teardown logic outside of finally blocks, and they'll break if the yield can start raising. We could declare that trio fixtures can raise Cancelled, while regular fixtures can't. But then people have to remember that some fixtures have one semantics, and some have the other semantics, and that seems confusing too. That's why I kept the pytest fixture semantics in the first place. But I hadn't realized then how easy it would be for people to write fixtures like

@fixture
async def my_fixture():
    async with my_existing_context_manager:
        yield

and now we're basically taking the weird pytest fixture semantics and forcing them onto my_existing_context_manager, which its authors might not be expecting at all.

Maybe it wouldn't be so bad if literally the only exception that can be raised from a fixture yield is Cancelled, and only when the fixture has a nursery and a background task and the background task crashed?

Can you look at your fixtures and see if there are any that rely on running code after yield, even if a background task crashes?

Other than that... I think the only option is to document it. You do seem to need a pretty specific combination of factors to trigger this, so I guess if the error message pointed people to a detailed writeup of what to look for, then most people would be able to find the cause pretty quickly?

njsmith · 2019-06-23T01:20:26Z

I think this can be closed now that #83 is merged... please re-open if I'm wrong

touilleMan force-pushed the yield-fixture-crash-swallowed-with-parent-cancel branch from beff403 to 4f24fb2 Compare January 12, 2019 12:21

touilleMan changed the title ~~Add test_async_yield_fixture_crashed_then_parent_cancel~~ Showcase exception swallow in fixture due to parent cancel scope Jan 12, 2019

touilleMan mentioned this pull request Jan 15, 2019

Crash on fs.entry.updated event Scille/parsec-cloud#115

Closed

touilleMan added 2 commits May 30, 2019 14:38

Add test_async_yield_fixture_crashed_then_parent_cancel

a896332

This test showcase a bug where an exception occuring in a yield fixture is silently swallowed when a parent cancel_scope get cancelled during teardown.

Fix false positive with test making use of nursery cancellation durin…

ab985c7

…g fixture teardown

touilleMan force-pushed the yield-fixture-crash-swallowed-with-parent-cancel branch from d9b4eb5 to ab985c7 Compare May 30, 2019 12:39

Add pragma no cover on sanity check

0d7fc39

touilleMan changed the title ~~Showcase exception swallow in fixture due to parent cancel scope~~ Fix exception swallow in fixture due to parent cancel scope May 30, 2019

This was referenced Jun 8, 2019

test_ctx.crashed() should always make the test end up with crash #75

Closed

Cancelled __aexit__ handlers eat exceptions python-trio/trio#455

Open

Rework handling of cancelled 'yield' in fixtures #83

Merged

njsmith closed this Jun 23, 2019

jakkdl mentioned this pull request Oct 11, 2024

pytest-anyio and crashed background task in taskgroup fixture agronholm/anyio#805

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exception swallow in fixture due to parent cancel scope #77

Fix exception swallow in fixture due to parent cancel scope #77

touilleMan commented Jan 12, 2019

codecov bot commented May 30, 2019 •

edited

Loading

touilleMan commented May 30, 2019

oremanj commented May 30, 2019

touilleMan commented May 31, 2019

njsmith commented Jun 7, 2019 •

edited

Loading

touilleMan commented Jun 7, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 7, 2019

touilleMan commented Jun 7, 2019

touilleMan commented Jun 7, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 8, 2019

njsmith commented Jun 23, 2019

Fix exception swallow in fixture due to parent cancel scope #77

Fix exception swallow in fixture due to parent cancel scope #77

Conversation

touilleMan commented Jan 12, 2019

codecov bot commented May 30, 2019 • edited Loading

Codecov Report

touilleMan commented May 30, 2019

oremanj commented May 30, 2019

touilleMan commented May 31, 2019

njsmith commented Jun 7, 2019 • edited Loading

touilleMan commented Jun 7, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 7, 2019

touilleMan commented Jun 7, 2019

touilleMan commented Jun 7, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 8, 2019

njsmith commented Jun 23, 2019

codecov bot commented May 30, 2019 •

edited

Loading

njsmith commented Jun 7, 2019 •

edited

Loading