Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverted: Increase stream-cancellation-delay default to 1000 millis #590

Merged

Conversation

mdedetrich
Copy link
Contributor

@mdedetrich mdedetrich commented Sep 4, 2024

References: #422

Still need to update tests as per @jrudolph original comment, i.e.

That broke lots of tests that probably didn't automatically propagate closing so they now fail while waiting for the now ten times longer cancellation times... (Might still might sense to "just" fix those tests)

@pjfanning
Copy link
Contributor

Reading the Akka issue and PR seems to indicate that this will cause test failures. I can help if needed to try to fix up the test issues. I'm hoping to get Pekko HTTP 1.1.0 released but it seems useful to fix this.

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Sep 4, 2024

Reading the Akka issue and PR seems to indicate that this will cause test failures. I can help if needed to try to fix up the test issues. I'm hoping to get Pekko HTTP 1.1.0 released but it seems useful to fix this.

The test failures are expected because some of the tests expect the stream to be cancelled within 100ms and now that is has been increased to 1000ms those tests are now broken (as per @jrudolph 's comment, the change itself is safe).

@mdedetrich
Copy link
Contributor Author

Ill have time to look into this tonight

@JD557
Copy link

JD557 commented Sep 4, 2024

Should this actually be marked as closing the issue?
It's better than the current status quo, but the underlying issue is still there.

For reference, we've used 1s for a long time with no issue until today, where we saw this resurface again. So 1s is not bullet proof.

@mdedetrich
Copy link
Contributor Author

Should this actually be marked as closing the issue? It's better than the current status quo, but the underlying issue is still there.

For reference, we've used 1s for a long time with no issue until today, where we saw this resurface again. So 1s is not bullet proof.

Fair point, ill change this so it won't close original issue

@mdedetrich mdedetrich force-pushed the increase-stream-cancellation-delay-default branch 2 times, most recently from fe0067b to 3998b45 Compare September 4, 2024 21:51
@mdedetrich mdedetrich force-pushed the increase-stream-cancellation-delay-default branch from 3998b45 to 5bdce88 Compare September 4, 2024 22:05
@mdedetrich mdedetrich marked this pull request as ready for review September 4, 2024 22:18
@mdedetrich
Copy link
Contributor Author

PR is ready, I managed to fix the issues with the tests.

@@ -6,4 +6,5 @@ pekko {
default-dispatcher.throughput = 1
}
stream.materializer.debug.fuzzing-mode = off
stream.testkit.all-stages-stopped-timeout = 20 s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is related to Utils.assertAllStagesStopped which is designed to check that all stream stages finish when the test does. Since stream-cancellation-delay creates a stage in of itself this needs to be increased.

To me it makes sense to do this globally (for tests) rather than on a test by test basis because there is no issue in having it too long as it only increases the failure test case, not the positive.

The default value of this 5 s

Copy link
Contributor

@jrudolph jrudolph Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot remember the reason to choose 100ms for the delay cancellation limit. Having to increase the all-stages-stopped-timeout to higher values is somewhat of an indication of what kind of real world implications this change might have: cleanup of broken/closed connections might now take longer.

IIRC, the whole reason to introduce the delay was to deal with the fact that stream cancellation could not be well-configured in Akka 2.5. With 2.6, it's not 100% clear if the delay is even still needed or whether it could be solved in a better way (i.e. by resolving the cancellation race more intelligently, taking the cancellation reason into account at the right/more places) to avoid symptoms as shown in the ticket.

Copy link
Contributor Author

@mdedetrich mdedetrich Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot remember the reason to choose 100ms for the delay cancellation limit. Having to increase the all-stages-stopped-timeout to higher values is somewhat of an indication of what kind of real world implications this change might have: cleanup of broken/closed connections might now take longer.

iirc its only in a few tests which used Utils.assertAllStagesStopped started timing out so I can granularly update those tests. My ultimate reasoning for making this change global is was to reduce surprises for people writing future tests (i.e. giving more QoL to developers), more specifically avoiding the "I just updated/wrote a test and now its failing due to a finely tuned timeout and now I have to figure out that it was because of that said timeout"

I think that as long as you handle shutdown of pekko http gracefully I don't think it should cause issues in prod however I admit that many people may not even be handling shutdowns of pekko http correctly.

IIRC, the whole reason to introduce the delay was to deal with the fact that stream cancellation could not be well-configured in Akka 2.5. With 2.6, it's not 100% clear if the delay is even still needed or whether it could be solved in a better way (i.e. by resolving the cancellation race more intelligently, taking the cancellation reason into account at the right/more places) to avoid symptoms as shown in the ticket.

If this is the case then it should definitely be looked into, I am not aware of the updates to stream cancellation in akka 2.6 but judging by the date of the report at #422 (comment), if that stream cancellation update is now the default it didn't seem to fix the underlying issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ultimate reasoning for making this change global is was to reduce surprises for people writing future tests (i.e. giving more QoL to developers), more specifically avoiding the "I just updated/wrote a test and now its failing due to a finely tuned timeout and now I have to figure out that it was because of that said timeout"

I agree with the goal of helping test writers but it's not good if tests only succeed or fail after a long timeout. Event the old 2 second timeout might have been somewhat of a test fixing bankruptcy issue where we didn't have the time to make sure all tests run through in a timely fashion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the goal of helping test writers but it's not good if tests only succeed or fail after a long timeout. Event the old 2 second timeout might have been somewhat of a test fixing bankruptcy issue where we didn't have the time to make sure all tests run through in a timely fashion.

Should I go through and undo this global timeout increase and only apply it to the specific tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a nice middle ground would be to increase the timeout in the configuration for the test classes that are affected? Not as neat as specifying it for an individual test, but more granular than applying it globally.

Copy link
Contributor Author

@mdedetrich mdedetrich Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats definitely true, however its usually only one specific test per test classes that requires an override (as an example with WebSocketIntegrationSpec its only send back 100 elements and then terminate without error even when not ordinarily closed that even needs the override.

Basically its only tests which explicitly cancel some part of request/response as part of what is being tested and given how big test suites normally are in pekko-http it might not be enough of a difference to be useful.

@@ -306,6 +306,7 @@ class H2SpecIntegrationSpec extends PekkoFreeSpec(
executable,
"-k", "-t",
"-p", port.toString,
"-o", "9",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/summerwind/h2spec?tab=readme-ov-file#usage for reference. The default value is 2 seconds, I got to 9 with manual testing by incrememting 1 each time.

Copy link
Contributor

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mdedetrich
Copy link
Contributor Author

@pjfanning Feel free to merge when you see fit

@pjfanning pjfanning merged commit 932a22f into apache:main Sep 4, 2024
10 checks passed
@mdedetrich mdedetrich deleted the increase-stream-cancellation-delay-default branch September 4, 2024 22:27
pjfanning added a commit that referenced this pull request Sep 12, 2024
pjfanning added a commit that referenced this pull request Sep 21, 2024
@pjfanning pjfanning removed this from the 1.1.0 milestone Sep 21, 2024
@pjfanning pjfanning changed the title Increase stream-cancellation-delay default to 1000 millis Reverted: Increase stream-cancellation-delay default to 1000 millis Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants