Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: addresses event completion race condition in pubsub test utility #2332

Merged
merged 7 commits into from
Aug 15, 2024

Conversation

jonathanj-square
Copy link
Contributor

fixes: #2266

The race conditions was caused by two issues.

  1. The WaitGroup down tick was not synchronized with the completion of the subscriber verb execution. Failure to wait for completion may result in a down time coming before a new event is dispatched (which ticks up the waitgroup) - which may result in the WaitGroup reaching zero too soon.
  2. Non-linear pubsub networks are not supported by simply counting up when an event is published and down after it is completed. The up count needs to match the number of live subscriptions (e.g. subscriptions with registered subscribers)

This synchronization scheme does not cover asynchronous event dispatch (e.g. events dispatched via go routines within the subscriber verb) such scenarios introduce a race condition that cannot be resolved by black box external synchronization.

… given the underlying issue is topology sensitive
…ce condition.

issue #1
the waitgroup used to await event processing completion is not synchronized with event handling completion. The waitgroup has its count synchronously incremented before the event gets published but the down tick is not explicitly synchronized with handler completion. The down tick occurs in `handlePubSubEvent` in response to a `publishEvent` signal - this signals the intention to distribute an event on a topic not the completion. The `subscriptionDidConsumeEvent` is the right event to target. Given the event is dispatched via a goroutine the race condition may or may not trigger failure depending on competition execution timings.

issue #2
event synchronization needs to consider the subscription topology in order to account for non-linear event propagation networks

remaining issue
synchronous event propagation from the subscriber verb is covered but not asynchronous propagation (e.g. a scenario where a verb spawns a goroutine that delays before writing an event to another topic might fire the event after the waitgroup reaches zero)
@jonathanj-square jonathanj-square requested review from a team and safeer and removed request for a team August 12, 2024 21:21
@ftl-robot ftl-robot mentioned this pull request Aug 12, 2024
Copy link
Contributor

@safeer safeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 12, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@alecthomas alecthomas requested a review from matt2e August 13, 2024 02:04
Copy link
Collaborator

@matt2e matt2e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 14, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 15, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 15, 2024
@jonathanj-square jonathanj-square added this pull request to the merge queue Aug 15, 2024
Merged via the queue into main with commit c6d889c Aug 15, 2024
17 checks passed
@jonathanj-square jonathanj-square deleted the jonathanj/otel/pubsubtest-race-condition branch August 15, 2024 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ftltest race condition when waiting for pubsub to complete
3 participants