Fix yielding in progress loop #278

mdorier · 2024-07-31T09:39:59Z

This PR fixes an issue that appears when we have 2 pools, with the progress loop running on the second one. It we are in a situation where the progress loop is the only ULT in its pool, it's never going to yield and never going to give a chance for the scheduler to check the first pool again, even if that first pool could have ULTs available to run.

The fix adds ABT_thread_yield right after HG_Progress so that we give a chance for the scheduler to take over.

Note that while this fixes the starvation problem, it highlights a potential other issue (performance related, this time): if the progress loop is alone in its pool, HG_Progress will be called with a timeout of 100ms, even though ideally we would want to call it with a timeout of 0 so that we can immediately yield and check other pools. We could warn the user about this behavior by displaying a warning if the progress pool is sharing an execution stream with other pools.

codecov · 2024-07-31T09:47:12Z

Codecov Report

Attention: Patch coverage is 81.25000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 57.26%. Comparing base (4cdf588) to head (4fa9572).

Files	Patch %	Lines
src/margo-core.c	40.00%	2 Missing and 1 partial ⚠️
tests/unit-tests/margo-init.c	88.88%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #278      +/-   ##
==========================================
+ Coverage   57.21%   57.26%   +0.04%     
==========================================
  Files          69       69              
  Lines        9740     9762      +22     
  Branches     1256     1256              
==========================================
+ Hits         5573     5590      +17     
- Misses       3399     3404       +5     
  Partials      768      768

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

carns · 2024-07-31T12:19:46Z

Thanks, will look at this soon, just wanted to comment quickly on the discussion. If progress loop is configured to run on an ES that must multiplex between two pools, then the only choice is between two (not ideal) configurations: a) setting the progress_timeout_ub_msec to 0 as you pointed out to avoid 100 ms stalls, but in exchange making the progress loop busy spin so that it consumes 100% CPU even when margo is idle and b) leaving the default value as is and accepting that the other pool will only get occasional attention. Neither one is a very desirable configuration.

There are definitely use cases for taking a busy spin approach, but that can be accomplished in any configuration by setting the progress timeout; it doesn't require setting up multiple pools to share an execution stream.

mdorier · 2024-07-31T12:50:42Z

In #279 I have added a warning that the configuration will cause unnecessary delays if any pool is sharing an ES with the progress pool. If that is something a user is comfortable with (e.g if they have set the progress timeout to 0 to busy-spin) they can simply ignore the warning. I prefer this to the solution of setting the timeout to 0 if there is another pool used by the same ES.

- remove redundant yield that occurs after progress in some conditions - refactored logic slightly for calculating timeout so that there is only one progress call instead of a special path for zero timeout - moved yield from within internal progress call to main progress loop, but after triggers, to avoid yielding when the progress loop has completions ready to handle

carns · 2024-07-31T23:02:49Z

Check my revision and see if it makes sense. The tests still pass for me with this configuration.

I started to comment that there was a superfluous yield call one level up the call stack in some cases, but then I realized we could probably consolidate the calls a little to help avoid that sort of thing, and I also suggest putting the remaining yield after trigger instead of after progress (mostly to avoid splitting up the progress/trigger path when there are completions to handle).

carns

committed a suggested modification

mdorier · 2024-08-01T08:02:28Z

Looks good to me. I'm merging.

mdorier added 2 commits July 31, 2024 09:21

added test to showcase the problem

c919055

added a yield after HG_Progress

4a9e8c9

mdorier assigned carns Jul 31, 2024

mdorier mentioned this pull request Jul 31, 2024

Starvation scenario in a particular setup of pools and execution streams #277

Open

mdorier unassigned carns Jul 31, 2024

mdorier requested a review from carns July 31, 2024 12:47

carns mentioned this pull request Jul 31, 2024

ASAN issue at finalization time when multiple ES use the primary pool #280

Open

carns reviewed Jul 31, 2024

View reviewed changes

mdorier merged commit d480bca into main Aug 1, 2024
7 checks passed

github-actions bot locked and limited conversation to collaborators Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix yielding in progress loop #278

Fix yielding in progress loop #278

mdorier commented Jul 31, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading

carns commented Jul 31, 2024

mdorier commented Jul 31, 2024

carns commented Jul 31, 2024

carns left a comment

mdorier commented Aug 1, 2024

Fix yielding in progress loop #278

Fix yielding in progress loop #278

Conversation

mdorier commented Jul 31, 2024

codecov bot commented Jul 31, 2024 • edited Loading

Codecov Report

carns commented Jul 31, 2024

mdorier commented Jul 31, 2024

carns commented Jul 31, 2024

carns left a comment

Choose a reason for hiding this comment

mdorier commented Aug 1, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading