Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix yielding in progress loop #278

Merged
merged 3 commits into from
Aug 1, 2024

Conversation

mdorier
Copy link
Contributor

@mdorier mdorier commented Jul 31, 2024

This PR fixes an issue that appears when we have 2 pools, with the progress loop running on the second one. It we are in a situation where the progress loop is the only ULT in its pool, it's never going to yield and never going to give a chance for the scheduler to check the first pool again, even if that first pool could have ULTs available to run.

The fix adds ABT_thread_yield right after HG_Progress so that we give a chance for the scheduler to take over.

Note that while this fixes the starvation problem, it highlights a potential other issue (performance related, this time): if the progress loop is alone in its pool, HG_Progress will be called with a timeout of 100ms, even though ideally we would want to call it with a timeout of 0 so that we can immediately yield and check other pools. We could warn the user about this behavior by displaying a warning if the progress pool is sharing an execution stream with other pools.

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 81.25000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 57.26%. Comparing base (4cdf588) to head (4fa9572).

Files Patch % Lines
src/margo-core.c 40.00% 2 Missing and 1 partial ⚠️
tests/unit-tests/margo-init.c 88.88% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #278      +/-   ##
==========================================
+ Coverage   57.21%   57.26%   +0.04%     
==========================================
  Files          69       69              
  Lines        9740     9762      +22     
  Branches     1256     1256              
==========================================
+ Hits         5573     5590      +17     
- Misses       3399     3404       +5     
  Partials      768      768              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carns
Copy link
Member

carns commented Jul 31, 2024

Thanks, will look at this soon, just wanted to comment quickly on the discussion. If progress loop is configured to run on an ES that must multiplex between two pools, then the only choice is between two (not ideal) configurations: a) setting the progress_timeout_ub_msec to 0 as you pointed out to avoid 100 ms stalls, but in exchange making the progress loop busy spin so that it consumes 100% CPU even when margo is idle and b) leaving the default value as is and accepting that the other pool will only get occasional attention. Neither one is a very desirable configuration.

There are definitely use cases for taking a busy spin approach, but that can be accomplished in any configuration by setting the progress timeout; it doesn't require setting up multiple pools to share an execution stream.

@mdorier mdorier requested a review from carns July 31, 2024 12:47
@mdorier
Copy link
Contributor Author

mdorier commented Jul 31, 2024

In #279 I have added a warning that the configuration will cause unnecessary delays if any pool is sharing an ES with the progress pool. If that is something a user is comfortable with (e.g if they have set the progress timeout to 0 to busy-spin) they can simply ignore the warning. I prefer this to the solution of setting the timeout to 0 if there is another pool used by the same ES.

- remove redundant yield that occurs after progress in some conditions
- refactored logic slightly for calculating timeout so that there is
  only one progress call instead of a special path for zero timeout
- moved yield from within internal progress call to main progress loop,
  but after triggers, to avoid yielding when the progress loop has
  completions ready to handle
@carns
Copy link
Member

carns commented Jul 31, 2024

Check my revision and see if it makes sense. The tests still pass for me with this configuration.

I started to comment that there was a superfluous yield call one level up the call stack in some cases, but then I realized we could probably consolidate the calls a little to help avoid that sort of thing, and I also suggest putting the remaining yield after trigger instead of after progress (mostly to avoid splitting up the progress/trigger path when there are completions to handle).

Copy link
Member

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed a suggested modification

@mdorier
Copy link
Contributor Author

mdorier commented Aug 1, 2024

Looks good to me. I'm merging.

@mdorier mdorier merged commit d480bca into main Aug 1, 2024
7 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants