Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better object lifetime management in margo #285

Merged
merged 9 commits into from
Aug 22, 2024

Conversation

mdorier
Copy link
Contributor

@mdorier mdorier commented Aug 16, 2024

This PR does the following:

  • Add a collection of functions to increment and decrement a refcount on margo-managed pools and xstreams; this allows margo_remove_pool/xstream to avoid removing a pool/xstream that could be in used at an upper level.
  • Add generic (C11) versions of many functions have been added to aggregate _by_index, _by_name, and _by_handle functions.
  • Change the way Argobots and Mercury are finalized: in margo_cleanup, instead of fully finalizing Mercury, we finalize everything except the HG class. This should ensure that RPC handlers complete, as before, but leaves the HG class available for freeing any addresses the user may still have. Mercury is then fully finalized (including HG class) at the end of margo_cleanup if the refcount is 0. If the refcount is not 0, it means Thallium (or another higher-level code) is holding onto the margo instance and will need it to at least destroy Argobots constructs and Mercury addresses.

This PR should make it easier to track who is using the margo instance in Thallium, allowing engine (and any other object that needs the margo instance) to increment and decrement its refcount, preventing it from being freed too early.

Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 69.02655% with 105 lines in your changes missing coverage. Please review.

Project coverage is 57.64%. Comparing base (eb83582) to head (394b9f2).
Report is 2 commits behind head on main.

Files Patch % Lines
src/margo-config.c 57.33% 45 Missing and 51 partials ⚠️
src/margo-core.c 76.92% 1 Missing and 5 partials ⚠️
src/margo-abt-config.c 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
+ Coverage   57.51%   57.64%   +0.12%     
==========================================
  Files          70       70              
  Lines        9910    10150     +240     
  Branches     1282     1337      +55     
==========================================
+ Hits         5700     5851     +151     
- Misses       3422     3464      +42     
- Partials      788      835      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mdorier mdorier changed the title WIP: some modifications for better lifetime management in margo Better object lifetime management in margo Aug 16, 2024
@mdorier mdorier requested a review from carns August 16, 2024 15:17
@carns
Copy link
Member

carns commented Aug 16, 2024

I haven't reviewed the code yet, but just for discussion since (to my knowledge) we haven't had a dependency on c11 features before: are there any portability concerns?

@mdorier
Copy link
Contributor Author

mdorier commented Aug 16, 2024

_Generic is available in gcc since version 4.9. I suspect clang has been supporting it for just as long. Technically margo has already been using C11 for a while now, with _Atomic and bool.

@mdorier
Copy link
Contributor Author

mdorier commented Aug 16, 2024

I realize I will need to fix something: I thought delaying freeing the hg class would be enough but actually I want to also delay freeing the hg context until the end, in case the user still has RPC ids to deregister.

I'll also change "refincr" and "refdecr" to "ref_incr" and "release" for consistency with the ref count methods of the Margo instance.

@mdorier
Copy link
Contributor Author

mdorier commented Aug 17, 2024

Ok I think I'm done with this PR, here are the highlights:

With respect to margo instance ref-counting

  • When creating a margo instance, its refcount is 1;
  • The refcount is decreased by 1 after joining the progress thread in margo_finalize;
  • margo_cleanup will be called by margo_finalize/margo_wait_for_finalize/margo_finalize_and_wait only if the refcount is 0;
  • The effect of the above is that nothing changes for existing code: margo_init initializes the instance, and margo_finalize/margo_wait_for_finalize/margo_finalize_and_wait will free it;
  • margo_instance_ref_incr/release respectively increment/decrement the refcount;
  • margo_instance_release will call margo_cleanup if the refcount has gone to 0, unless the margo instance hasn't been finalized, in which case it calls margo_finalize (which itself will cause a margo_cleanup).

The effect of this is that we now have a proper refcount that we can use in Thallium and PyMargo. The engine class in Thallium can use ref_incr/release in the copy constructor, and we can build multiple engine instances wrapping the same margo instance, they will all be aware of the number of engines referencing the instance.

Another effect is that we can now do something like this:

mid = margo_init(...);
margo_instance_ref_incr(mid);
...
margo_finalize(mid);
/* here Argobots is still initialized, Mercury class and context are still available, so we can cleanup mutex, hg_addr_t, etc. with the need to install a finalize callback via margo_push_finalize_callback */
margo_instance_release(mid);
/* here the instance is actually gone */

With respect to re-counting of pools and xstreams managed by margo

  • This PR adds a bunch of functions to increase/decrease the refcount, and margo respects that refcount when a user requests the removal of a pool or xstream. Trying to remove a pool or xstream that has a refcount != 0 will result in HG_PERMISSION error.

This API will allow Bedrock to delegate the refcounting to Margo. Currently, Bedrock maintains a copy of the list of pools and xstreams just for the purpose of refcounting and preventing the removal of pools and xstreams that are used by providers. We can now leave that refcounting to the providers themself.

Testing

Tests have been updated.

Copy link
Member

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to say that it was odd that the counterpart to ref_incr was release (rather than ref_decr), but then I realized that we had already started that convention with the instance functions, so we should stick with it.

It's too bad that it takes so many API calls but I don't see any way to avoid it. The code looks good to me.

@mdorier mdorier merged commit a840d90 into main Aug 22, 2024
7 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants