Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safe access to graph cache #258

Merged
merged 14 commits into from
Sep 27, 2024
Merged

Thread-safe access to graph cache #258

merged 14 commits into from
Sep 27, 2024

Conversation

Yadunund
Copy link
Member

@Yadunund Yadunund commented Aug 6, 2024

Step 1 in addressing #249.

This PR updates the rmw_context_impl_s class to accurately manage the lifetime of its members while ensure thread-safe data access. For starters, it ensures that the Graph Cache updates and lookups are thread safe.

In subsequent PRs, I will update the rmw_context_impl_s class to store rmw_publisher_data_t, etc with member functions to manage/access their functionalities.

@Yadunund Yadunund requested a review from clalancette August 6, 2024 00:36
@Yadunund Yadunund changed the title Make rmw_context_impl_s an RAII class Thread-safe access to graph cache Aug 6, 2024
@Yadunund Yadunund force-pushed the yadu/raii-context branch from 03c4ee1 to a9b34e2 Compare August 6, 2024 17:12
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few things to think about.

rmw_zenoh_cpp/src/rmw_init.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/rmw_zenoh.cpp Show resolved Hide resolved
@Yadunund Yadunund force-pushed the yadu/raii-context branch from eb936da to f6d3ab1 Compare August 7, 2024 20:09
@MichaelOrlov
Copy link

@Yadunund Friendly ping to follow up on this issue.

@MichaelOrlov
Copy link

Discussion from maintenance triage: Decided to assign this issue to @Yadunund

@clalancette
Copy link
Collaborator

rmw_context_impl_s::publish() etc

I think this is what I'm struggling with most in this change. It just doesn't seem right to me to be having one giant class that encapsulates all functionality of the RMW. It does fix the locking problem, but it doesn't seem very elegant.

If we ignore the threading problem for the moment, in my ideal world we'd have a ContextImpl class at the top-level. During rmw_create_node, we'd call ContextImpl::create_node(), which would return a Node object. During rmw_create_publisher, we'd call Node::create_publisher(), which would return a Publisher object. During rmw_publish, we'd call Publisher::publish(). (There would be similar classes for Subscription, ServiceClient, and ServiceServer). And the GraphCache would be embedded inside of the ContextImpl (since that is indeed a session-wide entity). In that class hierarchy, the functionality for each of these entities is encapsulated into its own class, and further they don't have to share one giant lock. This is, incidentally, how zenoh-cpp works (though there is no Node entity there, so you ask the Session to create a Publisher object).

Now we have to think about the locking. If we just did the above, it wouldn't be much different from what we have today, and it wouldn't fix our locking problems. So what we need is for each entity to be able to query its "parent" on whether it is still alive. For instance, in rmw_publish, we'd have to ask the Node whether the Publisher is still alive. But before that, we have to actually ask the ContextImpl whether the Node is still alive. (There's an argument to be made here that we have a similar problem with ContextImpl, but I'm not sure we can solve that with the current RMW API).

Thus, the Node class would keep a list of active Publisher, Subscription, ServiceServer, and ServiceClient objects. The ContextImpl class would keep a list of active Node objects. And the opaque data (rmw_publisher->data, etc) that we return from rmw_create_publisher (and siblings) would be a pointer to the ContextImpl, some kind of identifier for the Node, and some kind of identifier for the Publisher. Then we'd have enough information to walk the entire hierarchy and determine whether the thing we want is still alive. During an rmw_destroy_publisher, we'd remove it from the Node list, and thus further calls wouldn't work. rmw_destroy_subscription is a bit harder, because a callback may race here; we'd have to make the callback ask the node whether the subscription is still alive at this point.

I admit I haven't totally thought through all of the implications here. But I think this proposal creates a class hierarchy, but also allows us to add in additional checking/locking. What do you think of the general idea?

@Yadunund
Copy link
Member Author

rmw_context_impl_s::publish() etc

I think this is what I'm struggling with most in this change. It just doesn't seem right to me to be having one giant class that encapsulates all functionality of the RMW. It does fix the locking problem, but it doesn't seem very elegant.

If we ignore the threading problem for the moment, in my ideal world we'd have a ContextImpl class at the top-level. During rmw_create_node, we'd call ContextImpl::create_node(), which would return a Node object. During rmw_create_publisher, we'd call Node::create_publisher(), which would return a Publisher object. During rmw_publish, we'd call Publisher::publish(). (There would be similar classes for Subscription, ServiceClient, and ServiceServer). And the GraphCache would be embedded inside of the ContextImpl (since that is indeed a session-wide entity). In that class hierarchy, the functionality for each of these entities is encapsulated into its own class, and further they don't have to share one giant lock. This is, incidentally, how zenoh-cpp works (though there is no Node entity there, so you ask the Session to create a Publisher object).

Now we have to think about the locking. If we just did the above, it wouldn't be much different from what we have today, and it wouldn't fix our locking problems. So what we need is for each entity to be able to query its "parent" on whether it is still alive. For instance, in rmw_publish, we'd have to ask the Node whether the Publisher is still alive. But before that, we have to actually ask the ContextImpl whether the Node is still alive. (There's an argument to be made here that we have a similar problem with ContextImpl, but I'm not sure we can solve that with the current RMW API).

Thus, the Node class would keep a list of active Publisher, Subscription, ServiceServer, and ServiceClient objects. The ContextImpl class would keep a list of active Node objects. And the opaque data (rmw_publisher->data, etc) that we return from rmw_create_publisher (and siblings) would be a pointer to the ContextImpl, some kind of identifier for the Node, and some kind of identifier for the Publisher. Then we'd have enough information to walk the entire hierarchy and determine whether the thing we want is still alive. During an rmw_destroy_publisher, we'd remove it from the Node list, and thus further calls wouldn't work. rmw_destroy_subscription is a bit harder, because a callback may race here; we'd have to make the callback ask the node whether the subscription is still alive at this point.

I admit I haven't totally thought through all of the implications here. But I think this proposal creates a class hierarchy, but also allows us to add in additional checking/locking. What do you think of the general idea?

Thanks for the additional thoughts here. I agree such a hierarchy would be ideal. What i'm not clear about is whether Node will have an API to return a Publisher::SharedPtr? My only concern here is that we need to ensure that the only thing that rmw_destroy_publisher does is remove the Publisher::SharedPtr from the container in which it is stored within Node. Then the destructor of Publisher takes care of all the cleaning up. So even if rmw_publish obtained this Publisher::SharedPtr from Node, and is thus keeping it alive while rmw_destroy_publisher is invoked in a separate thread, the Publisher's destructor will only be invoked after rmw_publish returns. This would also be the case if Node is destroyed when rmw_publish is being invoked.

In any case it seems like this is a bridge we will need to cross only after we merge this PR; perhaps when addressing #259? So should we move the discussion there since the changes here simple make rmw_context_impl_s a concrete class with private data members?

@Yadunund
Copy link
Member Author

Yadunund commented Sep 3, 2024

In 7cc52bd, I updated the rmw_context_impl_s class to return a shared_ptr<GraphCache>. The graph_sub_data_handler() cb should still be thread-safe with the new implementation as the callback and other methods that query the shared_ptr<GraphCache> should lock the same mutex within GraphCache.
Hoping this addresses the concern of rmw_context_impl_s class embedding everything within it. I will update #259 in a similar way such that rmw_context_impl_s returns a shared_ptr<NodeData>.

@Yadunund Yadunund requested a review from clalancette September 4, 2024 12:28
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
Signed-off-by: Yadunund <[email protected]>
Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor fixes, otherwise LGTM

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.hpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.hpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase to fix conflicts, and we should fix @ahcorde 's comments, but this otherwise looks good to me.

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
@Yadunund Yadunund merged commit 67ed661 into rolling Sep 27, 2024
8 checks passed
@Yadunund Yadunund deleted the yadu/raii-context branch September 27, 2024 17:50
@Yadunund Yadunund restored the yadu/raii-context branch September 27, 2024 18:13
YuanYuYuan pushed a commit to ZettaScaleLabs/rmw_zenoh that referenced this pull request Sep 30, 2024
* Make rmw_context_impl_s an RAII class

Signed-off-by: Yadunund <[email protected]>

* fix regression with graph_guard_condition not triggering when entity is removed

Signed-off-by: Yadunund <[email protected]>

* Have the context create the zenoh artifacts

Signed-off-by: Yadunund <[email protected]>

* Add comment for session() api

Signed-off-by: Yadunund <[email protected]>

* Style

Signed-off-by: Yadunund <[email protected]>

* Add api to register querying_sub cb in rmw_context_impl_s

Signed-off-by: Yadunund <[email protected]>

* Have rmw_context_impl_s return a shared_ptr to GraphCache

Signed-off-by: Yadunund <[email protected]>

* Add todo on thread safety

Signed-off-by: Yadunund <[email protected]>

* Update rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp

Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadu <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

* Do not use allocator for creating graph_guard_condition

Signed-off-by: Yadunund <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

---------

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadu <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
imstevenpmwork pushed a commit to ZettaScaleLabs/rmw_zenoh that referenced this pull request Sep 30, 2024
* Make rmw_context_impl_s an RAII class

Signed-off-by: Yadunund <[email protected]>

* fix regression with graph_guard_condition not triggering when entity is removed

Signed-off-by: Yadunund <[email protected]>

* Have the context create the zenoh artifacts

Signed-off-by: Yadunund <[email protected]>

* Add comment for session() api

Signed-off-by: Yadunund <[email protected]>

* Style

Signed-off-by: Yadunund <[email protected]>

* Add api to register querying_sub cb in rmw_context_impl_s

Signed-off-by: Yadunund <[email protected]>

* Have rmw_context_impl_s return a shared_ptr to GraphCache

Signed-off-by: Yadunund <[email protected]>

* Add todo on thread safety

Signed-off-by: Yadunund <[email protected]>

* Update rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp

Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadu <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

* Do not use allocator for creating graph_guard_condition

Signed-off-by: Yadunund <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

---------

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadu <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
@Yadunund Yadunund deleted the yadu/raii-context branch September 30, 2024 17:03
clalancette added a commit that referenced this pull request Dec 6, 2024
* chore: configure the compiliation

* chore: complete the 1st version

* fix: memory leak

* fix: z_error_t -> z_result_t

* Fix `scouting/*/autoconnect/*` per eclipse-zenoh/zenoh@b31a410 (#3)

* chore: checkout the local zenoh-c

* chore: polish z_open

* feat: `z_bytes_serialize_from_slice` without copy

* Initialize `query_` member of `ZenohQuery`

* refactor: use `z_owned_slice_t` instead

* chore: adapt the latest change of zenoh-c dev/1.0.0

* chore: use `strncmp` to avoid copying

* refactor: use `z_view_keyexpr_t` to avoid copying

* chore: adapt the new changes from zenoh-c and fix the bug in liveliness

* fix: segmentation fault due to the unallocated query memory

* fix: workaround the ZID parsing issue

* fix Zenoh Config read\check

* adopt to recent zenoh-c API changes

* fix: adapt the latest change of batching config

* build: deprecate the zenohc_debug and include the zenohc dependency in the zenoh_c_vendor

* Use main branch for upgrading to Zenoh 1.0

* Increase the delay in scouting (#16)

* ci: fix the argument order in the style CI

* refactor: use `z_id_to_string`

* build: enable the unstable feature flag

* build: bump up the zenoh-c commit

* build: update zenoh-c version

* fix: set the max size of initial query queue to `SIZE_MAX - 1`

* fix: iterator memory leak

* feat: update to zenoh-c 1.0.0.8 changes

* chore(style): address `ament_cpplint` and `ament_uncrustiy`

* fix: initiate zenoh logger

* chore: apply the suggestions

* chore: add the comments for the zenoh logger

* fix: store and destroy the subscriber properly

* chore: improve the null pointer check: NULL => nullptr

* Change liveliness tokens logs from warn to debug level (#22)

* fix: properly clone the pointer of query and reply to resolve the segfault in test_service__rmw_zenoh_cpp

* chore: update to zenoh-c 1.0.0.9 (#23)

* Thread-safe access to graph cache (#258)

* refactor(api): align with latest serialization changes

* chore(deps): bump up zenoh-c to 1.0.0.10

* chore(api): align with latest serialization changes

* fix: correct the sub_ke and selector_ke in the querying_subscriber

* fix: thread-safe publisher

* Enable history option for liveliness subscriber. (#27)

* refactor!: adopt the TLS config renaming

* refactor: allow Zenoh session to close without dropping

* fix: address the failure in rclcpp/test_wait_for_message of declaring a subscriber after the RMW has been shut down

* test: close but not drop the session

* fix: correct the merge

* chore: Explicit false in adminspace config

* fix: enable admin space in rmw router and ros nodes

* Bump zenoh-c version.

* Use the latest zenoh-c which fix some nav2 issues. (#31)

* Update config files according to Zenoh 1.0.0 DEFAULT_CONFIG.json5 (#33)

* chore(zenoh_c_vendor): bumb up zenoh-c version

* refactor: remove the free_attachment

* Fix unset request header writer GUID in `rmw_take_response`

* fix: keyexpr is missing in the service

* Avoid touching Zenoh Session while exiting.

* Register function right after opening Zenoh Session.

* chore(deps): bump up zenoh-c to 1.0.1

* fix: use TRUE value to configure the feature flag

* fix: correct typo `attachement` to `attachment`

* refactor: remove the warning of subscriber reliability QoS

* Fix `z_view_string_t` to `std::string` conversion

* refactor: zc_liveliness_* -> z_liveliness_* and bump up zenoh-c version

* refactor: reorder the cancel functions

* chore: reorder some lines of code

* refactor: add `session_is_valid` check

* fixup! refactor: reorder the cancel functions

* fixup! refactor: zc_liveliness_* -> z_liveliness_* and bump up zenoh-c version

Signed-off-by: Luca Cominardi <[email protected]>
Signed-off-by: ChenYing Kuo <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Co-authored-by: Mahmoud Mazouz <[email protected]>
Co-authored-by: yellowhatter <[email protected]>
Co-authored-by: Steven Palma <[email protected]>
Co-authored-by: Julien Enoch <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants