Fix replies coming in after a client is destroyed. #228

clalancette · 2024-07-01T18:16:14Z

It turns out that in Zenoh, once a query is in-flight (via
the z_get zenoh-c API), there is no way to cancel it. With
that in mind, consider a sequence of events:

rmw_create_client
rmw_send_request
rmw_destroy_client

If all of that happens before the query callback is called,
we'll have freed up the structure that we passed to the zenoh
closure (client_data), and thus the structure will access
already freed memory.

The fix for this is relatively complicated. First of all,
anytime that there is a call to rmw_send_request (which calls z_get()),
it increases a per-client counter. rmw_destroy_client() checks to
to see if that counter is 0. If it is, we know both that there are
no queries in flight, and also that there will be no new ones (because
we are cleaning the client up). Thus the structure is freed directly.
If the counter is greater than 0, then there is at least one query
in flight and it is not safe to free the structure. However, the
client_data structure is marked as "shutdown" at this point. There
will not be any new requests for this client, but the in-flight ones
still need to be dealt with.

For the in-flight ones, every time client_data_handler() (the
rmw_zenoh_cpp callback for queries) is called, it first decrements the
number of in-flight queries. If the client is shutdown, and the number
of in-flight queries is 0, then it is safe to free the client_data
structure. If the client is shutdown but there are other queries in
flight, no actual work is done except for the decrement.

There is one case which is not handled here at all, and that has
to do with timeouts. Currently the rmw_zenoh_cpp client query timeout
is set to essentially infinite. Thus, if a query is in-flight, but never
returns, the memory corresponding to that client will be leaked.
However, this is already an existing problem; this patch changes that
from a UB to a memory leak.

In my testing, this fixes the problems pointed out in #186

clalancette · 2024-07-01T18:16:30Z

I should note that this was on the advice of the Zenoh developers in Discord.

Yadunund · 2024-07-01T18:31:19Z

rmw_zenoh_cpp/src/rmw_zenoh.cpp

-      client_data->keyexpr), "", &client_data->zn_closure_reply, &opts);
+    z_loan(context_impl->session),
+    z_loan(client_data->keyexpr), "",
+    z_move(zn_closure_reply),


Wondering if it makes sense for the rmw_client to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?

Would it be better to keep the original implementation but ensure we free the callback with z_reply_null when we call rmw_client_fini?

According to the z_get documentation:

callback – The callback function that will be called on reception of replies for this query. Note that the reply parameter of the callback is passed by mutable reference, but will be dropped once your callback exits to help you avoid memory leaks. If you’d rather take ownership, please refer to the documentation of z_reply_null()

Will closing the Zenoh session ensure all callbacks are freed?
cc: @JEnoch

Wondering if it makes sense for the rmw_client to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?

Right, so that is a good question. And the answer is that it is up to rmw_zenoh_cpp to deal with this situation, by not freeing the underlying memory until all in-flight callbacks are accounted for. I'll point out that this is already a problem in rmw_zenoh_cpp; if a callback for a query comes in after rmw_destroy_client has been called, we'll access freed memory. I have an upcoming patch which fixes that, which is built on top of this. If you think it is better, I can combine that into this PR.

Would it be better to keep the original implementation but ensure we free the callback with z_reply_null when we call rmw_client_fini?

I'm not 100% sure, but I believe that this is a different thing. This PR is concerned with the ownership of the closure itself. To see what I am talking about, consider that in almost all other places where we call into the Zenoh API, we use either z_loan or z_move. The code before this PR was using a raw pointer reference, which is a code smell.

What you are talking about is the ownership of the reply. There, we are already taking ownership of the memory during the callback (via z_reply_null), so I don't think we need to make a change. Does that make sense?

Also, it might be worthwhile to get feedback from @JEnoch and @OlivierHecart, since they were the ones that suggested this to me to begin with.

Thanks for clarifying. I think it would be good to combine those changes in this PR as well.

What you are talking about is the ownership of the reply.

Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.

Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.

This is a question for the Zenoh developers, but I believe not. That's the reason for doing the z_move, so that zenoh-c/Zenoh proper is responsible for the lifetime of the closure. I'll also point out that in my tests with valgrind, this didn't show a problem.

clalancette · 2024-07-01T20:04:25Z

OK, so I've now updated this PR to include all of the fixes necessary to fix our use-after-free bug here. I've also updated the title and the initial description to explain what all of this is doing. And there is another comment in the code explaining why we need all of this. Please take another look and let me know what you think.

Yadunund

Wow this is quite the tricky problem. Thanks for finding this suitable fix.

clalancette · 2024-07-01T21:32:57Z

Wow this is quite the tricky problem. Thanks for finding this suitable fix.

Yeah, agreed. This one is tricky. Let's hold off on merging this one for a little bit to give the Zenoh developers a chance to take a look and comment.

OlivierHecart · 2024-07-02T08:18:39Z

rmw_zenoh_cpp/src/rmw_zenoh.cpp

+
+  // See the comment about the "num_in_flight" class variable in the rmw_client_data_t class for
+  // why we need to do this.
+  client_data->increment_in_flight_callbacks();


Shouldn't increment_in_flight_callbacks be called before z_get so that we are sure rmw_destroy_client is not called in parallel between z_get and increment_in_flight_callbacks ? I don't know the rest of the code enough to know if rmw_destroy_client and rmw_send_requestcan be called concurrently or not.

This is a good point. We probably actually have a larger problem here in that we don't have any kind of locking at all for concurrent calls of the APIs related to clients (and servers, and subscriptions, and publishers). That is something we should look into.

For now, I am going to move this increment ahead of the z_get call as you suggest.

OlivierHecart · 2024-07-02T08:21:35Z

rmw_zenoh_cpp/src/rmw_zenoh.cpp

@@ -2464,11 +2467,17 @@ rmw_send_request(
  // and any number.
  opts.consolidation = z_query_consolidation_latest();
  opts.value.payload = z_bytes_t{data_length, reinterpret_cast<const uint8_t *>(request_bytes)};
-  client_data->zn_closure_reply =
+  z_owned_closure_reply_t zn_closure_reply =


Shouldn't we check that is_shutdown_ is false before issuing new queries for this client ? (Maybe there is another check somewhere else that already prevents that I'm not aware of)

Yes, good point. I'll add this in as well.

OlivierHecart · 2024-07-02T08:27:11Z

rmw_zenoh_cpp/src/detail/rmw_data_types.cpp

@@ -522,6 +552,20 @@ void client_data_handler(z_owned_reply_t * reply, void * data)
    );
    return;
  }
+


This decrement should be done in the drop function of the z_owned_closure_reply_t. There is no guarantee that the call function will be called only once per query, while drop will be called once per query at the end.

OlivierHecart

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

That is, we should be using z_move() to hand responsibility of the closure to Zenoh. Signed-off-by: Chris Lalancette <[email protected]>

It turns out that in Zenoh, once a query is in-flight (via the z_get zenoh-c API), there is no way to cancel it. With that in mind, consider a sequence of events: 1. rmw_create_client 2. rmw_send_request 3. rmw_destroy_client If all of that happens before the query callback is called, we'll have freed up the structure that we passed to the zenoh closure (client_data), and thus the structure will access already freed memory. The fix for this is relatively complicated. First of all, anytime that there is a call to rmw_send_request (which calls z_get()), it increases a per-client counter. rmw_destroy_client() checks to to see if that counter is 0. If it is, we know both that there are no queries in flight, and also that there will be no new ones (because we are cleaning the client up). Thus the structure is freed directly. If the counter is greater than 0, then there is at least one query in flight and it is not safe to free the structure. However, the client_data structure is marked as "shutdown" at this point. There will not be any *new* requests for this client, but the in-flight ones still need to be dealt with. For the in-flight ones, every time client_data_handler() (the rmw_zenoh_cpp callback for queries) is called, it first decrements the number of in-flight queries. If the client is shutdown, and the number of in-flight queries is 0, then it is safe to free the client_data structure. If the client is shutdown but there are other queries in flight, no actual work is done except for the decrement. There is one case which is not handled here at all, and that has to do with timeouts. Currently the rmw_zenoh_cpp client query timeout is set to essentially infinite. Thus, if a query is in-flight, but never returns, the memory corresponding to that client will be leaked. However, this is already an existing problem; this patch changes that from a UB to a memory leak. Signed-off-by: Chris Lalancette <[email protected]>

clalancette · 2024-07-02T20:22:43Z

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:

  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);

(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:

 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.

That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

OlivierHecart · 2024-07-03T07:44:45Z

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:
  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);
(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:
 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.
That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

I believe that what creates the confusion here is that in Zenoh one single z_get can lead to 0, 1 or multiple replies. Which results in the API to 0, 1 or multiple calls to call and finally one call to drop to indicate to the user that there are no more replies for this single z_get. With that in mind, the documentation does not necessarily suggests that the same closure can be used in multiple z_get.
So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

clalancette · 2024-07-03T11:35:26Z

So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

Fantastic, thank you for that explanation. I'll continue with that assumption, then.

Signed-off-by: Chris Lalancette <[email protected]>

clalancette · 2024-07-03T19:32:21Z

All right, I believe I responded to all comments now. Please take another look when you get a chance.

clalancette requested a review from Yadunund July 1, 2024 18:16

Yadunund reviewed Jul 1, 2024

View reviewed changes

clalancette changed the title ~~Make sure to give ownership of the closure to Zenoh.~~ Fix replies coming in after a client is destroyed. Jul 1, 2024

clalancette force-pushed the clalancette/fix-closure-semantics branch from 98962ea to e1af1f1 Compare July 1, 2024 20:52

Yadunund approved these changes Jul 1, 2024

View reviewed changes

OlivierHecart reviewed Jul 2, 2024

View reviewed changes

OlivierHecart suggested changes Jul 2, 2024

View reviewed changes

clalancette added 2 commits July 2, 2024 19:54

Make sure to give ownership of the closure to Zenoh.

66af6cf

That is, we should be using z_move() to hand responsibility of the closure to Zenoh. Signed-off-by: Chris Lalancette <[email protected]>

clalancette force-pushed the clalancette/fix-closure-semantics branch 2 times, most recently from ff53c04 to d694ad8 Compare July 3, 2024 19:31

Feedback from review.

1705100

Signed-off-by: Chris Lalancette <[email protected]>

clalancette force-pushed the clalancette/fix-closure-semantics branch from d694ad8 to 1705100 Compare July 3, 2024 19:31

OlivierHecart approved these changes Jul 4, 2024

View reviewed changes

clalancette merged commit 5c09185 into rolling Jul 5, 2024
8 checks passed

clalancette mentioned this pull request Jul 5, 2024

malloc(): unaligned tcache chunk detected error #186

Closed

Yadunund deleted the clalancette/fix-closure-semantics branch July 8, 2024 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replies coming in after a client is destroyed. #228

Fix replies coming in after a client is destroyed. #228

clalancette commented Jul 1, 2024 •

edited

Loading

clalancette commented Jul 1, 2024

Yadunund Jul 1, 2024 •

edited

Loading

clalancette Jul 1, 2024

Yadunund Jul 1, 2024

clalancette Jul 1, 2024

clalancette commented Jul 1, 2024

Yadunund left a comment

clalancette commented Jul 1, 2024

OlivierHecart Jul 2, 2024

clalancette Jul 3, 2024

OlivierHecart Jul 2, 2024

clalancette Jul 3, 2024

OlivierHecart Jul 2, 2024

OlivierHecart left a comment

clalancette commented Jul 2, 2024

OlivierHecart commented Jul 3, 2024

clalancette commented Jul 3, 2024

clalancette commented Jul 3, 2024

Fix replies coming in after a client is destroyed. #228

Fix replies coming in after a client is destroyed. #228

Conversation

clalancette commented Jul 1, 2024 • edited Loading

clalancette commented Jul 1, 2024

Yadunund Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clalancette commented Jul 1, 2024

Yadunund left a comment

Choose a reason for hiding this comment

clalancette commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlivierHecart left a comment

Choose a reason for hiding this comment

clalancette commented Jul 2, 2024

OlivierHecart commented Jul 3, 2024

clalancette commented Jul 3, 2024

clalancette commented Jul 3, 2024

clalancette commented Jul 1, 2024 •

edited

Loading

Yadunund Jul 1, 2024 •

edited

Loading