Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replies coming in after a client is destroyed. #228

Merged
merged 3 commits into from
Jul 5, 2024

Conversation

clalancette
Copy link
Collaborator

@clalancette clalancette commented Jul 1, 2024

It turns out that in Zenoh, once a query is in-flight (via
the z_get zenoh-c API), there is no way to cancel it. With
that in mind, consider a sequence of events:

  1. rmw_create_client
  2. rmw_send_request
  3. rmw_destroy_client

If all of that happens before the query callback is called,
we'll have freed up the structure that we passed to the zenoh
closure (client_data), and thus the structure will access
already freed memory.

The fix for this is relatively complicated. First of all,
anytime that there is a call to rmw_send_request (which calls z_get()),
it increases a per-client counter. rmw_destroy_client() checks to
to see if that counter is 0. If it is, we know both that there are
no queries in flight, and also that there will be no new ones (because
we are cleaning the client up). Thus the structure is freed directly.
If the counter is greater than 0, then there is at least one query
in flight and it is not safe to free the structure. However, the
client_data structure is marked as "shutdown" at this point. There
will not be any new requests for this client, but the in-flight ones
still need to be dealt with.

For the in-flight ones, every time client_data_handler() (the
rmw_zenoh_cpp callback for queries) is called, it first decrements the
number of in-flight queries. If the client is shutdown, and the number
of in-flight queries is 0, then it is safe to free the client_data
structure. If the client is shutdown but there are other queries in
flight, no actual work is done except for the decrement.

There is one case which is not handled here at all, and that has
to do with timeouts. Currently the rmw_zenoh_cpp client query timeout
is set to essentially infinite. Thus, if a query is in-flight, but never
returns, the memory corresponding to that client will be leaked.
However, this is already an existing problem; this patch changes that
from a UB to a memory leak.

In my testing, this fixes the problems pointed out in #186

@clalancette clalancette requested a review from Yadunund July 1, 2024 18:16
@clalancette
Copy link
Collaborator Author

I should note that this was on the advice of the Zenoh developers in Discord.

client_data->keyexpr), "", &client_data->zn_closure_reply, &opts);
z_loan(context_impl->session),
z_loan(client_data->keyexpr), "",
z_move(zn_closure_reply),
Copy link
Member

@Yadunund Yadunund Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it makes sense for the rmw_client to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?

Would it be better to keep the original implementation but ensure we free the callback with z_reply_null when we call rmw_client_fini?

According to the z_get documentation:

callback – The callback function that will be called on reception of replies for this query. Note that the reply parameter of the callback is passed by mutable reference, but will be dropped once your callback exits to help you avoid memory leaks. If you’d rather take ownership, please refer to the documentation of z_reply_null()

Will closing the Zenoh session ensure all callbacks are freed?
cc: @JEnoch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it makes sense for the rmw_client to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?

Right, so that is a good question. And the answer is that it is up to rmw_zenoh_cpp to deal with this situation, by not freeing the underlying memory until all in-flight callbacks are accounted for. I'll point out that this is already a problem in rmw_zenoh_cpp; if a callback for a query comes in after rmw_destroy_client has been called, we'll access freed memory. I have an upcoming patch which fixes that, which is built on top of this. If you think it is better, I can combine that into this PR.

Would it be better to keep the original implementation but ensure we free the callback with z_reply_null when we call rmw_client_fini?

I'm not 100% sure, but I believe that this is a different thing. This PR is concerned with the ownership of the closure itself. To see what I am talking about, consider that in almost all other places where we call into the Zenoh API, we use either z_loan or z_move. The code before this PR was using a raw pointer reference, which is a code smell.

What you are talking about is the ownership of the reply. There, we are already taking ownership of the memory during the callback (via z_reply_null), so I don't think we need to make a change. Does that make sense?

Also, it might be worthwhile to get feedback from @JEnoch and @OlivierHecart, since they were the ones that suggested this to me to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I think it would be good to combine those changes in this PR as well.

What you are talking about is the ownership of the reply.

Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.

This is a question for the Zenoh developers, but I believe not. That's the reason for doing the z_move, so that zenoh-c/Zenoh proper is responsible for the lifetime of the closure. I'll also point out that in my tests with valgrind, this didn't show a problem.

@clalancette clalancette changed the title Make sure to give ownership of the closure to Zenoh. Fix replies coming in after a client is destroyed. Jul 1, 2024
@clalancette
Copy link
Collaborator Author

OK, so I've now updated this PR to include all of the fixes necessary to fix our use-after-free bug here. I've also updated the title and the initial description to explain what all of this is doing. And there is another comment in the code explaining why we need all of this. Please take another look and let me know what you think.

@clalancette clalancette force-pushed the clalancette/fix-closure-semantics branch from 98962ea to e1af1f1 Compare July 1, 2024 20:52
Copy link
Member

@Yadunund Yadunund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is quite the tricky problem. Thanks for finding this suitable fix.

@clalancette
Copy link
Collaborator Author

Wow this is quite the tricky problem. Thanks for finding this suitable fix.

Yeah, agreed. This one is tricky. Let's hold off on merging this one for a little bit to give the Zenoh developers a chance to take a look and comment.


// See the comment about the "num_in_flight" class variable in the rmw_client_data_t class for
// why we need to do this.
client_data->increment_in_flight_callbacks();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't increment_in_flight_callbacks be called before z_get so that we are sure rmw_destroy_client is not called in parallel between z_get and increment_in_flight_callbacks ? I don't know the rest of the code enough to know if rmw_destroy_client and rmw_send_requestcan be called concurrently or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We probably actually have a larger problem here in that we don't have any kind of locking at all for concurrent calls of the APIs related to clients (and servers, and subscriptions, and publishers). That is something we should look into.

For now, I am going to move this increment ahead of the z_get call as you suggest.

@@ -2464,11 +2467,17 @@ rmw_send_request(
// and any number.
opts.consolidation = z_query_consolidation_latest();
opts.value.payload = z_bytes_t{data_length, reinterpret_cast<const uint8_t *>(request_bytes)};
client_data->zn_closure_reply =
z_owned_closure_reply_t zn_closure_reply =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check that is_shutdown_ is false before issuing new queries for this client ? (Maybe there is another check somewhere else that already prevents that I'm not aware of)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I'll add this in as well.

@@ -522,6 +552,20 @@ void client_data_handler(z_owned_reply_t * reply, void * data)
);
return;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decrement should be done in the drop function of the z_owned_closure_reply_t. There is no guarantee that the call function will be called only once per query, while drop will be called once per query at the end.

Copy link

@OlivierHecart OlivierHecart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

That is, we should be using z_move() to hand responsibility
of the closure to Zenoh.

Signed-off-by: Chris Lalancette <[email protected]>
It turns out that in Zenoh, once a query is in-flight (via
the z_get zenoh-c API), there is no way to cancel it.  With
that in mind, consider a sequence of events:

1.  rmw_create_client
2.  rmw_send_request
3.  rmw_destroy_client

If all of that happens before the query callback is called,
we'll have freed up the structure that we passed to the zenoh
closure (client_data), and thus the structure will access
already freed memory.

The fix for this is relatively complicated.  First of all,
anytime that there is a call to rmw_send_request (which calls z_get()),
it increases a per-client counter.  rmw_destroy_client() checks to
to see if that counter is 0.  If it is, we know both that there are
no queries in flight, and also that there will be no new ones (because
we are cleaning the client up).  Thus the structure is freed directly.
If the counter is greater than 0, then there is at least one query
in flight and it is not safe to free the structure.  However, the
client_data structure is marked as "shutdown" at this point.  There
will not be any *new* requests for this client, but the in-flight ones
still need to be dealt with.

For the in-flight ones, every time client_data_handler() (the
rmw_zenoh_cpp callback for queries) is called, it first decrements the
number of in-flight queries.  If the client is shutdown, and the number
of in-flight queries is 0, then it is safe to free the client_data
structure.  If the client is shutdown but there are other queries in
flight, no actual work is done except for the decrement.

There is one case which is not handled here at all, and that has
to do with timeouts.  Currently the rmw_zenoh_cpp client query timeout
is set to essentially infinite.  Thus, if a query is in-flight, but never
returns, the memory corresponding to that client will be leaked.
However, this is already an existing problem; this patch changes that
from a UB to a memory leak.

Signed-off-by: Chris Lalancette <[email protected]>
@clalancette
Copy link
Collaborator Author

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:

  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);

(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:

 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.

That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

@OlivierHecart
Copy link

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:

  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);

(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:

 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.

That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

I believe that what creates the confusion here is that in Zenoh one single z_get can lead to 0, 1 or multiple replies. Which results in the API to 0, 1 or multiple calls to call and finally one call to drop to indicate to the user that there are no more replies for this single z_get. With that in mind, the documentation does not necessarily suggests that the same closure can be used in multiple z_get.
So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

@clalancette
Copy link
Collaborator Author

So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

Fantastic, thank you for that explanation. I'll continue with that assumption, then.

@clalancette clalancette force-pushed the clalancette/fix-closure-semantics branch 2 times, most recently from ff53c04 to d694ad8 Compare July 3, 2024 19:31
Signed-off-by: Chris Lalancette <[email protected]>
@clalancette clalancette force-pushed the clalancette/fix-closure-semantics branch from d694ad8 to 1705100 Compare July 3, 2024 19:31
@clalancette
Copy link
Collaborator Author

All right, I believe I responded to all comments now. Please take another look when you get a chance.

@clalancette clalancette merged commit 5c09185 into rolling Jul 5, 2024
8 checks passed
@Yadunund Yadunund deleted the clalancette/fix-closure-semantics branch July 8, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants