-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix replies coming in after a client is destroyed. #228
Conversation
I should note that this was on the advice of the Zenoh developers in Discord. |
client_data->keyexpr), "", &client_data->zn_closure_reply, &opts); | ||
z_loan(context_impl->session), | ||
z_loan(client_data->keyexpr), "", | ||
z_move(zn_closure_reply), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes sense for the rmw_client
to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?
Would it be better to keep the original implementation but ensure we free the callback with z_reply_null
when we call rmw_client_fini
?
According to the z_get documentation:
callback – The callback function that will be called on reception of replies for this query. Note that the reply parameter of the callback is passed by mutable reference, but will be dropped once your callback exits to help you avoid memory leaks. If you’d rather take ownership, please refer to the documentation of z_reply_null()
Will closing the Zenoh session ensure all callbacks are freed?
cc: @JEnoch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes sense for the
rmw_client
to own the callback. What happens when the callback does not get triggered yet (the service is busy), but then the client is destroyed?
Right, so that is a good question. And the answer is that it is up to rmw_zenoh_cpp
to deal with this situation, by not freeing the underlying memory until all in-flight callbacks are accounted for. I'll point out that this is already a problem in rmw_zenoh_cpp
; if a callback for a query comes in after rmw_destroy_client
has been called, we'll access freed memory. I have an upcoming patch which fixes that, which is built on top of this. If you think it is better, I can combine that into this PR.
Would it be better to keep the original implementation but ensure we free the callback with
z_reply_null
when we callrmw_client_fini
?
I'm not 100% sure, but I believe that this is a different thing. This PR is concerned with the ownership of the closure itself. To see what I am talking about, consider that in almost all other places where we call into the Zenoh API, we use either z_loan
or z_move
. The code before this PR was using a raw pointer reference, which is a code smell.
What you are talking about is the ownership of the reply. There, we are already taking ownership of the memory during the callback (via z_reply_null
), so I don't think we need to make a change. Does that make sense?
Also, it might be worthwhile to get feedback from @JEnoch and @OlivierHecart, since they were the ones that suggested this to me to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I think it would be good to combine those changes in this PR as well.
What you are talking about is the ownership of the reply.
Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah sorry I got things mixed up but my original concern still stands on whether the callback gets freed if the client is destroyed before we receive a reply.
This is a question for the Zenoh developers, but I believe not. That's the reason for doing the z_move
, so that zenoh-c/Zenoh proper is responsible for the lifetime of the closure. I'll also point out that in my tests with valgrind, this didn't show a problem.
OK, so I've now updated this PR to include all of the fixes necessary to fix our use-after-free bug here. I've also updated the title and the initial description to explain what all of this is doing. And there is another comment in the code explaining why we need all of this. Please take another look and let me know what you think. |
98962ea
to
e1af1f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is quite the tricky problem. Thanks for finding this suitable fix.
Yeah, agreed. This one is tricky. Let's hold off on merging this one for a little bit to give the Zenoh developers a chance to take a look and comment. |
rmw_zenoh_cpp/src/rmw_zenoh.cpp
Outdated
|
||
// See the comment about the "num_in_flight" class variable in the rmw_client_data_t class for | ||
// why we need to do this. | ||
client_data->increment_in_flight_callbacks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't increment_in_flight_callbacks
be called before z_get
so that we are sure rmw_destroy_client
is not called in parallel between z_get
and increment_in_flight_callbacks
? I don't know the rest of the code enough to know if rmw_destroy_client
and rmw_send_request
can be called concurrently or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. We probably actually have a larger problem here in that we don't have any kind of locking at all for concurrent calls of the APIs related to clients (and servers, and subscriptions, and publishers). That is something we should look into.
For now, I am going to move this increment
ahead of the z_get
call as you suggest.
@@ -2464,11 +2467,17 @@ rmw_send_request( | |||
// and any number. | |||
opts.consolidation = z_query_consolidation_latest(); | |||
opts.value.payload = z_bytes_t{data_length, reinterpret_cast<const uint8_t *>(request_bytes)}; | |||
client_data->zn_closure_reply = | |||
z_owned_closure_reply_t zn_closure_reply = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check that is_shutdown_
is false
before issuing new queries for this client ? (Maybe there is another check somewhere else that already prevents that I'm not aware of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I'll add this in as well.
@@ -522,6 +552,20 @@ void client_data_handler(z_owned_reply_t * reply, void * data) | |||
); | |||
return; | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decrement should be done in the drop
function of the z_owned_closure_reply_t
. There is no guarantee that the call
function will be called only once per query, while drop
will be called once per query at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_in_flight_
should be decremented in the drop function of the z_owned_closure_reply_t
.
That is, we should be using z_move() to hand responsibility of the closure to Zenoh. Signed-off-by: Chris Lalancette <[email protected]>
It turns out that in Zenoh, once a query is in-flight (via the z_get zenoh-c API), there is no way to cancel it. With that in mind, consider a sequence of events: 1. rmw_create_client 2. rmw_send_request 3. rmw_destroy_client If all of that happens before the query callback is called, we'll have freed up the structure that we passed to the zenoh closure (client_data), and thus the structure will access already freed memory. The fix for this is relatively complicated. First of all, anytime that there is a call to rmw_send_request (which calls z_get()), it increases a per-client counter. rmw_destroy_client() checks to to see if that counter is 0. If it is, we know both that there are no queries in flight, and also that there will be no new ones (because we are cleaning the client up). Thus the structure is freed directly. If the counter is greater than 0, then there is at least one query in flight and it is not safe to free the structure. However, the client_data structure is marked as "shutdown" at this point. There will not be any *new* requests for this client, but the in-flight ones still need to be dealt with. For the in-flight ones, every time client_data_handler() (the rmw_zenoh_cpp callback for queries) is called, it first decrements the number of in-flight queries. If the client is shutdown, and the number of in-flight queries is 0, then it is safe to free the client_data structure. If the client is shutdown but there are other queries in flight, no actual work is done except for the decrement. There is one case which is not handled here at all, and that has to do with timeouts. Currently the rmw_zenoh_cpp client query timeout is set to essentially infinite. Thus, if a query is in-flight, but never returns, the memory corresponding to that client will be leaked. However, this is already an existing problem; this patch changes that from a UB to a memory leak. Signed-off-by: Chris Lalancette <[email protected]>
Thanks. This was a very helpful comment; I didn't realize that However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via
(this is more-or-less required by the
That suggests to me that it is possible to use the same closure in multiple |
I believe that what creates the confusion here is that in Zenoh one single |
Fantastic, thank you for that explanation. I'll continue with that assumption, then. |
ff53c04
to
d694ad8
Compare
Signed-off-by: Chris Lalancette <[email protected]>
d694ad8
to
1705100
Compare
All right, I believe I responded to all comments now. Please take another look when you get a chance. |
It turns out that in Zenoh, once a query is in-flight (via
the z_get zenoh-c API), there is no way to cancel it. With
that in mind, consider a sequence of events:
If all of that happens before the query callback is called,
we'll have freed up the structure that we passed to the zenoh
closure (client_data), and thus the structure will access
already freed memory.
The fix for this is relatively complicated. First of all,
anytime that there is a call to rmw_send_request (which calls z_get()),
it increases a per-client counter. rmw_destroy_client() checks to
to see if that counter is 0. If it is, we know both that there are
no queries in flight, and also that there will be no new ones (because
we are cleaning the client up). Thus the structure is freed directly.
If the counter is greater than 0, then there is at least one query
in flight and it is not safe to free the structure. However, the
client_data structure is marked as "shutdown" at this point. There
will not be any new requests for this client, but the in-flight ones
still need to be dealt with.
For the in-flight ones, every time client_data_handler() (the
rmw_zenoh_cpp callback for queries) is called, it first decrements the
number of in-flight queries. If the client is shutdown, and the number
of in-flight queries is 0, then it is safe to free the client_data
structure. If the client is shutdown but there are other queries in
flight, no actual work is done except for the decrement.
There is one case which is not handled here at all, and that has
to do with timeouts. Currently the rmw_zenoh_cpp client query timeout
is set to essentially infinite. Thus, if a query is in-flight, but never
returns, the memory corresponding to that client will be leaked.
However, this is already an existing problem; this patch changes that
from a UB to a memory leak.
In my testing, this fixes the problems pointed out in #186