RPC invocation_id conflict #103

songweijia · 2019-02-21T22:48:32Z

When we construct an RPC message, we use a random generated from Matt's library(mutils::long_rand()) as invocation_id, which index into the results_map holding the promises of the RPC call. In objectstore performance benchmark, I found duplicate invocation_ids within ~50K RPC calls, throwing an std::future already retrieved exception and then crashing. We need to fix this with non-conflicting invocation_id. And we need reclaim the used invocation_id - Don't leave used one-time garbage there forever!

Code location: send_return send(...)@derecho/remote_invocable.h:101.

The text was updated successfully, but these errors were encountered:

songweijia · 2019-02-21T23:01:49Z

I think a simple solution is to have a set holding all the pending invocation_ids. When we create a new invocation_id, we test duplication and insert it into the set. Once RPC is done, we remove it from the set. But this is not lock-free.

songweijia · 2019-02-21T23:03:34Z

Another option is to use a sequencer, which is easier to manage than a set.

etremel · 2019-02-22T16:58:19Z

I agree that the results_map in a RemoteInvoker should be garbage-collected once each invocation has completed; this seems like an oversight. Also, I don't see why invocation IDs need to be random; they could just as easily be sequential, since they don't have to be globally unique (just unique within a RemoteInvoker instance). You might want to check with Matthew about why he designed it that way, though.

…teInvoker::result_map.

sagarjha · 2019-02-22T22:18:34Z

After discussions with Weijia, we decided to garbage collect the relevant data structures when all replies are available and the user has destroyed the reply objects. This is not very urgent right now as Weijia's fix to use a sequencer for invocation_ids allows him to dodge the bug.

mpmilano · 2019-02-23T01:28:34Z

All we need is unique, not random. So as long as the sequencing is globally unique you should be fine.

Also we should tie the lifetime of the entry in that table to the lifetime of the results object we give the programmer. No need for invasive GC when we can track the actual usage

songweijia · 2019-02-23T02:38:59Z

Why do we need a globally unique invocation_id? The target of the RPC call should know the caller's node id and the RPC signature. I had thought it should be fine as long as the invocation_id is unique in such a domain.

Sagar and I had a sketch solution for GC like this: in the destructor of class QueryResuts, we can test if the promises are all fulfilled. If they are, clear it; otherwise, we set a tombstone in the corresponding PendingResults entry in the map.

mpmilano · 2019-02-23T02:43:55Z

That makes sense for the sketch. I'm actually super sick right now so I genuinely can't remember why I think there needs to be uniqueness

KenBirman · 2019-02-23T02:48:54Z

The reply has to be matched with the specific call.

…

Sent from my iPhone On Feb 22, 2019, at 9:39 PM, Weijia Song <[email protected]<mailto:[email protected]>> wrote: Why do we need a globally unique invocation_id? The target of the RPC call should know the caller's node id and the RPC signature. I had thought it should be fine as long as the invocation_id is unique in such a domain. Sagar and I had a sketch solution for GC like this: in the destructor of class QueryResuts, we can test if the promises are all fulfilled. If they are, clear it; otherwise, we set a tombstone in the corresponding PendingResults entry in the map. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#103 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AWDC76mAS8f7Z3d3qjWN84IGECcPddyeks5vQKnDgaJpZM4bIfY->.

KenBirman · 2019-02-23T13:50:08Z

What about this: if the caller were to send the address of a location at which the RPC target can place the result, and this is in a region of previously pinned memory enabled for one sided RDMA (in effect, in the SST), then the response can be written directly to the desired region of client memory. Plus, since that address would be playing a dedicated role while the RPC is underway, it can also serve as the unique request id! On Feb 22, 2019, at 9:39 PM, Weijia Song <[email protected]<mailto:[email protected]>> wrote: Why do we need a globally unique invocation_id? The target of the RPC call should know the caller's node id and the RPC signature. I had thought it should be fine as long as the invocation_id is unique in such a domain. Sagar and I had a sketch solution for GC like this: in the destructor of class QueryResuts, we can test if the promises are all fulfilled. If they are, clear it; otherwise, we set a tombstone in the corresponding PendingResults entry in the map. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#103 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AWDC76mAS8f7Z3d3qjWN84IGECcPddyeks5vQKnDgaJpZM4bIfY->.

mpmilano · 2019-02-26T20:56:13Z

Oh that is pretty good actually --- it would certainly be unique enough, and it would entirely bypass the existing RDMC-style buffer management. Since we're overhauling that anyway, it makes sense to me to take advantage of these things!

Re: Weijia's question: as of the last time I had a hand in this, the actual runtime process that ships around results to invocations is totally type-erased; the result datagram is only addressed based on the invocation ID, not based on the receiver types at all. This is why they needed to be globally unique; it would be enough for them to be unique within a single RDMC/top-level group, but so far we've only ever had one of those per process anyway.

songweijia · 2019-05-07T20:39:20Z

It turns out my fix had an issue. On receiving a reply message, the p2p message loop is going to set the returned value in RemoteInvoker::results_map[invocation_id]. At the same time, application thread may be calling p2p_send() and inserting a new entry to RemoteInvoker::results_map. Since the p2p message loop had given up the lock on the results_map, those two things will happen concurrently. However, std::map, the type of results_map, is not thread safe: if one thread is inserting/deleting some key K1, retrieving the value of another key K2 may fail with std::out_of_range exception. That's why the p2p message loop encounters nondeterministic std::out_of_range exception.

In commit 6b7c1ac, I gave up on using std::map with a pre-filled array. I also added a reset() method to PendingResults class. p2p_send() resets the corresponding PendingResults slot in the array without any lock or map insertion/deletion operation. Therefore, the reply processing path in p2p message loop will not be affected anymore. However, we still use several hundred KB for the results vector. We may need a better design later.

songweijia added bug derecho labels Feb 21, 2019

songweijia assigned etremel, sagarjha and songweijia Feb 21, 2019

etremel changed the title ~~RPC invocation_id confliction~~ RPC invocation_id conflict Feb 22, 2019

songweijia added a commit that referenced this issue Feb 22, 2019

Partial bugfix for #103, we need to reclaim the used promises in Remo…

cc1a9f1

…teInvoker::result_map.

sagarjha assigned mpmilano Feb 22, 2019

sagarjha added the low_priority label Feb 22, 2019

songweijia removed the bug label May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC invocation_id conflict #103

RPC invocation_id conflict #103

songweijia commented Feb 21, 2019

songweijia commented Feb 21, 2019

songweijia commented Feb 21, 2019

etremel commented Feb 22, 2019

sagarjha commented Feb 22, 2019

mpmilano commented Feb 23, 2019

songweijia commented Feb 23, 2019

mpmilano commented Feb 23, 2019

KenBirman commented Feb 23, 2019 via email

KenBirman commented Feb 23, 2019 via email

mpmilano commented Feb 26, 2019

songweijia commented May 7, 2019

RPC invocation_id conflict #103

RPC invocation_id conflict #103

Comments

songweijia commented Feb 21, 2019

songweijia commented Feb 21, 2019

songweijia commented Feb 21, 2019

etremel commented Feb 22, 2019

sagarjha commented Feb 22, 2019

mpmilano commented Feb 23, 2019

songweijia commented Feb 23, 2019

mpmilano commented Feb 23, 2019

KenBirman commented Feb 23, 2019 via email

KenBirman commented Feb 23, 2019 via email

mpmilano commented Feb 26, 2019

songweijia commented May 7, 2019