-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minogrpc: deadlock in non tree based routing #170
Labels
bug
Something isn't working
Comments
Logs for a sample run with 15 nodes (1-15 running on ports 2000-2009, 20010, 20011, 20012, 20013, 20014). https://drive.google.com/file/d/1a-TPPOXJzTFpmUEAFuIxhuzdeszeWChE/view?usp=sharing The following table describes the Nodes and their IDs
|
gnarula
added a commit
that referenced
this issue
Jan 5, 2021
gnarula
added a commit
that referenced
this issue
Jan 5, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This came up while trying out RABYT. Consider a scenario where we have 15 nodes (0-14).
Node 0 broadcasts a message to the other 14 nodes. Some of them are via intermediate nodes so they take 2 hops. Now each node on receiving the message from Node0 tries to send a response back to it. Let's assume the routing to Node0 is via Node9. Here's how a deadlock can occur:
Node0 holds an
RLock
onsession.parentsLock
while it tries to broadcast a message. At the same time, Node9 tries to establish a connection to Node0. When the Stream GRPC handler is invoked on the server side it tries to acquire a write lock on session.parentsLock in the call tosess.Listen()
and blocks. Eventually both Node0 and Node9 aren't able to make progress. Since other nodes depend on Node9 to route message to Node0 none of the nodes are able to continue.Possible workaround:
session.RecvPacket
andsession.Send
can avoid holding the RLock by copying the references to values ofs.parents
in a local slice. The risk then would be trying to use a relay that has been closed whensession.Listen
exits. Maybe we can have a per parent lock instead to avoid it as suggested by @cache-nezThe text was updated successfully, but these errors were encountered: