Why do each node have different views on the nodes that rejoin the network in a fully mesh RDMACM configuration? #9784
-
We have four nodes: A, B, C, and D. They use RDMACM for full connectivity, which means they are both servers and clients to each other. When the process on node C is stopped out and restarted after few minutes, the other three nodes act as clients and initiate an active connection to node C. However, only node D successfully connects, while for nodes A and B, connection failure occurs on node C due to receiving the RDMA_CM_EVENT_REJECTED event. The status value of the event is 10 (according to IBTA, it means a stale connection). It seems that each node has different opinions on the rejoining of the rejoined C node. Even more strangely, just after node D successfully connected to node C, the connection between node A and node D(D as server), and the connection between node B and node D(D as server too) are almost simultaneously disconnected, because they received the RDMA_CM_EVENT_DISCONNECTED event from each other. Could you please help me check what the problem is? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Seems caused by duplicated Node GUID. I am going to close this. |
Beta Was this translation helpful? Give feedback.
Seems caused by duplicated Node GUID. I am going to close this.