-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qp_impl.hpp:131] poll till completion error: 12 transport retry counter exceeded #3
Comments
你好, 能不能给出一个具体的代码来复现问题?目前从描述上来看我没看出什么问题。 ps:现在这个project移到 https://github.com/wxdwfc/rlibv2 进行维护了,如果方便的话还是用新版本比较好。 谢谢! |
好的,具体代码是这样的: void Server::RDMAConnect(std::string& client_ip,
int client_port,
int client_id) {
// Server has already registered two seperate memory regions
/************************************* RDMA Connection ***************************************/
RDMA_LOG(INFO) << "Waiting for RDMA connecting compute nodes...";
auto qp0 = rdma_ctrl->create_rc_qp(QPIdx{.node_id = client_id, .worker_id = 0, .index = 0},
rdma_ctrl->get_device(),
nullptr);
while (qp0->connect(client_ip, client_port) != SUCC) {
usleep(2000);
}
auto qp1 = rdma_ctrl->create_rc_qp(QPIdx{.node_id = client_id, .worker_id = 0, .index = 1},
rdma_ctrl->get_device(),
nullptr);
while (qp1->connect(client_ip, client_port) != SUCC) {
usleep(2000);
}
RDMA_LOG(INFO) << "Server: QP connected!";
} Client端,线程t1 void PairQPConnect(RdmaCtrl* rdma_ctrl,
RemoteNode& remote_node, // struct RemoteNode {int node_id; std::string ip; int port;};
MemoryAttr remote_mr0, // has been prefetched via QP::get_remote_mr()
MemoryAttr remote_mr1, // has been prefetched via QP::get_remote_mr()
RNicHandler* opened_rnic) {
// Create the two queue pairs
MemoryAttr local_mr = rdma_ctrl->get_local_mr(CLIENT_MR_ID); // CLIENT_MR_ID is a magic number
RCQP* qp0 = rdma_ctrl->create_rc_qp(
QPIdx{.node_id = remote_node.node_id, .worker_id = 0, .index = 0},
opened_rnic,
&local_mr);
qp0->bind_remote_mr(remote_mr0);
RCQP* qp1 = rdma_ctrl->create_rc_qp(
QPIdx{.node_id = remote_node.node_id, .worker_id = 0, .index = 1},
opened_rnic,
&local_mr);
qp1->bind_remote_mr(remote_mr1);
// Queue pair connection, exchange queue pair info via TCP
while (qp0->connect(remote_node.ip, remote_node.port) != SUCC) {
usleep(2000);
}
while (qp1->connect(remote_node.ip, remote_node.port) != SUCC) {
usleep(2000);
}
RDMA_LOG(INFO) << "Client: QP connected!";
qp0_array[remote_node.node_id] = qp0;
qp1_array[remote_node.node_id] = qp1;
} Client端,线程t1 int node_id = GetRemoteNodeID();
RCQP* qp = qp0_array[node_id];
size_t data_size = 1024;
char* read_buf = (char*) Rmalloc(data_size);
memset(read_buf, 0, data_size);
uint64_t remote_offset = 0;
auto rc = qp->post_send_to_mr(local_mr, remote_mr0, IBV_WR_RDMA_READ, read_buf, data_size, remote_offset, IBV_SEND_SIGNALED);
if (rc != SUCC) {
RDMA_LOG(ERROR) << "client: post read fail. rc=" << rc;
}
ibv_wc wc{};
rc = qp->poll_till_completion(wc, no_timeout);
if (rc != SUCC) {
RDMA_LOG(ERROR) << "client: poll read fail. rc=" << rc;
}
Rfree(read_buf); Client端,线程t2 int node_id = GetRemoteNodeID();
RCQP* qp = qp1_array[node_id];
size_t data_size = 1024;
char* write_buf = (char*) Rmalloc(data_size);
memset(read_buf, 0, data_size);
uint64_t remote_offset = 0;
auto rc = qp->post_send_to_mr(local_mr, remote_mr1, IBV_WR_RDMA_WRITE, write_buf, data_size, remote_offset, IBV_SEND_SIGNALED);
if (rc != SUCC) {
RDMA_LOG(ERROR) << "client: post read fail. rc=" << rc;
}
ibv_wc wc{};
rc = qp->poll_till_completion(wc, no_timeout); // ERROR: [qp_impl.hpp:131] poll till completion error: 12 transport retry counter exceeded
if (rc != SUCC) {
RDMA_LOG(ERROR) << "client: poll read fail. rc=" << rc;
}
Rfree(write_buf); 如果将t2内的RCQP* qp = qp1_array[node_id];换成RCQP* qp = qp0_array[node_id];就没有问题 |
你好, 目前如果要在rlib中连接多个QP的话建议借助 如果想要单独连接QP可以使用https://github.com/wxdwfc/rlibv2。 最后,由于rlib目前已经迁移到了https://github.com/wxdwfc/rlibv2, 感谢! |
好的,谢谢解答! |
您好,感谢开源rlib!我在使用过程中,遇到了一个问题:
背景:client端(在机器1)使用线程t1和server端(在机器2)建立2个RCQP连接(QP1,QP2)后,client端的t1线程内创建一个新线程t2。接下来,t1使用QP1对server进行one-sided RDMA READ,t2使用QP2对server进行one-sided RDMA WRITE。t1和t2的RDMA READ/WRITE是并行的(读写没有任何冲突)。
到这里本应该没有问题,但是t2的RDMA WRITE却无法写成功(通过查看server端mem region未被修改而得知),导致t2在poll cq时出现 “transport retry counter exceeded” 报错。
查阅RDMA Aware Networks Programming User Manual (Rev 1.7),该错误的解释是:
奇怪的是,如果t2使用QP1进行RDMA WRITE,则可以写成功,poll也没问题(注意到QP1和QP2都是使用class RRCQP中的connect函数分2次成功连接的)。
但我并不希望t1和t2共用一个RCQP,因为t1和t2会争抢completion queue,比如t1 poll到了t2的ack,导致t1认为自己的RDMA READ成功了,但实际上可能还没读到remote data。
希望您可以解答,谢谢!
The text was updated successfully, but these errors were encountered: