Skip to content

Commit

Permalink
BTL/OFI: retry posting receive buffer
Browse files Browse the repository at this point in the history
There are cases under heavy load (at least for HPE CXI provider)
that trying to post a receive buffer can return  -FI_EAGAIN.

This PR uses the OFI_RETRY_UNTIL_DONE macro to try reposting receive buffer in the
event -FI_EAGAIN is returned from the fi_recv call.

Signed-off-by: Howard Pritchard <[email protected]>
  • Loading branch information
hppritcha committed May 21, 2024
1 parent 6e99e02 commit c522de1
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions opal/mca/btl/ofi/btl_ofi_module.c
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
*
* Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All Rights reserved.
* Copyright (c) 2020 Google, LLC. All rights reserved.
* Copyright (c) 2022-2023 Triad National Security, LLC. All rights
* Copyright (c) 2022-2024 Triad National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
Expand All @@ -31,6 +31,7 @@
#include "opal/mca/accelerator/accelerator.h"
#include "opal/mca/accelerator/base/base.h"
#include "opal/mca/btl/btl.h"
#include "opal/mca/common/ofi/common_ofi.h"
#include "opal/mca/mpool/base/base.h"
#include "opal/mca/mpool/mpool.h"
#include "opal/util/printf.h"
Expand Down Expand Up @@ -412,9 +413,8 @@ int mca_btl_ofi_post_recvs(mca_btl_base_module_t *module, mca_btl_ofi_context_t

comp = mca_btl_ofi_frag_completion_alloc(module, context, frag, MCA_BTL_OFI_TYPE_RECV);

rc = fi_recv(context->rx_ctx, &frag->hdr, MCA_BTL_OFI_RECV_SIZE, NULL, FI_ADDR_UNSPEC,
&comp->comp_ctx);

OFI_RETRY_UNTIL_DONE(fi_recv(context->rx_ctx, &frag->hdr, MCA_BTL_OFI_RECV_SIZE, NULL, FI_ADDR_UNSPEC,
&comp->comp_ctx), rc);
if (FI_SUCCESS != rc) {
BTL_ERROR(("cannot post recvs"));
return OPAL_ERROR;
Expand Down

0 comments on commit c522de1

Please sign in to comment.