Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intranode GPU communication crashes in MPI called from Cabana::Gather::apply() #106

Open
patrickb314 opened this issue Feb 27, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@patrickb314
Copy link

CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.

The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:

[email protected]:101
PAMI::Protocol::Get::GetRdma<PAMI::Device::Shmem::DmaModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<P
AMI::Fifo::FifoPacket<64u,@libpami.so.3
PAMI::Protocol::Get::CompositeRGet<PAMI::Protocol::Get::RGet,@libpami.so.3
PAMI::Context::rget_impl(pami_rget_simple_t*)@libpami.so.3
[email protected]
process_rndv_msg@mca_pml_pami.so
pml_pami_recv_rndv_cb@mca_pml_pami.so
PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::Wr
apFifo<PAMI::Fifo::FifoPacket<64u,@libpami.so.3
[email protected]
mca_pml_pami_progress_wait@mca_pml_pami.so
mca_pml_pami_send@mca_pml_pami.so
PMPI_Send@libmpi_ibm.so.3
Cabana::Gather<Cabana::Halo<Kokkos::Device<Kokkos::Cuda,@()
void@()
Comm<System<Kokkos::Device<Kokkos::Cuda,@()
CbnMD<System<Kokkos::Device<Kokkos::Cuda,@()
main@()
---STACK
@streeve
Copy link
Member

streeve commented Mar 1, 2023

Thanks for the details - I'll test this out when I'm back from travel next week. Looks like I also need to manually restart the CI periodically to try to catch this type of bug

@streeve streeve added the bug Something isn't working label Mar 1, 2023
@patrickb314
Copy link
Author

Unclear on more exploration that this is a CabanaMD problem. I'm seeing multiple cases where small GPU-GPU intranode sends are crashing on those systems, but haven't yet been able to isolate. I'll update as I find out more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants