You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current issue is mitigated by #2073. It now takes active effort to create unused bias weights in the convolution and fully-connected layers. This issue is a record in case we run into a similar issue in the future or if we refactor the weights class.
Description
@samadejacobs has observed hangs when training models with many convolution layers. On Lassen, he sees that either all even ranks or all odd ranks get stuck in an asynchronous allreduce in model::reconcile_weights:
@benson31 has reproduced the hang on Pascal (edit: it was on Lassen), although without the even/odd rank behavior. He observes that several non-blocking allreduces on CPU data return invalid request objects instead of the expected MPI_REQUEST_NULL.
Minimal reproducer on Lassen
Interestingly, the hang shows up with num_layers=128 but not with num_layers=127.
Proposal
I think this is happening because we are constructing weights objects that are not properly configured before the setup stage. In particular, the weights dims and data distribution is set in layer::setup_data. The convolution layer can accept two weights objects, but doesn't configure the second one if bias is disabled.
I think the best solution would be to force the user (or Python front-end) to fully and explicitly configure the dims and distribution of any weights objects they create. If the user doesn't provide weights, the layers can just create a new one with the right configuration, so this is actually not much less convenient than our current approach. This will remove the messy two-way interaction between weights and layers, allow for weights that are not owned by any layers, and be more convenient for importing weights and sub-grid/sub-graph parallelism.
The text was updated successfully, but these errors were encountered:
The current issue is mitigated by #2073. It now takes active effort to create unused bias weights in the convolution and fully-connected layers. This issue is a record in case we run into a similar issue in the future or if we refactor the weights class.
Description
@samadejacobs has observed hangs when training models with many convolution layers. On Lassen, he sees that either all even ranks or all odd ranks get stuck in an asynchronous allreduce in
model::reconcile_weights
:lbann/src/models/model.cpp
Line 2161 in b9bb511
@benson31 has reproduced the hang on
Pascal(edit: it was on Lassen), although without the even/odd rank behavior. He observes that several non-blocking allreduces on CPU data return invalid request objects instead of the expectedMPI_REQUEST_NULL
.Minimal reproducer on Lassen
Interestingly, the hang shows up with
num_layers=128
but not withnum_layers=127
.Proposal
I think this is happening because we are constructing weights objects that are not properly configured before the setup stage. In particular, the weights dims and data distribution is set in
layer::setup_data
. The convolution layer can accept two weights objects, but doesn't configure the second one if bias is disabled.I think the best solution would be to force the user (or Python front-end) to fully and explicitly configure the dims and distribution of any weights objects they create. If the user doesn't provide weights, the layers can just create a new one with the right configuration, so this is actually not much less convenient than our current approach. This will remove the messy two-way interaction between weights and layers, allow for weights that are not owned by any layers, and be more convenient for importing weights and sub-grid/sub-graph parallelism.
The text was updated successfully, but these errors were encountered: