You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like
for e in range(epochs):
xxx
some_collective_ops
Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.
The text was updated successfully, but these errors were encountered:
Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation.
2. Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.
Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like
for e in range(epochs):
xxx
some_collective_ops
Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.
The text was updated successfully, but these errors were encountered: