You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue: Synchronization Error with Epoch Time Logging
In PR #308 , we introduced logging of the epoch time. The intention was to log this information only on the main process, so we implemented a check to verify if the process has rank 0 before executing the logging.
However, this implementation appears to cause a synchronization error, resulting in the training process getting locked. Consequently, training halts and does not continue after the first epoch.
Steps to Reproduce:
Run training with multiple processes.
Observe that training stops after the first epoch.
Expected Behavior:
Training should continue seamlessly across epochs.
Actual Behavior:
Training locks after the first epoch due to a synchronization error likely related to the rank-based logging check.
The text was updated successfully, but these errors were encountered:
Issue: Synchronization Error with Epoch Time Logging
In PR #308 , we introduced logging of the epoch time. The intention was to log this information only on the main process, so we implemented a check to verify if the process has rank 0 before executing the logging.
However, this implementation appears to cause a synchronization error, resulting in the training process getting locked. Consequently, training halts and does not continue after the first epoch.
Steps to Reproduce:
Expected Behavior:
Actual Behavior:
The text was updated successfully, but these errors were encountered: