You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
yes, checkpointing is a must-have for this type of of training especially if you plan on training on a single gpu. Is there any progress? See below my considerations:
given it takes a very long time to train the model, it is essential to be able to do checkpoints.
Is the checkpoint supported? How can I make sure a long simulation is checkpointed? Why is it not documented?
Here is what I see in the code:
in train.py :
parser.add_argument('--checkpoint', dest='checkpoint', default=0, type=int,
help='Enables checkpoint saving of model')
in solver.py :
# Save model each epoch
if self.checkpoint:
file_path = os.path.join(
self.save_folder, 'epoch%d.pth.tar' % (epoch + 1))
torch.save(self.model.serialize(self.model, self.optimizer, epoch + 1,
tr_loss=self.tr_loss,
cv_loss=self.cv_loss),
file_path)
print('Saving checkpoint model to %s' % file_path)
Hi
I tried to train model from previous checkpoint
For example, I trained the model during 100 epochs and got the final.pth.tar file.
I put the abs path to it in the run.sh in lines:
but training exiting with this log:
what object can give this tensor size problem?
do I correctly use training from checkpoint?
The text was updated successfully, but these errors were encountered: