Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training with more epochs #44

Closed
oussaifi-majdi opened this issue Sep 21, 2023 · 7 comments · Fixed by #45
Closed

training with more epochs #44

oussaifi-majdi opened this issue Sep 21, 2023 · 7 comments · Fixed by #45
Assignees
Labels
enhancement New feature or request

Comments

@oussaifi-majdi
Copy link

i'm facing time limitations in Google Colab and need to train my data for 150 epochs, but in 50 epochs colab is termine how to resume from the last saved checkpoint when you restart the Colab session.

@naseemap47 naseemap47 self-assigned this Sep 22, 2023
@naseemap47 naseemap47 added the enhancement New feature or request label Sep 22, 2023
@naseemap47
Copy link
Owner

Hi @oussaifi-majdi ,
To solve your issue, I added new option for resume the model training.
I think this will solve your issue.
If you have any issues, Please let me know.
Thank You.

@naseemap47 naseemap47 linked a pull request Sep 22, 2023 that will close this issue
@naseemap47 naseemap47 mentioned this issue Sep 22, 2023
@oussaifi-majdi
Copy link
Author

@naseemap47 Thank you so much for your help with this issue! Your guidance and support were invaluable in resolving the problem.
Now I use summary metrics to train the data :

python3 train.py --data /dir/dataset/data.yaml --batch 16 --epoch 120 --model yolo_nas_m --size 640 --resume

but how can I determine the figure for accuracy, precision..etc with tensorboard throughout the training, from the first hours of training to the end when i finish training all epochs.

CHECKPOINT_DIR =?
EXPERIMENT_NAME =?
%load_ext tensorboard
%tensorboard --logdir {CHECKPOINT_DIR}/{EXPERIMENT_NAME} --port 6005
%reload_ext tensorboard

@naseemap47
Copy link
Owner

Hi @oussaifi-majdi ,
I am giving on example. i think this will help you.
Example:

python3 train.py --data /dir/dataset/data.yaml --batch 6 --epoch 100 --model yolo_nas_m --size 640 --weight runs/train2/ckpt_latest.pth --resume

@oussaifi-majdi
Copy link
Author

thanks sor , but If I resume training later using the --resume option, it may be difficult to get the full figure of precision and accuracy from the first epoch to the end.
Is there a solution to get the complete figure?

@naseemap47
Copy link
Owner

Hi @oussaifi-majdi ,
I fixed the issue, you can check now.
Thank you for finding this issue.
Please let me know. This is fixed your issue.
Thank you

@oussaifi-majdi
Copy link
Author

@naseemap47 thanks the #46 resume works well but the problem for example if we stop in epochs from 0 to 70 then summarize and continue from 70 to 100. when using tensorboard at the end to display the curves of recal, precision, F1.. . it only displays the last part of training 70 to 100 not from 1 to 100
I found some solution https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/experiment_monitoring.md but it does not work with this project, it is necessary to integrate a method among these methods to make the project the best and differentiate it from the others, it solves a very interesting problem

@naseemap47
Copy link
Owner

@oussaifi-majdi Thank you.
I will look into it.
Thank you for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants