Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Training Epochs & GPU Memory #112

Open
user20421 opened this issue Nov 19, 2024 · 2 comments
Open

About Training Epochs & GPU Memory #112

user20421 opened this issue Nov 19, 2024 · 2 comments

Comments

@user20421
Copy link

Thank you very much for your excellent work.I would like to build on your work and try some new ideas. However, I am currently encountering some issues. Could you please provide me with some suggestions or solutions? I would be very grateful for your help.
1.When I try to reproduce the YOLOv-S and YOLOv++-S models on a single 3080Ti GPU following your configuration, I sometimes encounter a "CUDA out of memory" error. Using nvidia-smi, I noticed that the GPU memory usage fluctuates significantly, which is uncommon in typical deep learning scenarios. Could this behavior be related to the multi-scale training mode or EMA training mode?
2.Regarding the first issue, when I attempt multi-GPU training, the program gets stuck and eventually throws a timeout error. Is there any way to enable single-machine multi-GPU training?
3.In your YOLOv-S experiments, the maximum epoch is set to 7. I noticed that the loss remains high near the end of the training. I tried increasing the maximum epoch to 14, but the loss is still quite large.Does having a high loss significantly impact the training results? Should I train for more epochs or reduce the learning rate to improve the performance?

@YuHengsss
Copy link
Owner

For the first question, the GPU memory usage is related to the candidate proposal numbers, you may set the maximum limitation as a hyper-parameter in (300 will be ok for 11G GPU memory):

self.maximal_limit = 0

For the second question, due to the limited computational cost and my limited coding ability when I try Video Object Detection, the multi-GPU training is not well supported.

For the third question, there are some losses belonging to the base detector that is not optimized (see here). In my experience on ImageNet VID dataset, the best performance could be obtained in three or four epochs and will decrease then.

@user20421
Copy link
Author

Thank you very much for your suggestions and for the outstanding work you have provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants