-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Training] On Device Training is not working #18168
Comments
@IzanCatalan: Can you also paste your train recipe? What are your timelines - How urgent is this? |
@askhade Are you asking for my training code? It is the following and I would like to find a solution as soon as possible within this week:
|
@askhade, I have an update on this issue. I tried to train with a batch size of 256 instead of 8 images, and suddenly, the first epoch gave an accuracy of 0.54 instead of 0.31 (with batch size = 8). So, am I right when supposing that the higher the batch size, the higher the accuracy per epoch? However, I don't understand yet how it is possible when re-training a pre-trained model with an accuracy higher than 0.75 (a Resnet Model downloaded from the Official Onnx Model Zoo Repo), the first epoch degrades this accuracy to 0.54. So, my question is still the same: is this behaviour normal for On Device Training with OnnxRuntime Training? And, if it is so, how I can change it to perform a more optimal training which can give me a higher accuracy per epoch? |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
@IzanCatalan hello, do you have an updated link to the ONNX file you're generating artifacts from, or the name of which resnet50 model you're using? |
Describe the issue
I am re-training some onnx models from ONNX Model Zoo Repo, especially Resnet50. Previously, I created the artifacts according to onnx-runtime-training-examples Repo. In my case I create them with the following code:
I follow the same structure train-test defined in the onnx-runtime-training-examples Repo code. In my case, I load the Imagenet train and validate dataset:
However, every time I try to train, the accuracy I get is much lower than the accuracy obtained in ONNX Model Zoo Repo. For example, after 20 epochs, the accuracy reached is 0.53 compared to the original accuracy, which is 0.75. I don't insert any changes or modify the weights in the process, but it surprises me the fact that if I re-train a model which previously obtained 0.75 accuracy, I can barely reach 0.53. Moreover, the first epoch only gives 0.31 accuracy.
I would like to know if this is normal behaviour, and if it is so, I would like to know how I can re-train some pre-trained and pre-tested onnx models without losing a huge amount of accuracy.
To reproduce
I am running onnxruntime build from source for cuda 11.2, GCC 9.5, cmake 3.27 and python 3.8 with ubuntu 20.04.
Urgency
No response
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-training 1.17.0+cu112
PyTorch Version
None
Execution Provider
CUDA
Execution Provider Library Version
Cuda 11.2
The text was updated successfully, but these errors were encountered: