Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] Retraining a YOLO V8n model on device #20201

Closed
OAHLSTM opened this issue Apr 4, 2024 · 1 comment
Closed

[Training] Retraining a YOLO V8n model on device #20201

OAHLSTM opened this issue Apr 4, 2024 · 1 comment
Labels
training issues related to ONNX Runtime training; typically submitted using template

Comments

@OAHLSTM
Copy link

OAHLSTM commented Apr 4, 2024

Describe the issue

Hello,

I'm trying to retrain a YoloV8n model on a custom dataset retrieved directly on a device arm64 running on Linux. I'm using onnxruntime to generate the artifacts and for now I'm struggling a little bit to define a loss function for my model. I have the pytorch model generated from ultralytics.
I tried following the suggestion made by @baijumeswani on a similar issue.

class MyPTModelWithLoss:
    def __init__(self):
         ...

    def forward(self, ...):
        p, q, r = compute_logits()
        loss = loss1(p) + loss2(q) + loss3(r)
        return loss

pt_model = MyPTModelWithLoss(...)
torch.onnx.export(pt_model, ...)

onnx_model = onnx.load(<exported_onnx_model_path>)
artifacts.generate_artifacts(onnx_model, requires_grad=[...], frozen_params=[...], loss=None, optimizer=...)

This approach suggests to add the loss function into the end of forward pass of the model and feed None to the loss while generating the artifacts. The problem with that approach is that the gradient builder tries to build gradient for operations used by the loss such as ReduceMin, ReduceMax ... However there is no gradient definition for these operation and it is not the correct practice to compute gradient for the loss.
I was wondering if there is a way to cut the graph into two subgraphs to build the gradient only for the forward pass and not the loss function too ? If not, what would be the best approach to generate the training artifacts in this case ?

Thank you for you support,

To reproduce

class MyPTModelWithLoss:
    def __init__(self):
         ...

    def forward(self, ...):
        p, q, r = compute_logits()
        loss = loss1(p) + loss2(q) + loss3(r)
        return loss

pt_model = MyPTModelWithLoss(...)
torch.onnx.export(pt_model, ...)

onnx_model = onnx.load(<exported_onnx_model_path>)
artifacts.generate_artifacts(onnx_model, requires_grad=[...], frozen_params=[...], loss=None, optimizer=...)

Urgency

This is really urgent, we are trying to deploy a retrainable yolov8 model on the device using onnxruntime-training framework.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.1

PyTorch Version

Execution Provider

Default CPU

Execution Provider Library Version

No response

@OAHLSTM OAHLSTM added the training issues related to ONNX Runtime training; typically submitted using template label Apr 4, 2024
@baijumeswani
Copy link
Contributor

However there is no gradient definition for these operation and it is not the correct practice to compute gradient for the loss.

Gradient computation should always start at the loss. What is being computed during backpropagation is the gradient of the loss with respect to the inputs at each node.

The goal of the training phase is to minimize the loss. So, we want to find the changes that need to be made to the weight parameters such that the loss is minimized. During backpropagation, we start with 1 (as the gradient of the loss w.r.t to itself). And as we encounter each node in the forward graph (in a backwards order), we want to compute the gradient of the loss with respect to the inputs to that node.

The problem right now is that, for the loss defined in your model, we don't have the necessary gradient operator kernels (i.e. ReduceMinGrad and ReduceMaxGrad). It might take some time for us to get to this work. Would you like to contribute and write the CPU kernels for these operators?

@OAHLSTM OAHLSTM closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

2 participants