[Training] Retraining a YOLO V8n model on device #20201

OAHLSTM · 2024-04-04T13:32:36Z

Describe the issue

Hello,

I'm trying to retrain a YoloV8n model on a custom dataset retrieved directly on a device arm64 running on Linux. I'm using onnxruntime to generate the artifacts and for now I'm struggling a little bit to define a loss function for my model. I have the pytorch model generated from ultralytics.
I tried following the suggestion made by @baijumeswani on a similar issue.

class MyPTModelWithLoss:
    def __init__(self):
         ...

    def forward(self, ...):
        p, q, r = compute_logits()
        loss = loss1(p) + loss2(q) + loss3(r)
        return loss

pt_model = MyPTModelWithLoss(...)
torch.onnx.export(pt_model, ...)

onnx_model = onnx.load(<exported_onnx_model_path>)
artifacts.generate_artifacts(onnx_model, requires_grad=[...], frozen_params=[...], loss=None, optimizer=...)

This approach suggests to add the loss function into the end of forward pass of the model and feed None to the loss while generating the artifacts. The problem with that approach is that the gradient builder tries to build gradient for operations used by the loss such as ReduceMin, ReduceMax ... However there is no gradient definition for these operation and it is not the correct practice to compute gradient for the loss.
I was wondering if there is a way to cut the graph into two subgraphs to build the gradient only for the forward pass and not the loss function too ? If not, what would be the best approach to generate the training artifacts in this case ?

Thank you for you support,

To reproduce

class MyPTModelWithLoss:
    def __init__(self):
         ...

    def forward(self, ...):
        p, q, r = compute_logits()
        loss = loss1(p) + loss2(q) + loss3(r)
        return loss

pt_model = MyPTModelWithLoss(...)
torch.onnx.export(pt_model, ...)

onnx_model = onnx.load(<exported_onnx_model_path>)
artifacts.generate_artifacts(onnx_model, requires_grad=[...], frozen_params=[...], loss=None, optimizer=...)

Urgency

This is really urgent, we are trying to deploy a retrainable yolov8 model on the device using onnxruntime-training framework.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.1

PyTorch Version

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

baijumeswani · 2024-04-04T17:07:39Z

However there is no gradient definition for these operation and it is not the correct practice to compute gradient for the loss.

Gradient computation should always start at the loss. What is being computed during backpropagation is the gradient of the loss with respect to the inputs at each node.

The goal of the training phase is to minimize the loss. So, we want to find the changes that need to be made to the weight parameters such that the loss is minimized. During backpropagation, we start with 1 (as the gradient of the loss w.r.t to itself). And as we encounter each node in the forward graph (in a backwards order), we want to compute the gradient of the loss with respect to the inputs to that node.

The problem right now is that, for the loss defined in your model, we don't have the necessary gradient operator kernels (i.e. ReduceMinGrad and ReduceMaxGrad). It might take some time for us to get to this work. Would you like to contribute and write the CPU kernels for these operators?

OAHLSTM added the training issues related to ONNX Runtime training; typically submitted using template label Apr 4, 2024

OAHLSTM closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Retraining a YOLO V8n model on device #20201

[Training] Retraining a YOLO V8n model on device #20201

OAHLSTM commented Apr 4, 2024

baijumeswani commented Apr 4, 2024

[Training] Retraining a YOLO V8n model on device #20201

[Training] Retraining a YOLO V8n model on device #20201

Comments

OAHLSTM commented Apr 4, 2024

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

baijumeswani commented Apr 4, 2024