[Training] On Device Training is not working #18168

IzanCatalan · 2023-10-30T16:31:05Z

Describe the issue

I am re-training some onnx models from ONNX Model Zoo Repo, especially Resnet50. Previously, I created the artifacts according to onnx-runtime-training-examples Repo. In my case I create them with the following code:

frozen_params = []
requires_grad = []
for init in onnx_model.graph.initializer:
    if init.name.endswith("running_mean") or init.name.endswith("running_var"):
        frozen_params.append(init.name)
    elif init.name not in frozen_params:
        requires_grad.append(init.name)

print(len(requires_grad), len(frozen_params))
print(frozen_params)
# Generate the training artifacts.
artifacts.generate_artifacts(
   onnx_model,
   requires_grad=requires_grad,
   frozen_params=frozen_params,
   loss=artifacts.LossType.CrossEntropyLoss,
   optimizer=artifacts.OptimType.AdamW,
   artifact_directory=sys.argv[2]
)

I follow the same structure train-test defined in the onnx-runtime-training-examples Repo code. In my case, I load the Imagenet train and validate dataset:

imagenet_data_train = datasets.ImageNet('/home/Imagenet', split="train", transform=transform_test)
imagenet_data_val = datasets.ImageNet('/home/Imagenet', split="val", transform=transform_test)
train_loader = torch.utils.data.DataLoader(imagenet_data_train,
                                          batch_size=8,
                                          shuffle=True,
                                          num_workers=4)

val_loader = torch.utils.data.DataLoader(imagenet_data_val,
                                          batch_size=8,
                                          shuffle=True,
                                          num_workers=4)

However, every time I try to train, the accuracy I get is much lower than the accuracy obtained in ONNX Model Zoo Repo. For example, after 20 epochs, the accuracy reached is 0.53 compared to the original accuracy, which is 0.75. I don't insert any changes or modify the weights in the process, but it surprises me the fact that if I re-train a model which previously obtained 0.75 accuracy, I can barely reach 0.53. Moreover, the first epoch only gives 0.31 accuracy.

I would like to know if this is normal behaviour, and if it is so, I would like to know how I can re-train some pre-trained and pre-tested onnx models without losing a huge amount of accuracy.

To reproduce

I am running onnxruntime build from source for cuda 11.2, GCC 9.5, cmake 3.27 and python 3.8 with ubuntu 20.04.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-training 1.17.0+cu112

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.2

The text was updated successfully, but these errors were encountered:

askhade · 2023-10-30T19:07:01Z

@IzanCatalan: Can you also paste your train recipe? What are your timelines - How urgent is this?

IzanCatalan · 2023-10-31T00:45:06Z

@askhade Are you asking for my training code? It is the following and I would like to find a solution as soon as possible within this week:

----------------------------------------------------------------------------------------------------------------------------------
# ----------------------------------------Trainining methods-------------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------------------------------------

# Util function to convert logits to predictions.
def get_pred(logits):
    return np.argmax(logits, axis=1)

# Training Loop :
def train(epoch):
    print("dentro train")
    model.train()
    print("setup train")
    losses = []
    p = 0
    for i, (data, target) in enumerate(train_loader):
        forward_inputs = [data.reshape(len(data),3,224,224).numpy().astype(np.float32),target.numpy().astype(np.int64)]
        train_loss = model(*forward_inputs)
        optimizer.step()
        model.lazy_reset_grad()
        losses.append(train_loss)
        p = i
    
    print("image", p)
    print(f'Epoch: {epoch+1},Train Loss: {sum(losses)/len(losses):.4f}')


# Test Loop :
def test(epoch):
    print("dentro train")
    model.eval()
    print("setup eval")
    losses = []
    metric = evaluate.load('accuracy')

    for _, (data, target) in enumerate(val_loader):
        forward_inputs = [data.reshape(len(data),3,224,224).numpy().astype(np.float32),target.numpy().astype(np.int64)]
        test_loss, logits = model(*forward_inputs)
        metric.add_batch(references=target, predictions=get_pred(logits))
        losses.append(test_loss)

    metrics = metric.compute()
    print(f'Epoch: {epoch+1}, Test Loss: {sum(losses)/len(losses):.4f}, Accuracy : {metrics["accuracy"]:.2f}')
    return metrics["accuracy"]


# --------------------------------------MAIN------------------------------------------------------
# pdb.set_trace()
print("STARTING")

# Define image transforms
normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
transform_test = transforms.Compose([
   transforms.Resize(256),
   transforms.CenterCrop(224),
   transforms.ToTensor(),
   normalize
])


print("RUNNING")
imagenet_data_train = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="train", transform=transform_test)
imagenet_data_val = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="val", transform=transform_test)
train_loader = torch.utils.data.DataLoader(imagenet_data_train,
                                          batch_size=8,
                                          shuffle=True,
                                          num_workers=4)

val_loader = torch.utils.data.DataLoader(imagenet_data_val,
                                          batch_size=8,
                                          shuffle=True,
                                          num_workers=4)


print("DATASET LOADED")
# Instantiate the training session by defining the checkpoint state, module, and optimizer
# The checkpoint state contains the state of the model parameters at any given time.
checkpoint_state = orttraining.CheckpointState.load_checkpoint(
    "docker/training_artifacts/checkpoint")

model = orttraining.Module(
    "docker/training_artifacts/training_model.onnx",
    checkpoint_state,
    "docker/training_artifacts/eval_model2.onnx",
    "cuda"
)

optimizer = orttraining.Optimizer(
    "docker/training_artifacts/optimizer_model.onnx", model
)
    # --------------Training Phase----------------------
while (res < 0.69):
    train(epoch)
    res = test(epoch)
    print(res)
    epoch += 1

model.export_model_for_inferencing("inference.onnx", ["resnetv17_dense0_fwd"])

IzanCatalan · 2023-11-01T22:33:15Z

@askhade, I have an update on this issue. I tried to train with a batch size of 256 instead of 8 images, and suddenly, the first epoch gave an accuracy of 0.54 instead of 0.31 (with batch size = 8). So, am I right when supposing that the higher the batch size, the higher the accuracy per epoch?

However, I don't understand yet how it is possible when re-training a pre-trained model with an accuracy higher than 0.75 (a Resnet Model downloaded from the Official Onnx Model Zoo Repo), the first epoch degrades this accuracy to 0.54. So, my question is still the same: is this behaviour normal for On Device Training with OnnxRuntime Training? And, if it is so, how I can change it to perform a more optimal training which can give me a higher accuracy per epoch?

github-actions · 2023-12-23T15:00:55Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

carzh · 2024-01-19T17:29:57Z

@IzanCatalan hello, do you have an updated link to the ONNX file you're generating artifacts from, or the name of which resnet50 model you're using?

IzanCatalan added the training issues related to ONNX Runtime training; typically submitted using template label Oct 30, 2023

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 30, 2023

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Dec 23, 2023

baijumeswani assigned carzh Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] On Device Training is not working #18168

[Training] On Device Training is not working #18168

IzanCatalan commented Oct 30, 2023 •

edited

Loading

askhade commented Oct 30, 2023

IzanCatalan commented Oct 31, 2023

IzanCatalan commented Nov 1, 2023

github-actions bot commented Dec 23, 2023

carzh commented Jan 19, 2024 •

edited

Loading

[Training] On Device Training is not working #18168

[Training] On Device Training is not working #18168

Comments

IzanCatalan commented Oct 30, 2023 • edited Loading

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

askhade commented Oct 30, 2023

IzanCatalan commented Oct 31, 2023

IzanCatalan commented Nov 1, 2023

github-actions bot commented Dec 23, 2023

carzh commented Jan 19, 2024 • edited Loading

IzanCatalan commented Oct 30, 2023 •

edited

Loading

carzh commented Jan 19, 2024 •

edited

Loading