[Training] [ShapeInferenceError] Dimension could not be inferred: incompatible shapes #21327

srijanie03 · 2024-07-11T20:31:49Z

Describe the issue

I am trying to get ONNX training graph for Lllama2_7b model. I can get the forward graph, no problem. But the issue occurs when I use generate artifacts.

I don't get this error when I run a signle layer Transformer block (attention+MLP) with similar dimensions. What is causing the issue here?

Additionally, the loss function is throwing an error too : expected 2 but got 66 arguments. Please explain. Thank you!

To reproduce

base_model = llama2_7b()


batch = torch.tensor([[    1,  7569,  7225, 16229,   366],
        [    1,  7569,  2462,  8640,   263]])

model_outputs = base_model(batch)
if isinstance(model_outputs, torch.Tensor):
    model_outputs = [model_outputs]

input_names = ["input"]
output_names = ["output"]
dynamic_axes = {"input": {0: "batch_size"}, "output": {0: "batch_size"}}


f = io.BytesIO()
torch.onnx.export(
    base_model,
    batch,
    "torchtune_llama2.onnx",
    input_names=input_names,
    output_names=output_names,
    opset_version=14,
    do_constant_folding=False,
    training=torch.onnx.TrainingMode.TRAINING,
    dynamic_axes=dynamic_axes,
    export_params=True,
    keep_initializers_as_inputs=False,
)
requires_grad = [name for name, param in base_model.named_parameters() if param.requires_grad]

frozen_params = [name for name, param in base_model.named_parameters() if not param.requires_grad]

artifacts.generate_artifacts(
    "torchtune_llama2.onnx",
    #optimizer=artifacts.OptimType.AdamW,
    #loss=artifacts.LossType.CrossEntropyLoss, 
    #loss=artifacts.LossType.MSELoss,
    requires_grad=requires_grad,
    frozen_params=frozen_params,
    artifact_directory="llama2",
    additional_output_names=["output"])

Urgency

Urgent

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.19.0

PyTorch Version

2.4.0

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.4

The text was updated successfully, but these errors were encountered:

carzh · 2024-07-26T18:06:37Z

Hi, do you have the full script for exporting & generating artifacts? Or could you provide the forward graph ONNX file?

Generally, we see these errors when the generated forward graph is incorrect. Especially for the loss function, which expects a certain number of graph outputs. For LLM's especially, we see that unless the base Torch model that is being exported is in training mode, then usually it uses a key-value cache (an inference-only optimization that adds inputs and outputs to the graph).

The Torch model passed to torch.onnx.export (the base_model) must be in training mode (ie, you should be able to train with it), and the input and output names passed to the export function should correlate with the input names and output names of the Torch model.

If you have a working PyTorch training script for Llama2_7b, you can use that to determine the correct input names and output names, and what inputs you need to pass in for it to be in training mode.

srijanie03 · 2024-07-30T18:19:43Z

Hi @carzh. Thanks a lot for the comment. Yes, I exactly did that and was able to resolve the issue. There was a mismatch with the input dimensions which was generating the error.

carzh · 2024-07-30T18:55:46Z

Glad to hear it! Closing as resolved -- feel free to reopen if you run into further issues.

srijanie03 added the training issues related to ONNX Runtime training; typically submitted using template label Jul 11, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 11, 2024

sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Jul 11, 2024

carzh self-assigned this Jul 15, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 22, 2024

carzh removed the ep:CUDA issues related to the CUDA execution provider label Jul 26, 2024

carzh closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] [ShapeInferenceError] Dimension could not be inferred: incompatible shapes #21327

[Training] [ShapeInferenceError] Dimension could not be inferred: incompatible shapes #21327

srijanie03 commented Jul 11, 2024 •

edited by mindest

Loading

carzh commented Jul 26, 2024

srijanie03 commented Jul 30, 2024

carzh commented Jul 30, 2024

[Training] [ShapeInferenceError] Dimension could not be inferred: incompatible shapes #21327

[Training] [ShapeInferenceError] Dimension could not be inferred: incompatible shapes #21327

Comments

srijanie03 commented Jul 11, 2024 • edited by mindest Loading

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

carzh commented Jul 26, 2024

srijanie03 commented Jul 30, 2024

carzh commented Jul 30, 2024

srijanie03 commented Jul 11, 2024 •

edited by mindest

Loading