[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

hupreti · 2023-12-19T09:07:21Z

Describe the issue

Unable to generate training artifacts for model having size greater than 2GB. The onnx model is exported from hugging face repo with weights in external file,

Model exported: OPT-1.3B

Error when passing onnx.ModelProto

Traceback (most recent call last):
File "onnxruntime_artifacts.py", line 58, in
artifacts.generate_artifacts(
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 137, in generate_artifacts
_ = training_block(*[output.name for output in model.graph.output])
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/onnxblock.py", line 188, in call
output = self.build(*args, **kwargs)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 107, in build
return self._loss(*inputs_to_loss)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/blocks.py", line 48, in call
output = self.build(*args, **kwargs)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/loss/loss.py", line 48, in build
target_name = blocks.InputLike(loss_input_name)(target_name)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/blocks.py", line 50, in call
onnx.checker.check_model(self.base, True)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnx/checker.py", line 145, in check_model
raise ValueError(

ValueError: This protobuf of onnx model is too large (>2GB). Call check_model with model path instead.

Error when passing str (onnx file path)

Traceback (most recent call last):
File "onnxruntime_artifacts.py", line 58, in
artifacts.generate_artifacts(
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 137, in generate_artifacts
_ = training_block(*[output.name for output in model.graph.output])

AttributeError: 'str' object has no attribute 'graph'

To reproduce

artifacts.generate_artifacts( onnx_model, optimizer=artifacts.OptimType.AdamW, loss=artifacts.LossType.MSELoss, requires_grad=requires_grad, frozen_params=frozen_params, artifact_directory=output_dir, additional_output_names=["logits"])

Urgency

High

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnx-1.15.0 onnxruntime-1.16.3 onnxruntime-training-1.16.3

PyTorch Version

pytorch-2.0.1

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

anujgupt-github · 2024-01-10T03:58:24Z

Any updates to this?

baijumeswani · 2024-01-26T18:03:36Z

Unable to generate training artifacts for model having size greater than 2GB. The onnx model is exported from hugging face repo with weights in external file

Currently, we have not added support for models > 2GB. This would be something we should address. I don't have an exact timeline as to when I can get to this work. But I will add this as a backlog item on our end.

Originally, the intent behind on-device training was to be able to train relatively small models on the device (where there are memory and resource constraints). We did not prioritize large models (>2GB) for on-device training for this reason. But perhaps there are scenarios where this is useful.

Could you share your scenario and also maybe a repro for the error you're seeing?

carzh · 2024-06-21T21:31:51Z

After #20077 & #20958 (and other PR's), generate_artifacts now supports >2GB models. Since the changes are recent, will require a build from main until the nightly training packages are pushed.

Please re-open if you continue to run into issues with >2GB models with generate_artifacts()

hupreti added the training issues related to ONNX Runtime training; typically submitted using template label Dec 19, 2023

carzh closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

hupreti commented Dec 19, 2023 •

edited

Loading

anujgupt-github commented Jan 10, 2024

baijumeswani commented Jan 26, 2024

carzh commented Jun 21, 2024

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

Comments

hupreti commented Dec 19, 2023 • edited Loading

Describe the issue

Error when passing onnx.ModelProto

ValueError: This protobuf of onnx model is too large (>2GB). Call check_model with model path instead.

Error when passing str (onnx file path)

AttributeError: 'str' object has no attribute 'graph'

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

anujgupt-github commented Jan 10, 2024

baijumeswani commented Jan 26, 2024

carzh commented Jun 21, 2024

hupreti commented Dec 19, 2023 •

edited

Loading