Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

Closed
hupreti opened this issue Dec 19, 2023 · 3 comments
Labels
training issues related to ONNX Runtime training; typically submitted using template

Comments

@hupreti
Copy link

hupreti commented Dec 19, 2023

Describe the issue

Unable to generate training artifacts for model having size greater than 2GB. The onnx model is exported from hugging face repo with weights in external file,

Model exported: OPT-1.3B

Error when passing onnx.ModelProto

Traceback (most recent call last):
File "onnxruntime_artifacts.py", line 58, in
artifacts.generate_artifacts(
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 137, in generate_artifacts
_ = training_block(*[output.name for output in model.graph.output])
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/onnxblock.py", line 188, in call
output = self.build(*args, **kwargs)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 107, in build
return self._loss(*inputs_to_loss)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/blocks.py", line 48, in call
output = self.build(*args, **kwargs)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/loss/loss.py", line 48, in build
target_name = blocks.InputLike(loss_input_name)(target_name)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/onnxblock/blocks.py", line 50, in call
onnx.checker.check_model(self.base, True)
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnx/checker.py", line 145, in check_model
raise ValueError(

ValueError: This protobuf of onnx model is too large (>2GB). Call check_model with model path instead.

Error when passing str (onnx file path)

Traceback (most recent call last):
File "onnxruntime_artifacts.py", line 58, in
artifacts.generate_artifacts(
File "/workspace/envs/training_env/lib/python3.8/site-packages/onnxruntime/training/artifacts.py", line 137, in generate_artifacts
_ = training_block(*[output.name for output in model.graph.output])

AttributeError: 'str' object has no attribute 'graph'

To reproduce

artifacts.generate_artifacts( onnx_model, optimizer=artifacts.OptimType.AdamW, loss=artifacts.LossType.MSELoss, requires_grad=requires_grad, frozen_params=frozen_params, artifact_directory=output_dir, additional_output_names=["logits"])

Urgency

High

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnx-1.15.0 onnxruntime-1.16.3 onnxruntime-training-1.16.3

PyTorch Version

pytorch-2.0.1

Execution Provider

Default CPU

Execution Provider Library Version

No response

@hupreti hupreti added the training issues related to ONNX Runtime training; typically submitted using template label Dec 19, 2023
@anujgupt-github
Copy link

Any updates to this?

@baijumeswani
Copy link
Contributor

Unable to generate training artifacts for model having size greater than 2GB. The onnx model is exported from hugging face repo with weights in external file

Currently, we have not added support for models > 2GB. This would be something we should address. I don't have an exact timeline as to when I can get to this work. But I will add this as a backlog item on our end.

Originally, the intent behind on-device training was to be able to train relatively small models on the device (where there are memory and resource constraints). We did not prioritize large models (>2GB) for on-device training for this reason. But perhaps there are scenarios where this is useful.

Could you share your scenario and also maybe a repro for the error you're seeing?

@carzh
Copy link
Contributor

carzh commented Jun 21, 2024

After #20077 & #20958 (and other PR's), generate_artifacts now supports >2GB models. Since the changes are recent, will require a build from main until the nightly training packages are pushed.

Please re-open if you continue to run into issues with >2GB models with generate_artifacts()

@carzh carzh closed this as completed Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

4 participants