Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] Batchnormalisation supported? #17879

Closed
elephantpanda opened this issue Oct 11, 2023 · 4 comments
Closed

[Training] Batchnormalisation supported? #17879

elephantpanda opened this issue Oct 11, 2023 · 4 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider training issues related to ONNX Runtime training; typically submitted using template

Comments

@elephantpanda
Copy link

elephantpanda commented Oct 11, 2023

Describe the issue

When trying to get the training model with a batchnormalisation node I get the error:

failed to find NodeArg by name f2_bn2.running_var_grad

Is batchnormalisation not supported for training?

snapshot5

This is a trained model converted from here: https://github.com/suragnair/alpha-zero-general (othello)

To reproduce

Create a training artefact with a batch normalisation node.

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

PyTorch Version

1.13.1

Execution Provider

CUDA

Execution Provider Library Version

CUDA 1.13.1

@elephantpanda elephantpanda added the training issues related to ONNX Runtime training; typically submitted using template label Oct 11, 2023
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 11, 2023
@baijumeswani
Copy link
Contributor

baijumeswani commented Oct 11, 2023

failed to find NodeArg by name f2_bn2.running_var_grad

the BatchNorm training inputs (running mean and running var) are not trainable. They are initializers in the model that get updated, but they are not trainable.

You will have to put the initializers in the frozen_params list while generating the training artifacts:

frozen_params = ["f2_bn2.running_var", "f2_bn2.running_mean"] # and potentially other similar initializers

@elephantpanda
Copy link
Author

Thanks. Thought that might be the problem. However now I get the error:

snapshot5

@baijumeswani
Copy link
Contributor

Could you share the forward onnx model? Will try to find the reason for the issue.

@baijumeswani
Copy link
Contributor

Closing this issue as I am not able to reproduce the error. If you're still encountering this, please reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

2 participants