[Training] Batchnormalisation supported? #17879

elephantpanda · 2023-10-11T01:20:29Z

Describe the issue

When trying to get the training model with a batchnormalisation node I get the error:

failed to find NodeArg by name f2_bn2.running_var_grad

Is batchnormalisation not supported for training?

This is a trained model converted from here: https://github.com/suragnair/alpha-zero-general (othello)

To reproduce

Create a training artefact with a batch normalisation node.

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

PyTorch Version

1.13.1

Execution Provider

CUDA

Execution Provider Library Version

CUDA 1.13.1

baijumeswani · 2023-10-11T04:45:38Z

failed to find NodeArg by name f2_bn2.running_var_grad

the BatchNorm training inputs (running mean and running var) are not trainable. They are initializers in the model that get updated, but they are not trainable.

You will have to put the initializers in the frozen_params list while generating the training artifacts:

frozen_params = ["f2_bn2.running_var", "f2_bn2.running_mean"] # and potentially other similar initializers

elephantpanda · 2023-10-11T15:53:37Z

Thanks. Thought that might be the problem. However now I get the error:

baijumeswani · 2023-10-11T16:05:22Z

Could you share the forward onnx model? Will try to find the reason for the issue.

baijumeswani · 2024-01-26T18:08:14Z

Closing this issue as I am not able to reproduce the error. If you're still encountering this, please reopen the issue.

elephantpanda added the training issues related to ONNX Runtime training; typically submitted using template label Oct 11, 2023

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 11, 2023

askhade assigned baijumeswani Oct 11, 2023

baijumeswani closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Batchnormalisation supported? #17879

[Training] Batchnormalisation supported? #17879

elephantpanda commented Oct 11, 2023 •

edited

Loading

baijumeswani commented Oct 11, 2023 •

edited

Loading

elephantpanda commented Oct 11, 2023

baijumeswani commented Oct 11, 2023

baijumeswani commented Jan 26, 2024

[Training] Batchnormalisation supported? #17879

[Training] Batchnormalisation supported? #17879

Comments

elephantpanda commented Oct 11, 2023 • edited Loading

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

baijumeswani commented Oct 11, 2023 • edited Loading

elephantpanda commented Oct 11, 2023

baijumeswani commented Oct 11, 2023

baijumeswani commented Jan 26, 2024

elephantpanda commented Oct 11, 2023 •

edited

Loading

baijumeswani commented Oct 11, 2023 •

edited

Loading