AMP BF16 issue with batch norm layer #8496

lukeliu15 · 2024-12-17T20:22:58Z

🐛 Bug

The following runtime error is raised when using autocast dtype=torch.bfloat16 with conv + batchnorm layer:

RuntimeError: Bad StatusOr access: INTERNAL: during context [Unknown]: Seen floating point types of different precisions in %batch-norm-training.15 = (bf16[4,16,32,32]{3,2,1,0}, bf16[16]{0}, bf16[16]{0}) batch-norm-training(bf16[4,16,32,32]{3,2,1,0} %add.14, f32[16]{0} %p2.3, f32[16]{0} %p3.4), epsilon=1e-05, feature_index=1, but mixed precision is disallowed.

This issue is NOT reproducible when using dtype=torch.float16 or using torch.cuda.amp.autocast without XLA.

To Reproduce

import os

os.environ["XLA_REGISTER_INSTALLED_PLUGINS"] = "1"

import torch
from torch import nn
from torch_xla.amp import autocast
import torch_xla.core.xla_model as xm

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(16)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return x

def main():
    device = xm.xla_device()
    model = SimpleModel().to(device)
    inputs = torch.randn(4, 3, 32, 32).to(device)

    with autocast(device, dtype=torch.bfloat16):
        output = model(inputs)

    xm.mark_step()


if __name__ == '__main__':
    main()

Expected behavior

Above code should run without error.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
torch_xla version: 2.3.0

The text was updated successfully, but these errors were encountered:

yaochengji mentioned this issue Dec 17, 2024

fix batch_norm amp autocast #8498

Merged

yaochengji closed this as completed in #8498 Jan 8, 2025

yaochengji mentioned this issue Jan 11, 2025

fix batch_norm amp autocast #8556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMP BF16 issue with batch norm layer #8496

AMP BF16 issue with batch norm layer #8496

lukeliu15 commented Dec 17, 2024

AMP BF16 issue with batch norm layer #8496

AMP BF16 issue with batch norm layer #8496

Comments

lukeliu15 commented Dec 17, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment