ttnn.mean op - Tensor Mismatch #869

chandrasekaranpradeep · 2024-10-08T07:06:53Z

Summary:
The ttnn.mean op fails with tensor mismatch (PCC : 0.7203957195745748)
Details:
The ttnn.mean throws tensor mismatch and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to reduce_mean(i.e ttnn.mean) op in forge. The tensor mismatch is observed while comparing PyTorch and Forge(i.e ttnn) output

For more context, here is the exact error message:

  >       assert compare_with_golden_pcc(golden=fw_out, calculated=co_out[0], pcc=0.99)
E       assert False
E        +  where False = compare_with_golden_pcc(golden=tensor([[[0.4979],\n         [0.4969],\n         [0.5080],\n         [0.5029],\n         [0.5012],\n         [0.5046],\n         [0.4993],\n         [0.5034],\n         [0.5109],\n         [0.4984],\n         [0.4972],\n         [0.4963]]]), calculated=tensor([[[0.4648],\n         [0.4707],\n         [0.4844],\n         [0.4727],\n         [0.4570],\n         [0.4766],\n         [0.4727],\n         [0.4785],\n         [0.4883],\n         [0.4707],\n         [0.4766],\n         [0.4648]]]), pcc=0.99)

Repro:
TTIR:

 module @ReduceMean attributes {tt.system_desc = #tt.system_desc<[{arch = <wormhole_b0>, grid = 8x8, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32, l1_unreserved_base = 1024, erisc_l1_unreserved_base = 1024, dram_unreserved_base = 1024, dram_unreserved_end = 1073741824, physical_cores = {worker = [ 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7,  1x0,  1x1,  1x2,  1x3,  1x4,  1x5,  1x6,  1x7,  2x0,  2x1,  2x2,  2x3,  2x4,  2x5,  2x6,  2x7,  3x0,  3x1,  3x2,  3x3,  3x4,  3x5,  3x6,  3x7,  4x0,  4x1,  4x2,  4x3,  4x4,  4x5,  4x6,  4x7,  5x0,  5x1,  5x2,  5x3,  5x4,  5x5,  5x6,  5x7,  6x0,  6x1,  6x2,  6x3,  6x4,  6x5,  6x6,  6x7,  7x0,  7x1,  7x2,  7x3,  7x4,  7x5,  7x6,  7x7] dram = [ 8x0,  9x0,  10x0,  8x1,  9x1,  10x1,  8x2,  9x2,  10x2,  8x3,  9x3,  10x3]}, supported_data_types = [<f32>, <f16>, <bf16>, <bfp_f8>, <bfp_bf8>, <bfp_f4>, <bfp_bf4>, <bfp_f2>, <bfp_bf2>, <u32>, <u16>, <u8>], supported_tile_sizes = [ 4x16,  16x16,  32x16,  4x32,  16x32,  32x32]}], [0], [3 : i32], [ 0x0x0x0]>} {
  func.func @forward(%arg0: tensor<1x12x3200xf32> {ttir.name = "a"}) -> (tensor<1x12x1xf32> {ttir.name = "ReduceMean.output_reduce_avg_0"}) {
    %0 = tensor.empty() : tensor<1x12x1xf32>
    %1 = "ttir.mean"(%arg0, %0) <{dim_arg = [-1 : i32], keep_dim = true, operand_constraints = [#tt.operand_constraint<dram|l1|scalar|tile|none|interleaved|single_bank|height_sharded|width_sharded|block_sharded|any_layout|any_device|any_device_tile|l1_block_sharded>, #tt.operand_constraint<dram|l1|scalar|tile|none|interleaved|single_bank|height_sharded|width_sharded|block_sharded|any_layout|any_device|any_device_tile|l1_block_sharded>]}> : (tensor<1x12x3200xf32>, tensor<1x12x1xf32>) -> tensor<1x12x1xf32>
    return %1 : tensor<1x12x1xf32>
  }
}

TTNN test cases:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1
    
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    
    assert_with_pcc(torch_output_tensor, output_tensor)

Forge test cases:

git checkout pchandrasekaran/rms_norm_and_mean

# Before running the test, comment the xfail for data mismatch in this test
pytest forge/test/mlir/test_ops.py::test_reduce_mean[-1-input_shape2] -vss

The text was updated successfully, but these errors were encountered:

chandrasekaranpradeep · 2024-10-09T04:51:57Z

Created a issue in TT-Metal for the ttnn.mean op tensor mismatch - tenstorrent/tt-metal#13621

nvukobratTT · 2024-10-09T07:37:30Z

@sdjordjevicTT we also confirmed that there is an issue on ttnn side, here is the blocker issues at hand:

[Bug Report] ttnn.mean op - Data Mismatch tt-metal#13621

Note: This one is also marked as P0 as it exists on Llama 3B model we're referencing

sdjordjevicTT · 2024-10-09T08:25:20Z

Great! Let's see with TTNN folks what it is about.

chandrasekaranpradeep mentioned this issue Oct 8, 2024

Update RMS normalization and reduce_mean tests tenstorrent/tt-forge-fe#349

Merged

chandrasekaranpradeep mentioned this issue Oct 9, 2024

[Ops] Support for reduce average op (ttnn.mean) tenstorrent/tt-forge-fe#114

Open

nvukobratTT added this to the [Inference - 3] Llama 3B Bring Up milestone Oct 9, 2024

nvukobratTT assigned sdjordjevicTT Oct 9, 2024

nvukobratTT mentioned this issue Oct 30, 2024

[Bug Report] ttnn.mean op - Data Mismatch tenstorrent/tt-metal#13621

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ttnn.mean op - Tensor Mismatch #869

ttnn.mean op - Tensor Mismatch #869

chandrasekaranpradeep commented Oct 8, 2024

chandrasekaranpradeep commented Oct 9, 2024

nvukobratTT commented Oct 9, 2024 •

edited

Loading

sdjordjevicTT commented Oct 9, 2024

ttnn.mean op - Tensor Mismatch #869

ttnn.mean op - Tensor Mismatch #869

Comments

chandrasekaranpradeep commented Oct 8, 2024

chandrasekaranpradeep commented Oct 9, 2024

nvukobratTT commented Oct 9, 2024 • edited Loading

sdjordjevicTT commented Oct 9, 2024

nvukobratTT commented Oct 9, 2024 •

edited

Loading