[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep · 2024-10-09T04:31:34Z

Describe the bug
The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op.
For more context, here is the exact error message

def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.7203957195745748

To Reproduce

Run the following test:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1
    
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    
    assert_with_pcc(torch_output_tensor, output_tensor)

Expected behavior
The data mismatch between PyTorch and TTNN output should be resolved.

The text was updated successfully, but these errors were encountered:

sdjordjevicTT · 2024-10-25T15:51:25Z

@ntarafdar @sjameelTT can you please help me to find owners for this issue?

ntarafdar · 2024-10-29T16:16:35Z

hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you

ntarafdar · 2024-10-29T16:48:39Z

@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this.
We cannot get to this until end of next week.

sdjordjevicTT · 2024-10-29T19:43:26Z

Thanks @ntarafdar for picking this up. Great, I believe that should work for us.

jvasilje · 2024-10-30T15:05:38Z

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

sdjordjevicTT · 2024-10-30T15:57:25Z

@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.

nvukobratTT · 2024-10-30T16:02:31Z

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.

Additional details can be found on the MLIR issue as well:

ttnn.mean op - Tensor Mismatch tt-mlir#869

ntarafdar · 2024-11-05T16:00:30Z

Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.

bbradelTT · 2024-11-05T16:57:01Z

I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:

    #input_shape = (1, 12, 3200) # .72
    #input_shape = (1, 12, 1600) # .76
    #input_shape = (1, 12, 1400) # .78
    #input_shape = (1, 12, 1376) # .81
    #input_shape = (1, 12, 1363) # .93
    #input_shape = (1, 12, 1369) # .83
    #input_shape = (1, 12, 1368) # .85
    #input_shape = (1, 12, 1367) # .71
    #input_shape = (1, 12, 1366) # .96
    #input_shape = (1, 12, 1350) # .95
    #input_shape = (1, 12, 1344) # .87
    #input_shape = (1, 12, 1300) # .91
    #input_shape = (1, 12, 1200) # .93
    #input_shape = (1, 12, 800) # .92
    #input_shape = (1, 12, 320) # .99

sdjordjevicTT · 2024-11-13T10:59:57Z

Hi @bbradelTT do we have some updates regarding this missmatch problem?

bbradelTT · 2024-11-13T14:41:49Z

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

nvukobratTT · 2024-11-13T16:13:00Z

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@bbradelTT thanks for the details. Can you clarify the following:

What are the core issues with reduced mean and related PCC issues?
- Having this info, we might be able to work around it until a fix is in place
Details around lowering the priority on this issue.
- From the current standpoint, this issue should be treated as P0 as it blocks one of the Forge core models, Llama.

To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.

Thanks!

bbradelTT · 2024-11-13T22:15:04Z

@nvukobratTT

At a high level, depending on the inputs, a lot of different things are done (including transpose, auto format, and reshape) that don't work that well when different dimensions are padded, and padding may not be done properly in all instances (see [Bug Report] ttnn.max wrong results #12662 and [Bug Report] ttnn.mean returns wrong results and shapes #13647). I'm not sure of the root cause of the reduced mean in this case, although in one of the other issues it seems to be because the entire tile is used and therefore there are extra 0s and the denominator is larger than it should be. Having said that, I still haven't isolated the exact issue for this specific scenario.
I don't have context for the priority.

Update for today:

I'm working on printing out the data in the circular buffers to see where any discrepancies are occurring. Will continue that work tomorrow.

nvukobratTT · 2024-11-14T13:47:29Z

Thanks for pushing this one further @bbradelTT! Much appreciated 🙌

bbradelTT · 2024-11-14T21:21:25Z

Update for today:

I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

nvukobratTT · 2024-11-15T08:52:47Z

Update for today:

I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

Thanks for the update and for letting us know Borys :))

It's valuable for us to know the state of the issues, and when we expect it to be resolved, so that we can plan accordingly on our side as well. Thanks once again!

chandrasekaranpradeep added the bug Something isn't working label Oct 9, 2024

This was referenced Oct 9, 2024

ttnn.mean op - Tensor Mismatch tenstorrent/tt-mlir#869

Open

[Ops] Support for reduce average op (ttnn.mean) tenstorrent/tt-forge-fe#114

Open

tt-mpantic added the forge label Oct 14, 2024

nvukobratTT added the P0 label Oct 14, 2024

sdjordjevicTT assigned ntarafdar and sjameelTT Oct 25, 2024

ntarafdar assigned yugi957 and unassigned sjameelTT and ntarafdar Oct 29, 2024

jvasilje added P1 and removed P0 labels Oct 30, 2024

ayerofieiev-tt added this to PyTorch 2.0 TT-NN Compiler Oct 30, 2024

ayerofieiev-tt added op_cat: reduces pytorch-compiler labels Oct 30, 2024

ntarafdar assigned bbradelTT and unassigned yugi957 Nov 5, 2024

nvukobratTT removed the P1 label Nov 13, 2024

nvukobratTT added the P0 label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] ttnn.mean op - Data Mismatch #13621

[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep commented Oct 9, 2024

sdjordjevicTT commented Oct 25, 2024

ntarafdar commented Oct 29, 2024

ntarafdar commented Oct 29, 2024

sdjordjevicTT commented Oct 29, 2024

jvasilje commented Oct 30, 2024

sdjordjevicTT commented Oct 30, 2024

nvukobratTT commented Oct 30, 2024

ntarafdar commented Nov 5, 2024

bbradelTT commented Nov 5, 2024

sdjordjevicTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 14, 2024

bbradelTT commented Nov 14, 2024

nvukobratTT commented Nov 15, 2024

[Bug Report] ttnn.mean op - Data Mismatch #13621

[Bug Report] ttnn.mean op - Data Mismatch #13621

Comments

chandrasekaranpradeep commented Oct 9, 2024

sdjordjevicTT commented Oct 25, 2024

ntarafdar commented Oct 29, 2024

ntarafdar commented Oct 29, 2024

sdjordjevicTT commented Oct 29, 2024

jvasilje commented Oct 30, 2024

sdjordjevicTT commented Oct 30, 2024

nvukobratTT commented Oct 30, 2024

ntarafdar commented Nov 5, 2024

bbradelTT commented Nov 5, 2024

sdjordjevicTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 14, 2024

bbradelTT commented Nov 14, 2024

nvukobratTT commented Nov 15, 2024