Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] ttnn.mean op - Data Mismatch #13621

Open
chandrasekaranpradeep opened this issue Oct 9, 2024 · 16 comments
Open

[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep opened this issue Oct 9, 2024 · 16 comments
Assignees

Comments

@chandrasekaranpradeep
Copy link

Describe the bug
The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op.
For more context, here is the exact error message

def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.7203957195745748

To Reproduce

Run the following test:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1
    
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    
    assert_with_pcc(torch_output_tensor, output_tensor)

Expected behavior
The data mismatch between PyTorch and TTNN output should be resolved.

@sdjordjevicTT
Copy link
Contributor

@ntarafdar @sjameelTT can you please help me to find owners for this issue?

@ntarafdar
Copy link
Contributor

hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you

@ntarafdar
Copy link
Contributor

@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this.
We cannot get to this until end of next week.

@ntarafdar ntarafdar assigned yugi957 and unassigned sjameelTT and ntarafdar Oct 29, 2024
@sdjordjevicTT
Copy link
Contributor

Thanks @ntarafdar for picking this up. Great, I believe that should work for us.

@jvasilje
Copy link
Collaborator

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

@jvasilje jvasilje added P1 and removed P0 labels Oct 30, 2024
@sdjordjevicTT
Copy link
Contributor

@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.

@nvukobratTT
Copy link

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.

Additional details can be found on the MLIR issue as well:

@ntarafdar
Copy link
Contributor

Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.

@ntarafdar ntarafdar assigned bbradelTT and unassigned yugi957 Nov 5, 2024
@bbradelTT
Copy link
Contributor

I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:

    #input_shape = (1, 12, 3200) # .72
    #input_shape = (1, 12, 1600) # .76
    #input_shape = (1, 12, 1400) # .78
    #input_shape = (1, 12, 1376) # .81
    #input_shape = (1, 12, 1363) # .93
    #input_shape = (1, 12, 1369) # .83
    #input_shape = (1, 12, 1368) # .85
    #input_shape = (1, 12, 1367) # .71
    #input_shape = (1, 12, 1366) # .96
    #input_shape = (1, 12, 1350) # .95
    #input_shape = (1, 12, 1344) # .87
    #input_shape = (1, 12, 1300) # .91
    #input_shape = (1, 12, 1200) # .93
    #input_shape = (1, 12, 800) # .92
    #input_shape = (1, 12, 320) # .99

@sdjordjevicTT
Copy link
Contributor

Hi @bbradelTT do we have some updates regarding this missmatch problem?

@bbradelTT
Copy link
Contributor

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@nvukobratTT
Copy link

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@bbradelTT thanks for the details. Can you clarify the following:

  1. What are the core issues with reduced mean and related PCC issues?
    • Having this info, we might be able to work around it until a fix is in place
  2. Details around lowering the priority on this issue.
    • From the current standpoint, this issue should be treated as P0 as it blocks one of the Forge core models, Llama.

To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.

Thanks!

@nvukobratTT nvukobratTT removed the P1 label Nov 13, 2024
@nvukobratTT nvukobratTT added the P0 label Nov 13, 2024
@bbradelTT
Copy link
Contributor

@nvukobratTT

  1. At a high level, depending on the inputs, a lot of different things are done (including transpose, auto format, and reshape) that don't work that well when different dimensions are padded, and padding may not be done properly in all instances (see [Bug Report] ttnn.max wrong results  #12662 and [Bug Report] ttnn.mean returns wrong results and shapes #13647). I'm not sure of the root cause of the reduced mean in this case, although in one of the other issues it seems to be because the entire tile is used and therefore there are extra 0s and the denominator is larger than it should be. Having said that, I still haven't isolated the exact issue for this specific scenario.
  2. I don't have context for the priority.

Update for today:

  • I'm working on printing out the data in the circular buffers to see where any discrepancies are occurring. Will continue that work tomorrow.

@nvukobratTT
Copy link

Thanks for pushing this one further @bbradelTT! Much appreciated 🙌

@bbradelTT
Copy link
Contributor

Update for today:

  • I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

@nvukobratTT
Copy link

Update for today:

  • I spent today in meetings and on my other currently active P0 that I didn't have a chance to look at yesterday.

Thanks for the update and for letting us know Borys :))

It's valuable for us to know the state of the issues, and when we expect it to be resolved, so that we can plan accordingly on our side as well. Thanks once again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

10 participants