Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon7b t3k/single-card demos hang non-deterministically #15059

Open
skhorasganiTT opened this issue Nov 14, 2024 · 2 comments
Open

Falcon7b t3k/single-card demos hang non-deterministically #15059

skhorasganiTT opened this issue Nov 14, 2024 · 2 comments

Comments

@skhorasganiTT
Copy link
Contributor

skhorasganiTT commented Nov 14, 2024

The Falcon7b t3k demo is hanging non-deterministically (observed for both the 1024 and 2048 sequence length tests, and in both the prefill and decode stages) on CI and locally. It is unclear when the issue started happening as recent CI runs (including this one for the commit below) have been passing. In addition, sometimes the error in the picture below occurs instead of a hang.

Commit: 16123a1
Command:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/t3000/falcon7b/input_data_t3000.json' models/demos/t3000/falcon7b/demo_t3000.py::test_demo_multichip[wormhole_b0-True-user_input0-8-True-perf_mode_2048_stochastic]

Example failing CI run: https://github.com/tenstorrent/tt-metal/actions/runs/11827306784

Image

@skhorasganiTT
Copy link
Contributor Author

@uaydonat possibly di/dt related

@skhorasganiTT
Copy link
Contributor Author

skhorasganiTT commented Nov 14, 2024

The single-card falcon7b functionality demo is also hanging non-deterministically on n300.
E.g:
Single-card falcon7b functionality demo passing (16123a1): https://github.com/tenstorrent/tt-metal/actions/runs/11811548922/job/32905366114
Single-card falcon7b functionality demo hanging (a5d9979): https://github.com/tenstorrent/tt-metal/actions/runs/11816808517/job/32921118321

Note that 16123a1 was already hanging for t3k as stated in the issue description.

@skhorasganiTT skhorasganiTT changed the title Falcon7b T3K demo hangs non-deterministically Falcon7b demo hangs non-deterministically Nov 14, 2024
@skhorasganiTT skhorasganiTT changed the title Falcon7b demo hangs non-deterministically Falcon7b t3k/single-card demos hang non-deterministically Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant