Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

tt-dma · 2023-12-01T01:36:58Z

The issue here was that when an invalid WAIT is followed by more prints than fit in the print buffer, the core will spin forever waiting for space in the print buffer. However, since the WAIT is never satisfied we get a deadlock where the host will also spin forever waiting for the core to finish.

Typically this isn't something we'd expect to have happen (since DPRINT RAISE/WAIT is only for ordering prints), but it's possible that a user could write bad RAISE/WAITs so we shouldn't hang at least.

Changes here are for trying to detect this deadlock from the print server, and give a warning and exit cleanly if it happens.

Passing CI: https://github.com/tenstorrent-metal/tt-metal/actions/runs/7053409465

tt-dma · 2023-12-01T01:37:27Z

Also snuck in a fix for syntax highlighting in the DPRINT docs

pgkeller · 2023-12-01T15:00:43Z

tt_metal/impl/debug/dprint_server.cpp

+                    ", waiting on a RAISE signal: " +
+                    to_string(wait_signal) + "\n";
+                stream << error_str << flush;
+                throw std::runtime_error(error_str);


I think rather than throw/catch we should use a TT_THROW for the error and not catch it, let the program die. I'm open to other views though. If we don't let it die, still not sure throw/catch is the right way to handle this vs printing the warning and continuing.

One issue I ran into with using TT_THROW and letting the program die, is that when we do that we don't run any teardown functions, which then messes things up for the next test(s) in the suite. Not sure if we have other tests that have a way around this?

Hmm. That has to be the test code capturing the throw and resuming, guess we don't tear down properly in that case (wonder if this will show up elsewhere at some point). Let's go w/ removing the throw/catch here and returning, then using the log warning as the message.

Ok, updated it to just return instead of throw catch. Keeping the messages in both the default output and the print log so it makes sense if the user only reads the print log file (it also makes it easy for the test to pick up).

tt-dma · 2023-12-09T09:23:36Z

CI passing after rebase: https://github.com/tenstorrent-metal/tt-metal/actions/runs/7148699190

An invalid WAIT command can cause the device to spin forever if the print buffer fills up afterwards. Add some detection for this in the print server so the host-side doesn't hang as well.

tt-dma requested a review from pgkeller December 1, 2023 01:36

tt-dma requested review from DrJessop and davorchap as code owners December 1, 2023 01:36

pgkeller reviewed Dec 1, 2023

View reviewed changes

tt-dma force-pushed the dma/4073_dprint_wait_hang branch from f471a34 to f756cdb Compare December 1, 2023 22:41

pgkeller approved these changes Dec 1, 2023

View reviewed changes

tt-dma temporarily deployed to dev December 1, 2023 22:54 — with GitHub Actions Inactive

tt-dma had a problem deploying to dev December 1, 2023 22:54 — with GitHub Actions Failure

tt-dma temporarily deployed to dev December 1, 2023 22:54 — with GitHub Actions Inactive

tt-dma had a problem deploying to dev December 1, 2023 22:54 — with GitHub Actions Failure

tt-dma temporarily deployed to dev December 1, 2023 22:54 — with GitHub Actions Inactive

tt-dma had a problem deploying to dev December 1, 2023 22:54 — with GitHub Actions Failure

tt-dma temporarily deployed to dev December 1, 2023 22:54 — with GitHub Actions Inactive

tt-dma temporarily deployed to production December 1, 2023 23:19 — with GitHub Actions Inactive

tt-dma had a problem deploying to dev December 2, 2023 05:48 — with GitHub Actions Failure

tt-dma temporarily deployed to dev December 2, 2023 05:48 — with GitHub Actions Inactive

tt-dma temporarily deployed to production December 9, 2023 03:35 — with GitHub Actions Inactive

tt-dma temporarily deployed to dev December 9, 2023 07:35 — with GitHub Actions Inactive

tt-dma temporarily deployed to production December 9, 2023 07:35 — with GitHub Actions Inactive

DrJessop approved these changes Dec 11, 2023

View reviewed changes

tt-dma added 3 commits December 11, 2023 10:40

#0: Update code block language for DPRINT docs

21a6636

#4037: Add DPRINT server handling for invalid WAIT commands

3bc5d4d

An invalid WAIT command can cause the device to spin forever if the print buffer fills up afterwards. Add some detection for this in the print server so the host-side doesn't hang as well.

#4073: Throw instead of continue on DPRINT WAIT hang

6702969

tt-dma force-pushed the dma/4073_dprint_wait_hang branch from 8b08951 to 6702969 Compare December 11, 2023 18:40

tt-dma merged commit b6313e0 into main Dec 11, 2023
3 checks passed

tt-dma mentioned this pull request Dec 11, 2023

Host-side code hangs when DPRINT WAIT() is never answered #4073

Closed

tt-dma deleted the dma/4073_dprint_wait_hang branch January 16, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

tt-dma commented Dec 1, 2023

tt-dma commented Dec 1, 2023

pgkeller Dec 1, 2023

tt-dma Dec 1, 2023 •

edited

Loading

pgkeller Dec 1, 2023

tt-dma Dec 1, 2023

tt-dma commented Dec 9, 2023

Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

Conversation

tt-dma commented Dec 1, 2023

tt-dma commented Dec 1, 2023

pgkeller Dec 1, 2023

Choose a reason for hiding this comment

tt-dma Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

pgkeller Dec 1, 2023

Choose a reason for hiding this comment

tt-dma Dec 1, 2023

Choose a reason for hiding this comment

tt-dma commented Dec 9, 2023

tt-dma Dec 1, 2023 •

edited

Loading