Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103

Merged
merged 3 commits into from
Dec 11, 2023

Conversation

tt-dma
Copy link
Contributor

@tt-dma tt-dma commented Dec 1, 2023

The issue here was that when an invalid WAIT is followed by more prints than fit in the print buffer, the core will spin forever waiting for space in the print buffer. However, since the WAIT is never satisfied we get a deadlock where the host will also spin forever waiting for the core to finish.

Typically this isn't something we'd expect to have happen (since DPRINT RAISE/WAIT is only for ordering prints), but it's possible that a user could write bad RAISE/WAITs so we shouldn't hang at least.

Changes here are for trying to detect this deadlock from the print server, and give a warning and exit cleanly if it happens.

Passing CI: https://github.com/tenstorrent-metal/tt-metal/actions/runs/7053409465

@tt-dma
Copy link
Contributor Author

tt-dma commented Dec 1, 2023

Also snuck in a fix for syntax highlighting in the DPRINT docs

", waiting on a RAISE signal: " +
to_string(wait_signal) + "\n";
stream << error_str << flush;
throw std::runtime_error(error_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than throw/catch we should use a TT_THROW for the error and not catch it, let the program die. I'm open to other views though. If we don't let it die, still not sure throw/catch is the right way to handle this vs printing the warning and continuing.

Copy link
Contributor Author

@tt-dma tt-dma Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I ran into with using TT_THROW and letting the program die, is that when we do that we don't run any teardown functions, which then messes things up for the next test(s) in the suite. Not sure if we have other tests that have a way around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. That has to be the test code capturing the throw and resuming, guess we don't tear down properly in that case (wonder if this will show up elsewhere at some point). Let's go w/ removing the throw/catch here and returning, then using the log warning as the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated it to just return instead of throw catch. Keeping the messages in both the default output and the print log so it makes sense if the user only reads the print log file (it also makes it easy for the test to pick up).

@tt-dma
Copy link
Contributor Author

tt-dma commented Dec 9, 2023

An invalid WAIT command can cause the device to spin forever if the
print buffer fills up afterwards. Add some detection for this in the
print server so the host-side doesn't hang as well.
@tt-dma tt-dma merged commit b6313e0 into main Dec 11, 2023
3 checks passed
@tt-dma tt-dma deleted the dma/4073_dprint_wait_hang branch January 16, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants