-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device. #4103
Conversation
Also snuck in a fix for syntax highlighting in the DPRINT docs |
", waiting on a RAISE signal: " + | ||
to_string(wait_signal) + "\n"; | ||
stream << error_str << flush; | ||
throw std::runtime_error(error_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rather than throw/catch we should use a TT_THROW for the error and not catch it, let the program die. I'm open to other views though. If we don't let it die, still not sure throw/catch is the right way to handle this vs printing the warning and continuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I ran into with using TT_THROW and letting the program die, is that when we do that we don't run any teardown functions, which then messes things up for the next test(s) in the suite. Not sure if we have other tests that have a way around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. That has to be the test code capturing the throw and resuming, guess we don't tear down properly in that case (wonder if this will show up elsewhere at some point). Let's go w/ removing the throw/catch here and returning, then using the log warning as the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated it to just return instead of throw catch. Keeping the messages in both the default output and the print log so it makes sense if the user only reads the print log file (it also makes it easy for the test to pick up).
f471a34
to
f756cdb
Compare
CI passing after rebase: https://github.com/tenstorrent-metal/tt-metal/actions/runs/7148699190 |
An invalid WAIT command can cause the device to spin forever if the print buffer fills up afterwards. Add some detection for this in the print server so the host-side doesn't hang as well.
8b08951
to
6702969
Compare
The issue here was that when an invalid WAIT is followed by more prints than fit in the print buffer, the core will spin forever waiting for space in the print buffer. However, since the WAIT is never satisfied we get a deadlock where the host will also spin forever waiting for the core to finish.
Typically this isn't something we'd expect to have happen (since DPRINT RAISE/WAIT is only for ordering prints), but it's possible that a user could write bad RAISE/WAITs so we shouldn't hang at least.
Changes here are for trying to detect this deadlock from the print server, and give a warning and exit cleanly if it happens.
Passing CI: https://github.com/tenstorrent-metal/tt-metal/actions/runs/7053409465