#7724: Add prototype for autonomous streams for use in tunneller #8207

SeanNijjar · 2024-05-07T15:40:54Z

Summary

This PR adds a prototype for a chain of autonomous streams that will eventually be used to replace the ethernet tunneller component such that we can recover an erisc/ethernet link for user kernels. This will be possible because the streams are programmed to run completely autonomously without FW intervention after initial setup and final teardown.

In this PR, an end-to-end unit test is provided that generates a FW stream writer, relay stream (pair), and FW stream reader chain that forwards the specified number of messages. Configurable attributes are stream buffer sizes, total # messages, and num messages per phase.

This prototype runs with various configurations successfully. Collectively, if all tests are run, it comprises hours of runtime and thousands of stream resets, which pass without requiring a device reset.

Tests can be run with

./build/test/tt_metal/unit_tests_fast_dispatch --gtest_filter="*TestAutonomousRelayStreams"

The prototype seems stable for the first run coming out of reset but isn't stable after multiple back-to-back runs (it'll hang during initial handshake). However, integration work can still begin while this bug is resolved. Additionally, I'd like to turn my attention to other feature work in the short term

High Level Description

3 main components/kernels:

remote sender: this is a kernel that explicitly manages a stream via FW and sends data to the first relay stream
relay: This is the autonomous stream. The kernel code here is only responsible for setting up the blob in L1, kicking off the stream, and then waiting for finish signals from remote sender and remote receiver
remote receiver: this is a kernel that explicitly manages a stream via FW and receivers data from the last relay stream

Remote sender and remote receiver are responsible for managing buffer rdptrs (including proper wraparound).

Phase Programming

Remote sender sends one message per phase
Relays and remote receiver send some num_messages_per_phase per phase (e.g. 256). This value is the same for both relay and remote receiver
- Relays and remote receiver alternate between 2 phases, switching phases after every num_messages_per_phase messages
- They are always set to next_phase_src|dest_change=true so that they can properly drain their buffers at the end of the phase
  - Without this draining at the end of the phase, senders/receivers will not be able to properly coordinate buffer rd/wrptrs across phase boundaries
  - (Keep in mind that we need num_messages_per_phase because stream hardware doesn't support unlimited messages per phase
remote_sender will reset back to the first phase after sending 2 * num_messages_per_phase messages (and thus after 2 * num_messages_per_phase phase increments
- e.g. if remote_sender starts in phase 1, and num_messages_per_phase=256, then after 512 messages/phases, it will reset to phase 1
remote_sender must set next_phase_src|dest_change=true for the phase before it would be sending the first message to its remote_receiver's new phase
- e.g. num_messages_per_phase=8 and starting_phase=1, then next_phase_src|dest_change=true for phase=7 and phase=15 of remote_sender

Quirks

The current test is setup such that we read in an input from DRAM and feed it through a CB to the remote_sender stream. For this reason, we require our noc-reads to be 32B aligned. This is a quirk specific to this test and not a general stream constraint. For that reason, you'll notice that stream_relay_remote_sender_reader.cpp and stream_relay_remote_receiver_writer.cpp insert and strip 16B of padding at the head of the packet (between buffer and payload). This maintains alignment constraints. When integrated with the dispatch datapath, this will not be required because we will not be reading/writing from/to DRAM.
I implement variable sized messages by "randomly" dropping some number of words from the payload in the stream_relay_remote_sender_reader kernel, without any of the downstream stream kernels aware of this. The output comparison host is aware how many words are dropped for each packets and skips checking those dropped words in the output tensor.
We currently bounce between +1 and +4096 for our starting phase based on current phase to avoid phase aliasing causing handshake problems. See "Stream Resets/Startup" note below.

Phase Selection

Streams are very finicky and have an undesirable trait where even if
they are reset, they expect the next phase they handshake on to be
different. So if in a prior run, the sender finished on phase 1 and the
relay finished on phase 1, then for the next run, neither stream should
start on phase 1 on the next run.

For this reason, on stream startup, the FW inspects the streams current
phase and based on that, chooses a valid next starting phase. It sends
this starting phase information to its sender stream, if it has one. The
same is done for the downstream direction so receivers know which
remote_src_phase to handshake on.

Resets

After every run, we must teardown and reset the streams so they are
ready to use and able to handshake properly the next time a program uses
the AI accelerator. To reset properly, we need to ensure a few things:

The relay stream is not processing any data at the time of reset
- in other words, the full datapath should be flushed before reset
There should be no acks pending to be sent upstream.
The receiver/relay kernels do this be checking for stream active and
a special debug register

In a fully fleshed out design, this reset should ideally be done before
stream construction. Additionally, it must also be done in the event of
program failure (e.g. ctrl^C, sigkill, etc.).

Limitations

There are some limitations that will always be true:

max_message_size == min(stream_buffer_size, sender_buffer_size)
- streams expect a header present for every message
- streams expect the entire message to be resident when send is
  started

There are currently some known limitations (may be lifted in future):

min # messages per phase = 128
max # messages per phase = 2048
- This is due to how the phase range selection is implemented

Stream Resets/Startup

During teardown, we must guarantee:

all messages have been flushed through the datapath before initiating reset
FW managed remote sender/remote receiver are received strictly before the autonomous streams
During rerun, we must guarantee that each core's starting phase is neither the current value held in the stream's curr_phase register, nor the relay/remote_receiver streams' remote_src_phase register.
- For this reason, we have an initialization phase where streams exchange phase information with their neighbours.
This is always why we unroll the relay stream phases so when we wrap-around, there is no ambiguity about phase IDs during handshake

Outstanding Bugs

fewer than 128 messages per phase leads to deterministic handshake hang
- this hang deterministically happens after min(num_phase_ranges,24) runs. 24 also happens to be the number of dest ready table entries for WH although it's unclear if this is a pure coincidence
  - disabling the dest_ready_table leads to immediate handshake hang and so wasn't pursued further

SeanNijjar · 2024-05-07T15:42:00Z

FYI @davorchap

Streams are autonomous data movement hardware engines present on every tensix and erisc core. Typically, streams are only capable of moving data in a static pattern from a predetermined ordering of senders/receivers. Luckily, for the tunneling use case, the producers and consumers are always the same and we only need to make sure we can forward messages indefinitely. This prototype is the first step to enable streams in the dispatch datapath so that we may recover erisc cores for use by kernels. Since the stream can run autonomously with this setup, we can initialize it such that it implements tunnelling behaviour without erisc overhead. With the exception of some bandwidth sharing (L1 and ethernet) on the margins, a user kernel would never know the stream is busy working as the tunneler. Indefinite message forwarding can be accomplished by creating two phases in the autonomous stream's blob and making the second phase point its next phase to the start of the first phase. This way, with the stream configured to auto-configure and auto-advance, it will end up looping forever. The remaining challenge is to ensure that we can safely reset/teardown the stream so that the next time a program runs on the hardware, the remote sender dispatch core is able to establish a handshake with the relay stream. If it kept running in the background, the dispatch code path would have no idea how to intercept it and establish communication with it. Therefore we reset any time we need to teardown the dispatch datapath. Streams are opaque and brittle, and this is not an originally intended use-case for them. However, ironically, it seems to map best with all of the other limitations provided with streams. === Phase Selection === Streams are very finicky and have an undesirable trait where even if they are reset, they expect the next phase they handshake on to be different. So if in a prior run, the sender finished on phase 1 and the relay finished on phase 1, then for the next run, neither stream should start on phase 1 on the next run. For this reason, on stream startup, the FW inspects the streams current phase and based on that, chooses a valid next starting phase. It sends this starting phase information to its sender stream, if it has one. The same is done for the downstream direction so receivers know which `remote_src_phase` to handshake on. === Resets === After every run, we must teardown and reset the streams so they are ready to use and able to handshake properly the next time a program uses the AI accelerator. To reset properly, we need to ensure a few things: 1) The relay stream is *not* processing any data at the time of reset - in other words, the full datapath should be flushed before reset 2) There should be no acks pending to be sent upstream. The receiver/relay kernels do this be checking for stream active and a special debug register In a fully fleshed out design, this reset should ideally be done before stream construction. Additionally, it must also be done in the event of program failure (e.g. ctrl^C, sigkill, etc.). === Limitations === There are some limitations that will always be true: - max_message_size == min(stream_buffer_size, sender_buffer_size) - streams expect a header present for every message - streams expect the entire message to be resident when send is started There are currently some known limitations (may be lifted in future): - min # messages per phase = 128 - fewer leads to deterministic handshake hang - this hang deterministically happens after min(num_phase_ranges,24) runs. 24 also happens to be the number of dest ready table entries for WH although it's unclear if this is a pure coincidence - disabling the dest_ready_table leads to immediate handshake hang and so wasn't pursued further - max # messages per phase = 2048 - This is due to how the phase range selection is implemented

SeanNijjar assigned ubcheema and aliuTT May 7, 2024

SeanNijjar unassigned ubcheema and aliuTT May 7, 2024

SeanNijjar requested review from ubcheema and aliuTT May 7, 2024 15:43

SeanNijjar self-assigned this May 7, 2024

SeanNijjar temporarily deployed to dev May 7, 2024 15:45 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev May 7, 2024 15:52 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to production May 7, 2024 16:13 — with GitHub Actions Inactive

SeanNijjar force-pushed the snijjar/issue-7724 branch 2 times, most recently from 5c10a94 to d35df90 Compare May 27, 2024 15:15

SeanNijjar force-pushed the snijjar/issue-7724 branch from d35df90 to 0781c54 Compare June 3, 2024 19:24

SeanNijjar temporarily deployed to dev June 3, 2024 19:26 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev June 4, 2024 02:10 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev June 4, 2024 02:11 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev June 4, 2024 02:15 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev June 4, 2024 02:29 — with GitHub Actions Error

SeanNijjar temporarily deployed to production June 4, 2024 02:32 — with GitHub Actions Inactive

SeanNijjar force-pushed the snijjar/issue-7724 branch from e00acf6 to bc6a3ca Compare June 4, 2024 03:47

SeanNijjar force-pushed the snijjar/issue-7724 branch from 07c9e83 to aa94961 Compare June 4, 2024 04:30

SeanNijjar merged commit d280a3f into main Jun 4, 2024
5 checks passed

SeanNijjar deleted the snijjar/issue-7724 branch June 4, 2024 04:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#7724: Add prototype for autonomous streams for use in tunneller #8207

#7724: Add prototype for autonomous streams for use in tunneller #8207

SeanNijjar commented May 7, 2024 •

edited

Loading

SeanNijjar commented May 7, 2024

#7724: Add prototype for autonomous streams for use in tunneller #8207

#7724: Add prototype for autonomous streams for use in tunneller #8207

Conversation

SeanNijjar commented May 7, 2024 • edited Loading

Summary

High Level Description

Phase Programming

Quirks

Phase Selection

Resets

Limitations

Stream Resets/Startup

Outstanding Bugs

SeanNijjar commented May 7, 2024

SeanNijjar commented May 7, 2024 •

edited

Loading