#7724: Add prototype for autonomous streams for use in tunneller

Streams are autonomous data movement hardware engines present on every tensix and erisc core. Typically, streams are only capable of moving data in a static pattern from a predetermined ordering of senders/receivers. Luckily, for the tunneling use case, the producers and consumers are always the same and we only need to make sure we can forward messages indefinitely. This prototype is the first step to enable streams in the dispatch datapath so that we may recover erisc cores for use by kernels. Since the stream can run autonomously with this setup, we can initialize it such that it implements tunnelling behaviour without erisc overhead. With the exception of some bandwidth sharing (L1 and ethernet) on the margins, a user kernel would never know the stream is busy working as the tunneler. Indefinite message forwarding can be accomplished by creating two phases in the autonomous stream's blob and making the second phase point its next phase to the start of the first phase. This way, with the stream configured to auto-configure and auto-advance, it will end up looping forever. The remaining challenge is to ensure that we can safely reset/teardown the stream so that the next time a program runs on the hardware, the remote sender dispatch core is able to establish a handshake with the relay stream. If it kept running in the background, the dispatch code path would have no idea how to intercept it and establish communication with it. Therefore we reset any time we need to teardown the dispatch datapath. Streams are opaque and brittle, and this is not an originally intended use-case for them. However, ironically, it seems to map best with all of the other limitations provided with streams. === Phase Selection === Streams are very finicky and have an undesirable trait where even if they are reset, they expect the next phase they handshake on to be different. So if in a prior run, the sender finished on phase 1 and the relay finished on phase 1, then for the next run, neither stream should start on phase 1 on the next run. For this reason, on stream startup, the FW inspects the streams current phase and based on that, chooses a valid next starting phase. It sends this starting phase information to its sender stream, if it has one. The same is done for the downstream direction so receivers know which `remote_src_phase` to handshake on. === Resets === After every run, we must teardown and reset the streams so they are ready to use and able to handshake properly the next time a program uses the AI accelerator. To reset properly, we need to ensure a few things: 1) The relay stream is *not* processing any data at the time of reset - in other words, the full datapath should be flushed before reset 2) There should be no acks pending to be sent upstream. The receiver/relay kernels do this be checking for stream active and a special debug register In a fully fleshed out design, this reset should ideally be done before stream construction. Additionally, it must also be done in the event of program failure (e.g. ctrl^C, sigkill, etc.). === Limitations === There are some limitations that will always be true: - max_message_size == min(stream_buffer_size, sender_buffer_size) - streams expect a header present for every message - streams expect the entire message to be resident when send is started There are currently some known limitations (may be lifted in future): - min # messages per phase = 128 - fewer leads to deterministic handshake hang - this hang deterministically happens after min(num_phase_ranges,24) runs. 24 also happens to be the number of dest ready table entries for WH although it's unclear if this is a pure coincidence - disabling the dest_ready_table leads to immediate handshake hang and so wasn't pursued further - max # messages per phase = 2048 - This is due to how the phase range selection is implemented
tenstorrent · Jun 4, 2024 · d280a3f · d280a3f
1 parent 604789b
commit d280a3f
Show file tree

Hide file tree

Showing 9 changed files with 2,174 additions and 0 deletions.
diff --git a/tests/tt_metal/tt_metal/test_kernels/dataflow/streams/stream_io_kernel_helpers.hpp b/tests/tt_metal/tt_metal/test_kernels/dataflow/streams/stream_io_kernel_helpers.hpp
@@ -0,0 +1,135 @@
+// SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.
+//
+// SPDX-License-Identifier: Apache-2.0
+
+#pragma once
+
+#include <cstdint>
+
+#include "dataflow_api.h"
+#include "stream_interface.h"
+#include "tt_metal/hw/inc/wormhole/noc/noc_overlay_parameters.h"
+
+struct stream_state_t {
+    const uint32_t local_data_buffer_base_address;
+    const uint32_t local_msg_info_ptr_base_address;
+
+    uint32_t local_phase_id;
+    uint32_t messages_per_phase;
+    uint32_t msg_info_wrptr_addr;
+
+    uint32_t num_tiles_sent;
+    uint32_t tile_header_num_msgs;
+
+    uint32_t local_buffer_base_addr;
+    uint32_t local_buffer_size;
+    uint32_t local_msg_info_ptr;
+    uint32_t local_buffer_read_offset;
+
+    uint32_t remote_buffer_base_addr;
+    uint32_t remote_buffer_size;
+    uint32_t remote_msg_info_ptr;
+    uint32_t remote_buffer_write_offset;
+
+    uint32_t remote_phase_id;
+
+    uint32_t get_current_local_buffer_address() const {
+        return local_data_buffer_base_address + local_buffer_read_offset;
+    }
+};
+
+struct phase_iterator_t {
+    phase_iterator_t(uint32_t start_phase, uint32_t max_phase) :
+        phase_id(start_phase), max_phase(max_phase), start_phase(start_phase) {}
+    uint32_t phase_id;
+    uint32_t max_phase;
+    uint32_t start_phase;
+
+    FORCE_INLINE uint32_t get() const { return phase_id; }
+
+    FORCE_INLINE void increment() { phase_id = phase_id == max_phase ? start_phase : phase_id + 1; }
+};
+
+struct noc_endpoint_info_t {
+    uint32_t data_noc_id;
+    uint32_t update_noc_id;
+    uint32_t noc_x;
+    uint32_t noc_y;
+};
+
+#define STREAM_CFG(field, val) ((val) << (field))
+
+#define AUTO_CFG_HEADER(next_phase_num_cfg_reg_writes, curr_phase_num_msgs, phase_num_incr) \
+    ((uint32_t)(((next_phase_num_cfg_reg_writes) << 24) | ((curr_phase_num_msgs) << 12) | (phase_num_incr)))
+
+#define STREAM_REMOTE_DEST(dest_x, dest_y, dest_stream_id)                     \
+    (((dest_x) << STREAM_REMOTE_DEST_X) | ((dest_y) << STREAM_REMOTE_DEST_Y) | \
+     ((dest_stream_id) << STREAM_REMOTE_DEST_STREAM_ID))
+
+#define STREAM_REMOTE_SRC(src_x, src_y, src_stream_id) \
+    (((src_x) << STREAM_REMOTE_SRC_X) | ((src_y) << STREAM_REMOTE_SRC_Y) | ((src_stream_id) << REMOTE_SRC_STREAM_ID))
+
+FORCE_INLINE uint32_t
+blob_header_dw(uint32_t next_phase_num_cfg_reg_writes, uint32_t curr_phase_num_msgs, uint32_t phase_num_incr) {
+    return (next_phase_num_cfg_reg_writes << 24) | (curr_phase_num_msgs << 12) | phase_num_incr;
+}
+
+FORCE_INLINE void stream_phase_blob_run(
+    uint32_t stream_id, volatile uint32_t *blob_start_addr, uint32_t start_phase_num_cfg_regs) {
+    NOC_STREAM_WRITE_REG(stream_id, STREAM_PHASE_AUTO_CFG_PTR_REG_INDEX, reinterpret_cast<uint32_t>(blob_start_addr));
+    NOC_STREAM_WRITE_REG(
+        stream_id, STREAM_PHASE_AUTO_CFG_HEADER_REG_INDEX, start_phase_num_cfg_regs << NEXT_PHASE_NUM_CFG_REG_WRITES);
+    NOC_STREAM_WRITE_REG(
+        stream_id,
+        STREAM_MISC_CFG_REG_INDEX,
+        (0x1 << PHASE_AUTO_CONFIG) | (1 << NEXT_PHASE_SRC_CHANGE) | (1 << NEXT_PHASE_DEST_CHANGE));
+}
+FORCE_INLINE void stream_phase_blob_run(
+    uint32_t stream_id,
+    volatile uint32_t *blob_start_addr,
+    uint32_t num_messages_per_phase,
+    uint32_t start_phase_num_cfg_regs) {
+    NOC_STREAM_WRITE_REG(stream_id, STREAM_PHASE_AUTO_CFG_PTR_REG_INDEX, reinterpret_cast<uint32_t>(blob_start_addr));
+
+    NOC_STREAM_WRITE_REG(
+        stream_id,
+        STREAM_PHASE_AUTO_CFG_HEADER_REG_INDEX,
+        blob_header_dw(start_phase_num_cfg_regs, num_messages_per_phase, 1));
+    NOC_STREAM_WRITE_REG(
+        stream_id,
+        STREAM_MISC_CFG_REG_INDEX,
+        (0x1 << PHASE_AUTO_ADVANCE) | (0x1 << PHASE_AUTO_CONFIG) | (1 << NEXT_PHASE_SRC_CHANGE) |
+            (1 << NEXT_PHASE_DEST_CHANGE));
+    NOC_STREAM_WRITE_REG(stream_id, STREAM_PHASE_ADVANCE_REG_INDEX, 1);
+}
+
+FORCE_INLINE uint32_t blob_cfg_dw(uint32_t reg_index, uint32_t reg_val) { return (reg_val << 8) | reg_index; }
+
+FORCE_INLINE uint32_t set_blob_reg_field(uint32_t blob_dw, uint32_t field_width, uint32_t field_offset, uint32_t val) {
+    uint32_t mask = ((1 << field_width) - 1) << field_offset;
+    return (blob_dw & ~mask) | ((val << field_offset) & mask);
+}
+
+FORCE_INLINE uint32_t get_first_available_phase_out_of_reset(uint32_t stream_id) {
+    uint32_t stream_phase_coming_out_of_reset = stream_get_curr_phase(stream_id);
+    return (
+        stream_phase_coming_out_of_reset < 4096   ? 4096 : 1);
+}
+
+FORCE_INLINE uint32_t notify_remote_receiver_of_starting_phase(
+    uint32_t stream_id, uint32_t local_buffer_addr, uint64_t remote_receiver_noc_addr) {
+    uint32_t starting_phase = get_first_available_phase_out_of_reset(stream_id);
+    ASSERT(starting_phase > 0);
+    *reinterpret_cast<volatile uint32_t *>(local_buffer_addr) = starting_phase;
+    noc_async_write(local_buffer_addr, remote_receiver_noc_addr, sizeof(uint32_t));
+    // noc_semaphore_set_remote(local_buffer_addr, remote_receiver_noc_addr);
+    noc_async_writes_flushed();
+    return starting_phase;
+}
+
+FORCE_INLINE uint32_t wait_for_remote_source_starting_phase(volatile uint32_t *addr) {
+    while (*addr == 0) {
+        asm volatile("nop");
+    }
+    return *addr;
+}