Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#7724: Add prototype for autonomous streams for use in tunneller #8207

Merged
merged 1 commit into from
Jun 4, 2024

Commits on Jun 4, 2024

  1. #7724: Add prototype for autonomous streams for use in tunneller

    Streams are autonomous data movement hardware engines present on every
    tensix and erisc core. Typically, streams are only capable of moving
    data in a static pattern from a predetermined ordering of
    senders/receivers. Luckily, for the tunneling use case, the producers and
    consumers are always the same and we only need to make sure we can
    forward messages indefinitely.
    
    This prototype is the first step to enable streams in the dispatch
    datapath so that we may recover erisc cores for use by kernels. Since
    the stream can run autonomously with this setup, we can initialize it
    such that it implements tunnelling behaviour without erisc overhead.
    With the exception of some bandwidth sharing (L1 and ethernet) on the
    margins, a user kernel would never know the stream is busy working as
    the tunneler.
    
    Indefinite message forwarding can be accomplished by creating two phases
    in the autonomous stream's blob and making the second phase point its
    next phase to the start of the first phase. This way, with the stream
    configured to auto-configure and auto-advance, it will end up looping
    forever. The remaining challenge is to ensure that we can safely
    reset/teardown the stream so that the next time a program runs on the
    hardware, the remote sender dispatch core is able to establish a
    handshake with the relay stream. If it kept running in the background,
    the dispatch code path would have no idea how to intercept it and
    establish communication with it. Therefore we reset any time we need to
    teardown the dispatch datapath.
    
    Streams are opaque and brittle, and this is not an originally intended
    use-case for them. However, ironically, it seems  to map best with all
    of the other limitations provided with streams.
    
    === Phase Selection ===
    Streams are very finicky and have an undesirable trait where even if
    they are reset, they expect the next phase they handshake on to be
    different. So if in a prior run, the sender finished on phase 1 and the
    relay finished on phase 1, then for the next run, neither stream should
    start on phase 1 on the next run.
    
    For this reason, on stream startup, the FW inspects the streams current
    phase and based on that, chooses a valid next starting phase. It sends
    this starting phase information to its sender stream, if it has one. The
    same is done for the downstream direction so receivers know which
    `remote_src_phase` to handshake on.
    
    === Resets ===
    After every run, we must teardown and reset the streams so they are
    ready to use and able to handshake properly the next time a program uses
    the AI accelerator. To reset properly, we need to ensure a few things:
    1) The relay stream is *not* processing any data at the time of reset
       - in other words, the full datapath should be flushed before reset
    2) There should be no acks pending to be sent upstream.
       The receiver/relay kernels do this be checking for stream active and
       a special debug register
    
    In a fully fleshed out design, this reset should ideally be done before
    stream construction. Additionally, it must also be done in the event of
    program failure (e.g. ctrl^C, sigkill, etc.).
    
    === Limitations ===
    
    There are some limitations that will always be true:
    - max_message_size == min(stream_buffer_size, sender_buffer_size)
      - streams expect a header present for every message
      - streams expect the entire message to be resident when send is
    started
    
    There are currently some known limitations (may be lifted in future):
    - min # messages per phase = 128
      - fewer leads to deterministic handshake hang
        - this hang deterministically happens after min(num_phase_ranges,24)
          runs. 24 also happens to be the number of dest ready table entries
          for WH although it's unclear if this is a pure coincidence
          - disabling the dest_ready_table leads to immediate handshake hang
            and so wasn't pursued further
    - max # messages per phase = 2048
      - This is due to how the phase range selection is implemented
    SeanNijjar committed Jun 4, 2024
    Configuration menu
    Copy the full SHA
    aa94961 View commit details
    Browse the repository at this point in the history