Tenstorrent has built the future of AI architecture and parallel programming. It achieves high performance on current AI models but is flexible and programmable, enabling the invention of future AI models and HPC (High-Performance Computing) applications without the constraints of current architectures. It is designed for both inference and training and is, from the ground up, designed for the scale-out of AI workloads, while also allowing for scaling down to just a couple of cores. Additionally, it is built using cost-effective components: simple packages, GDDR memory and Ethernet. This document describes it.
- Scalable Architecture
- MIMD and Control of Both Compute and Data
- Everything is a RISCV kernel
- Bare Metal C/C++ kernels on RISCV
- User Kernels: Explicit and Decoupled Data Movement and Compute
- Data Movement Kernels
- Compute Kernels
- Ethernet Data Movement Kernels
- Read-Compute-Write kernel pipeline
- Dispatch Kernels
- Efficiency of Tiled-Based Compute and Data Movement
- Interleaved and Sharded Buffers
- Fast Kernel Dispatch
- FAQ
A Tensix Core is:
- 5 small RISC-V processors (aka "Baby RISCVs") that run C/C++ kernels and dispatch instructions to the compute and data movement engines
- 1 MB SRAM memory, (aka L1) a scratch pad accessible by all RISCVs and engines within the core
- Matrix engine (aka FPU) that performs Matrix multiplication, elementwise, and dot product operations on small matrices (or tiles) of shape 32x32 and similar
- Vector engine (aka SFPU) for vectorized kernels such as Top-k, Sort and special functions such as GELU, Exp, and Sqrt
- Data Movement engine connected to 2 Networks on Chip (NoCs)
A chips is a collection of cores and I/O blocks, connected into a mesh via a NoC:
- Tensix compute core (each with local SRAM)
- DRAM memory banks
- Ethernet cores for chip-to-chip interconnect
- PCIe link for host interface
- ARC core for board management and control
The high BW and large capacity SRAM in each Tensix core is a form of near memory compute. A Tensix core operating on its local SRAM achieves "silicon peak" of what current technology node allows for. Tensix cores are connected into a mesh via 2 NOCs, and each Tensix core can communicate with any other Tensix core in the mesh, and with off-chip DRAM, as well as Ethernet cores. GPU fracture SRAM across levels within the chip: large register files, small L1, and L2. SRAM is primarily used for re-use and pre-fetch on the way to off-chip DRAM, not as a primary form of Tensor storage and large parts of it not at the peak silicon speed. In contrast, in TT architecture, the entire SRAM is in one single level, and its significant capacity allows it be used as intermediates between operations, w/o relaying on HBM as the primary storage to hand-off data between operations.
The mesh of Tensix cores architecture is the first one to efficiently implement distributed shared memory within the chip and enable programmers and compilers to optimize both layout and movement of the data. In many AI and HPC operations, such as as elementwise, the tensors can be laid out (ie "sharded") across SRAMs so that compute can operate on the local data in-place without any data movement. Further elaboration in Scalable Architecture section.
The performance and efficiency of data movement in AI and HPC application is as important as raw compute capacity of the math engines. In Tenix, data movement is explicit and decoupled from compute. The data movement kernels use the data movement engine in each Tensix to bring data from neighbouring cores or off-chip DRAM to the local SRAM of the Tensix core, and trigger the compute engine to operate on the data. The data movement in TT architecture can be pre-planned, optimized and debugged separately from the compute. There is no caches, no global crossbars, no memory access coalescing or other complex mechanisms that are used in traditional architectures that hide the data movement from the programmer or compiler. For deeper insight see section User Kernels: Explicit and Decoupled Data Movement and Compute.
In Tensix, compute instructions operate on tiles -- 32x32 matrix of scalars. Operating on coarse chunks of data allows for use of a simple single-threaded RISCVs processors to dispatch these instructions. Similarly, data movement RISCVs issue asynchronous tile-sized data movement instructions to bring data into the scratch SRAM, allowing for large number of outstanding transfers generated by a single RISC-V data movement processor, concurrently with the compute engine.
Each RISCV processor runs single-threaded and Core-to-Thread mapping is 1:1. Thus, the parallelization involves breaking the work across cores and dispatching the kernels directly to cores. This is in contrast to a complex thread scheduling scheme where a very large number of threads is time-slice scheduled onto a limited number of cores. As a result, in TT architecture there is no context switching or complex thread scheduling. Once the kernel is dispatched to a core it runs to completion without interruption or preemption by another thread. This simplifies reasoning about performance: it boils down to direct cycle couting of sections of a C/C++ kernel running on a bare metal RISCV core.
Equally important, it simplifies direct debug of kernels via gdb
step-through, breakpoints, and printf
from cores.
AI workloads operate on tensors (N-dimensional data) and exhibit a high degree of locality and regularity in the fundamental compute operations:
- Elementwise operations are entirely local on each element in the tensor (in-place), and can be achieved without any data movement
- Matrix multiplication operations have regular communication across the rows and columns of the matrix
- Reduction operation can be decomposed across dimensions, such as columns, rows, and nearest neighbours in a matrix
- Window based (stencil) operations such as convolutions exchange data with their neighbours
These data movement patterns (local, row/column, nearest neighbour) are most efficiently implemented via a regular and scalable mesh architecture.
Tenstorrent architecture is a mesh of cores within a chip and mesh of chips at the cluster level. TODO: Describe Galaxy, break up the slide into two slides
TODO: Describe SRAM and DRAM as two levels of memory hierarchy within the mesh, both distributed and explicit data movement.
TODO: Describe that TT wins at scale-out, best compute density at the server and cluster level
- "Program cores not threads"
- Bare Metal C/C++ kernels on RISCV
- User Kernels
- Data Movement Kernels
- Compute Kernels
- Ethernet Data Movement Kernels
- Dispatch Kernels
- Interleaved Buffers
- Sharded Buffers
- GPU
- CPU
- FPGA
- TPU
TODO: 1) TOC, write it
TODO: 1) TOC, 2) write it
TODO: this is a placeholder
#include <cstdint>
#include "compute_kernel_api/eltwise_binary.h"
#include "compute_kernel_api/tile_move_copy.h"
namespace NAMESPACE {
void MAIN {
uint32_t per_core_block_cnt = get_arg_val<uint32_t>(0);
uint32_t per_core_block_size = get_arg_val<uint32_t>(1); // should be <= 8 in this kernel
constexpr auto cb_in0 = tt::CB::c_in0;
constexpr auto cb_in1 = tt::CB::c_in1;
constexpr auto cb_out0 = tt::CB::c_out0;
binary_op_init_common(cb_in0, cb_in1, cb_out0);
add_tiles_init();
for(uint32_t block = 0; block < per_core_block_cnt; ++block) {
// wait for a block of tiles in each of input CBs
cb_wait_front(cb_in0, per_core_block_size);
cb_wait_front(cb_in1, per_core_block_size);
tile_regs_acquire(); // acquire 8 tile registers
// add a block of tiles
for(uint32_t i = 0; i < per_core_block_size; ++i)
{
add_tiles(cb_in0, cb_in1, i, i, i);
}
tile_regs_commit(); // signal the packer
tile_regs_wait(); // packer waits here
// pack a block of tiles
for(uint32_t i = 0; i < per_core_block_size; ++i)
{
pack_tile(i, cb_out0);
}
tile_regs_release(); // packer releases
// pop a block of tiles from each of input CBs
cb_pop_front(cb_in0, per_core_block_size);
cb_pop_front(cb_in1, per_core_block_size);
// push a block of tiles to output CB
cb_push_back(cb_out0, per_core_block_size);
}
}
}