Model | Batch | Hardware | ttft (ms) | t/s/u | Target t/s/u |
t/s | Release |
---|---|---|---|---|---|---|---|
Falcon7B-decode | 32 | e150 | 4.2 | 4.4 | 134.4 | ||
Falcon7B | 32 | n150 | 75 | 17.1 | 26 | 547.2 | v0.53.0-rc33 |
Mistral-7B | 32 | n150 | 9.9 | 25 | 316.8 | v0.51.0-rc28 | |
Mamba-2.8B | 32 | n150 | 48 | 12.3 | 41 | 393.6 | v0.51.0-rc26 |
LLaMA-3.1-8B | 1 | n150 | 291 | 22.9 | 23 | 22.9 | v0.53.0-rc16 |
Falcon7B (DP=8) | 256 | QuietBox | 101 | 14.4 | 26 | 3686.4 | v0.53.0-rc33 |
LLaMA-3.1-70B (TP=8) | 32 | QuietBox | 190 | 15.1 | 20 | 483.2 | v0.53.0-rc33 |
Falcon40B (TP=8) | 32 | QuietBox | 5.3 | 36 | 169.6 | v0.53.0-rc33 | |
Mixtral7Bx8 (TP=8) | 32 | QuietBox | 235 | 14.2 | 33 | 454.4 | v0.53.0-rc33 |
Falcon7B (DP=32) | 1024 | Galaxy | 242 | 4.4 | 26 | 4505.6 | v0.53.0-rc33 |
LLaMA-3.1-70B (DP=4, TP=8) | 128 | Galaxy | 190 | 14.3 | 20 | 1835.5 | v0.52.0-rc31 |
Last Update: November 4, 2024
Notes:
- TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.
- The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).
- The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.
Model | Batch | Hardware | fps | Target fps | Release |
---|---|---|---|---|---|
ResNet-50 (224x224) | 20 | e150 | 5,100 | 10,000 | |
ResNet-50 (224x224) | 16 | n150 | 4,670 | 7,000 | |
ResNet-50 (224x224) (DP=2) | 32 | n300 | 8,200 | 14,000 | |
ResNet-50 (224x224) (DP=8) | 128 | QuietBox | 32,250 | 56,000 | |
ResNet-50 (224x224) (DP=32) | 512 | Galaxy | 95,900 | 224,000 | |
ResNet-50 (224x224) (DP=64) | 1024 | Two Galaxies | 145,000 | 448,000 | |
ViT (224x224) | 9 | e150 | 1,360 | 2,000 | |
ViT (224x224) | 8 | n150 | 912 | 1,600 | |
Stable Diffusion 1.4 (512x512) | 1 | n150 | 0.167 | 0.3 | |
Yolo V4 (320x320) | 1 | n150 | 95 | 300 |
Model | Batch | Hardware | sen/sec | Target sen/sec | Release |
---|---|---|---|---|---|
BERT-Large | 12 | e150 | 370 | 410 | |
BERT-Large | 8 | n150 | 270 | 400 | |
T5 small | e150 | 140 | |||
Bloom | e150 | 70 |
For the latest model updates and features, please see MODEL_UPDATES.md
- Advanced Performance Optimizations for Models (updated Oct 24th)
- Programming Mesh of Devices (updated Sept 9th)
- ViT Implementation in TT-NN on GS (updated Sept 22nd)
- LLMs Bring up in TT-NN (updated Oct 29th)
- YOLOv4 Implementation in TT-NN on WH (updated November 8th)
- Matrix Multiply FLOPS on WH (updated November 13th)
TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.
Get started with simple kernels.
- Matrix Engine (updated Sept 6th)
- Data Formats (updated Sept 7th)
- Reconfiguring Data Formats (updated Oct 17th)
- Handling special floating-point numbers (updated Oct 5th)
- Allocator (Updated Oct 30th)
- Tensor Layouts (updated Sept 6th)
- Saturating DRAM Bandwidth (updated Sept 6th)
- Flash Attention on Wormhole (updated Sept 6th)
- CNNs on TT Architectures (updated Sept 6th)
- Ethernet and Multichip Basics (Updated Sept 20th)
- Collective Communication Library (CCL) (Updated Sept 20th)
- Blackhole Bring-Up Prgramming Guide (Updated Oct 30th)