LLM Inference

Glossary

Prompt: the initial text or instruction given to the model.
Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
Post-training quantization: quantizing the weights and activations of the model after the model has been trained.
Quantization-Aware Training: incorporating quantization considerations during training.

Open Source Software

Name	Hardware	Org
Transformers	CPU / NVIDIA GPU / TPU / AMD GPU	Hugging Face
Text Generation Inference	CPU / NVIDIA GPU / AMD GPU	Hugging Face
gpt-fast	CPU / NVIDIA GPU / AMD GPU	PyTorch
TensorRT-LLM	NVIDIA GPU	NVIDIA
vLLM	NVIDIA GPU	UC Berkeley
llama.cpp / ggml	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	ggml
ctransformers	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	Ravindra Marella
DeepSpeed	CPU / NVIDIA GPU	Microsoft
FastChat	CPU / NVIDIA GPU / Apple Silicon	lmsys.org
MLC-LLM	CPU / NVIDIA GPU	MLC
LightLLM	CPU / NVIDIA GPU	SenseTime
LMDeploy	CPU / NVIDIA GPU	Shanghai AI Lab & SenseTime
OpenLLM	CPU / NVIDIA GPU / AMD GPU	BentoML
OpenPPL.nn / OpenPPL.nn.llm	CPU / NVIDIA GPU	OpenMMLab & SenseTime
ScaleLLM	NVIDIA GPU	Vectorch
RayLLM	CPU / NVIDIA GPU / AMD GPU	Anyscale
Xorbits Inference	CPU / NVIDIA GPU / AMD GPU	Xorbits

Paper List

Name	Paper Title	Paper Link	Artifact	Keywords	Recommend
LLaMA	LLaMA: Open and Efficient Foundation Language Models	arXiv 23	Code	Pre-training	⭐️⭐️⭐️⭐️⭐️
Llama 2	Llama 2: Open Foundation and Fine-Tuned Chat Models	arXiv 23	Model	Pre-training / Fine-tuning / Safety	⭐️⭐️⭐️⭐️
Multi-Query	Fast Transformer Decoding: One Write-Head is All You Need	arXiv 19		Architecture	⭐️⭐️⭐️
Grouped-Query	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	arXiv 23		Architecture	⭐️⭐️⭐️
RoPE	Roformer: Enhanced transformer with rotary position embedding	arXiv 21		Position Encoding	⭐️⭐️⭐️⭐️
Megatron-LM	Efficient large-scale language model training on GPU clusters using megatron-LM	SC 21	Code	Parallelism	⭐️⭐️⭐️⭐️⭐️
Google's Practice	Efficiently Scaling Transformer Inference	MLSys 23		Parallelism	⭐️⭐️⭐️⭐️
FlashAttention	Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS 23	Code	Effiencent Attention / GPU	⭐️⭐️⭐️⭐️⭐️
Orca	Orca: A distributed serving system for Transformer-Based generative models	OSDI 22	Code	Continuous Batching	⭐️⭐️⭐️⭐️⭐️
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	SOSP 23	Code	Effiencent Attention / Continuous Batching /	⭐️⭐️⭐️⭐️⭐️
FlexGen	FlexGen: High-throughput generative inference of large language models with a single GPU	ICML 23	Code	Offloading	⭐️⭐️⭐️
Speculative Decoding	Fast Inference from Transformers via Speculative Decoding	ICML 23		Sampling	⭐️⭐️⭐️⭐️
LLM.int8()	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 22	Code	Quantization	⭐️⭐️⭐️⭐️
Alpa	Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	osdi22	Code	Parallelism	⭐️⭐️⭐️
Gpipe	Easy Scaling with Micro-Batch Pipeline Parallelism	arXiv19		Paralleism	⭐️⭐️⭐️⭐️
Beam Search	Beam Search Strategies for Neural Machine Translation	arXiv17		Parallism	⭐️⭐️⭐️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM Inference

Table of Contents

Glossary

Open Source Software

Paper List

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM Inference

Table of Contents

Glossary

Open Source Software

Paper List