Skip to content

Latest commit

 

History

History
64 lines (56 loc) · 10.5 KB

README.md

File metadata and controls

64 lines (56 loc) · 10.5 KB

LLM Inference

Table of Contents

Glossary

  • Prompt: the initial text or instruction given to the model.
  • Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
  • Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
  • Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
  • KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
  • Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
  • Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
  • Post-training quantization: quantizing the weights and activations of the model after the model has been trained.
  • Quantization-Aware Training: incorporating quantization considerations during training.

Open Source Software

Name Stars Hardware Org
Transformers CPU / NVIDIA GPU / TPU / AMD GPU Hugging Face
Text Generation Inference CPU / NVIDIA GPU / AMD GPU Hugging Face
gpt-fast CPU / NVIDIA GPU / AMD GPU PyTorch
TensorRT-LLM NVIDIA GPU NVIDIA
vLLM NVIDIA GPU UC Berkeley
llama.cpp / ggml CPU / Apple Silicon / NVIDIA GPU / AMD GPU ggml
ctransformers CPU / Apple Silicon / NVIDIA GPU / AMD GPU Ravindra Marella
DeepSpeed CPU / NVIDIA GPU Microsoft
FastChat CPU / NVIDIA GPU / Apple Silicon lmsys.org
MLC-LLM CPU / NVIDIA GPU MLC
LightLLM CPU / NVIDIA GPU SenseTime
LMDeploy CPU / NVIDIA GPU Shanghai AI Lab & SenseTime
OpenLLM CPU / NVIDIA GPU / AMD GPU BentoML
OpenPPL.nn / OpenPPL.nn.llm CPU / NVIDIA GPU OpenMMLab & SenseTime
ScaleLLM NVIDIA GPU Vectorch
RayLLM CPU / NVIDIA GPU / AMD GPU Anyscale
Xorbits Inference CPU / NVIDIA GPU / AMD GPU Xorbits

Paper List

Name Paper Title Paper Link Artifact Keywords Recommend
LLaMA LLaMA: Open and Efficient Foundation Language Models arXiv 23 Code Pre-training ⭐️⭐️⭐️⭐️⭐️
Llama 2 Llama 2: Open Foundation and Fine-Tuned Chat Models arXiv 23 Model Pre-training / Fine-tuning / Safety ⭐️⭐️⭐️⭐️
Multi-Query Fast Transformer Decoding: One Write-Head is All You Need arXiv 19 Architecture ⭐️⭐️⭐️
Grouped-Query GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints arXiv 23 Architecture ⭐️⭐️⭐️
RoPE Roformer: Enhanced transformer with rotary position embedding arXiv 21 Position Encoding ⭐️⭐️⭐️⭐️
Megatron-LM Efficient large-scale language model training on GPU clusters using megatron-LM SC 21 Code Parallelism ⭐️⭐️⭐️⭐️⭐️
Google's Practice Efficiently Scaling Transformer Inference MLSys 23 Parallelism ⭐️⭐️⭐️⭐️
FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 23 Code Effiencent Attention / GPU ⭐️⭐️⭐️⭐️⭐️
Orca Orca: A distributed serving system for Transformer-Based generative models OSDI 22 Code Continuous Batching ⭐️⭐️⭐️⭐️⭐️
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP 23 Code Effiencent Attention / Continuous Batching / ⭐️⭐️⭐️⭐️⭐️
FlexGen FlexGen: High-throughput generative inference of large language models with a single GPU ICML 23 Code Offloading ⭐️⭐️⭐️
Speculative Decoding Fast Inference from Transformers via Speculative Decoding ICML 23 Sampling ⭐️⭐️⭐️⭐️
LLM.int8() LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 22 Code Quantization ⭐️⭐️⭐️⭐️
Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning osdi22 Code Parallelism ⭐️⭐️⭐️
Gpipe Easy Scaling with Micro-Batch Pipeline Parallelism arXiv19 Paralleism ⭐️⭐️⭐️⭐️
Beam Search Beam Search Strategies for Neural Machine Translation arXiv17 Parallism ⭐️⭐️⭐️