diff --git a/README.md b/README.md index 66913e287..f37549b6c 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,10 @@ items that are not factual. If you find an item that is incorrect, please tag as an issue, so we can triage and determine whether to fix, or drop from our initial release.* -# TorchAt *NORTHSTAR* +# torchat *NORTHSTAR* A repo for building and using llama on servers, desktops and mobile. -The TorchAt repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices. +The torchat repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices. For a list of devices, see below, under *SUPPORTED SYSTEMS*. A goal of this repo, and the design of the PT2 components was to offer seamless integration and consistent workflows. @@ -29,12 +29,12 @@ Featuring: and backend-specific mobile runtimes ("delegates", such as CoreML and Hexagon). The model definition (and much more!) is adopted from gpt-fast, so we support the same models. As new models are supported by gpt-fast, -bringing them into TorchAt should be straight forward. In addition, we invite community contributions +bringing them into torchat should be straight forward. In addition, we invite community contributions # Getting started Follow the `gpt-fast` [installation instructions](https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#installation). -Because TorchAt was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in TorchAt are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features. +Because torchat was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in torchat are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features. Install sentencepiece and huggingface_hub @@ -67,6 +67,10 @@ export MODEL_DOWNLOAD=meta-llama/Llama-2-7b-chat-hf While we strive to support a broad range of models, we can't test all models. Consequently, we classify supported models as tested ✅, work in progress 🚧 and not tested. We invite community contributions of both new models, as well as test reports. +Some common models are recognized by torchat based on their filename (`Transformer.from_name()`). For models not recognized based +on the filename, you can construct a model by initializing the `ModelArgs` dataclass that controls model construction from a parameter json +specified using the `params-path ${PARAMS_PATH}` containing the appropriate model parameters. + | Model | tested | eager | torch.compile | AOT Inductor | ET Runtime | Fits on Mobile | |-----|--------|-------|-----|-----|-----|-----| tinyllamas/stories15M | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | @@ -81,6 +85,7 @@ codellama/CodeLlama-34b-Python-hf | -| ✅ | ✅ | ✅ | ✅ | ❌ | mistralai/Mistral-7B-v0.1 | 🚧 | ✅ | ✅ | ✅ | ✅ | ❹ | mistralai/Mistral-7B-Instruct-v0.1 | - | ✅ | ✅ | ✅ | ✅ | ❹ | mistralai/Mistral-7B-Instruct-v0.2 | - | ✅ | ✅ | ✅ | ✅ | ❹ | +Llama3 | 🚧 | ✅ | ✅ | ✅ | ✅ | ❹ | *Key:* ✅ works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; 📵 not on mobile phone (may fit some high-end devices such as tablets); @@ -89,10 +94,10 @@ mistralai/Mistral-7B-Instruct-v0.2 | - | ✅ | ✅ | ✅ | ✅ | ❹ | ### More downloading -First cd into TorchAt. We first create a directory for stories15M and download the model and tokenizers. +First cd into torchat. We first create a directory for stories15M and download the model and tokenizers. We show how to download @Andrej Karpathy's stories15M tiny llama-style model that were used in llama2.c. Advantageously, stories15M is both a great example and quick to download and run across a range of platforms, ideal for introductions like this -README and for [testing](https://github.com/pytorch-labs/TorchAt/blob/main/.github/workflows). We will be using it throughout +README and for [testing](https://github.com/pytorch-labs/torchat/blob/main/.github/workflows). We will be using it throughout this introduction as our running example. ``` @@ -122,11 +127,11 @@ We use several variables in this example, which may be set as a preparatory step name of the directory holding the files for the corresponding model. You *must* follow this convention to ensure correct operation. -* `MODEL_OUT` is the location where we store model and tokenizer information for a particular model. We recommend `checkpoints/${MODEL_NAME}` +* `MODEL_DIR` is the location where we store model and tokenizer information for a particular model. We recommend `checkpoints/${MODEL_NAME}` or any other directory you already use to store model information. * `MODEL_PATH` describes the location of the model. Throughput the description - herein, we will assume that MODEL_PATH starts with a subdirectory of the TorchAt repo + herein, we will assume that MODEL_PATH starts with a subdirectory of the torchat repo named checkpoints, and that it will contain the actual model. In this case, the MODEL_PATH will thus be of the form ${MODEL_OUT}/model.{pt,pth}. (Both the extensions `pt` and `pth` are used to describe checkpoints. In addition, model may be replaced with the name of the model.) @@ -143,7 +148,7 @@ You can set these variables as follows for the exemplary model15M model from And MODEL_NAME=stories15M MODEL_DIR=checkpoints/${MODEL_NAME} MODEL_PATH=${MODEL_OUT}/stories15M.pt -MODEL_OUT=~/TorchAt-exports +MODEL_OUT=~/torchat-exports ``` When we export models with AOT Inductor for servers and desktops, and Executorch for mobile and edge devices, @@ -179,13 +184,20 @@ environment: ./run ${MODEL_OUT}/model.{so,pte} -z ${MODEL_OUT}/tokenizer.bin ``` +### llama3 tokenizer + +Add option to load tiktoken +``` +--tiktoken +``` + # Generate Text ## Eager Execution Model definition in model.py, generation code in generate.py. The model checkpoint may have extensions `pth` (checkpoint and model definition) or `pt` (model checkpoint). -At present, we always use the TorchAt model for export and import the checkpoint into this model definition +At present, we always use the torchat model for export and import the checkpoint into this model definition because we have tested that model with the export descriptions described herein. ``` @@ -223,7 +235,7 @@ quantization to achieve this, as described below. We export the model with the export.py script. Running this script requires you first install executorch with pybindings, see [here](#setting-up-executorch-and-runner-et). At present, when exporting a model, the export command always uses the -xnnpack delegate to export. (Future versions of TorchAt will support additional +xnnpack delegate to export. (Future versions of torchat will support additional delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as they are released for Executorch.) @@ -250,8 +262,32 @@ device supported by Executorch, most models need to be compressed to fit in the target device's memory. We use quantization to achieve this. +# llama3 support + +How to obtain snapshot (to be filled in when published by Meta, we use internal snapshot] + +enable llama3 tokenizer with option `--tiktoken` (see also discussion under tokenizer) + +Enable all export options for llama3 as described below + +Identify and enable a runner/run.cpp with a binary tiktoken optimizer. (May already be available in OSS) +we cannot presently run runner/run.cpp with llama3, until we have a C/C++ tokenizer im[plementation +(initial tiktoken is python) + # Optimizing your model for server, desktop and mobile devices +## Model precision (dtype precision setting)_ + +You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options) +specify the precision of the model with +``` +python generate.py --dtype [bf16 | fp16 | fp32] ... +python export.py --dtype [bf16 | fp16 | fp32] ... +``` + +Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to `--dtype bf16` or `--dtype fp16` on server / desktop for best performance. + + ## Making your models fit and execute fast! Next, we'll show you how to optimize your model for mobile execution @@ -260,7 +296,7 @@ AOTI). The basic model build for mobile surfaces two issues: Models quickly run out of memory and execution can be slow. In this section, we show you how to fit your models in the limited memory of a mobile device, and optimize execution speed -- both using quantization. This -is the `TorchAt` repo after all! +is the `torchat` repo after all! For high-performance devices such as GPUs, quantization provides a way to reduce the memory bandwidth required to and take advantage of the @@ -274,6 +310,9 @@ We can specify quantization parameters with the --quantize option. The quantize option takes a JSON/dictionary with quantizers and quantization options. +generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations +to avoid combinatorial explosion. + #### Embedding quantization (8 bit integer, channelwise & groupwise) *Channelwise quantization*: @@ -390,27 +429,58 @@ not been optimized for CUDA and CPU targets where the best performnance requires a group-wise quantized mixed dtype linear operator. +#### 4-bit integer quantization (int4) +To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use +of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. +``` +python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'group_size' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] +``` + +Now you can run your model with the same command as before: +``` +python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is" +``` #### 4-bit integer quantization (8da4w) To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving this scheme its name (8da4w = 8b dynamically quantized activations with 4b weights), and boost performance. ``` -python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'group_size' : 7} }" --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte +python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'group_size' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ] ``` Now you can run your model with the same command as before: ``` -python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte --prompt "Hello my name is" +python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is" +``` + +#### Quantization with GPTQ (gptq) + +``` +python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'group_size' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team ``` -#### Quantization with GPTQ (8da4w-gptq) -TBD. +Now you can run your model with the same command as before: +``` +python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is" +``` -#### Adding additional quantization schemes +#### Adding additional quantization schemes (hqq) We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness. +# Loading GGUF models + +GGUF is a nascent industry standard format and will will read fp32, fp16 and some quantized formats (q4_0 and whatever is necessary to read llama2_78_q4_0.gguf) + +``` +--load_gguf # all other options as described elsewhere, works for generate and export, for all backends, but cannot be used with --quantize +``` + +``` +--dequantize_gguf