diff --git a/README.md b/README.md index 58f517056..ab82344f8 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,14 @@ -# LLM Runtime +# Neural Speed -LLM Runtime is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from [llama.cpp](https://github.com/ggerganov/llama.cpp), which organizes almost all the core code (e.g., kernels) in a single big file with a large number of pre-defined macros, thus making it not easy for developers to support a new model. Our LLM Runtime has the following features: +Neural Speed is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from [llama.cpp](https://github.com/ggerganov/llama.cpp), which organizes almost all the core code (e.g., kernels) in a single big file with a large number of pre-defined macros, thus making it not easy for developers to support a new model. Our Neural Speed has the following features: - Modular design to support new models -- [Highly optimized low precision kernels](core/README.md) +- [Highly optimized low precision kernels](neural_speed/core/README.md) - Utilize AMX, VNNI, AVX512F and AVX2 instruction set - Support CPU (x86 platforms only) and Intel GPU (WIP) - Support 4bits and 8bits quantization -> LLM Runtime is under active development so APIs are subject to change. +> Neural Speed is under active development so APIs are subject to change. ## Supported Hardware | Hardware | Optimization | @@ -22,7 +22,7 @@ LLM Runtime is designed to provide the efficient inference of large language mod ## Supported Models -LLM Runtime supports the following models: +Neural Speed supports the following models: ### Text Generation @@ -198,211 +198,71 @@ LLM Runtime supports the following models:
-## How to Use -There are two methods for utilizing the LLM runtime: -- [Transformer-based API](#How-to-use-Transformer-based-API) -- [Straightforward Python script](#How-to-use-Straightforward-Python-script) +## Install -## How to use: Transformer-based API -### 1. Install -Install from binary +### Build Python package ```shell -pip install intel-extension-for-transformers -pip install -r requirements.txt # under graph folder +pip install . ``` -> Some models only support specific versions of transformers. Please refer to the table above or official documentation. -### 2. Run LLM with Transformer-based API +### Build executable only -You can use Python API to run Hugging Face model simply. Here is the sample code: -```python -from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM -model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model -prompt = "Once upon a time, there existed a little girl," - -tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) -inputs = tokenizer(prompt, return_tensors="pt").input_ids -streamer = TextStreamer(tokenizer) +```shell +# Linux and WSL +git submodule update --init --recursive +mkdir build +cd build +cmake .. -G Ninja +ninja +``` -model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) -outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) +```powershell +# Windows +# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' +mkdir build +cd build +cmake .. +cmake --build . -j --config Release ``` -To directly load a GPTQ model, here is the sample code: -```python -from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig +## How to Use +There are two methods for utilizing the Neural Speed: +- [Transformer-based API](#How-to-use-Transformer-based-API) +- [Straightforward Python script](#How-to-use-Python-script) -# Download Hugging Face GPTQ model to local path -model_name = "PATH_TO_MODEL" # local path to model -woq_config = WeightOnlyQuantConfig(use_gptq=True) -prompt = "Once upon a time, a little girl" -tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) -inputs = tokenizer(prompt, return_tensors="pt").input_ids -streamer = TextStreamer(tokenizer) -model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) -outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) -``` +## How to use: Transformer-based API + +> Please refer to [intel extension for transformers](https://github.com/intel/intel-extension-for-transformers) for detailed usage. + +### 1. Basic usage of Running LLM with Transformer-based API -To enable [StreamingLLM for infinite inference](./docs/infinite_inference.md), here is the sample code: +You can use Python API to run Hugging Face model simply. Here is the sample code: ```python from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig +from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model -woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") prompt = "Once upon a time, there existed a little girl," tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer) -model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config) - -# Paper: https://arxiv.org/pdf/2309.17453.pdf -# Recommend n_keep=4 to do attention sinks (four initial tokens) and n_discard=-1 to drop half rencetly tokens when meet length threshold -outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, ctx_size=100, n_keep=4, n_discard=-1) -``` - -https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b - -Argument description of WeightOnlyQuantConfig ([supported MatMul combinations](#supported-matrix-multiplication-data-types-combinations)): -| Argument | Type | Description | -| -------------- | ---------- | ----------------------------------------------------------------------- | -| compute_dtype | String | Data type of Gemm computation: int8/bf16/fp16/fp32 (default: fp32) | -| weight_dtype | String | Data type of quantized weight: int4/int8/fp8(=fp8_e4m3)/fp8_e5m2/fp4(=fp4_e2m1)/nf4 (default int4) | -| alg | String | Quantization algorithm: sym/asym (default sym) | -| group_size | Int | Group size: Int, 32/128/-1 (per channel) (default: 32) | -| scale_dtype | String | Data type of scales: fp32/bf16/fp8 (default fp32) | -| use_ggml | Bool | Enable ggml for quantization and inference (default: False) | -| use_quant | Bool | Determine whether or not the model will be quantized. (default: True) | - -Argument description of generate function: -| Argument | Type | Description | -| -------------- | ---------- | ----------------------------------------------------------------------- | -| inputs | Lists[Int] | Input ids after tokenizer | -| interactive | Bool | Interactive mode, use history commands when True (default: False) | -| n_keep | Int | Number of tokens to keep from the initial prompt (default: 0, -1 = all) | -| n_discard | Int | Number of tokens will be discarded (default: -1, -1 = half of tokens will be discarded) | -| shift_roped_k | Bool | Use ring-buffer and thus do not re-computing after reaching ctx_size (default: False) | -| ignore_prompt | Bool | Generate outputs w/o prompt (default: False) | -| batch_size | Int | Batch size for prompt processing (default: 512) | -| ctx_size | Int | Size of the prompt context (default: 512) | -| seed | Int | NG seed (default: -1, use random seed for < 0) | -| threads | Int | Number of threads to use during computation (default: min(available_core_num, OMP_NUM_THREADS)) | -| memory_dtype | str | Data type of the KV memory; one of f16, f32, auto (enables Fused Attention when possible otherwise fallback to f16) (default: auto) | -| repetition_penalty| Float | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| num_beams | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| do_sample | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| top_k | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| top_p | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| temperature | Float | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| min_new_tokens | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| length_penalty | Float | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| early_stopping | Bool | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| max_new_tokens | Int | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| streamer | Class | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| stopping_criteria | Class | Please refer to [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | -| pad_token | Int | pad_token_id of [Transformer's generate](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/text_generation#generation) | - -### 3. Multi-Round Chat - -Chat with LLaMA2: -```python -from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig - -# Please change to local path to model, llama2 does not support online conversion, currently. -model_name = "meta-llama/Llama-2-7b-chat-hf" -woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") -tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) -streamer = TextStreamer(tokenizer) -model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) - -while True: - prompt = input("> ").strip() - if prompt == "quit": - break - b_prompt = "[INST]{}[/INST]".format(prompt) # prompt template for llama2 - inputs = tokenizer(b_prompt, return_tensors="pt").input_ids - outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True, do_sample=True) -``` - -Chat with ChatGLM2: -```python -from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig - -model_name = "THUDM/chatglm2-6b" # or local path to model -woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") -tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) -streamer = TextStreamer(tokenizer) -model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) - -while True: - prompt = input("> ").strip() - if prompt == "quit": - break - prompt = tokenizer.build_prompt(prompt) # prompt template for chatglm2 - inputs = tokenizer([prompt], return_tensors="pt").input_ids - outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True, do_sample=True, n_keep=2) +model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) +outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) ``` -Chat with Qwen: -```python -from transformers import AutoTokenizer, TextStreamer -from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig - -model_name = "Qwen/Qwen-7B-Chat" # or local path to model -woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") -tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) -streamer = TextStreamer(tokenizer) -model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True) - -while True: - prompt = input("> ").strip() - if prompt == "quit": - break - prompt = "\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n".format(prompt) # prompt template for qwen - inputs = tokenizer([prompt], return_tensors="pt").input_ids - outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True, do_sample=True) -``` ## How to use: Python script -Install from binary -```shell -pip install intel-extension-for-transformers -``` -Build from source -> :warning: **If you want to use ```from_pretrain``` API**: please follow [Transformer-based API](#How-to-use-Transformer-based-API) -```shell -# Linux and WSL -# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder -git submodule update --init --recursive -mkdir build -cd build -cmake .. -G Ninja -ninja -``` - -```powershell -# Windows -# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' -# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder -mkdir build -cd build -cmake .. -cmake --build . -j --config Release -``` +> warning: **If you want to use ```from_pretrain``` API**: please follow [Transformer-based API](#How-to-use-Transformer-based-API) ### 1. Run LLM with Python Script You can run LLM with one-click python script including conversion, quantization and inference. ``` -python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see" +python neural_speed/scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see" ``` Argument description of run.py ([supported MatMul combinations](#supported-matrix-multiplication-data-types-combinations)): @@ -428,32 +288,32 @@ Argument description of run.py ([supported MatMul combinations](#supported-matri ## Advanced Usage -Besides the one-click script, LLM Runtime also offers the detailed script: 1) convert and quantize, and 2) inference. +Besides the one-click script, Neural Speed also offers the detailed script: 1) convert and quantize, and 2) inference. ### 1. Convert and Quantize LLM -LLM Runtime assumes the compatible model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps: +Neural Speed assumes the compatible model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps: ```bash # convert the model directly use model id in Hugging Face. (recommended) -python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b +python neural_speed/scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b # or you can download fp32 model (e.g., LLAMA2) from Hugging Face at first, then convert the pytorch model to ggml format. git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf -python scripts/convert.py --outtype f32 --outfile ne-f32.bin model_path +python neural_speed/scripts/convert.py --outtype f32 --outfile ne-f32.bin model_path -# To convert model with PEFT(Parameter-Efficient Fine-Tuning) adapter, you need to merge the PEFT adapter into the model first, use below command to merge the PEFT adapter and save the merged model, afterwards you can use 'scripts/convert.py' just like above mentioned. -python scripts/load_peft_and_merge.py --model_name_or_path meta-llama/Llama-2-7b-hf --peft_name_or_path dfurman/llama-2-7b-instruct-peft --save_path ./Llama-2-7b-hf-instruct-peft +# To convert model with PEFT(Parameter-Efficient Fine-Tuning) adapter, you need to merge the PEFT adapter into the model first, use below command to merge the PEFT adapter and save the merged model, afterwards you can use 'neural_speed/scripts/convert.py' just like above mentioned. +python neural_speed/scripts/load_peft_and_merge.py --model_name_or_path meta-llama/Llama-2-7b-hf --peft_name_or_path dfurman/llama-2-7b-instruct-peft --save_path ./Llama-2-7b-hf-instruct-peft # quantize weights of fp32 ggml bin # model_name: llama, llama2, mpt, falcon, gptj, starcoder, dolly # optimized INT4 model with group size 128 (recommended) -python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype int8 +python neural_speed/scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype int8 # Alternativly you could run ggml q4_0 format like following -python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_0.bin --weight_dtype int4 --use_ggml +python neural_speed/scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_0.bin --weight_dtype int4 --use_ggml # optimized INT4 model with group size 32 -python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8 +python neural_speed/scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8 ``` Argument description of quantize.py ([supported MatMul combinations](#supported-matrix-multiplication-data-types-combinations)): @@ -473,7 +333,7 @@ Argument description of quantize.py ([supported MatMul combinations](#supported- #### Supported Matrix Multiplication Data Types Combinations -Our LLM runtime supports INT4 / INT8 / FP8 (E4M3, E5M2) / FP4 (E2M1) / NF4 weight-only quantization and FP32 / FP16 / BF16 / INT8 computation forward matmul on the Intel platforms. Here are the all supported data types combinations for matmul operations (quantization and forward). +Our Neural Speed supports INT4 / INT8 / FP8 (E4M3, E5M2) / FP4 (E2M1) / NF4 weight-only quantization and FP32 / FP16 / BF16 / INT8 computation forward matmul on the Intel platforms. Here are the all supported data types combinations for matmul operations (quantization and forward). > This table will be updated frequently due to active development | Weight dtype | Compute dtype (default value if missing or wrong setting) | Scale dtype (default if missing or wrong setting) | algo (default if missing or wrong setting) | @@ -495,17 +355,17 @@ We provide LLM inference script to run the quantized model. Please reach [us](ma # please type prompt about codes when run `StarCoder`, for example, -p "def fibonnaci(". #Linux and WSL -OMP_NUM_THREADS= numactl -m 0 -C 0- python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" +OMP_NUM_THREADS= numactl -m 0 -C 0- python neural_speed/scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" # if you want to generate fixed outputs, please set --seed arg, for example: -OMP_NUM_THREADS= numactl -m 0 -C 0- python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" --seed 12 +OMP_NUM_THREADS= numactl -m 0 -C 0- python neural_speed/scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" --seed 12 # if you want to reduce repeated generated texts, please set --repeat_penalty (value > 1.0, default = 1.0), for example: -OMP_NUM_THREADS= numactl -m 0 -C 0- python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" --repeat_penalty 1.2 +OMP_NUM_THREADS= numactl -m 0 -C 0- python neural_speed/scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" --repeat_penalty 1.2 #Windows #Recommend to build and run our project in WSL to get a better and stable performance -python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" +python neural_speed/scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t --color -p "She opened the door and see" ``` Argument description of inference.py: @@ -574,6 +434,3 @@ stopping_criteria = StoppingCriteriaList( outputs = model.generate(inputs, streamer=streamer, stopping_criteria=stopping_criteria) ``` - -### 6. Perplexity (measuring model quality) -You can use the [scripts/perplexity.py](./scripts/perplexity.py) script to over a given (subset of) dataset. Run `python scripts/perplexity.py --help` for detailed usage. For more infomation of the perplexity metric, see https://huggingface.co/docs/transformers/perplexity. diff --git a/neural_speed/scripts/clang-format.py b/clang-format.py similarity index 95% rename from neural_speed/scripts/clang-format.py rename to clang-format.py index 5f1743bd0..fcb1e8a43 100644 --- a/neural_speed/scripts/clang-format.py +++ b/clang-format.py @@ -60,7 +60,7 @@ def parse_args(argv=None): if __name__ == '__main__': if len(sys.argv) == 1: - args = parse_args(['', '--dirs', 'core', 'models', 'vectors', 'application']) + args = parse_args(['', '--dirs', 'neural_speed', 'bestla']) else: args = parse_args() clang_format_dir(args) diff --git a/developer_document.md b/developer_document.md index 64c7db381..12eb9bde9 100644 --- a/developer_document.md +++ b/developer_document.md @@ -6,7 +6,7 @@ However, LLM inference thing is complicated. It may have its own: 1. special tok For simplicity, we take [polyglot](https://huggingface.co/EleutherAI/polyglot-ko-5.8b) as the example model. It has the same architecture as `GPT-NEOX` but only fewer layers. -Firstly, we need to add its temp buffer in its [related model-arch header file](https://github.com/intel/intel-extension-for-transformers/blob/1.2.1/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.h) and [re-compile](https://github.com/intel/intel-extension-for-transformers/blob/1.2.1/intel_extension_for_transformers/llm/runtime/graph/README.md#1-install-llm-runtime). +Firstly, we need to add its temp buffer in its [related model-arch header file](neural_speed/models/gptneox/gptneox.h) and [re-compile](README.md#Install). ```diff static const model_scratch gptneox_mem_req(int n_layers) { switch (n_layers) { @@ -65,7 +65,7 @@ she open the door and see him. She looks at him and says, "How do you do?" He sa Once you make sure your model has the same generated tokens as PyTorch, you can deploy it by using low-bits precision like `INT4` data type and customized acceleration. Please refer to `Python API` section for more details. -# Enable graph cpp model process +# Enable cpp model process We enable a CPP model in the following four steps. ```mermaid @@ -80,11 +80,11 @@ graph LR; We need to implement corresponding serialization methods from pytorch format, which is mainly divided into the following three steps. ## 1.1. Hyperparamters -The term **"hyperparamters"** describes a value that is used to configure the behavior of a large language model; this is in contrast to the model's parameters, which are the weight that were derived in the training process that was used to create the model. Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model. Valid ITREX graph files must list these values in the correct order, and each value must be represented using the correct data type. Although hyperparameters are different across models, some attributes appear in the hyperparameters for most models: +The term **"hyperparamters"** describes a value that is used to configure the behavior of a large language model; this is in contrast to the model's parameters, which are the weight that were derived in the training process that was used to create the model. Each model defines its own hyperparameter structure that defines the hyperparameter values accepted by that model. Valid ITREX model files must list these values in the correct order, and each value must be represented using the correct data type. Although hyperparameters are different across models, some attributes appear in the hyperparameters for most models: - n_vocab: the size of the model's vocabulary - n_embd: the size of the model's " embedding layer", which is used during prompt ingestion. - n_layer: the number of layers in the model; each layer represents a set of weights. -Here we will use [convert_gptneox.py](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_gptneox.py#L96) as an example, +Here we will use [convert_gptneox.py](neural_speed/scripts/convert_gptneox.py#L96) as an example, ```python fout.write(struct.pack("i", hparams["num_attention_heads"])) fout.write(struct.pack("i", hparams.get("n_head_kv", 0))) # multi-query attention @@ -96,7 +96,7 @@ The above `fout` is the file we need to get, and the `num_attention`, `n_head_kv As the name implies, a model's vocabulary comprises components that are used by the model to generate language (text). However, unlike the vocabulary of a human, which consists of words, the vocabulary of a large language model consists of "tokens". A token can be an entire word, but oftentimes they are word fragments. Just like humans can compose millions of words from just a dozen or two letters, large language models use tokens to express a large number of words from a relatively smaller number of components. Consider a vocabulary with the following tokens: `whi`, `ch`, `le`, `who`, and `a`; this vocabulary can be used to create the English words `"which"`, `"while"`, `"who"`, `"a"`, and `"leach"`. How would the behavior change if the model contained the following tokens: `wh`, `ich`, `ile`, `o`, and `leach`? Choices such as these allow model-creators to tune the behavior and performance of their models. As described above, the model's hyperparameters typically contain a value that specifies the number of tokens in the vocabulary. The vocabulary is encoded as a list of tokens, each of which includes a 32-bit integer that specifies the length of the token. If your model has some new tokenizers, we suggest using a python tokenizer from transformers and feeding the input_ids to model Python API (python example in scripts folder) -Here we will use [convert_gptneox.py](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_gptneox.py#L122) as an example to processed the vocabulary of gptneox and written it into `fout`. +Here we will use [convert_gptneox.py](neural_speed/scripts/convert_gptneox.py#L122) as an example to processed the vocabulary of gptneox and written it into `fout`. ```python encoder = tokenizer.vocab encoder.update(tokenizer.get_added_vocab()) @@ -108,7 +108,7 @@ byte_decoder = {v:k for k, v in byte_encoder.items()} Finally, and largest, component of a ITREX GRAPH file is the weights of the LLM that the file represents. Abstractly, a large language model is software that is used to generate language - just like software that is used to generate images can be improved by increasing the number of colors with which images can be rendered, large language models can be improved by increasing the number of weights in the model. The total number of weights in a model is referred to as the "size" of that model. For example, the dolly-v2-3b implementation of the gpt-neox-20b language model architecture is available in several sizes, like 3B and 20B, which stand for 3 billion and 20 billion, respectively. These numbers refer to the total number of weights in that model. As described in the hyperparameters section, weights are grouped in sets called "layers", which, like hyperparameters, have structures that are uniquely defined by the model architecture; within a layer, weights are grouped in structures called "tensors". So, for instance, both dolly-v2-3B and gpt-neox-20B use layers that comprise the same tensors, but dolly-v2-3B has relatively fewer layers when compared to gpt-neox-20B. -Here we will use [convert_gptneox.py](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_gptneox.py#L149) as an example to convert model weights to `fout`. +Here we will use [convert_gptneox.py](neural_speed/scripts/convert_gptneox.py#L149) as an example to convert model weights to `fout`. ```python fout.write(struct.pack("iii", n_dims, len(str), ftype_cur)) for i in range(n_dims): @@ -120,7 +120,7 @@ data.tofile(fout) # 2. Model enablements ## 2.1. Model loading -- Model type: Refers to the type of the model, This can be compared to the model type in the Transformers library, we can see model_class in [model_type.h](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_types.h#L68), here defines the basic properties of an ITREX graph model, including model_hparams, model_layer, model_struct.etc. If you have a new cpp model you should update [model_archs](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_types.h#L68). +- Model type: Refers to the type of the model, This can be compared to the model type in the Transformers library, we can see model_class in [model_type.h](neural_speed/models/model_utils/model_types.h#L68), here defines the basic properties of an neural speed model, including model_hparams, model_layer, model_struct.etc. If you have a new cpp model you should update [model_archs](neural_speed/models/model_utils/model_types.h#L68). ```diff enum model_archs { MODEL_UNKNOWN, @@ -138,7 +138,7 @@ enum model_archs { + MODEL_NEW }; ``` -and update [model_name_to_arch()](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_types.h#L395). +and update [model_name_to_arch()](neural_speed/models/model_utils/model_types.h#L395). ```diff private: model_name_to_arch() {} @@ -154,7 +154,7 @@ and update [model_name_to_arch()](https://github.com/intel/intel-extension-for-t + {"baichuan", MODEL_BAICHUAN}},{"new_model", MODEL_NEW_MODEL}}; }; ``` -- Set buffer size: we need to set the corresponding buffer size in model.h according to the size of parameters for the model, just like [gptneox.h](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.h), you should update [enum gptneox_model](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.h#L21), [model_scratch](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.h#L26) and [model class](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.h#L39). +- Set buffer size: we need to set the corresponding buffer size in model.h according to the size of parameters for the model, just like [gptneox.h](neural_speed/models/gptneox/gptneox.h), you should update [enum gptneox_model](neural_speed/models/gptneox/gptneox.h#L21), [model_scratch](neural_speed/models/gptneox/gptneox.h#L26) and [model class](neural_speed/models/gptneox/gptneox.h#L39). ```diff +#ifndef NEW_MODEL_H +#define NEW_MODEL_H @@ -193,13 +193,13 @@ and update [model_name_to_arch()](https://github.com/intel/intel-extension-for-t +#endif // NEW_MODEL_H ``` -- Model_load_internal: This function include model init and model load, The [model init function](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp#L42) initializes the model's hyperparameter, such as `n_layer` and `n_embd parameters`. +- Model_load_internal: This function include model init and model load, The [model init function](neural_speed/models/gptneox/gptneox_utils.cpp#L42) initializes the model's hyperparameter, such as `n_layer` and `n_embd parameters`. ```cpp n_embd = hparams.n_embd; n_vocab = hparams.n_vocab; n_layer = hparams.n_layer; ``` -The weights of the model in the ITREX Graph file will be loaded in [model load function](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp#L71). Here, we'll re-read some of the parameters and weights of the converted binary,include ffn, attention, and norm weight and bias, We'll use the mapping between the name and the weight to read the weight we need. It is shown below. +The weights of the model in the ITREX Graph file will be loaded in [model load function](neural_speed/models/gptneox/gptneox_utils.cpp#L71). Here, we'll re-read some of the parameters and weights of the converted binary,include ffn, attention, and norm weight and bias, We'll use the mapping between the name and the weight to read the weight we need. It is shown below. ```cpp model.others[0] = ml->get_tensor("gpt_neox.embed_in.weight", {n_embd, n_vocab}, NE_BACKEND_CPU); model.others[1] = ml->get_tensor("gpt_neox.final_layer_norm.weight", {n_embd}, NE_BACKEND_CPU); @@ -212,7 +212,7 @@ So when enabling a new model, we should implement the `new_model_utils.cpp` of t ## 2.2. Inference process -- Model_eval_internal: This function can be equivalent to the forward process in pytorch, which has the same computational process. In [gptneox.cpp](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox.cpp), the model_eval_internal here will perform a complete operation on the input values, such as ffn, layernorm, mha, etc. Here's a layernorm operation: +- Model_eval_internal: This function can be equivalent to the forward process in pytorch, which has the same computational process. In [gptneox.cpp](neural_speed/models/gptneox/gptneox.cpp), the model_eval_internal here will perform a complete operation on the input values, such as ffn, layernorm, mha, etc. Here's a layernorm operation: ```cpp cur = ne_norm(ctx0, inpL); cur = ne_add(ctx0, ne_mul(ctx0, ne_repeat(ctx0, model.layers[il].norm[0], cur), cur), @@ -299,7 +299,7 @@ Most of our model examples only support single prompt processing. You need to ad ``` ## 2.3. Application -- Q4_0 quant : We can quantize the model generated by convert by adding a quant layer class to quantize it into an int4 low-bit file, so as to obtain better inference performance. Register quant layer class in your new_model_utils.cpp, just like [gptneox_utils.cpp](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/gptneox_utils.cpp#L163), replace `gptneox_quant_layer` to your `new_model_quant_layer`. +- Q4_0 quant : We can quantize the model generated by convert by adding a quant layer class to quantize it into an int4 low-bit file, so as to obtain better inference performance. Register quant layer class in your new_model_utils.cpp, just like [gptneox_utils.cpp](neural_speed/models/gptneox/gptneox_utils.cpp#L163), replace `gptneox_quant_layer` to your `new_model_quant_layer`. ```diff +class new_quant_layer : public quant_layer_base { + public: @@ -320,7 +320,7 @@ Most of our model examples only support single prompt processing. You need to ad +}; +REGISTER_QUANT_LAYER_CLASS(new_model); ``` -- Add new CMakeLists.txt: We need to add the newly added model to the following CMakeList.txt. New model CMakeList.txt just like [gptneox_CMakeLists.txt](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/gptneox/CMakeLists.txt), +- Add new CMakeLists.txt: We need to add the newly added model to the following CMakeList.txt. New model CMakeList.txt just like [gptneox_CMakeLists.txt](neural_speed/models/gptneox/CMakeLists.txt), ```diff +set(TARGET new_model) +add_library_w_warning(${TARGET} new_model.cpp new_model_utils.cpp ${MODEL_UTILS_SOURCE}) @@ -328,7 +328,7 @@ Most of our model examples only support single prompt processing. You need to ad +set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON) +target_link_libraries(${TARGET} PUBLIC ne_layers jblas::jblas) ``` - and and new_model to [models_CMakeLists.txt](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/models/CMakeLists.txt). + and and new_model to [models_CMakeLists.txt](neural_speed/models/CMakeLists.txt). ```diff add_subdirectory(opt) add_subdirectory(bloom) @@ -339,22 +339,22 @@ add_subdirectory(baichuan) ## 2.4. Python API -We support binding LLM runtime to transformer-based Python API, which is more convenient for customers to use. You need to modify the following files. -Please refer to [install-from-source](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md#install-from-source) and [how-to-use-transformer-based-api](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/README.md#how-to-use-transformer-based-api) of using Python API. +We support binding Neural Speed to transformer-based Python API, which is more convenient for customers to use. You need to modify the following files. +Please refer to [install-from-source](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md#install-from-source) and [how-to-use-transformer-based-api](neural_speed/README.md#how-to-use-transformer-based-api) of using Python API. > The Python API will automatically call the convert script and quantization script to convert the hugging face model into a quantified model. Please ensure that the scripts have been added. Files need to be modified: -- `intel_extension_for_transformers/llm/runtime/graph/application/CMakeLists.txt` -- `intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp` -- `intel_extension_for_transformers/llm/runtime/graph/__init__.py` +- `neural_speed/application/CMakeLists.txt` +- `neural_speed/application/main_pybind.cpp` +- `neural_speed/__init__.py` If `new_model` will be added, modify the code as follows: ```diff -diff --git a/intel_extension_for_transformers/llm/runtime/graph/__init__.py b/intel_extension_for_transformers/llm/runtime/graph/__init__.py +diff --git a/neural_speed/__init__.py b/neural_speed/__init__.py index aaeab8d16a..12a835e652 100644 ---- a/intel_extension_for_transformers/llm/runtime/graph/__init__.py -+++ b/intel_extension_for_transformers/llm/runtime/graph/__init__.py +--- a/neural_speed/__init__.py ++++ b/neural_speed/__init__.py @@ -57,6 +57,8 @@ class Model: import neural_speed.baichuan_cpp as cpp_model elif model_name == "polyglot": @@ -364,10 +364,10 @@ index aaeab8d16a..12a835e652 100644 else: raise TypeError("Unspported model type {}!".format(model_name)) self.module = cpp_model -diff --git a/intel_extension_for_transformers/llm/runtime/graph/application/CMakeLists.txt b/intel_extension_for_transformers/llm/runtime/graph/application/CMakeLists.txt +diff --git a/neural_speed/application/CMakeLists.txt b/neural_speed/application/CMakeLists.txt index d86107d26e..36d30cabe3 100644 ---- a/intel_extension_for_transformers/llm/runtime/graph/application/CMakeLists.txt -+++ b/intel_extension_for_transformers/llm/runtime/graph/application/CMakeLists.txt +--- a/neural_speed/application/CMakeLists.txt ++++ b/neural_speed/application/CMakeLists.txt @@ -67,6 +67,7 @@ compile_quant(quant_chatglm quant_model.cpp chatglm chatglm) compile_quant(quant_chatglm2 quant_model.cpp chatglm2 chatglm2) compile_quant(quant_baichuan quant_model.cpp baichuan baichuan) @@ -389,10 +389,10 @@ index d86107d26e..36d30cabe3 100644 compile_run(run_baichuan main_run.cpp baichuan baichuan) compile_run(run_mistral main_run.cpp mistral llama) +compile_run(run_new_model main_run.cpp new_model new_model) -diff --git a/intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp b/intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp +diff --git a/neural_speed/application/main_pybind.cpp b/neural_speed/application/main_pybind.cpp index 894be0134d..a9a57c0a9e 100644 ---- a/intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp -+++ b/intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp +--- a/neural_speed/application/main_pybind.cpp ++++ b/neural_speed/application/main_pybind.cpp @@ -471,6 +471,10 @@ PYBIND11_MODULE(polyglot_cpp, m) PYBIND11_MODULE(mistral_cpp, m) @@ -412,14 +412,14 @@ Quantize model and use the jblas library for inference can lead to better perfor ```bash # convert the model directly use model path -python scripts/convert_new_model.py --outtype f32 --outfile ne-f32.bin new_model_path +python neural_speed/scripts/convert_new_model.py --outtype f32 --outfile ne-f32.bin new_model_path # optimized INT4 model with group size 128 (recommended) ./build/bin/quant_new_model --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype int8 ``` -Then you can use the model to inference according to the process in the [README](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/llm/runtime/graph). +Then you can use the model to inference according to the process in the [README](https://github.com/intel/intel-extension-for-transformers/tree/main/neural_speed). ## 3.2. MHA fusion We can improve the performance by fusion the multihead attention process. -- [MHA-Fusion Introduction](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph/fused_attention.md) +- [MHA-Fusion Introduction](neural_speed/fused_attention.md) - [MHA-Fusion example](https://github.com/intel/intel-extension-for-transformers/pull/567) ## 3.3. FFN fusion We can improve the performance by fusion the FFN process. diff --git a/docs/fused_attention.md b/docs/fused_attention.md index 34b17648e..67b260aed 100644 --- a/docs/fused_attention.md +++ b/docs/fused_attention.md @@ -97,7 +97,7 @@ Fused attention is designed to be able to easily support various models: > ✅: Supported; 🚧: WIP ### Limitations -Currently the fused attention is only enabled when compiling the llm runtime with GCC11+. +Currently the fused attention is only enabled when compiling the Neural Speed with GCC11+. ## Tips for parallelism Thanks to the mathematical nature of attention, one can simply parallel the whole kv-cache operations and fused attention on commonly-parallelizable dimensions. Just pass each part to every KV-cache operations (and merge them together if needed). diff --git a/docs/infinite_inference.md b/docs/infinite_inference.md index 92f6f84cc..4d2051a58 100644 --- a/docs/infinite_inference.md +++ b/docs/infinite_inference.md @@ -1,10 +1,10 @@ Infinite Inference ================== -As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, LLM Runtime supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation). +As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, Neural Speed supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation). ## Discard and Re-evaluate -By default, the LLM Runtime discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper. +By default, the Neural Speed discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper. ## Shift-RoPE-K and Ring-Buffer If the model implements its positional embedding with [the Rotary Positional Encoding (RoPE)](https://arxiv.org/abs/2104.09864), a "shift operation" can be applied to existing K-Cache, avoiding re-computation for all previous tokens that are not discarded. This method makes use of the full context size in the generation of long text and it introduces no overhead before the KV-cache context is fully filled. @@ -21,7 +21,7 @@ Notice that the [fused-attention](./fused_attention.md) layer does not need to b The shifting-RoPE operation can be viewed as a vector-matrix element-wise complex multiplication, where the complex vector is consist of the cosine/sine value of $-N \times \theta_i \text{ for } i \in \left[0, d/2\right)$ (where $N$ is the length of current tokens / number of discarded cached tokens), and the complex matrix is of shape `d/2 x n_ctx`. The complex vector is precomputed and is been broadcasted in the dimension of `n_ctx` to multiply to the matrix. Therefore, it is straightforward to accelerate this operation with the `VFMULCPH` instruction which performs 16 complex multiplications to 16 pairs of fp16 values (and `VPBROADCASTD` for broadcasting). ### Supported Models -The following models supports shift-RoPE-K method by the LLM Runtime: +The following models supports shift-RoPE-K method by the Neural Speed: | Model name | Status (Challenges) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------: | | [LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | ✅ | diff --git a/docs/tensor_parallelism.md b/docs/tensor_parallelism.md index e3b5f342f..37e45c16c 100644 --- a/docs/tensor_parallelism.md +++ b/docs/tensor_parallelism.md @@ -91,7 +91,7 @@ make -j First you should download and convert the model to f32 format. You can also quantize the model to q4_0 format, but it is optional. ```shell -python scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b +python neural_speed/scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b ``` Then quantize the model to q4_0 format(optional). diff --git a/neural_speed/core/README.md b/neural_speed/core/README.md index 0b049d706..3aea65b04 100644 --- a/neural_speed/core/README.md +++ b/neural_speed/core/README.md @@ -1,5 +1,5 @@ # Highly Optimized Low Precision Kernels -Our kernels are based on x64 template library [jblas](../../../library/jblas). +Our kernels are based on x64 template library [BESTLA](../../bestla/README.md). ## Support Matrix Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.