add docs for Qwen 2 release (#2961)

intel · Jun 6, 2024 · e5478dd · e5478dd
1 parent 3ccd4f7
commit e5478dd
Show file tree

Hide file tree

Showing 44 changed files with 5,709 additions and 0 deletions.
diff --git a/llm/qwen2/cpu/_sources/index.md.txt b/llm/qwen2/cpu/_sources/index.md.txt
@@ -0,0 +1,150 @@
+# Intel® Extension for PyTorch\* Large Language Model (LLM) Feature Get Started For Qwen2 models
+
+Intel® Extension for PyTorch\* provides dedicated optimization for running Qwen2 models faster, including technical points like paged attention, ROPE fusion, etc. And a set of data types are supported for various scenarios, including BF16, Weight Only Quantization, etc. 
+# 1. Environment Setup
+
+There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended.
+
+## 1.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.3-qwen-2
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch* prebuilt wheel files
+DOCKER_BUILDKIT=1 docker build -f examples/cpu/inference/python/llm/Dockerfile -t ipex-llm:qwen2 .
+
+# Run the container with command below
+docker run --rm -it --privileged ipex-llm:qwen2 bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+
+## 1.2 Conda-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout 2.3-qwen-2
+git submodule sync
+git submodule update --init --recursive
+
+# Create a conda environment (pre-built wheel only available with python=3.10)
+conda create -n llm python=3.10 -y
+conda activate llm
+
+# Setup the environment with the provided script
+# A sample "prompt.json" file for benchmarking is also downloaded
+cd examples/cpu/inference/python/llm
+bash ./tools/env_setup.sh 7
+
+# Activate environment variables
+source ./tools/env_activate.sh
+```
+<br>
+
+# 2. How To Run Qwen2 with ipex.llm
+
+**ipex.llm provides a single script to facilitate running generation tasks as below:**
+
+```
+# if you are using a docker container built from commands above in Sec. 1.1, the placeholder LLM_DIR below is /home/ubuntu/llm
+# if you are using a conda env created with commands above in Sec. 1.2, the placeholder LLM_DIR below is intel-extension-for-pytorch/examples/cpu/inference/python/llm
+cd <LLM_DIR>
+python run.py --help # for more detailed usages
+```
+
+| Key args of run.py | Notes |
+|---|---|
+| model id | `--model-name-or-path` or `-m` to specify the &lt;QWEN2_MODEL_ID_OR_LOCAL_PATH&gt;, it is model id from Huggingface or downloaded local path |
+| generation | default: beam search (beam size = 4), `--greedy` for greedy search |
+| input tokens | provide fixed sizes for input prompt size, use `--input-tokens` for &lt;INPUT_LENGTH&gt; in [1024, 2048, 4096, 8192, 16384, 32768]; if `--input-tokens` is not used, use `--prompt` to choose other strings as prompt inputs|
+| output tokens | default: 32, use `--max-new-tokens` to choose any other size |
+| batch size |  default: 1, use `--batch-size` to choose any other size |
+| token latency |  enable `--token-latency` to print out the first or next token latency |
+| generation iterations |  use `--num-iter` and `--num-warmup` to control the repeated iterations of generation, default: 100-iter/10-warmup |
+| streaming mode output | greedy search only (work with `--greedy`), use `--streaming` to enable the streaming generation output |
+
+*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login).
+
+## 2.1 Usage of running Qwen2 models
+
+The *&lt;QWEN2_MODEL_ID_OR_LOCAL_PATH&gt;* in the below commands specifies the Qwen2 model you will run, which can be found from [HuggingFace Models](https://huggingface.co/models).
+
+### 2.1.1 Run generation with multiple instances on multiple CPU numa nodes
+
+#### 2.1.1.1 Prepare:
+
+```bash
+unset KMP_AFFINITY
+```
+
+In the DeepSpeed cases below, we recommend `--shard-model` to shard model weight sizes more even for better memory usage when running with DeepSpeed.
+
+If using `--shard-model`, it will save a copy of the shard model weights file in the path of `--output-dir` (default path is `./saved_results` if not provided).
+If you have used `--shard-model` and generated such a shard model path (or your model weights files are already well sharded), in further repeated benchmarks, please remove `--shard-model`, and replace `-m <QWEN2_MODEL_ID_OR_LOCAL_PATH>` with `-m <shard model path>` to skip the repeated shard steps.
+
+Besides, the standalone shard model function/scripts are also provided in section 2.1.1.4, in case you would like to generate the shard model weights files in advance before running distributed inference.
+
+#### 2.1.1.2 BF16:
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank  run.py --benchmark -m <QWEN2_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex  --greedy --input-tokens <INPUT_LENGTH> --autotp --shard-model
+```
+
+#### 2.1.1.3 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference (`--quant-with-amp`) to get peak performance and fair accuracy.
+For weight-only quantization with deepspeed, we quantize the model then run the benchmark. The quantized model won't be saved.
+
+- Command:
+```bash
+deepspeed --bind_cores_to_rank run.py  --benchmark -m <QWEN2_MODEL_ID_OR_LOCAL_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --greedy --input-tokens <INPUT_LENGTH>  --autotp --shard-model
+```
+
+#### 2.1.1.4 How to Shard Model weight files for Distributed Inference with DeepSpeed
+
+To save memory usage, we could shard the model weights files under the local path before we launch distributed tests with DeepSpeed.
+
+```
+cd ./utils
+# general command:
+python create_shard_model.py -m <QWEN2_MODEL_ID_OR_LOCAL_PATH>  --save-path ./local_qwen2_model_shard
+# After sharding the model, using "-m ./local_qwen2_model_shard" in later tests
+```
+
+### 2.1.2 Run generation with single instance on a single numa node
+#### 2.1.2.1 BF16:
+
+- Command:
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <QWEN2_MODEL_ID_OR_LOCAL_PATH> --dtype bfloat16 --ipex --greedy --input-tokens <INPUT_LENGTH> 
+```
+
+#### 2.1.2.2 Weight-only quantization (INT8):
+
+By default, for weight-only quantization, we use quantization with [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html) inference (`--quant-with-amp`) to get peak performance and fair accuracy.
+
+- Command:
+```bash
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list>  python run.py  --benchmark -m <QWEN2_MODEL_ID_OR_LOCAL_PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"  --greedy --input-tokens <INPUT_LENGTH>
+```
+
+#### 2.1.2.3 Notes:
+
+(1) [`numactl`](https://linux.die.net/man/8/numactl) is used to specify memory and cores of your hardware to get better performance. *&lt;node N&gt;* specifies the [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) node id (e.g., 0 to use the memory from the first numa node). *&lt;physical cores list&gt;* specifies phsysical cores which you are using from the *&lt;node N&gt;* numa node. You can use [`lscpu`](https://man7.org/linux/man-pages/man1/lscpu.1.html) command in Linux to check the numa node information.
+
+(2) For all quantization benchmarks, both quantization and inference stages will be triggered by default. For quantization stage, it will auto-generate the quantized model named `best_model.pt` in the `--output-dir` path, and for inference stage, it will launch the inference with the quantized model `best_model.pt`.  For inference-only benchmarks (avoid the repeating quantization stage), you can also reuse these quantized models for by adding `--quantized-model-path <output_dir + "best_model.pt">`.
+
+## Miscellaneous Tips
+Intel® Extension for PyTorch\* also provides dedicated optimization for many other Large Language Models (LLM), which cover a set of data types that are supported for various scenarios. For more details, please check this [Intel® Extension for PyTorch\* doc](https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/README.md).
diff --git a/llm/qwen2/cpu/_static/_sphinx_javascript_frameworks_compat.js b/llm/qwen2/cpu/_static/_sphinx_javascript_frameworks_compat.js
@@ -0,0 +1,123 @@
+/* Compatability shim for jQuery and underscores.js.
+ *
+ * Copyright Sphinx contributors
+ * Released under the two clause BSD licence
+ */
+
+/**
+ * small helper function to urldecode strings
+ *
+ * See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/decodeURIComponent#Decoding_query_parameters_from_a_URL
+ */
+jQuery.urldecode = function(x) {
+    if (!x) {
+        return x
+    }
+    return decodeURIComponent(x.replace(/\+/g, ' '));
+};
+
+/**
+ * small helper function to urlencode strings
+ */
+jQuery.urlencode = encodeURIComponent;
+
+/**
+ * This function returns the parsed url parameters of the
+ * current request. Multiple values per key are supported,
+ * it will always return arrays of strings for the value parts.
+ */
+jQuery.getQueryParameters = function(s) {
+    if (typeof s === 'undefined')
+        s = document.location.search;
+    var parts = s.substr(s.indexOf('?') + 1).split('&');
+    var result = {};
+    for (var i = 0; i < parts.length; i++) {
+        var tmp = parts[i].split('=', 2);
+        var key = jQuery.urldecode(tmp[0]);
+        var value = jQuery.urldecode(tmp[1]);
+        if (key in result)
+            result[key].push(value);
+        else
+            result[key] = [value];
+    }
+    return result;
+};
+
+/**
+ * highlight a given string on a jquery object by wrapping it in
+ * span elements with the given class name.
+ */
+jQuery.fn.highlightText = function(text, className) {
+    function highlight(node, addItems) {
+        if (node.nodeType === 3) {
+            var val = node.nodeValue;
+            var pos = val.toLowerCase().indexOf(text);
+            if (pos >= 0 &&
+                !jQuery(node.parentNode).hasClass(className) &&
+                !jQuery(node.parentNode).hasClass("nohighlight")) {
+                var span;
+                var isInSVG = jQuery(node).closest("body, svg, foreignObject").is("svg");
+                if (isInSVG) {
+                    span = document.createElementNS("http://www.w3.org/2000/svg", "tspan");
+                } else {
+                    span = document.createElement("span");
+                    span.className = className;
+                }
+                span.appendChild(document.createTextNode(val.substr(pos, text.length)));
+                node.parentNode.insertBefore(span, node.parentNode.insertBefore(
+                    document.createTextNode(val.substr(pos + text.length)),
+                    node.nextSibling));
+                node.nodeValue = val.substr(0, pos);
+                if (isInSVG) {
+                    var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
+                    var bbox = node.parentElement.getBBox();
+                    rect.x.baseVal.value = bbox.x;
+                    rect.y.baseVal.value = bbox.y;
+                    rect.width.baseVal.value = bbox.width;
+                    rect.height.baseVal.value = bbox.height;
+                    rect.setAttribute('class', className);
+                    addItems.push({
+                        "parent": node.parentNode,
+                        "target": rect});
+                }
+            }
+        }
+        else if (!jQuery(node).is("button, select, textarea")) {
+            jQuery.each(node.childNodes, function() {
+                highlight(this, addItems);
+            });
+        }
+    }
+    var addItems = [];
+    var result = this.each(function() {
+        highlight(this, addItems);
+    });
+    for (var i = 0; i < addItems.length; ++i) {
+        jQuery(addItems[i].parent).before(addItems[i].target);
+    }
+    return result;
+};
+
+/*
+ * backward compatibility for jQuery.browser
+ * This will be supported until firefox bug is fixed.
+ */
+if (!jQuery.browser) {
+    jQuery.uaMatch = function(ua) {
+        ua = ua.toLowerCase();
+
+        var match = /(chrome)[ \/]([\w.]+)/.exec(ua) ||
+            /(webkit)[ \/]([\w.]+)/.exec(ua) ||
+            /(opera)(?:.*version|)[ \/]([\w.]+)/.exec(ua) ||
+            /(msie) ([\w.]+)/.exec(ua) ||
+            ua.indexOf("compatible") < 0 && /(mozilla)(?:.*? rv:([\w.]+)|)/.exec(ua) ||
+            [];
+
+        return {
+            browser: match[ 1 ] || "",
+            version: match[ 2 ] || "0"
+        };
+    };
+    jQuery.browser = {};
+    jQuery.browser[jQuery.uaMatch(navigator.userAgent).browser] = true;
+}