Init ns doc (#9)

intel · Dec 22, 2023 · 2fc8644 · 2fc8644
1 parent 725508f
commit 2fc8644
Show file tree

Hide file tree

Showing 7 changed files with 92 additions and 235 deletions.
diff --git a/README.md b/README.md
diff --git a/neural_speed/scripts/clang-format.py → clang-format.py b/neural_speed/scripts/clang-format.py → clang-format.py
@@ -60,7 +60,7 @@ def parse_args(argv=None):
 
 if __name__ == '__main__':
     if len(sys.argv) == 1:
-        args = parse_args(['', '--dirs', 'core', 'models', 'vectors', 'application'])
+        args = parse_args(['', '--dirs', 'neural_speed', 'bestla'])
     else:
         args = parse_args()
     clang_format_dir(args)
diff --git a/developer_document.md b/developer_document.md
diff --git a/docs/fused_attention.md b/docs/fused_attention.md
@@ -97,7 +97,7 @@ Fused attention is designed to be able to easily support various models:
 > ✅: Supported; 🚧: WIP
 
 ### Limitations
-Currently the fused attention is only enabled when compiling the llm runtime with GCC11+.
+Currently the fused attention is only enabled when compiling the Neural Speed with GCC11+.
 
 ## Tips for parallelism
 Thanks to the mathematical nature of attention, one can simply parallel the whole kv-cache operations and fused attention on commonly-parallelizable dimensions. Just pass each part to every KV-cache operations (and merge them together if needed).

diff --git a/docs/infinite_inference.md b/docs/infinite_inference.md
@@ -1,10 +1,10 @@
 Infinite Inference
 ==================
 
-As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, LLM Runtime supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation).
+As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, Neural Speed supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation).
 
 ## Discard and Re-evaluate
-By default, the LLM Runtime discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper.
+By default, the Neural Speed discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper.
 
 ## Shift-RoPE-K and Ring-Buffer
 If the model implements its positional embedding with [the Rotary Positional Encoding (RoPE)](https://arxiv.org/abs/2104.09864), a "shift operation" can be applied to existing K-Cache, avoiding re-computation for all previous tokens that are not discarded. This method makes use of the full context size in the generation of long text and it introduces no overhead before the KV-cache context is fully filled.
@@ -21,7 +21,7 @@ Notice that the [fused-attention](./fused_attention.md) layer does not need to b
 The shifting-RoPE operation can be viewed as a vector-matrix element-wise complex multiplication, where the complex vector is consist of the cosine/sine value of $-N \times \theta_i \text{ for } i \in \left[0, d/2\right)$ (where $N$ is the length of current tokens / number of discarded cached tokens), and the complex matrix is of shape `d/2 x n_ctx`. The complex vector is precomputed and is been broadcasted in the dimension of `n_ctx` to multiply to the matrix. Therefore, it is straightforward to accelerate this operation with the `VFMULCPH` instruction which performs 16 complex multiplications to 16 pairs of fp16 values (and `VPBROADCASTD` for broadcasting).
 
 ### Supported Models
-The following models supports shift-RoPE-K method by the LLM Runtime:
+The following models supports shift-RoPE-K method by the Neural Speed:
 | Model name                                                                                                                                                                                                           |                    Status (Challenges)                    |
 | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------: |
 | [LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)          |                             ✅                             |

diff --git a/docs/tensor_parallelism.md b/docs/tensor_parallelism.md
@@ -91,7 +91,7 @@ make -j
 First you should download and convert the model to f32 format. You can also quantize the model to q4_0 format, but it is optional.
 
 ```shell
-python scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b
+python neural_speed/scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b
 ```
 Then quantize the model to q4_0 format(optional).
 

diff --git a/neural_speed/core/README.md b/neural_speed/core/README.md
@@ -1,5 +1,5 @@
 # Highly Optimized Low Precision Kernels
-Our kernels are based on x64 template library [jblas](../../../library/jblas).
+Our kernels are based on x64 template library [BESTLA](../../bestla/README.md).
 ## Support Matrix
 Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.