Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Commit

Permalink
Init ns doc (#9)
Browse files Browse the repository at this point in the history
  • Loading branch information
airMeng authored Dec 22, 2023
1 parent 725508f commit 2fc8644
Show file tree
Hide file tree
Showing 7 changed files with 92 additions and 235 deletions.
249 changes: 53 additions & 196 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion neural_speed/scripts/clang-format.py → clang-format.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def parse_args(argv=None):

if __name__ == '__main__':
if len(sys.argv) == 1:
args = parse_args(['', '--dirs', 'core', 'models', 'vectors', 'application'])
args = parse_args(['', '--dirs', 'neural_speed', 'bestla'])
else:
args = parse_args()
clang_format_dir(args)
64 changes: 32 additions & 32 deletions developer_document.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/fused_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Fused attention is designed to be able to easily support various models:
> ✅: Supported; 🚧: WIP
### Limitations
Currently the fused attention is only enabled when compiling the llm runtime with GCC11+.
Currently the fused attention is only enabled when compiling the Neural Speed with GCC11+.

## Tips for parallelism
Thanks to the mathematical nature of attention, one can simply parallel the whole kv-cache operations and fused attention on commonly-parallelizable dimensions. Just pass each part to every KV-cache operations (and merge them together if needed).
Expand Down
6 changes: 3 additions & 3 deletions docs/infinite_inference.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Infinite Inference
==================

As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, LLM Runtime supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation).
As a key feature to many LLM applications like ChatBot, the [StreamingLLM paper](https://arxiv.org/abs/2309.17453) discussed infinite inference and proposed their solution which preserves first `n_keep` tokens as "attention sink". Based on their work, Neural Speed supports infinite inference with two optimized implementations: re-evaluate and shift-RoPE-K. The discard and re-evaluate is available to all models, while the more efficient shift-RoPE-K method required certain models design and needs graph-level support to enable (but it only adds less than 10% overhead comparing to our optimized fix-length generation).

## Discard and Re-evaluate
By default, the LLM Runtime discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper.
By default, the Neural Speed discards half of the recent tokens and re-evaluates the left sequence to rebuild the KV-cache if no space left in the KV-cache. Obviously, no extra cost is introduced before the KV-cache context is full. The overhead of re-evaluation can be amortized until the context is full again which results in competitive average latency. This method avoids the copying (e.g. `torch.cat`) of the entire KV-cache in the original implement of StreamingLLM. However, the re-evaluation is triggered constantly if only one token is dropped at a time according to the StreamingLLM paper.

## Shift-RoPE-K and Ring-Buffer
If the model implements its positional embedding with [the Rotary Positional Encoding (RoPE)](https://arxiv.org/abs/2104.09864), a "shift operation" can be applied to existing K-Cache, avoiding re-computation for all previous tokens that are not discarded. This method makes use of the full context size in the generation of long text and it introduces no overhead before the KV-cache context is fully filled.
Expand All @@ -21,7 +21,7 @@ Notice that the [fused-attention](./fused_attention.md) layer does not need to b
The shifting-RoPE operation can be viewed as a vector-matrix element-wise complex multiplication, where the complex vector is consist of the cosine/sine value of $-N \times \theta_i \text{ for } i \in \left[0, d/2\right)$ (where $N$ is the length of current tokens / number of discarded cached tokens), and the complex matrix is of shape `d/2 x n_ctx`. The complex vector is precomputed and is been broadcasted in the dimension of `n_ctx` to multiply to the matrix. Therefore, it is straightforward to accelerate this operation with the `VFMULCPH` instruction which performs 16 complex multiplications to 16 pairs of fp16 values (and `VPBROADCASTD` for broadcasting).

### Supported Models
The following models supports shift-RoPE-K method by the LLM Runtime:
The following models supports shift-RoPE-K method by the Neural Speed:
| Model name | Status (Challenges) |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------: |
| [LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) ||
Expand Down
2 changes: 1 addition & 1 deletion docs/tensor_parallelism.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ make -j
First you should download and convert the model to f32 format. You can also quantize the model to q4_0 format, but it is optional.

```shell
python scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b
python neural_speed/scripts/convert.py --outtype f32 --outfile EleutherAI/gpt-j-6b
```
Then quantize the model to q4_0 format(optional).

Expand Down
2 changes: 1 addition & 1 deletion neural_speed/core/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Highly Optimized Low Precision Kernels
Our kernels are based on x64 template library [jblas](../../../library/jblas).
Our kernels are based on x64 template library [BESTLA](../../bestla/README.md).
## Support Matrix
Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.

Expand Down

0 comments on commit 2fc8644

Please sign in to comment.