-
Notifications
You must be signed in to change notification settings - Fork 347
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
黄宇扬
committed
Aug 13, 2024
1 parent
7d1973a
commit f392c83
Showing
7 changed files
with
214 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Fastllm Python Demo 参数说明 | ||
|
||
## 通用参数 | ||
|
||
模型相关配置,OpenAI API Server, WebUI, 对话Demo 均可使用 | ||
|
||
- **模型路径 (`-p, --path`)**: 指定模型的路径,可以是fastllm模型文件或Hugging Face模型文件夹。例如: | ||
```bash | ||
--path ~/Qwen2-7B-Instruct/ # 从~/Qwen2-7B-Instruct/中读取模型,这里的模型需要是从HuggingFace或ModelScope或其他网站下载的Hugging face格式的标准模型,暂不支持AWQ,GPTQ等格式 | ||
--path ~/model.flm # 从~/model.flm中读取模型,这里的模型是Fastllm格式的模型文件 | ||
``` | ||
- **推理类型 (`--atype`)**: 设置中间计算类型,可以指定为`float16`或`float32` | ||
- **权重类型 (`--dtype`)**: 指定模型的权重类型,适用于读取Hugging Face模型时。可以指定为`float16`, `int8`, `int4`, `int4g`(int4分组量化),例如: | ||
```bash | ||
--dtype float16 # 使用float16权重(不量化) | ||
--dtype int8 # 在线量化成int8权重 | ||
--dtype int4g128 # 在线量化成int4分组权重(128个权重一组) | ||
--dtype int4g256 # 在线量化成int4分组权重(256个权重一组) | ||
--dtype int4 # 在线量化成int4权重 | ||
``` | ||
- **使用的设备 (`--device`)**: 指定服务器使用的设备。可以指定为`cpu`或`cuda`或额外编译的其余device类型 | ||
- **CUDA Embedding (`--cuda_embedding`)**: 若带上此配置且device设置为`cuda`,那么会在cuda设备上进行embedding操作,这样速度会略微提升,显存占用也会提升,建议在显存非常充足的情况下使用 | ||
- **KV缓存最大使用量 (`--kv_cache_limit`)**: 设置KV缓存的最大使用量。若不使用此参数或设置为`auto`,框架会自动处理。手动设定示例如下: | ||
```bash | ||
--kv_cache_limit 5G # 设置为5G | ||
--kv_cache_limit 100M # 设置为100M | ||
--kv_cache_limit 168K # 设置为168K | ||
``` | ||
- **最大Batch数量 (`--max_batch`)**: 设置每次同时处理的请求数量。若不使用此参数,框架会自动处理 | ||
- **线程数量 (`-t, --threads`)**: 设置CPU线程数量,device设置为`cpu`时对速度有较大影响,设置为`cuda`时影响较小,主要影响读取模型的速度 | ||
- **自定义模型描述文件 (`--custom`)**: 指定描述自定义模型的Python文件。具体见 [自定义模型](custom.md) | ||
|
||
## OpenAI API Server配置参数 | ||
- **模型名称 (`--model_name`)**: 指定部署的模型名称,API调用时会进行名称核验 | ||
- **API服务器主机地址 (`--host`)**: 设置API服务器的主机地址 | ||
- **API服务器端口号 (`--port`)**: 设置API服务器的端口号 | ||
|
||
|
||
## Web UI 配置参数 | ||
- **API服务器端口号 (`--port`)**: 设置WebUI的端口号 | ||
- **页面标题 (`--title`)**: 设置WebUI的页面标题 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
### Custom Models | ||
|
||
For models that are not supported by the Fastllm framework, you can support them by customizing the model structure. | ||
|
||
A custom Python model requires only a Python file to describe the model structure. You can refer to the implementation in [QWEN](../example/python/qwen2.py). | ||
|
||
### Using Python Custom Models | ||
|
||
When using `ftllm.chat`, `ftllm.webui`, or `ftllm.server`, you can add the `--custom` parameter to specify the custom model file. | ||
|
||
Assuming our model is located in the `~/Qwen2-7B-Instruct/` directory and the custom model is located in `~/qwen2.py`, you can use the command: | ||
|
||
```sh | ||
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --custom ~/qwen2.py | ||
``` | ||
|
||
to load the Qwen2 model using the custom model file. The usage for `server` and `webui` is similar. | ||
|
||
### Writing Python Custom Models | ||
|
||
When creating a custom model, you need to implement a model description class that inherits from `ftllm.llm.ComputeGraph`. | ||
|
||
Refer to the code in [QWEN](../example/python/qwen2.py): | ||
|
||
```python | ||
from ftllm.llm import ComputeGraph | ||
class Qwen2Model(ComputeGraph): | ||
``` | ||
|
||
At the end of the file, you need to define the `__model__` variable to specify the class corresponding to the custom model structure, with the corresponding code: | ||
|
||
```python | ||
__model__ = Qwen2Model | ||
``` | ||
|
||
The model description class needs to implement the `build` method to obtain model parameters and describe the computation flow. | ||
|
||
Here is an example based on the sample code: | ||
|
||
```python | ||
class Qwen2Model(ComputeGraph): | ||
def build(self): | ||
# 1. Get weight, data, config | ||
weight, data, config = self.weight, self.data, self.config | ||
|
||
# 2. Set some config | ||
config["max_positions"] = 128000 | ||
|
||
# 3. Describe the computation flow | ||
head_dim = config["hidden_size"] // config["num_attention_heads"] | ||
self.Embedding(data["inputIds"], weight["model.embed_tokens.weight"], data["hiddenStates"]); | ||
# The following is the computation flow, see the example code for details | ||
``` | ||
|
||
#### `self.config` | ||
|
||
The model configuration, which by default is read from the `config.json` file in the model folder. | ||
|
||
You can modify parameters in the `config` within the `build` method, such as changing `max_positions` to modify the context length. | ||
|
||
For some models, the variable names used in `config.json` may differ and need to be manually assigned during the `build` process. | ||
|
||
For example, in the TeleChat7B model configuration, there is no `max_positions` variable but instead uses `seq_length` to represent the length. In the `build` method, you need to assign it with the following code: | ||
|
||
```python | ||
self.config["max_positions"] = self.config["seq_length"] | ||
``` | ||
|
||
In the `config`, the following variables must be assigned (if the variable names in `config.json` are consistent, no action is needed): | ||
|
||
```python | ||
self.config["max_positions"] # Represents the maximum context length | ||
``` | ||
|
||
#### `self.weight` | ||
|
||
Represents the weight data. | ||
|
||
`self.weight[weightName]` represents the parameter named `weightName` in the model file (corresponding to the parameter names in the `.safetensors` file in the HF model folder). | ||
|
||
#### `self.data` | ||
|
||
Represents the intermediate variables and input variables of the computation flow. | ||
|
||
`self.data[dataName]` represents the intermediate variable named `dataName`. `dataName` can be any string except for the following input variable names: | ||
|
||
Input variables: | ||
|
||
```python | ||
data["inputIds"] # Input tokens | ||
data["positionIds"] # Position information | ||
data["attentionMask"] # Mask information | ||
data["sin"] # Sin for rotary encoding | ||
data["cos"] # Cos for rotary encoding | ||
data["atype"] # Data type in inference | ||
data["pastKey."][i] # Key cache for the i-th block | ||
data["pastValue."][i] # Value cache for the i-th block | ||
``` | ||
|
||
#### Computation Flow and Operators | ||
|
||
Use the functions of the base class `ComputeGraph` to describe the computation flow. | ||
|
||
The currently supported operators are documented in [Custom Model Operators](./custom_op.md). | ||
|
||
### Custom Models in C++ | ||
|
||
(The interface for custom models in C++ is still under modification...) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Fastllm Python Demo Parameter Explanation | ||
|
||
## General Parameters | ||
|
||
Configuration related to the model, OpenAI API Server, WebUI, and conversation demo can all use these parameters. | ||
|
||
- **Model Path (`-p, --path`)**: Specifies the path to the model, which can be a fastllm model file or a Hugging Face model directory. For example: | ||
```bash | ||
--path ~/Qwen2-7B-Instruct/ # Reads the model from ~/Qwen2-7B-Instruct/, where the model needs to be a standard Hugging Face format model downloaded from HuggingFace, ModelScope, or other websites. Formats like AWQ, GPTQ, etc., are currently not supported. | ||
--path ~/model.flm # Reads the model from ~/model.flm, where the model is a Fastllm format model file | ||
``` | ||
- **Inference Type (`--atype`)**: Sets the intermediate computation type, which can be specified as `float16` or `float32`. | ||
- **Weight Type (`--dtype`)**: Specifies the weight type of the model, applicable when reading Hugging Face models. It can be specified as `float16`, `int8`, `int4`, `int4g` (int4 grouped quantization), for example: | ||
```bash | ||
--dtype float16 # Uses float16 weights (no quantization) | ||
--dtype int8 # Quantizes to int8 weights online | ||
--dtype int4g128 # Quantizes to int4 grouped weights online (128 weights per group) | ||
--dtype int4g256 # Quantizes to int4 grouped weights online (256 weights per group) | ||
--dtype int4 # Quantizes to int4 weights online | ||
``` | ||
- **Device to Use (`--device`)**: Specifies the device used by the server. It can be specified as `cpu`, `cuda`, or other device types compiled additionally. | ||
- **CUDA Embedding (`--cuda_embedding`)**: If this configuration is included and the device is set to `cuda`, embedding operations will be performed on the cuda device, slightly increasing speed and GPU memory usage. It is recommended to use this when there is ample GPU memory. | ||
- **KV Cache Maximum Usage (`--kv_cache_limit`)**: Sets the maximum usage for the KV cache. If this parameter is not used or set to `auto`, the framework will handle it automatically. Manual settings examples are as follows: | ||
```bash | ||
--kv_cache_limit 5G # Sets to 5G | ||
--kv_cache_limit 100M # Sets to 100M | ||
--kv_cache_limit 168K # Sets to 168K | ||
``` | ||
- **Maximum Batch Size (`--max_batch`)**: Sets the number of requests processed simultaneously each time. If this parameter is not used, the framework will handle it automatically. | ||
- **Number of Threads (`-t, --threads`)**: Sets the number of CPU threads, which significantly affects speed when the device is set to `cpu`, and has a smaller impact when set to `cuda`, mainly affecting the speed of model loading. | ||
- **Custom Model Description File (`--custom`)**: Specifies the Python file describing the custom model. See [Custom Model](custom.md) for details. | ||
|
||
## OpenAI API Server Configuration Parameters | ||
- **Model Name (`--model_name`)**: Specifies the name of the deployed model, which will be verified during API calls. | ||
- **API Server Host Address (`--host`)**: Sets the host address of the API server. | ||
- **API Server Port Number (`--port`)**: Sets the port number of the API server. | ||
|
||
## Web UI Configuration Parameters | ||
- **API Server Port Number (`--port`)**: Sets the port number for the WebUI. | ||
- **Page Title (`--title`)**: Sets the page title for the WebUI. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters