Skip to content

Commit

Permalink
更新参数说明文档
Browse files Browse the repository at this point in the history
  • Loading branch information
黄宇扬 committed Aug 13, 2024
1 parent 7d1973a commit f392c83
Show file tree
Hide file tree
Showing 7 changed files with 214 additions and 16 deletions.
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,30 +41,34 @@ bash install.sh -DUSE_CUDA=ON # 编译GPU版本
其他不同平台的编译可参考文档
[TFACC平台](docs/tfacc.md)

### 运行OpenAI API Server

编译安装完成后,

### 运行demo程序 (python)

假设我们的模型位于"~/Qwen2-7B-Instruct/"目录

编译完成后可以使用下列demo:

``` sh
# openai api server
# 需要安装依赖: pip install -r requirements-server.txt
# 这里在8080端口打开了一个模型名为qwen的server
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen

# 使用float16精度的模型对话
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/

# 在线量化为int8模型对话
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8

# openai api server (目前处于测试调优阶段)
# 需要安装依赖: pip install -r requirements-server.txt
# 这里在8080端口打开了一个模型名为qwen的server
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen

# webui
# 需要安装依赖: pip install streamlit-chat
python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080
```

以上demo均可使用参数 --help 查看详细参数
以上demo均可使用参数 --help 查看详细参数,详细参数说明可参考 [参数说明](docs/demo_arguments.md)

目前模型的支持情况见: [模型列表](docs/models.md)

Expand Down
14 changes: 9 additions & 5 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,28 +44,32 @@ Assuming our model is located in the "~/Qwen2-7B-Instruct/" directory:
After compilation, you can use the following demos:

``` sh
# OpenAI API server (currently in testing and tuning phase)
# Requires dependencies: pip install -r requirements-server.txt
# Opens a server named 'qwen' on port 8080
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen

# Use a model with float16 precision for conversation
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/

# Online quantization to int8 model for conversation
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8

# OpenAI API server (currently in testing and tuning phase)
# Requires dependencies: pip install -r requirements-server.txt
# Opens a server named 'qwen' on port 8080
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen

# webui
# Requires dependencies: pip install streamlit-chat
python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080
```

Detailed parameters can be viewed using the --help argument for all demos.

For detailed parameter explanations, please refer to [Parameter Documentation](docs/english_demo_arguments.md).

Current model support can be found at: [Model List](docs/models.md)

For architectures that cannot directly read Hugging Face models, refer to [Model Conversion Documentation](docs/convert_model.md) to convert models to fastllm format.

If you need to customize the model structure, you can refer to the detailed instructions [Custom Model](docs/english_custom.md)

### Running the demo program (c++)

```
Expand Down
41 changes: 41 additions & 0 deletions docs/demo_arguments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Fastllm Python Demo 参数说明

## 通用参数

模型相关配置,OpenAI API Server, WebUI, 对话Demo 均可使用

- **模型路径 (`-p, --path`)**: 指定模型的路径,可以是fastllm模型文件或Hugging Face模型文件夹。例如:
```bash
--path ~/Qwen2-7B-Instruct/ # 从~/Qwen2-7B-Instruct/中读取模型,这里的模型需要是从HuggingFace或ModelScope或其他网站下载的Hugging face格式的标准模型,暂不支持AWQ,GPTQ等格式
--path ~/model.flm # 从~/model.flm中读取模型,这里的模型是Fastllm格式的模型文件
```
- **推理类型 (`--atype`)**: 设置中间计算类型,可以指定为`float16``float32`
- **权重类型 (`--dtype`)**: 指定模型的权重类型,适用于读取Hugging Face模型时。可以指定为`float16`, `int8`, `int4`, `int4g`(int4分组量化),例如:
```bash
--dtype float16 # 使用float16权重(不量化)
--dtype int8 # 在线量化成int8权重
--dtype int4g128 # 在线量化成int4分组权重(128个权重一组)
--dtype int4g256 # 在线量化成int4分组权重(256个权重一组)
--dtype int4 # 在线量化成int4权重
```
- **使用的设备 (`--device`)**: 指定服务器使用的设备。可以指定为`cpu``cuda`或额外编译的其余device类型
- **CUDA Embedding (`--cuda_embedding`)**: 若带上此配置且device设置为`cuda`,那么会在cuda设备上进行embedding操作,这样速度会略微提升,显存占用也会提升,建议在显存非常充足的情况下使用
- **KV缓存最大使用量 (`--kv_cache_limit`)**: 设置KV缓存的最大使用量。若不使用此参数或设置为`auto`,框架会自动处理。手动设定示例如下:
```bash
--kv_cache_limit 5G # 设置为5G
--kv_cache_limit 100M # 设置为100M
--kv_cache_limit 168K # 设置为168K
```
- **最大Batch数量 (`--max_batch`)**: 设置每次同时处理的请求数量。若不使用此参数,框架会自动处理
- **线程数量 (`-t, --threads`)**: 设置CPU线程数量,device设置为`cpu`时对速度有较大影响,设置为`cuda`时影响较小,主要影响读取模型的速度
- **自定义模型描述文件 (`--custom`)**: 指定描述自定义模型的Python文件。具体见 [自定义模型](custom.md)

## OpenAI API Server配置参数
- **模型名称 (`--model_name`)**: 指定部署的模型名称,API调用时会进行名称核验
- **API服务器主机地址 (`--host`)**: 设置API服务器的主机地址
- **API服务器端口号 (`--port`)**: 设置API服务器的端口号


## Web UI 配置参数
- **API服务器端口号 (`--port`)**: 设置WebUI的端口号
- **页面标题 (`--title`)**: 设置WebUI的页面标题
108 changes: 108 additions & 0 deletions docs/english_custom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
### Custom Models

For models that are not supported by the Fastllm framework, you can support them by customizing the model structure.

A custom Python model requires only a Python file to describe the model structure. You can refer to the implementation in [QWEN](../example/python/qwen2.py).

### Using Python Custom Models

When using `ftllm.chat`, `ftllm.webui`, or `ftllm.server`, you can add the `--custom` parameter to specify the custom model file.

Assuming our model is located in the `~/Qwen2-7B-Instruct/` directory and the custom model is located in `~/qwen2.py`, you can use the command:

```sh
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --custom ~/qwen2.py
```

to load the Qwen2 model using the custom model file. The usage for `server` and `webui` is similar.

### Writing Python Custom Models

When creating a custom model, you need to implement a model description class that inherits from `ftllm.llm.ComputeGraph`.

Refer to the code in [QWEN](../example/python/qwen2.py):

```python
from ftllm.llm import ComputeGraph
class Qwen2Model(ComputeGraph):
```

At the end of the file, you need to define the `__model__` variable to specify the class corresponding to the custom model structure, with the corresponding code:

```python
__model__ = Qwen2Model
```

The model description class needs to implement the `build` method to obtain model parameters and describe the computation flow.

Here is an example based on the sample code:

```python
class Qwen2Model(ComputeGraph):
def build(self):
# 1. Get weight, data, config
weight, data, config = self.weight, self.data, self.config

# 2. Set some config
config["max_positions"] = 128000

# 3. Describe the computation flow
head_dim = config["hidden_size"] // config["num_attention_heads"]
self.Embedding(data["inputIds"], weight["model.embed_tokens.weight"], data["hiddenStates"]);
# The following is the computation flow, see the example code for details
```

#### `self.config`

The model configuration, which by default is read from the `config.json` file in the model folder.

You can modify parameters in the `config` within the `build` method, such as changing `max_positions` to modify the context length.

For some models, the variable names used in `config.json` may differ and need to be manually assigned during the `build` process.

For example, in the TeleChat7B model configuration, there is no `max_positions` variable but instead uses `seq_length` to represent the length. In the `build` method, you need to assign it with the following code:

```python
self.config["max_positions"] = self.config["seq_length"]
```

In the `config`, the following variables must be assigned (if the variable names in `config.json` are consistent, no action is needed):

```python
self.config["max_positions"] # Represents the maximum context length
```

#### `self.weight`

Represents the weight data.

`self.weight[weightName]` represents the parameter named `weightName` in the model file (corresponding to the parameter names in the `.safetensors` file in the HF model folder).

#### `self.data`

Represents the intermediate variables and input variables of the computation flow.

`self.data[dataName]` represents the intermediate variable named `dataName`. `dataName` can be any string except for the following input variable names:

Input variables:

```python
data["inputIds"] # Input tokens
data["positionIds"] # Position information
data["attentionMask"] # Mask information
data["sin"] # Sin for rotary encoding
data["cos"] # Cos for rotary encoding
data["atype"] # Data type in inference
data["pastKey."][i] # Key cache for the i-th block
data["pastValue."][i] # Value cache for the i-th block
```

#### Computation Flow and Operators

Use the functions of the base class `ComputeGraph` to describe the computation flow.

The currently supported operators are documented in [Custom Model Operators](./custom_op.md).

### Custom Models in C++

(The interface for custom models in C++ is still under modification...)
40 changes: 40 additions & 0 deletions docs/english_demo_arguments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Fastllm Python Demo Parameter Explanation

## General Parameters

Configuration related to the model, OpenAI API Server, WebUI, and conversation demo can all use these parameters.

- **Model Path (`-p, --path`)**: Specifies the path to the model, which can be a fastllm model file or a Hugging Face model directory. For example:
```bash
--path ~/Qwen2-7B-Instruct/ # Reads the model from ~/Qwen2-7B-Instruct/, where the model needs to be a standard Hugging Face format model downloaded from HuggingFace, ModelScope, or other websites. Formats like AWQ, GPTQ, etc., are currently not supported.
--path ~/model.flm # Reads the model from ~/model.flm, where the model is a Fastllm format model file
```
- **Inference Type (`--atype`)**: Sets the intermediate computation type, which can be specified as `float16` or `float32`.
- **Weight Type (`--dtype`)**: Specifies the weight type of the model, applicable when reading Hugging Face models. It can be specified as `float16`, `int8`, `int4`, `int4g` (int4 grouped quantization), for example:
```bash
--dtype float16 # Uses float16 weights (no quantization)
--dtype int8 # Quantizes to int8 weights online
--dtype int4g128 # Quantizes to int4 grouped weights online (128 weights per group)
--dtype int4g256 # Quantizes to int4 grouped weights online (256 weights per group)
--dtype int4 # Quantizes to int4 weights online
```
- **Device to Use (`--device`)**: Specifies the device used by the server. It can be specified as `cpu`, `cuda`, or other device types compiled additionally.
- **CUDA Embedding (`--cuda_embedding`)**: If this configuration is included and the device is set to `cuda`, embedding operations will be performed on the cuda device, slightly increasing speed and GPU memory usage. It is recommended to use this when there is ample GPU memory.
- **KV Cache Maximum Usage (`--kv_cache_limit`)**: Sets the maximum usage for the KV cache. If this parameter is not used or set to `auto`, the framework will handle it automatically. Manual settings examples are as follows:
```bash
--kv_cache_limit 5G # Sets to 5G
--kv_cache_limit 100M # Sets to 100M
--kv_cache_limit 168K # Sets to 168K
```
- **Maximum Batch Size (`--max_batch`)**: Sets the number of requests processed simultaneously each time. If this parameter is not used, the framework will handle it automatically.
- **Number of Threads (`-t, --threads`)**: Sets the number of CPU threads, which significantly affects speed when the device is set to `cpu`, and has a smaller impact when set to `cuda`, mainly affecting the speed of model loading.
- **Custom Model Description File (`--custom`)**: Specifies the Python file describing the custom model. See [Custom Model](custom.md) for details.

## OpenAI API Server Configuration Parameters
- **Model Name (`--model_name`)**: Specifies the name of the deployed model, which will be verified during API calls.
- **API Server Host Address (`--host`)**: Sets the host address of the API server.
- **API Server Port Number (`--port`)**: Sets the port number of the API server.

## Web UI Configuration Parameters
- **API Server Port Number (`--port`)**: Sets the port number for the WebUI.
- **Page Title (`--title`)**: Sets the page title for the WebUI.
9 changes: 5 additions & 4 deletions tools/fastllm_pytools/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1058,16 +1058,17 @@ def set_kv_cache_limit(self, limit: str):
limit_bytes = 0
try:
if (limit.endswith('k') or limit.endswith('K')):
limit_bytes = int(limit[:-1]) * 1024
limit_bytes = float(limit[:-1]) * 1e3
elif (limit.endswith('m') or limit.endswith('M')):
limit_bytes = int(limit[:-1]) * 1024 * 1024
limit_bytes = float(limit[:-1]) * 1e6
elif (limit.endswith('g') or limit.endswith('G')):
limit_bytes = int(limit[:-1]) * 1024 * 1024 * 1024
limit_bytes = float(limit[:-1]) * 1e9
else:
limit_bytes = int(limit[:-1])
limit_bytes = float(limit[:-1])
except:
print('set_kv_cache_limit error, param should be like "10k" or "10m" or "1g"')
exit(0)
limit_bytes = int(limit_bytes)
fastllm_lib.set_kv_cache_limit_llm_model(self.model, ctypes.c_int64(limit_bytes))

def set_max_batch(self, batch: int):
Expand Down
2 changes: 1 addition & 1 deletion tools/scripts/web_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def make_normal_parser(des: str) -> argparse.ArgumentParser:

def parse_args():
parser = make_normal_parser("fastllm webui")
parser.add_argument("--port", type = int, default = 8080, help = "API server port")
parser.add_argument("--port", type = int, default = 8080, help = "网页端口")
parser.add_argument("--title", type = str, default = "fastllm webui", help = "页面标题")
return parser.parse_args()

Expand Down

0 comments on commit f392c83

Please sign in to comment.