更新参数说明文档

ztxz16 · Aug 13, 2024 · f392c83 · f392c83
1 parent 7d1973a
commit f392c83
Show file tree

Hide file tree

Showing 7 changed files with 214 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -41,30 +41,34 @@ bash install.sh -DUSE_CUDA=ON # 编译GPU版本
 其他不同平台的编译可参考文档
 [TFACC平台](docs/tfacc.md)
 
+### 运行OpenAI API Server
+
+编译安装完成后，
+
 ### 运行demo程序 (python)
 
 假设我们的模型位于"~/Qwen2-7B-Instruct/"目录
 
 编译完成后可以使用下列demo:
 
 ``` sh
+# openai api server
+# 需要安装依赖: pip install -r requirements-server.txt
+# 这里在8080端口打开了一个模型名为qwen的server
+python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
+
 # 使用float16精度的模型对话
 python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ 
 
 # 在线量化为int8模型对话
 python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8
 
-# openai api server (目前处于测试调优阶段)
-# 需要安装依赖: pip install -r requirements-server.txt
-# 这里在8080端口打开了一个模型名为qwen的server
-python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
-
 # webui
 # 需要安装依赖: pip install streamlit-chat
 python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080
 ```
 
-以上demo均可使用参数 --help 查看详细参数
+以上demo均可使用参数 --help 查看详细参数，详细参数说明可参考 [参数说明](docs/demo_arguments.md)
 
 目前模型的支持情况见: [模型列表](docs/models.md)
 

diff --git a/README_EN.md b/README_EN.md
@@ -44,28 +44,32 @@ Assuming our model is located in the "~/Qwen2-7B-Instruct/" directory:
 After compilation, you can use the following demos:
 
 ``` sh
+# OpenAI API server (currently in testing and tuning phase)
+# Requires dependencies: pip install -r requirements-server.txt
+# Opens a server named 'qwen' on port 8080
+python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
+
 # Use a model with float16 precision for conversation
 python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ 
 
 # Online quantization to int8 model for conversation
 python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8
 
-# OpenAI API server (currently in testing and tuning phase)
-# Requires dependencies: pip install -r requirements-server.txt
-# Opens a server named 'qwen' on port 8080
-python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
-
 # webui
 # Requires dependencies: pip install streamlit-chat
 python3 -m ftllm.webui -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080
 ```
 
 Detailed parameters can be viewed using the --help argument for all demos.
 
+For detailed parameter explanations, please refer to [Parameter Documentation](docs/english_demo_arguments.md).
+
 Current model support can be found at: [Model List](docs/models.md)
 
 For architectures that cannot directly read Hugging Face models, refer to [Model Conversion Documentation](docs/convert_model.md) to convert models to fastllm format.
 
+If you need to customize the model structure, you can refer to the detailed instructions [Custom Model](docs/english_custom.md)
+
 ### Running the demo program (c++)
 
 ```

diff --git a/docs/demo_arguments.md b/docs/demo_arguments.md
@@ -0,0 +1,41 @@
+# Fastllm Python Demo 参数说明
+
+## 通用参数
+
+模型相关配置，OpenAI API Server， WebUI, 对话Demo 均可使用
+
+- **模型路径 (`-p, --path`)**: 指定模型的路径，可以是fastllm模型文件或Hugging Face模型文件夹。例如:
+```bash
+--path ~/Qwen2-7B-Instruct/ # 从~/Qwen2-7B-Instruct/中读取模型，这里的模型需要是从HuggingFace或ModelScope或其他网站下载的Hugging face格式的标准模型，暂不支持AWQ，GPTQ等格式
+--path ~/model.flm # 从~/model.flm中读取模型，这里的模型是Fastllm格式的模型文件
+```
+- **推理类型 (`--atype`)**: 设置中间计算类型，可以指定为`float16`或`float32`
+- **权重类型 (`--dtype`)**: 指定模型的权重类型，适用于读取Hugging Face模型时。可以指定为`float16`, `int8`, `int4`, `int4g`(int4分组量化)，例如：
+```bash
+--dtype float16  # 使用float16权重（不量化）
+--dtype int8     # 在线量化成int8权重
+--dtype int4g128 # 在线量化成int4分组权重（128个权重一组）
+--dtype int4g256 # 在线量化成int4分组权重（256个权重一组）
+--dtype int4     # 在线量化成int4权重
+```
+- **使用的设备 (`--device`)**: 指定服务器使用的设备。可以指定为`cpu`或`cuda`或额外编译的其余device类型
+- **CUDA Embedding (`--cuda_embedding`)**: 若带上此配置且device设置为`cuda`，那么会在cuda设备上进行embedding操作，这样速度会略微提升，显存占用也会提升，建议在显存非常充足的情况下使用
+- **KV缓存最大使用量 (`--kv_cache_limit`)**: 设置KV缓存的最大使用量。若不使用此参数或设置为`auto`，框架会自动处理。手动设定示例如下：
+```bash
+--kv_cache_limit 5G   # 设置为5G
+--kv_cache_limit 100M # 设置为100M
+--kv_cache_limit 168K # 设置为168K
+```
+- **最大Batch数量 (`--max_batch`)**: 设置每次同时处理的请求数量。若不使用此参数，框架会自动处理
+- **线程数量 (`-t, --threads`)**: 设置CPU线程数量，device设置为`cpu`时对速度有较大影响，设置为`cuda`时影响较小，主要影响读取模型的速度
+- **自定义模型描述文件 (`--custom`)**: 指定描述自定义模型的Python文件。具体见 [自定义模型](custom.md)
+
+## OpenAI API Server配置参数
+- **模型名称 (`--model_name`)**: 指定部署的模型名称，API调用时会进行名称核验
+- **API服务器主机地址 (`--host`)**: 设置API服务器的主机地址
+- **API服务器端口号 (`--port`)**: 设置API服务器的端口号
+
+
+## Web UI 配置参数
+- **API服务器端口号 (`--port`)**: 设置WebUI的端口号
+- **页面标题 (`--title`)**: 设置WebUI的页面标题
diff --git a/docs/english_custom.md b/docs/english_custom.md
@@ -0,0 +1,108 @@
+### Custom Models
+
+For models that are not supported by the Fastllm framework, you can support them by customizing the model structure.
+
+A custom Python model requires only a Python file to describe the model structure. You can refer to the implementation in [QWEN](../example/python/qwen2.py).
+
+### Using Python Custom Models
+
+When using `ftllm.chat`, `ftllm.webui`, or `ftllm.server`, you can add the `--custom` parameter to specify the custom model file.
+
+Assuming our model is located in the `~/Qwen2-7B-Instruct/` directory and the custom model is located in `~/qwen2.py`, you can use the command:
+
+```sh
+python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --custom ~/qwen2.py
+```
+
+to load the Qwen2 model using the custom model file. The usage for `server` and `webui` is similar.
+
+### Writing Python Custom Models
+
+When creating a custom model, you need to implement a model description class that inherits from `ftllm.llm.ComputeGraph`.
+
+Refer to the code in [QWEN](../example/python/qwen2.py):
+
+```python
+from ftllm.llm import ComputeGraph
+class Qwen2Model(ComputeGraph):
+```
+
+At the end of the file, you need to define the `__model__` variable to specify the class corresponding to the custom model structure, with the corresponding code:
+
+```python
+__model__ = Qwen2Model
+```
+
+The model description class needs to implement the `build` method to obtain model parameters and describe the computation flow.
+
+Here is an example based on the sample code:
+
+```python
+class Qwen2Model(ComputeGraph):
+    def build(self):
+        # 1. Get weight, data, config
+        weight, data, config = self.weight, self.data, self.config
+
+        # 2. Set some config
+        config["max_positions"] = 128000
+
+        # 3. Describe the computation flow
+        head_dim = config["hidden_size"] // config["num_attention_heads"]
+        self.Embedding(data["inputIds"], weight["model.embed_tokens.weight"], data["hiddenStates"]);
+        # The following is the computation flow, see the example code for details
+```
+
+#### `self.config`
+
+The model configuration, which by default is read from the `config.json` file in the model folder.
+
+You can modify parameters in the `config` within the `build` method, such as changing `max_positions` to modify the context length.
+
+For some models, the variable names used in `config.json` may differ and need to be manually assigned during the `build` process.
+
+For example, in the TeleChat7B model configuration, there is no `max_positions` variable but instead uses `seq_length` to represent the length. In the `build` method, you need to assign it with the following code:
+
+```python
+self.config["max_positions"] = self.config["seq_length"]
+```
+
+In the `config`, the following variables must be assigned (if the variable names in `config.json` are consistent, no action is needed):
+
+```python
+self.config["max_positions"] # Represents the maximum context length
+```
+
+#### `self.weight`
+
+Represents the weight data.
+
+`self.weight[weightName]` represents the parameter named `weightName` in the model file (corresponding to the parameter names in the `.safetensors` file in the HF model folder).
+
+#### `self.data`
+
+Represents the intermediate variables and input variables of the computation flow.
+
+`self.data[dataName]` represents the intermediate variable named `dataName`. `dataName` can be any string except for the following input variable names:
+
+Input variables:
+
+```python
+data["inputIds"] # Input tokens
+data["positionIds"] # Position information
+data["attentionMask"] # Mask information
+data["sin"] # Sin for rotary encoding
+data["cos"] # Cos for rotary encoding
+data["atype"] # Data type in inference
+data["pastKey."][i] # Key cache for the i-th block
+data["pastValue."][i] # Value cache for the i-th block
+```
+
+#### Computation Flow and Operators
+
+Use the functions of the base class `ComputeGraph` to describe the computation flow.
+
+The currently supported operators are documented in [Custom Model Operators](./custom_op.md).
+
+### Custom Models in C++
+
+(The interface for custom models in C++ is still under modification...)
diff --git a/docs/english_demo_arguments.md b/docs/english_demo_arguments.md
@@ -0,0 +1,40 @@
+# Fastllm Python Demo Parameter Explanation
+
+## General Parameters
+
+Configuration related to the model, OpenAI API Server, WebUI, and conversation demo can all use these parameters.
+
+- **Model Path (`-p, --path`)**: Specifies the path to the model, which can be a fastllm model file or a Hugging Face model directory. For example:
+  ```bash
+  --path ~/Qwen2-7B-Instruct/ # Reads the model from ~/Qwen2-7B-Instruct/, where the model needs to be a standard Hugging Face format model downloaded from HuggingFace, ModelScope, or other websites. Formats like AWQ, GPTQ, etc., are currently not supported.
+  --path ~/model.flm # Reads the model from ~/model.flm, where the model is a Fastllm format model file
+  ```
+- **Inference Type (`--atype`)**: Sets the intermediate computation type, which can be specified as `float16` or `float32`.
+- **Weight Type (`--dtype`)**: Specifies the weight type of the model, applicable when reading Hugging Face models. It can be specified as `float16`, `int8`, `int4`, `int4g` (int4 grouped quantization), for example:
+  ```bash
+  --dtype float16  # Uses float16 weights (no quantization)
+  --dtype int8     # Quantizes to int8 weights online
+  --dtype int4g128 # Quantizes to int4 grouped weights online (128 weights per group)
+  --dtype int4g256 # Quantizes to int4 grouped weights online (256 weights per group)
+  --dtype int4     # Quantizes to int4 weights online
+  ```
+- **Device to Use (`--device`)**: Specifies the device used by the server. It can be specified as `cpu`, `cuda`, or other device types compiled additionally.
+- **CUDA Embedding (`--cuda_embedding`)**: If this configuration is included and the device is set to `cuda`, embedding operations will be performed on the cuda device, slightly increasing speed and GPU memory usage. It is recommended to use this when there is ample GPU memory.
+- **KV Cache Maximum Usage (`--kv_cache_limit`)**: Sets the maximum usage for the KV cache. If this parameter is not used or set to `auto`, the framework will handle it automatically. Manual settings examples are as follows:
+  ```bash
+  --kv_cache_limit 5G   # Sets to 5G
+  --kv_cache_limit 100M # Sets to 100M
+  --kv_cache_limit 168K # Sets to 168K
+  ```
+- **Maximum Batch Size (`--max_batch`)**: Sets the number of requests processed simultaneously each time. If this parameter is not used, the framework will handle it automatically.
+- **Number of Threads (`-t, --threads`)**: Sets the number of CPU threads, which significantly affects speed when the device is set to `cpu`, and has a smaller impact when set to `cuda`, mainly affecting the speed of model loading.
+- **Custom Model Description File (`--custom`)**: Specifies the Python file describing the custom model. See [Custom Model](custom.md) for details.
+
+## OpenAI API Server Configuration Parameters
+- **Model Name (`--model_name`)**: Specifies the name of the deployed model, which will be verified during API calls.
+- **API Server Host Address (`--host`)**: Sets the host address of the API server.
+- **API Server Port Number (`--port`)**: Sets the port number of the API server.
+
+## Web UI Configuration Parameters
+- **API Server Port Number (`--port`)**: Sets the port number for the WebUI.
+- **Page Title (`--title`)**: Sets the page title for the WebUI.
diff --git a/tools/fastllm_pytools/llm.py b/tools/fastllm_pytools/llm.py
@@ -1058,16 +1058,17 @@ def set_kv_cache_limit(self, limit: str):
         limit_bytes = 0
         try:
             if (limit.endswith('k') or limit.endswith('K')):
-                limit_bytes = int(limit[:-1]) * 1024
+                limit_bytes = float(limit[:-1]) * 1e3
             elif (limit.endswith('m') or limit.endswith('M')):
-                limit_bytes = int(limit[:-1]) * 1024 * 1024
+                limit_bytes = float(limit[:-1]) * 1e6
             elif (limit.endswith('g') or limit.endswith('G')):
-                limit_bytes = int(limit[:-1]) * 1024 * 1024 * 1024
+                limit_bytes = float(limit[:-1]) * 1e9
             else:
-                limit_bytes = int(limit[:-1])
+                limit_bytes = float(limit[:-1])
         except:
             print('set_kv_cache_limit error, param should be like "10k" or "10m" or "1g"')
             exit(0)
+        limit_bytes = int(limit_bytes)
         fastllm_lib.set_kv_cache_limit_llm_model(self.model, ctypes.c_int64(limit_bytes))
 
     def set_max_batch(self, batch: int):

diff --git a/tools/scripts/web_demo.py b/tools/scripts/web_demo.py
@@ -17,7 +17,7 @@ def make_normal_parser(des: str) -> argparse.ArgumentParser:
 
 def parse_args():
     parser = make_normal_parser("fastllm webui")
-    parser.add_argument("--port", type = int, default = 8080, help = "API server port")
+    parser.add_argument("--port", type = int, default = 8080, help = "网页端口")
     parser.add_argument("--title", type = str, default = "fastllm webui", help = "页面标题")
     return parser.parse_args()