fastllm is a high-performance large model inference library implemented purely in C++ with no third-party dependencies, supporting multiple platforms.
Deployment and communication QQ group: 831641348
| Quick Start | Model Acquisition |
- 🚀 Pure C++ implementation, facilitating cross-platform移植, directly compilable on Android
- 🚀 Supports reading Hugging Face raw models and direct quantization
- 🚀 Supports deploying OpenAI API server
- 🚀 Supports multi-card deployment, supports GPU + CPU hybrid deployment
- 🚀 Supports dynamic batching, streaming output
- 🚀 Front-end and back-end separation design, easy to support new computing devices
- 🚀 Currently supports ChatGLM series models, Qwen2 series models, various LLAMA models (ALPACA, VICUNA, etc.), BAICHUAN models, MOSS models, MINICPM models, etc.
It is recommended to use cmake for compilation, requiring pre-installed gcc, g++ (recommended 9.4 or above), make, cmake (recommended 3.23 or above).
GPU compilation requires a pre-installed CUDA compilation environment, using the latest CUDA version is recommended.
Compile using the following commands:
bash install.sh -DUSE_CUDA=ON # Compile GPU version
# bash install.sh -DUSE_CUDA=ON -DCUDA_ARCH=89 # Specify CUDA architecture, e.g., 4090 uses architecture 89
# bash install.sh # Compile CPU version only
For compilation on other platforms, refer to the documentation: TFACC Platform
Assuming our model is located in the "/mnt/hfmodels/Qwen/Qwen2-7B-Instruct/" directory:
After compilation, you can use the following demos:
# Use a model with float16 precision for conversation
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/
# Online quantization to int8 model for conversation
python3 -m ftllm.chat -t 16 -p ~/Qwen2-7B-Instruct/ --dtype int8
# OpenAI API server (currently in testing and tuning phase)
# Requires dependencies: pip install -r requirements-server.txt
# Opens a server named 'qwen' on port 8080
python3 -m ftllm.server -t 16 -p ~/Qwen2-7B-Instruct/ --port 8080 --model_name qwen
Detailed parameters can be viewed using the --help argument for all demos.
Current model support can be found at: Model List
For architectures that cannot directly read Hugging Face models, refer to Model Conversion Documentation to convert models to fastllm format.
# Enter the fastllm/build-fastllm directory
# Command line chat program, supports typewriter effect (Linux only)
./main -p model.flm
# Simple webui, uses streaming output + dynamic batch, supports concurrent access
./webui -p model.flm --port 1234
Compilation on Windows is recommended using Cmake GUI + Visual Studio, completed in the graphical interface.
For compilation issues, especially on Windows, refer to FAQ.
# Model creation
from ftllm import llm
model = llm.model("model.flm")
# Generate response
print(model.response("你好"))
# Stream generate response
for response in model.stream_response("你好"):
print(response, flush = True, end = "")
Additional settings such as CPU thread count can be found in the detailed API documentation: ftllm
This package does not include low-level APIs. For deeper functionalities, refer to Python Binding API.
# Use the --device parameter to set multi-card calls
#--device cuda:1 # Set single device
#--device "['cuda:0', 'cuda:1']" # Deploy model evenly across multiple devices
#--device "{'cuda:0': 10, 'cuda:1': 5, 'cpu': 1} # Deploy model proportionally across multiple devices
from ftllm import llm
# Supports the following three methods, must be called before model creation
llm.set_device_map("cuda:0") # Deploy model on a single device
llm.set_device_map(["cuda:0", "cuda:1"]) # Deploy model evenly across multiple devices
llm.set_device_map({"cuda:0" : 10, "cuda:1" : 5, "cpu": 1}) # Deploy model proportionally across multiple devices
import pyfastllm as llm
# Supports the following method, must be called before model creation
llm.set_device_map({"cuda:0" : 10, "cuda:1" : 5, "cpu": 1}) # Deploy model proportionally across multiple devices
// Supports the following method, must be called before model creation
fastllm::SetDeviceMap({{"cuda:0", 10}, {"cuda:1", 5}, {"cpu", 1}}); // Deploy model proportionally across multiple devices
Running docker requires the local installation of NVIDIA Runtime and modification of the default runtime to nvidia.
- Install nvidia-container-runtime
sudo apt-get install nvidia-container-runtime
- Modify docker default runtime to nvidia
/etc/docker/daemon.json
{
"registry-mirrors": [
"https://hub-mirror.c.163.com",
"https://mirror.baidubce.com"
],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia" // This line is required
}
- Download the converted models to the models directory
models
chatglm2-6b-fp16.flm
chatglm2-6b-int8.flm
- Compile and start webui
DOCKER_BUILDKIT=0 docker compose up -d --build
# Compilation on PC requires downloading NDK tools
# You can also try compiling on the phone, using cmake and gcc in termux (no need for NDK)
mkdir build-android
cd build-android
export NDK=<your_ndk_directory>
# If the phone does not support, remove "-DCMAKE_CXX_FLAGS=-march=armv8.2a+dotprod" (most new phones support this)
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_CXX_FLAGS=-march=armv8.2a+dotprod ..
make -j
- Install the termux app on the Android device.
- Execute termux-setup-storage in termux to gain permission to read phone files.
- Copy the main file and model file compiled with NDK into the phone and into the termux root directory.
- Use the command
chmod 777 main
to grant permissions. - Run the main file, refer to
./main --help
for parameter format.