LLM Human Alignment Training Documentation

Environment Preparation

GPU devices: A10, 3090, V100, A100 are all acceptable. For GPUs with memory <=24G, at least a dual-card environment is required. Since human alignment training loads two models on one card, it occupies more memory than fine-tuning due to an additional inference model's memory consumption.

# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

# Environment alignment (usually not necessary. If you encounter errors, you can run the following code, the repository uses the latest environment for testing)
pip install -r requirements/framework.txt  -U
pip install -r requirements/llm.txt  -U

Human Alignment Training

The following shell script runs a human alignment training. First, you need to switch to the runtime directory:

cd examples/pytorch/llm

Run the following command:

# Experimental environment: 4*A100
# Memory usage: 4 * 20G, dual-card device_map * 2ddp
nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift rlhf \
    --rlhf_type dpo \
    --model_type  yi-6b-chat \
    --ref_model_type  yi-6b-chat \
    --model_revision  master \
    --sft_type  lora \
    --tuner_backend  swift \
    --dtype  AUTO  \
    --output_dir  output  \
    --dataset  hh-rlhf-cn:harmless_base_cn  \
    --num_train_epochs  3  \
    --max_length  1024  \
    --max_prompt_length  512  \
    --check_dataset_strategy  none  \
    --lora_rank  8  \
    --lora_alpha  32  \
    --lora_dropout_p  0.05  \
    --lora_target_modules  ALL  \
    --gradient_checkpointing  true  \
    --batch_size  1  \
    --weight_decay  0.1  \
    --learning_rate  5e-5  \
    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
    --max_grad_norm  1.0  \
    --warmup_ratio  0.03  \
    --eval_steps  2000  \
    --save_steps  2000  \
    --save_total_limit  2  \
    --logging_steps  10 \

Shell Script

The sh script can be viewed here.

# The following script needs to be executed in this directory
cd examples/pytorch/llm

Tips:

We default to setting --gradient_checkpointing true during training to save memory, which will slightly reduce training speed.
If you are using older GPUs such as V100, you need to set --dtype AUTO or --dtype fp16, because they do not support bf16.
If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing flash-attn, which will speed up training and inference as well as reduce memory usage (3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in LLM Supported Models
If you need to train offline, please use --model_id_or_path <model_dir> and set --check_model_is_latest false. For specific parameter meanings, please see Command Line Arguments.
If you want to push weights to the ModelScope Hub during training, you need to set --push_to_hub true.

# dpo training for mistral-7b max_length=1024, bs=1
# Recommended experimental environment: V100, A10, 3090, 2 cards, 4 cards or 8 cards
bash scripts/dpo/lora_ddp_mp/dpo.sh
bash scripts/dpo/lora_ddp_mp/infer.sh

Since DPO training will result in a complete model or adapter weights, the steps for LoRA merging and inference are the same as for fine-tuning, so please refer to the corresponding steps in the Fine-tuning Documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO.md

DPO.md

LLM Human Alignment Training Documentation

Table of Contents

Environment Preparation

Human Alignment Training

Shell Script

Files

DPO.md

Latest commit

History

DPO.md

File metadata and controls

LLM Human Alignment Training Documentation

Table of Contents

Environment Preparation

Human Alignment Training

Shell Script