Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
convert_to_sharegpt.py		convert_to_sharegpt.py
generate.py		generate.py

README.md

Generate chat data for self-distillation

We use vLLM to enable batched generation. First, install dependencies:

pip install vllm openai

Start server

python -m vllm.entrypoints.openai.api_server \
    --model YOUR_MODEL_NAME --port 8000

You can also start multiple servers with different ports to enable parallel generation. In generate.py, we scan the ports from 8000 to 8009 to find available servers. You can modify the code to use other ports.

Generate data

The following command will let the model to continue the first prompt from each sample in DATA_PATH, this is suitable for models that can play both roles in a conversation (e.g., Zephyr 7B). If you want to use all prompts in each sample to repeatly talk to the model, use --chat instead. --chat mode works for more models but may take longer time to generate due to repeated computation (welcome to contribute a better implementation).

python generate.py --data_path YOUR_DATA_PATH --output_path YOUR_OUTPUT_PATH --num_threads NUM_THREADS --max_tokens YOUR_MAX_TOKENS --temperature YOUR_TEMPERATURE

(Optional) Format data

When generated with --chat, the output file will follow the ShareGPT format (example). You can use the following command to convert the generated text withour --chat to the same format:

python convert_to_sharegpt.py --input_path YOUR_INPUT_PATH --model_name YOUR_MODEL_NAME --output_path YOUR_OUTPUT_PATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_generation

data_generation

README.md

Generate chat data for self-distillation

Start server

Generate data

(Optional) Format data

Files

data_generation

Directory actions

More options

Directory actions

More options

Latest commit

History

data_generation

Folders and files

parent directory

README.md

Generate chat data for self-distillation

Start server

Generate data

(Optional) Format data