KoRAE

We introduce KoRAE which is finetuned with a filtered high-quality Korean dataset.

The KoRAE is the output of a combination of high-quality data filtered by a special data filtering method and Korean Llama-2 that Korean vocabularis were added. We utilized special data filtering methods introduced in AlpaGasus to filter high-quality data from a mixture of several Korean datasets(OpenOrca-KO, KOpen-Platypus, KoCoT_2000, databricks-dolly-15k-ko). We finetuned Korean Llama-2 that introduced by @beomi on the filtered dataset. The Flash-Attention2 and LoRA were utilized for efficient finetuning.

The finding of KoRAE is as follows:

The finetuning in some epochs showed that high-quality filtered data has positive effects on the model's performance. However, finetuning in a few epochs, the quantity of data is more matter than quality. It seems to be due to the lack of performance of the Korean base model. Therefore, the research to improve the Korean base model must continue.
The model trained with DPO showed the best performance among KoRAE variants. This shows that DPO is effective in the Korean LLM.
The model finetuned with filtered high-quality KoRAE showed better performance than without. Therefore, for better LLM, we should try to finetune the LLM with high-quality data.

You can also check the performance of KoRAE in Open Ko-LLM Leaderboard!

The model and dataset are available via HuggingFace: Cartinoe5930

News

[2023.12] HAE-RAE Benchmark results of KoRAE and its variants have been uploaded. Thanks to @HAETAE-project for introducing awesome Korean LLM Benchmark! We thank @guijinSON for advice!

[2023.12] The KoRAE and its variants were uploaded on HuggingFace Hub and Open Ko-LLM Leaderboard. Thanks to @beomi, @kyujinpy, @nlp-ai, and @maywell for providing base model and training datasets!

Setup

This repository mainly uses Transformers and TRL provided by HuggingFace🤗. Please keep in mind! In addition, Flash Attention 2 and LoRA are used for the Parameter Efficient Fine Tuning(PEFT).

cd KoRAE
pip install -r requirements.txt

Dataset

We used a filtered high-quality Korean dataset for finetuning as mentioned above. First of all, we gathered Korean data and made a mixture of them. Then we filtered high-quality data from the combination of data through a filtering method introduced by AlpaGasus. The overview of the data processing procedure is as follows:

Collect various Korean datasets from HuggingFace Hub.
Rate the data quality using gpt-3.5-turbo.
Process the rated data and filter the high-scored data.

Let's go deeper into data processing!

1. Korean dataset mixture

We investigated several sources to collect high-quality Korean data, and among them, we collected data from various sources. As a result, we were able to create a new dataset containing 64K pieces of data. The specific configuration of the dataset is as follows:

Dataset	# Nums
OpenOrca-ko	21.6k
KOpen-Platypus	24.9k
KoCoT_2000	2.1k
databricks-dolly-15k-ko	15k
Total	63.7k

Thanks to @kyujinpy and @nlp-ai for providing Korean datasets.

2. Rating

We utilized ChatGPT(gpt-3.5-turbo) as a rater to rate the quality of the dataset. We considered whether to use the prompt for the evaluation in Korean or English, but we thought it would be undesirable to give evaluations in different languages, so we conducted the evaluation using the Korean prompt. However, since the rating code rating/rating.py also supports the English rating prompt format, you can choose the rating mode according to your preference.

Korean version

python rating/rating.py \
    --i 0 \
    --rating_type ko \
    --api_key YOUR_OPENAI_KEY

English version

python rating/rating.py \
    --i 0 \
    --rating_type en \
    --api_key YOUR_OPENAI_KEY

The rating code rating/rating.py and rating prompt format teamplates/rating_template.json were referred to AlpaGasus

3. Processing & Filtering

We post-processed the rated dataset after the rating. The main postprocessing procedure is as follows:

Wrong score extraction correction
Incorrect format dataset exclusion

After all postprocessing, we analyzed the score distribution of the rated dataset. As shown in the following figure, it was confirmed that 8-point data was the most. This confirms that the KoRAE dataset consisted of high-quality data from the beginning.

However, We filtered data only with a score of 8.5 or higher and used it to finetune KoRAE for better performance. As a result, we were able to filter the dataset from 64k to 12k! The following figure shows the shift of data distribution of KoRAE by filtering.

python rating/filtering.py \
    --score_criteria 8.5 \
    --output_dir PATH_TO_UPLOAD_DATASET \
    --hf_token YOUR_HF_ACCESS_TOKEN

The original and filtered datasets are uploaded on HuggingFace Hub, so you can check them!

Finetuning(SFT)

We finetuned KoRAE with Flash-Attention2 and LoRA for efficient finetuning on 1 * A100 80G GPUs. KoRAE was finetuned with the Parameter Efficient Fine Tuning method, which is called LoRA. In addition, since the high-quality filtered dataset is smaller than the original dataset, it makes finetuning more efficient. As a result, it took only 5 GPU hours to fully finetune the model with 3 epochs! The hyperparameters used for finetuning of KoRAE are as follows:

Training Hyperparameters

Hyperparameters	Value
Base model	beomi/llama-2-koen-13b
Dataset	Cartinoe5930/KoRAE_filtered_12k
Batch size	16
Micro batch size	1
Gradient accumulation steps	16
Epochs	3
Learning rate	1e-5
lr_scheduler	cosine
Max length	4096
Warmup ratio	0.03
Weight decay	0
bf16	True
Gradient checkpointing	True

LoRA Hyperparameters

Hyperparameters	Value
lora_r	8
lora_alpha	16
lora_dropout	0.05

The finetuning code of KoRAE is as follows:

python finetuning/finetune.py \
    --model_path beomi/llama-2-koen-13b \
    --data_path Cartinoe5930/KoRAE_filtered_12k \
    --output_dir finetuning/result/ \
    --wandb_project KoRAE_sft \
    --wandb_run_name KoRAE_sft \
    --hf_hub_path HUB_PATH_TO_UPLOAD_MODEL \
    --hf_token YOUR_HF_ACCESS_TOKEN

DPO

We additionally trained KoRAE with DPO to improve the model. Since we need binarized feedback to train the model with DPO, we utilized the ko_Ultrafeedback_binarized which is the Korean translated version of Ultrafeedback_binarized provided by @maywell. The hyperparameters used for DPO training of KoRAE are as follows and LoRA hyperparameters are the same as mentioned above:

DPO Hyperparameters

Hyperparameters	Value
Beta	0.1
Batch size	8
Micro batch size	2
Gradient accumulation steps	4
Epochs	3
Learning rate	5e-7
lr_scheduler	linear
Max prompt length	2048
Max length	4096
Warmup ratio	0.1
Weight decay	0
Gradient checkpointing	True

The DPO training code of KoRAE is as follows:

python DPO/dpo.py \
    --model_path Cartinoe5930/KoRAE-13b \
    --data_path maywell/ko_Ultrafeedback_binarized \
    --output_dir DPO/result/ \
    --wandb_project KoRAE_dpo \
    --wandb_run_name KoRAE_dpo \
    --hf_hub_path HUB_PATH_TO_UPLOAD_MODEL \
    --hf_token YOUR_HF_ACCESS_TOKEN

Prompting Format

We utilized the following prompt format for KoRAE. To follow the prompting format of popular models and preserve important information introduced in instruction, we used it. You can check the prompting format of KoRAE in templates/KoRAE_template.json or the following example:

### System:
{system_prompt}

### User:
{instruction + input}

### Assistant:
{output}

Since we implemented the KoRAE prompt format on the model's tokenizer, you can utilize it with apply_chat_template. For more details, please refer to the Model card of KoRAE!

Weights & Bias Result

The finetuning and DPO training results of KoRAE can be checked following the Weights & Bias link.

Open Ko-LLM Leaderboard

We uploaded several variants of the KoRAE model to the Open Ko-LLM Leaderboard and checked their performance of them. The best performance is in bold. The results are follow as:

Model	Average	Ko-ARC	Ko-HellaSwag	Ko-MMLU	Ko-TruthfulQA	Ko-CommonGen V2
KoRAE-filtered-1ep	48.1	45.22	56.79	42	40.4	56.08
KoRAE-filtered-3ep	48.64	46.33	57.25	42.8	41.08	55.73
KoRAE-original-1ep	48.5	45.56	57.04	42.2	40.67	57.02
KoRAE-original-3ep	48.16	44.37	56.97	43.27	41.75	54.43
KoRAE-DPO	48.71	46.5	57.54	42.87	41.28	55.37

Through the analysis of the performance table, we were able to confirm the following:

The model finetuned with filtered KoRAE dataset showed improved performance as it finetuned from more epochs. However, the model finetuned with the original KoRAE dataset showed the opposite trend.
The model showed better overall and average performance through DPO than without.
The model finetuned with more epochs showed that filtering high-quality data has a positive effect on the performance of the model.

HAE-RAE Benchmark

We evaluated the KoRAE and its variants on HAE-RAE Benchmark additionally. The results of evaluation are as follows:

Models	correct_definition_matching	csat_geo	csat_law	csat_socio	date_understanding	general_knowledge	history	loan_words	reading_comprehension	rare_words	standard_nomenclature
Cartinoe5930/KoRAE-13b	0.5011	0.1933	0.1659	0.1812	0.6105	0.3693	0.7500	0.7456	0.2244	0.7506	0.7647
Cartinoe5930/original-KoRAE-13b-3ep	0.5285	0.1867	0.1889	0.2081	0.5284	0.3920	0.7074	0.8047	0.2361	0.7235	0.8105
Cartinoe5930/KoRAE-13b-DPO	0.5057	0.1933	0.1659	0.1812	0.6063	0.3807	0.7500	0.7456	0.2190	0.7556	0.7712

The result of HAE-RAE Benchmark shows little bit different trend compared to the result of Open Ko-LLM Leaderboard. The high-quality dataset and DPO had a good effect according to the result of Open Ko-LLM Leaderboard, however the result of HAE-RAE Bench show that finetuning from more data had a better effect to the performance of model. We think the reason for the difference is that there are gap between the target of each benchmark. This is because the Open Ko-LLM Leaderboard focuses on evaluating the academic part of the model, however, the HAE-RAE Benchmark attempts to evaluate Korean cultural and contextual nuances.

Discussion

We were able to learn and find out some interesting things that could help the research of Korean LLM through the KoRAE project. In general, the Korean open-source model uploaded to the Open Ko-LLM Leaderboard mainly uses the beomi/llama-2-koen-13b model uploaded by @beomi. This model is trained on the Korean + English Mix dataset using the llama 2 model architecture and is widely used as an open-source LLM. However, due to the lack of Korean data, training could only be done from tokens of about 60B. We were able to see how this affects the model's performance through the KoRAE project.

The finetuning from a larger amount of data showed better performance than from high-quality data when finetuning from a small number of epochs. However, the finetuning on high-quality data improves performance compared to original data as shown in the AlpaGasus experiment results when finetuning from a large number of epochs.

We were able to confirm that it is important to keep trying to create a further improved Korean base model through future research. In addition, high-quality data is important for finetuning the model and, DPO had a positive effect on model performance.

Nevertheless, KoRAE still shows poor performance compared to models uploaded to the Open Ko-LLM Leaderboard. We will leave it as a future research. Stay tuned for an update of KoRAE!

Citation

@inproceedings{lee2023kullm,
  title={KULLM: Learning to Construct Korean Instruction-following Large Language Models},
  author={Lee, SeungJun and Lee, Taemin and Lee, Jeongwoo and Jang, Yoona and Lim, Heuiseok},
  booktitle={Annual Conference on Human and Language Technology},
  pages={196--202},
  year={2023},
  organization={Human and Language Technology}
}

@misc{chen2023alpagasus,
      title={AlpaGasus: Training A Better Alpaca with Fewer Data}, 
      author={Lichang Chen and Shiyang Li and Jun Yan and Hai Wang and Kalpa Gunaratna and Vikas Yadav and Zheng Tang and Vijay Srinivasan and Tianyi Zhou and Heng Huang and Hongxia Jin},
      year={2023},
      eprint={2307.08701},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc {l._junbum_2023,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { llama-2-koen-13b },
    year         = 2023,
    url          = { https://huggingface.co/beomi/llama-2-koen-13b },
    doi          = { 10.57967/hf/1280 },
    publisher    = { Hugging Face }
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
DPO		DPO
assets		assets
finetuning		finetuning
rating		rating
templates		templates
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KoRAE

News

Setup

Dataset

1. Korean dataset mixture

2. Rating

3. Processing & Filtering

Finetuning(SFT)

DPO

Prompting Format

Weights & Bias Result

Open Ko-LLM Leaderboard

HAE-RAE Benchmark

Discussion

Citation

About

Releases

Packages

Languages

License

gauss5930/KoRAE

Folders and files

Latest commit

History

Repository files navigation

KoRAE

News

Setup

Dataset

1. Korean dataset mixture

2. Rating

3. Processing & Filtering

Finetuning(SFT)

DPO

Prompting Format

Weights & Bias Result

Open Ko-LLM Leaderboard

HAE-RAE Benchmark

Discussion

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages