Not an issue but a question for going forwards #227

thusinh1969 · 2023-06-12T02:34:17Z

Hi,

I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.

How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?

I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https://github.com/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.

Thanks for any input.
Steve

Facico · 2023-06-29T07:50:28Z

Here is a similar issue: #12

Thank you for your interest in our project. LLaMA is a multilingual model and does have some proficiency in Chinese. Considering the lack of a strong Chinese base, we chose to use LLaMA as the foundation.

Given sufficient hardware resources, full-scale fine-tuning would certainly yield better results compared to using Lora, such as with FastChat's Vicuna.

The method of expanding the vocabulary for Chinese-LLaMA-Alpaca also requires extensive pretraining, which can be done if the hardware conditions are adequate. LLaMA itself utilizes encoding mechanisms that can encode many Chinese characters, but achieving one-to-one encoding is relatively limited, hence the need for vocabulary expansion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not an issue but a question for going forwards #227

Not an issue but a question for going forwards #227

thusinh1969 commented Jun 12, 2023 •

edited

Loading

Facico commented Jun 29, 2023

Not an issue but a question for going forwards #227

Not an issue but a question for going forwards #227

Comments

thusinh1969 commented Jun 12, 2023 • edited Loading

Facico commented Jun 29, 2023

thusinh1969 commented Jun 12, 2023 •

edited

Loading