Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

geo47 · 2024-12-09T03:33:26Z

Hello,

I would like to know if it's possible to continue pretraining an LLM model on raw text that is fine-tuned on instructions like (Qwen2.5-7B-Instruct).

Would there be any effect regarding its performance in understanding the instructions?

The best strategy that I am considering is to continue pre-training instruction fine-tuned version of an LLM on raw text and then fine-tune on instruction task to refresh the instruction knowledge.

Please guide! Thanks

omarbadran · 2024-12-09T15:46:43Z

Not sure if I understand this correctly, but I have fine-tuned a lot of models, both base and instruct versions with no problems. The quality is actually better than what I got when tuning Gemini flash in Vertex AI for my use case. The only concern is that your goal is to teach the model new information, which would require a lot of data and high LoRA rank number to avoid overfitting. Still much much better than a full fine-tune.

If your dataset is not HUGE, you can use a larger model with the "raw text" you have to generate an instruction dataset and then train on that directly.

I have done something like this before, I wanted my model to learn Deno 2 since it's new and all LLMs we have don't know about it, so I scraped the documentation, the blog posts, and some files from their Github, then used Claude 3.5 Haiku to generate a list of prompts, and used Sonnet to answer them, both with context caching to reduce the cost and latency. The whole process was less than $5.

If the text is larger than 200k tokens and won't fit the context window for Claude, you can use Gemini 1.5 Pro which supports up to two million tokens and also supports caching.

It's much cheaper to use a good model with context caching than running your own. There are even simpler methods with fewer steps and doesn't require using a huge model like Sonnet or Gemini but the quality of the dataset and time saved was not worth the extra code i would need to write.

Tejaswgupta · 2024-12-10T06:17:50Z

@omarbadran whats the metric you use to understand if the model if learning correctly and not overfitting.
I've tried pre-training Qwen-14B-Instruct model on a legal dataset of 6M tokens, the loss does converge to 0.7 but the model answers pretty much all the questions incorrectly.

I fine-tuned it on another curated dataset of 30k samples which did improve the accuracy but it still wasn't great.

This was with both Unsloth and Llamafactory.

Did you pre-train your models or fine-tune on the labelled data?

danielhanchen · 2024-12-12T09:26:05Z

@geo47 You can do it on instruct models, but I would advise against it if it's raw text - a trick is to at the end do (original instruct weights) / 2 + (finetuned instruct weights) / 2

@omarbadran Fair points - if the dataset is small, generally the best advice is to merge datasets from the open source world, or create some synthetic data. Large datasets are generally better (>10K)

@Tejaswgupta Did you use train_on_responses_only in the conversational notebook https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing which should help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

geo47 commented Dec 9, 2024

omarbadran commented Dec 9, 2024

Tejaswgupta commented Dec 10, 2024 •

edited

Loading

danielhanchen commented Dec 12, 2024

Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

Comments

geo47 commented Dec 9, 2024

omarbadran commented Dec 9, 2024

Tejaswgupta commented Dec 10, 2024 • edited Loading

danielhanchen commented Dec 12, 2024

Tejaswgupta commented Dec 10, 2024 •

edited

Loading