Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue pretraining an instruction-fine-tuned LLM model like Qwen2.5-7B-Instruct. #1405

Open
geo47 opened this issue Dec 9, 2024 · 3 comments

Comments

@geo47
Copy link

geo47 commented Dec 9, 2024

Hello,

I would like to know if it's possible to continue pretraining an LLM model on raw text that is fine-tuned on instructions like (Qwen2.5-7B-Instruct).

Would there be any effect regarding its performance in understanding the instructions?

The best strategy that I am considering is to continue pre-training instruction fine-tuned version of an LLM on raw text and then fine-tune on instruction task to refresh the instruction knowledge.

Please guide! Thanks

@omarbadran
Copy link

Not sure if I understand this correctly, but I have fine-tuned a lot of models, both base and instruct versions with no problems. The quality is actually better than what I got when tuning Gemini flash in Vertex AI for my use case. The only concern is that your goal is to teach the model new information, which would require a lot of data and high LoRA rank number to avoid overfitting. Still much much better than a full fine-tune.

If your dataset is not HUGE, you can use a larger model with the "raw text" you have to generate an instruction dataset and then train on that directly.

I have done something like this before, I wanted my model to learn Deno 2 since it's new and all LLMs we have don't know about it, so I scraped the documentation, the blog posts, and some files from their Github, then used Claude 3.5 Haiku to generate a list of prompts, and used Sonnet to answer them, both with context caching to reduce the cost and latency. The whole process was less than $5.

If the text is larger than 200k tokens and won't fit the context window for Claude, you can use Gemini 1.5 Pro which supports up to two million tokens and also supports caching.

It's much cheaper to use a good model with context caching than running your own. There are even simpler methods with fewer steps and doesn't require using a huge model like Sonnet or Gemini but the quality of the dataset and time saved was not worth the extra code i would need to write.

@Tejaswgupta
Copy link

Tejaswgupta commented Dec 10, 2024

@omarbadran whats the metric you use to understand if the model if learning correctly and not overfitting.
I've tried pre-training Qwen-14B-Instruct model on a legal dataset of 6M tokens, the loss does converge to 0.7 but the model answers pretty much all the questions incorrectly.

I fine-tuned it on another curated dataset of 30k samples which did improve the accuracy but it still wasn't great.

Screenshot 2024-12-10 at 11 45 05 AM Screenshot 2024-12-10 at 11 45 12 AM

This was with both Unsloth and Llamafactory.

Did you pre-train your models or fine-tune on the labelled data?

@danielhanchen
Copy link
Contributor

@geo47 You can do it on instruct models, but I would advise against it if it's raw text - a trick is to at the end do (original instruct weights) / 2 + (finetuned instruct weights) / 2

@omarbadran Fair points - if the dataset is small, generally the best advice is to merge datasets from the open source world, or create some synthetic data. Large datasets are generally better (>10K)

@Tejaswgupta Did you use train_on_responses_only in the conversational notebook https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing which should help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants