Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Codebase Refactor #22

Draft
wants to merge 120 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 102 commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
52099b6
Restructure as python package (#19)
shawseanyang Aug 16, 2024
053dc09
Add Ollama inference
ArmandNM Aug 16, 2024
0e8eb89
Panza Web Server (#12)
shawseanyang Aug 20, 2024
498fdc1
Add black line length configuration
ArmandNM Aug 20, 2024
aec8d50
Add new Panza src path
ArmandNM Aug 20, 2024
39d37e2
Create class interfaces
ArmandNM Aug 22, 2024
769277c
Set up unit testing
ArmandNM Aug 23, 2024
42acc78
web hosting
shawseanyang Aug 23, 2024
081bf94
implement ollama llm
shawseanyang Aug 23, 2024
de2fe12
Add FAISS retriever and update Document interface to preepare for ind…
ArmandNM Aug 24, 2024
32a8a39
Fix missing method in retriever interface
ArmandNM Aug 24, 2024
094fa24
Make thread and past_messages optional in EmailInstruction
ArmandNM Aug 24, 2024
c9ea456
Add email prompt builder
ArmandNM Aug 24, 2024
6a9897e
Add local transformers inference
ArmandNM Aug 25, 2024
30b3dc1
Add Peft models and conditional imports
ArmandNM Aug 26, 2024
1d73aff
Add Panza Writer
ArmandNM Aug 26, 2024
c648f98
Add support to return full prompt from writer
ArmandNM Aug 27, 2024
09f08e4
Set corresponding retriever document type in prompt builder
ArmandNM Aug 27, 2024
a3c6ac6
Remove debugging print from prompting utils
ArmandNM Aug 27, 2024
6b67156
Add Hydra config-based runner for Panza writer
ArmandNM Aug 27, 2024
cd9b041
rename ollama_llm.py to ollama.py
shawseanyang Aug 28, 2024
7e3ee43
add type annotations to OllamaLLM
shawseanyang Aug 28, 2024
8b1ef60
add some more type annotations to OllamaLLM
shawseanyang Aug 28, 2024
73b37be
check installation for OllamaLLM
shawseanyang Aug 28, 2024
90d59d1
rename test_llm.py to test_local_llm.py
shawseanyang Aug 28, 2024
f9ddb8f
add pytest to dev dependencies
shawseanyang Aug 28, 2024
af88a28
add sampling_params to super() init call
shawseanyang Aug 28, 2024
28dce17
add unit tests for ollama_llm.py
shawseanyang Aug 28, 2024
b8fba39
black formatting
shawseanyang Aug 28, 2024
5f05228
fix types
shawseanyang Aug 28, 2024
91ef882
add to gitignore
shawseanyang Aug 28, 2024
d4d0a95
black formatting
shawseanyang Aug 28, 2024
a9f9152
add omegaconf to dependencies
shawseanyang Aug 28, 2024
49dfb1d
Add FFT runner
ArmandNM Sep 1, 2024
c1775ca
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 2, 2024
48ba916
comment out the running examples
shawseanyang Sep 2, 2024
b662f80
split dependencies into base and training and add documentation in RE…
shawseanyang Sep 2, 2024
4b2fa76
add hydra to dependencies
shawseanyang Sep 2, 2024
d4bd1aa
Fix unused num_thread_emails parameter
ArmandNM Sep 3, 2024
8df8f11
Add RoSA runner
ArmandNM Sep 3, 2024
2c221cc
Temporarily rename serialized document in vector db metadata
ArmandNM Sep 3, 2024
e72d13c
move some dependencies from training into base
shawseanyang Sep 3, 2024
7447aac
add outputs folder to gitignore
shawseanyang Sep 3, 2024
b151fc3
add writer to the constructor arguments of the web service
shawseanyang Sep 3, 2024
77c1867
delete run_panza.py bc its just a test file
shawseanyang Sep 3, 2024
fc7db3f
rename constructor argument for the Ollama LLM class to match the loc…
shawseanyang Sep 3, 2024
b6f9921
add none retriever to allow running without RAG
shawseanyang Sep 3, 2024
d70f512
add config for Ollama LLM
shawseanyang Sep 3, 2024
5d79649
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 4, 2024
01b4052
remove DEFAULT_PORT and add integer type hint
shawseanyang Sep 4, 2024
1d05864
add interfaces
shawseanyang Sep 5, 2024
a522768
add hydra yaml overrides to training script (works for full training …
Sep 5, 2024
3a2c70b
remove my user from configs and add comments to show how to enable ot…
shawseanyang Sep 5, 2024
3801351
add temporary inference instructions
shawseanyang Sep 5, 2024
f1bf461
use command line config specifications in the inference instructions
shawseanyang Sep 5, 2024
9c821b3
add example to inference instructions
shawseanyang Sep 5, 2024
46f222d
update training script for RoSA
Sep 6, 2024
a4a38d5
Merge branch 'refactor' into jen/train-refactor
Sep 6, 2024
62d3b27
remove redundant, unnecessary and problematic use_rag and use_thread …
Sep 6, 2024
bd42248
minor training cleanups
Sep 6, 2024
7e8f343
add config for peft writer
Sep 10, 2024
090c8f1
deprecate panza_finetuning.yaml
Sep 11, 2024
aa6a586
small config fixes
Sep 11, 2024
19e5b54
Refactor data summarization
ArmandNM Sep 11, 2024
4fb5c8d
add sampling parameters to ollama LLM
shawseanyang Sep 11, 2024
422a2a5
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 11, 2024
21a3fcf
refactor configs to get unnecessary params out of configs/base
Sep 12, 2024
9a1c9c2
allow code execution during model loading to allow phi3.5
Sep 12, 2024
3c44172
greatly simplify the .sh training script to take advantage of the con…
Sep 12, 2024
6849903
add streaming to cli
shawseanyang Sep 12, 2024
764f980
add streaming to transformers
shawseanyang Sep 12, 2024
50dcde2
update training scripts to process arguments correctly
Sep 13, 2024
6937018
minor fix for train_fft
Sep 13, 2024
0dc314a
write the full checkpoint to the expected location
Sep 13, 2024
9fa93f0
add sampling parameters to ollama LLM
shawseanyang Sep 11, 2024
ddd6041
Refactor data summarization
ArmandNM Sep 11, 2024
91a64f0
add streaming to cli
shawseanyang Sep 12, 2024
41d27c3
add streaming to transformers
shawseanyang Sep 12, 2024
66cf2d4
update web.py to match LLM interface
shawseanyang Sep 13, 2024
098ef32
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
Sep 16, 2024
0c456b5
first pass at porting evaluation to new framework
Sep 16, 2024
2e20f12
do NOT split test and train data by default
Sep 16, 2024
298d023
make the json interface more robust to json file format
Sep 17, 2024
02e0322
fix bug where RoSA FFT still tries to move folder over
Sep 19, 2024
d4b70f8
emergency bugfix for streaming to cli
Sep 19, 2024
226082e
Merge branch 'refactor' into jen/eval-refactor
Oct 18, 2024
3314abe
add creating RAG DB to data preparation script
Oct 18, 2024
ab96fea
add test-train splitting to data preparation
Oct 18, 2024
e3f12e0
undo accidental commenting out
Oct 18, 2024
dbcc5d0
bug fix
Oct 18, 2024
dfbde00
add tqdm to json interface
Oct 18, 2024
eb720d1
let panza writer load latest checkpoint
Oct 18, 2024
d7b298f
add email extraction to data preparation
Oct 21, 2024
924033e
update panza readme
Oct 21, 2024
ea5ffea
update env preparation script
Oct 21, 2024
ca4b690
slight refactor of runner.py
Oct 28, 2024
233083e
remove some unused .sh files
Oct 28, 2024
6f94379
qq
Oct 28, 2024
8ce3c4a
make the first part of the data preparation script optional (in case …
Oct 29, 2024
c1f36a8
Edits and Bug Fixes
maddox-j Oct 31, 2024
bacbdf7
Add additional clarification on username importance
maddox-j Nov 4, 2024
6be5813
update data preparation
Nov 4, 2024
302bae6
Miscellanous updates
Nov 8, 2024
c13aa07
Fix function address
Nov 8, 2024
fee7cbd
Clean up code TODOs and revert to defaults
Nov 12, 2024
99f5c7c
move top-level README to default location
Nov 12, 2024
1238d39
update the scripts/ readme
Nov 12, 2024
c0a94a3
remove useless assert
Nov 12, 2024
99a0b68
Merge branch 'jen/eval-refactor' of github.com:IST-DASLab/PanzaMail i…
Nov 12, 2024
3e2203c
Merge changes.
Nov 12, 2024
1e62597
Once again, try to centralize the main README.
Nov 12, 2024
71a85c9
Update the README
Nov 12, 2024
73308a0
Update README.md remove resolved TODO
ohaijen Nov 13, 2024
e5a9e44
update hyperparameter tuning guide
Nov 13, 2024
49258b9
Merge branch 'jen/eval-refactor' of github.com:IST-DASLab/PanzaMail i…
Nov 13, 2024
b3bc00f
Refactor panza3 -> panza
Nov 13, 2024
be76c39
Clear ollama and web use-case
Nov 14, 2024
677dd8a
Update README.md remove confusing period.
ohaijen Nov 18, 2024
e1e8e6d
Update README.md Add instructions for quantized training
ohaijen Nov 18, 2024
525d0e3
correct README for quantized training
Nov 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Store API keys here
API_KEYS=apikey1,apikey2,apikey3
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@ __pycache__/
checkpoints/
results/
wandb/
outputs/

*.log
*.log
*.egg-info
.vscode
build/
205 changes: 205 additions & 0 deletions README_panza3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
<div align="center">
<img src="panza_logo.png" alt="panza demo" width="200"/>
</div>

# Panza: A personal email assistant, trained and running on-device



## What is Panza?




Panza is an automated email assistant customized to your writing style and past email history. \
Its main features are as follows:
* Panza produces a fine-tuned LLM that matches your writing style, pairing it with a Retrieval-Augmented Generation (RAG) component which helps it produce relevant emails.
* Panza **can be trained and run entirely locally**. Currently, it requires a single GPU with
16-24 GiB of memory, but we also plan to release a CPU-only version. **At no point in training or execution is your data shared with the entities that trained the original LLMs, with LLM distribution services such as Huggingface, or with us.**
* Training and execution are also quick - for a dataset on the order of 1000 emails, training Panza takes well under an hour, and generating a new email takes a few seconds at most.

<div align="center">
<img src="panza_demo.gif" alt="panza logo" width="500"/>
</div>


## TODO: Prerequisites
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean TODO

- Your emails, exported to `mbox` format (see tutorial below).
- A computer, preferably with a NVIDIA GPU with at least 24 GiB of memory (alternatively, check out [running in Google Colab](#cloud-try-out-panza-in-google-colab)).
- A Hugging Face [account](https://huggingface.co/login) to download the models (free of charge).
- [Optional] A Weights & Biases [account](https://wandb.ai/login) to log metrics during training (free of charge).
- Basic Python and Unix knowledge, such as building environments and running python scripts.
- *No prior LLMs experience is needed*.


## How it works

### :film_projector: Step 1: Data playback

For most email clients, it is possible to download a user's past emails in a machine-friendly .mbox format. For example, GMail allows you to do this via [Google Takeout](https://takeout.google.com), whereas Thunderbird allows one to do this via various plugins.

One key part of Panza is a dataset-generation technique we call **data playback**: Given some of your past emails in .mbox format, we automatically create a training set for Panza by using a pretrained LLM to summarize the emails in instruction form; each email becomes a `(synthetic instruction, real email)` pair.
Given a dataset consisting of all pairs, we use these pairs to "play back" your sent emails: the LLM receives only the instruction, and has to generate the "ground truth" email as a training target.

We find that this approach is very useful for the LLM to "learn" the user's writing style.


### :weight_lifting: Step 2: Local Fine-Tuning via Robust Adaptation (RoSA)

We then use parameter-efficient finetuning to train the LLM on this dataset, locally. We found that we get the best results with the [RoSA method](https://arxiv.org/pdf/2401.04679.pdf), which combines low-rank (LoRA) and sparse finetuning. If parameter efficiency is not a concern, that is, you have a more powerful GPU, then regular, full-rank/full-parameter finetuning can also be used. We find that a moderate amount of further training strikes the right balance between matching the writer's style without memorizing irrelevant details in past emails.


### :owl: Step 3: Serving via RAG

Once we have a custom user model, Panza can be run locally together with a Retrieval-Augmented Generation (RAG) module. Specifically, this functionality stores past emails in a database and provides a few relevant emails as context for each new query. This allows Panza to better insert specific details, such as a writer's contact information or frequently used Zoom links.

The overall structure of Panza is as follows:
<div align="center">
<img src="panza_diagram.png" alt="panza logo" width="703" style="max-width: 100%; height: auto;"/>
</div>

## Installation

### Conda
1. Make sure you have a version of [conda](https://docs.anaconda.com/free/miniconda/miniconda-install/) installed.
2. Create a new conda environment named 'panza' (or something else) and activate it:
``` bash
conda create -n panza python=3.10 -y
conda activate panza
```
3. Install the required packages:
``` bash
pip install .
```
4. If you want to also finetune models using Panza, you will need to install the additional packages:
``` bash
pip install .[training]
```

## TODO: :rocket: Getting started

To quickly get started with building your own personalized email assistant, follow the steps bellow:

<!-- To train your personalized email assistant, follow the three steps below. -->

<!-- TODO: Replace steps with #### heading? -->
### Step 0: Download your sent emails
<!-- **Step 1: Download your sent emails** -->
<details>
<summary> Expand for detailed download instructions.</summary>

We provide a description for doing this for GMail via Google Takeout.

1. Go to [https://takeout.google.com/](https://takeout.google.com/).
2. Click `Deselect all`.
3. Find `Mail` section (search for the phrase `Messages and attachments in your Gmail account in MBOX format`).
4. Select it.
5. Click on `All Mail data included` and deselect everything except `Sent`.
6. Scroll to the bottom of the page and click `Next step`.
7. Click on `Create export`.
8. Wait for download link to arrive in your inbox.
9. Download `Sent.mbox` and place it in the `data/` directory.

For Outlook accounts, we suggest doing this via a Thunderbird plugin for exporting a subset of your email as an MBOX format, such as [this add-on](https://addons.thunderbird.net/en-us/thunderbird/addon/importexporttools-ng/).
</details>

At the end of this step you should have the downloaded emails placed inside `data/Sent.mbox`.

<!-- **Step 0: Environment configuration** -->
### Step 1: Environment configuration

<!-- 🎛️ -->
Panza is configured through a set of yaml configurations defined in `configs/`. There is a single high-level config under `configs/base.yaml`, and the rest are organized under the main functionalities of the code.
Note that these task-specific configs can, in some cases, be used to override base configs.
Specific use cases, such as hyperparameter tuning, are covered in more detail in `scripts/README.md`. (TODO jen: write this up.)

1. Data preparation: `configs/data_preparation.yaml`. Additionally, a custom user config must be added under `config/users/` (see below).
1. Finetuning: the main config is in `configs/panza_finetuning.yaml` and the method-specific ones are in `configs/finetuning/`
1. Serving: Serving consists of two parts - a serving infrastructure (that we call 'writer') that runs the LLM and so converts prompts to Panza outputs, and an `interface`, which presents the outputs in a useful form - through a command-line interface, a web interface, a gmail client (TODO:Sean), or in a bulk `.json` format (useful for evaluation). The configs for serving are in `panza_writer.yaml`, and for the interfaces, under `configs/interfaces`.

<!-- 💬 -->
These scripts are described in more detail in `scripts/README.md`, but a few customizations need to happen immediately.
:warning: Before continuing, make sure you complete the following setup:
- Optionally, copy `users/default.yaml` to `users/[YOURNAME].yaml`. If this is skipped, perform the following modifications on `users/default.yaml` directly. A useful tip for choosing the name of `[YOURNAME]` is to set it to the output of `whoami`.
- In the user config, set the email address and username. The email address should be the sender address in the exported emails. (Panza uses this to edit out responses and other emails sent by a different author in the `.mbox` dump.). The username does not have to link to the email itself - it is simply used as a name for the various data files that will come out of the data preparation process. A handy way to set this is if you set it to be the output of the `whoami` call in your shell.
- Modify the personal prompt in `prompt_preambles/user_preamble.txt` to include some basic information about yourself that Panza can use to customize your emails with your correct full name, address, phone number, etc.


Additionally, please perform the following login steps to be able to download the base model.
- Login to Hugging Face to be able to download pretrained models: `huggingface-cli login`.
- [Optional] Login to Weights & Biases to log metrics during training: `wandb login`. Then, set `wandb_disabled=false` in `configs/finetuning/base.yaml`.


You are now ready to move to `scripts`.
``` bash
cd scripts
```

### Step 2: Extract emails
<!-- **Step 2: Extract emails** -->

1. Run `CUDA_VISIBLE_DEVICES=X python ./prepare_data.py`.<details>
<summary> This scripts takes care of all the prerequisites before training (expand for details). </summary>

- Extracts your emails in text format to `data/<username>_clean.jsonl` which you can manually inspect.
- Creates synthetic prompts for your emails as described in the [data playback](#film_projector-step-1-data-playback) section. The results are stored in `data/<username>_clean_summarized.jsonl` and you can inspect the `"summary"` field.
- Splits data into training and test subsets. See `data/train.jsonl` and `data/test.jsonl`.
- Creates a vector database from the embeddings of the training emails which will later be used for *Retrieval-Augmented Generation (RAG)*. See `data/<username>.pkl` and `data/<username>.faiss`.
</details>

ODO Jen: This doesn't work anymore, because we make the RAG database right away. If you wish to eliminate any emails from the training set (e.g. containing certain personal information), you can simply remove the corresponding rows.

### Step 3: Train a LLM on your emails
<!-- **Step 3: Train a LLM on your emails** -->

We currently support `LLaMA3-8B-Instruct` and `Mistral-Instruct-v0.2` LLMs as base models; the former is the default, but we obtained good results with either model.

1. [Recommended] For parameter efficient fine-tuning, run `./train_rosa.sh`.
If a larger GPU is available and full-parameter fine-tuning is possible, run `./train_fft.sh`.

2. We have prepopulated the training configs with parameter values that worked best for us. We recommend you try those first, but you can also experiment with different hyper-parameters by passing extra arguments to the training script, such as `lr`, `lora_lr`, `num_epochs`. All the trained models are saved in the `checkpoints` directory.

Examples:
``` bash
./train_rosa.sh # Will use the default parameters.

./train_rosa.sh finetuning.lr=1e-6 finetuning.rosa_lr=1e-6 finetuning.max_duration=7ep.
```
<details>
<summary> FAQs. </summary>
The bash scripts that are used to execute the finetuning procedure assume by default that your username is what is returned by the <code>whoami</code> command. This is used to locate the name of the user configs inside the <code>configs/user</code> directory as above. If you directly modified <code>default.yaml</code>, or created another yaml file where the name of that file does not match with the output of <code>whoami</code>, there will be an error. This is an easy fix. You can either:
<ol>
<li> Change the name of the yaml file to be the output of <code>whoami</code>.
<li> You can override the username manually when you launch the bash script by adding <code>user=x</code> where <code>x</code> is the name of the yaml file you created. For example: <code>./train_rosa.sh user=alonso</code>
</ol>
<br>
If you wish to add <code>CUDA_VISIBLE_DEVICES</code> to specify a specific GPU, please add this in the shell script directly by <code>export CUDA_VISIBLE_DEVICES=x</code> where <code>x</code> is the ID of the GPU you wish to use.
</details>


### Step 5: Launch Panza!
<!-- **Step 5: Launch Panza!** -->

- To run Panza after a full training run, try something like `CUDA_VISIBLE_DEVICES=0 python3 runner.py user=USERNAME interfaces=cli writer/llm=transformers`.
- To run Panza after a RoSA or LoRA training run, replace `writer/llm=transformers` with `writer/llm=peft` TODO Armand: can we fix this?

Copy link
Contributor Author

@maddox-j maddox-j Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrate with the inference markdown+ resolve TODO


:email: **Have fun with your new email writing assistant!** :email:

<!-- For in depth customization of each step of the pipeline, refer to ... -->


## :microscope: Advanced usage
- [Data Preparation Guide](./scripts/README.md#data-guide)
- [Hyper-Parameter Tuning Guide](./scripts/README.md#hyper-parameter-tuning-guide)
- [Prompt Preambles Tutorial](prompt_preambles/README.md)

## Authors

Panza was conceived by Nir Shavit and Dan Alistarh and built by the [Distributed Algorithms and Systems group](https://ist.ac.at/en/research/alistarh-group/) at IST Austria. The contributors are (in alphabetical order):

Dan Alistarh, Eugenia Iofinova, Eldar Kurtic, Ilya Markov, Armand Nicolicioiu, Mahdi Nikdan, Andrei Panferov, and Nir Shavit.

Contact: [email protected]

We thank our collaborators Michael Goin and Tony Wang at NeuralMagic and MIT for their helpful testing and feedback.
46 changes: 46 additions & 0 deletions TEMP_HOW_TO_RUN_INFERENCE.md
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rename, and to link back to the original README

Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# How to run inference in Panza3

There are two backend options: Ollama (no GPU) or Local (with GPU). The dependencies necessary for each backend are different.

## Step 1: Install Dependencies for Panza

For Ollama, simply run:
```bash
pip install -e .
```

For Local, run:
```bash
pip install -e .
```
and
```bash
pip install panza_mail[training]
```

## Step 2a: Ollama Prerequisites

If running with Ollama, then Ollama needs to be installed from the [web page](https://ollama.com/).

Then, you will need to convert your model into a GGUF file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it beneficial to add more support for this?


## Step 2b: Local Prerequisites

If running locally, then the Panza model needs to be located in `data`.

## Step 3: Set configurations

In the `configs folder` add a user YAML file for yourself in `/user`.

If running with Ollama, edit the `name` and `gguf` fields in `/writer/llm/ollama.yaml` with a name of your choice and the path to the GGUF file.

## Step 4: Run Panza

To run Panza, cd into the `scripts` directory and run:
```bash
python3 runner.py user=<your name> interfaces=<cli/gui/web> writer/llm=<ollama/peft/transformers>
```
For example, to run with Ollama and the CLI interface with the user `test`, run:
```bash
python3 runner.py user=test interfaces=cli writer/llm=ollama
```
9 changes: 9 additions & 0 deletions configs/base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
defaults:
- user: default

panza_workspace: ${hydra:runtime.cwd}/../
checkpoint_dir: ${panza_workspace}/checkpoints
seed: 41

embedding_model: "sentence-transformers/all-mpnet-base-v2"
model_precision: bf16 # bf16 or fp32
Original file line number Diff line number Diff line change
@@ -1,60 +1,43 @@
wandb_disabled: true # We assume that wandb is disabled unless the user has logged on.

max_seq_len: 512
global_seed: 17
model_name_or_path: #TODO
global_seed: ${seed}

load_path: # set via bash script to be absolute path to your sparse checkpoint
precision: amp_bf16
hf_save_path: ./checkpoints
hf_save_path: ${checkpoint_dir}/models

max_duration: # TODO
eval_interval: 1
seed: ${global_seed}

global_train_batch_size: #TODO
device_train_microbatch_size: 16
device_eval_batch_size: 16
global_train_batch_size: 8
device_train_microbatch_size: 1
device_eval_batch_size: 1

run_name: # If left blank, will be read from env var $RUN_NAME
run_name: # If left blank, it will be generated based on configs

model:
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: ${model_name_or_path}
max_seq_len: ${max_seq_len}
pretrained_model_name_or_path: ${finetuning.model_name_or_path}
max_seq_len: ${finetuning.max_seq_len}
output_hidden_states: true
weight_bias_dtype: #TODO
weight_bias_dtype: ${model_precision}
compute_dtype: bf16

rosa:
lora_r: #TODO
spa_d: #TODO
lora_alpha: 16
target_modules: 'all-linear'
lora_dropout: 0.05
impl: auto
spa_store_transpose: true
rosa_dtype: bf16
spa_num_grads: 1
grad_acc_mode: mean_squared
mask_load_path: #TODO
mask_save_path: #TODO
terminate_after_mask_generation: #TODO
schedule: #TODO

tokenizer:
name: ${model_name_or_path}
name: ${finetuning.model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model_max_length: ${finetuning.max_seq_len}

train_loader:
name: finetuning
dataset:
hf_name: json
split: train
hf_kwargs:
data_files: #TODO
preprocessing_fn: preprocessing:panza_preprocessing_function
max_seq_len: ${max_seq_len}
data_files: ${user.data_dir}/train.jsonl
preprocessing_fn: panza3.finetuning.preprocessing:panza_preprocessing_function
max_seq_len: ${finetuning.max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
shuffle: true
Expand All @@ -72,7 +55,7 @@ scheduler:

optimizer:
name: decoupled_adamw
lr: # TODO
lr: 1e-5
betas:
- 0.9
- 0.999
Expand Down
29 changes: 29 additions & 0 deletions configs/finetuning/full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
defaults:
- base


max_duration: 3ep
lr: 1e-5
batch_size: 8
eval_interval: 1
seed: ${seed}
model_name_or_path: "ISTA-DASLab/Meta-Llama-3-8B-Instruct"

fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

callbacks:
hf_checkpointer:
overwrite: true
precision: # TODO
save_folder: ${finetuning.hf_save_path}/${finetuning.run_name}
save_interval: 1dur

scheduler:
t_warmup: 20ba
Loading