-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Codebase Refactor #22
base: main
Are you sure you want to change the base?
Changes from 102 commits
52099b6
053dc09
0e8eb89
498fdc1
aec8d50
39d37e2
769277c
42acc78
081bf94
de2fe12
32a8a39
094fa24
c9ea456
6a9897e
30b3dc1
1d73aff
c648f98
09f08e4
a3c6ac6
6b67156
cd9b041
7e3ee43
8b1ef60
73b37be
90d59d1
f9ddb8f
af88a28
28dce17
b8fba39
5f05228
91ef882
d4d0a95
a9f9152
49dfb1d
c1775ca
48ba916
b662f80
4b2fa76
d4bd1aa
8df8f11
2c221cc
e72d13c
7447aac
b151fc3
77c1867
fc7db3f
b6f9921
d70f512
5d79649
01b4052
1d05864
a522768
3a2c70b
3801351
f1bf461
9c821b3
46f222d
a4a38d5
62d3b27
bd42248
7e8f343
090c8f1
aa6a586
19e5b54
4fb5c8d
422a2a5
21a3fcf
9a1c9c2
3c44172
6849903
764f980
50dcde2
6937018
0dc314a
9fa93f0
ddd6041
91a64f0
41d27c3
66cf2d4
098ef32
0c456b5
2e20f12
298d023
02e0322
d4b70f8
226082e
3314abe
ab96fea
e3f12e0
dbcc5d0
dfbde00
eb720d1
d7b298f
924033e
ea5ffea
ca4b690
233083e
6f94379
8ce3c4a
c1f36a8
bacbdf7
6be5813
302bae6
c13aa07
fee7cbd
99f5c7c
1238d39
c0a94a3
99a0b68
3e2203c
1e62597
71a85c9
73308a0
e5a9e44
49258b9
b3bc00f
be76c39
677dd8a
e1e8e6d
525d0e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Store API keys here | ||
API_KEYS=apikey1,apikey2,apikey3 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,5 +6,9 @@ __pycache__/ | |
checkpoints/ | ||
results/ | ||
wandb/ | ||
outputs/ | ||
|
||
*.log | ||
*.log | ||
*.egg-info | ||
.vscode | ||
build/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
<div align="center"> | ||
<img src="panza_logo.png" alt="panza demo" width="200"/> | ||
</div> | ||
|
||
# Panza: A personal email assistant, trained and running on-device | ||
|
||
|
||
|
||
## What is Panza? | ||
|
||
|
||
|
||
|
||
Panza is an automated email assistant customized to your writing style and past email history. \ | ||
Its main features are as follows: | ||
* Panza produces a fine-tuned LLM that matches your writing style, pairing it with a Retrieval-Augmented Generation (RAG) component which helps it produce relevant emails. | ||
* Panza **can be trained and run entirely locally**. Currently, it requires a single GPU with | ||
16-24 GiB of memory, but we also plan to release a CPU-only version. **At no point in training or execution is your data shared with the entities that trained the original LLMs, with LLM distribution services such as Huggingface, or with us.** | ||
* Training and execution are also quick - for a dataset on the order of 1000 emails, training Panza takes well under an hour, and generating a new email takes a few seconds at most. | ||
|
||
<div align="center"> | ||
<img src="panza_demo.gif" alt="panza logo" width="500"/> | ||
</div> | ||
|
||
|
||
## TODO: Prerequisites | ||
- Your emails, exported to `mbox` format (see tutorial below). | ||
- A computer, preferably with a NVIDIA GPU with at least 24 GiB of memory (alternatively, check out [running in Google Colab](#cloud-try-out-panza-in-google-colab)). | ||
- A Hugging Face [account](https://huggingface.co/login) to download the models (free of charge). | ||
- [Optional] A Weights & Biases [account](https://wandb.ai/login) to log metrics during training (free of charge). | ||
- Basic Python and Unix knowledge, such as building environments and running python scripts. | ||
- *No prior LLMs experience is needed*. | ||
|
||
|
||
## How it works | ||
|
||
### :film_projector: Step 1: Data playback | ||
|
||
For most email clients, it is possible to download a user's past emails in a machine-friendly .mbox format. For example, GMail allows you to do this via [Google Takeout](https://takeout.google.com), whereas Thunderbird allows one to do this via various plugins. | ||
|
||
One key part of Panza is a dataset-generation technique we call **data playback**: Given some of your past emails in .mbox format, we automatically create a training set for Panza by using a pretrained LLM to summarize the emails in instruction form; each email becomes a `(synthetic instruction, real email)` pair. | ||
Given a dataset consisting of all pairs, we use these pairs to "play back" your sent emails: the LLM receives only the instruction, and has to generate the "ground truth" email as a training target. | ||
|
||
We find that this approach is very useful for the LLM to "learn" the user's writing style. | ||
|
||
|
||
### :weight_lifting: Step 2: Local Fine-Tuning via Robust Adaptation (RoSA) | ||
|
||
We then use parameter-efficient finetuning to train the LLM on this dataset, locally. We found that we get the best results with the [RoSA method](https://arxiv.org/pdf/2401.04679.pdf), which combines low-rank (LoRA) and sparse finetuning. If parameter efficiency is not a concern, that is, you have a more powerful GPU, then regular, full-rank/full-parameter finetuning can also be used. We find that a moderate amount of further training strikes the right balance between matching the writer's style without memorizing irrelevant details in past emails. | ||
|
||
|
||
### :owl: Step 3: Serving via RAG | ||
|
||
Once we have a custom user model, Panza can be run locally together with a Retrieval-Augmented Generation (RAG) module. Specifically, this functionality stores past emails in a database and provides a few relevant emails as context for each new query. This allows Panza to better insert specific details, such as a writer's contact information or frequently used Zoom links. | ||
|
||
The overall structure of Panza is as follows: | ||
<div align="center"> | ||
<img src="panza_diagram.png" alt="panza logo" width="703" style="max-width: 100%; height: auto;"/> | ||
</div> | ||
|
||
## Installation | ||
|
||
### Conda | ||
1. Make sure you have a version of [conda](https://docs.anaconda.com/free/miniconda/miniconda-install/) installed. | ||
2. Create a new conda environment named 'panza' (or something else) and activate it: | ||
``` bash | ||
conda create -n panza python=3.10 -y | ||
conda activate panza | ||
``` | ||
3. Install the required packages: | ||
``` bash | ||
pip install . | ||
``` | ||
4. If you want to also finetune models using Panza, you will need to install the additional packages: | ||
``` bash | ||
pip install .[training] | ||
``` | ||
|
||
## TODO: :rocket: Getting started | ||
|
||
To quickly get started with building your own personalized email assistant, follow the steps bellow: | ||
|
||
<!-- To train your personalized email assistant, follow the three steps below. --> | ||
|
||
<!-- TODO: Replace steps with #### heading? --> | ||
### Step 0: Download your sent emails | ||
<!-- **Step 1: Download your sent emails** --> | ||
<details> | ||
<summary> Expand for detailed download instructions.</summary> | ||
|
||
We provide a description for doing this for GMail via Google Takeout. | ||
|
||
1. Go to [https://takeout.google.com/](https://takeout.google.com/). | ||
2. Click `Deselect all`. | ||
3. Find `Mail` section (search for the phrase `Messages and attachments in your Gmail account in MBOX format`). | ||
4. Select it. | ||
5. Click on `All Mail data included` and deselect everything except `Sent`. | ||
6. Scroll to the bottom of the page and click `Next step`. | ||
7. Click on `Create export`. | ||
8. Wait for download link to arrive in your inbox. | ||
9. Download `Sent.mbox` and place it in the `data/` directory. | ||
|
||
For Outlook accounts, we suggest doing this via a Thunderbird plugin for exporting a subset of your email as an MBOX format, such as [this add-on](https://addons.thunderbird.net/en-us/thunderbird/addon/importexporttools-ng/). | ||
</details> | ||
|
||
At the end of this step you should have the downloaded emails placed inside `data/Sent.mbox`. | ||
|
||
<!-- **Step 0: Environment configuration** --> | ||
### Step 1: Environment configuration | ||
|
||
<!-- 🎛️ --> | ||
Panza is configured through a set of yaml configurations defined in `configs/`. There is a single high-level config under `configs/base.yaml`, and the rest are organized under the main functionalities of the code. | ||
Note that these task-specific configs can, in some cases, be used to override base configs. | ||
Specific use cases, such as hyperparameter tuning, are covered in more detail in `scripts/README.md`. (TODO jen: write this up.) | ||
|
||
1. Data preparation: `configs/data_preparation.yaml`. Additionally, a custom user config must be added under `config/users/` (see below). | ||
1. Finetuning: the main config is in `configs/panza_finetuning.yaml` and the method-specific ones are in `configs/finetuning/` | ||
1. Serving: Serving consists of two parts - a serving infrastructure (that we call 'writer') that runs the LLM and so converts prompts to Panza outputs, and an `interface`, which presents the outputs in a useful form - through a command-line interface, a web interface, a gmail client (TODO:Sean), or in a bulk `.json` format (useful for evaluation). The configs for serving are in `panza_writer.yaml`, and for the interfaces, under `configs/interfaces`. | ||
|
||
<!-- 💬 --> | ||
These scripts are described in more detail in `scripts/README.md`, but a few customizations need to happen immediately. | ||
:warning: Before continuing, make sure you complete the following setup: | ||
- Optionally, copy `users/default.yaml` to `users/[YOURNAME].yaml`. If this is skipped, perform the following modifications on `users/default.yaml` directly. A useful tip for choosing the name of `[YOURNAME]` is to set it to the output of `whoami`. | ||
- In the user config, set the email address and username. The email address should be the sender address in the exported emails. (Panza uses this to edit out responses and other emails sent by a different author in the `.mbox` dump.). The username does not have to link to the email itself - it is simply used as a name for the various data files that will come out of the data preparation process. A handy way to set this is if you set it to be the output of the `whoami` call in your shell. | ||
- Modify the personal prompt in `prompt_preambles/user_preamble.txt` to include some basic information about yourself that Panza can use to customize your emails with your correct full name, address, phone number, etc. | ||
|
||
|
||
Additionally, please perform the following login steps to be able to download the base model. | ||
- Login to Hugging Face to be able to download pretrained models: `huggingface-cli login`. | ||
- [Optional] Login to Weights & Biases to log metrics during training: `wandb login`. Then, set `wandb_disabled=false` in `configs/finetuning/base.yaml`. | ||
|
||
|
||
You are now ready to move to `scripts`. | ||
``` bash | ||
cd scripts | ||
``` | ||
|
||
### Step 2: Extract emails | ||
<!-- **Step 2: Extract emails** --> | ||
|
||
1. Run `CUDA_VISIBLE_DEVICES=X python ./prepare_data.py`.<details> | ||
<summary> This scripts takes care of all the prerequisites before training (expand for details). </summary> | ||
|
||
- Extracts your emails in text format to `data/<username>_clean.jsonl` which you can manually inspect. | ||
- Creates synthetic prompts for your emails as described in the [data playback](#film_projector-step-1-data-playback) section. The results are stored in `data/<username>_clean_summarized.jsonl` and you can inspect the `"summary"` field. | ||
- Splits data into training and test subsets. See `data/train.jsonl` and `data/test.jsonl`. | ||
- Creates a vector database from the embeddings of the training emails which will later be used for *Retrieval-Augmented Generation (RAG)*. See `data/<username>.pkl` and `data/<username>.faiss`. | ||
</details> | ||
|
||
ODO Jen: This doesn't work anymore, because we make the RAG database right away. If you wish to eliminate any emails from the training set (e.g. containing certain personal information), you can simply remove the corresponding rows. | ||
|
||
### Step 3: Train a LLM on your emails | ||
<!-- **Step 3: Train a LLM on your emails** --> | ||
|
||
We currently support `LLaMA3-8B-Instruct` and `Mistral-Instruct-v0.2` LLMs as base models; the former is the default, but we obtained good results with either model. | ||
|
||
1. [Recommended] For parameter efficient fine-tuning, run `./train_rosa.sh`. | ||
If a larger GPU is available and full-parameter fine-tuning is possible, run `./train_fft.sh`. | ||
|
||
2. We have prepopulated the training configs with parameter values that worked best for us. We recommend you try those first, but you can also experiment with different hyper-parameters by passing extra arguments to the training script, such as `lr`, `lora_lr`, `num_epochs`. All the trained models are saved in the `checkpoints` directory. | ||
|
||
Examples: | ||
``` bash | ||
./train_rosa.sh # Will use the default parameters. | ||
|
||
./train_rosa.sh finetuning.lr=1e-6 finetuning.rosa_lr=1e-6 finetuning.max_duration=7ep. | ||
``` | ||
<details> | ||
<summary> FAQs. </summary> | ||
The bash scripts that are used to execute the finetuning procedure assume by default that your username is what is returned by the <code>whoami</code> command. This is used to locate the name of the user configs inside the <code>configs/user</code> directory as above. If you directly modified <code>default.yaml</code>, or created another yaml file where the name of that file does not match with the output of <code>whoami</code>, there will be an error. This is an easy fix. You can either: | ||
<ol> | ||
<li> Change the name of the yaml file to be the output of <code>whoami</code>. | ||
<li> You can override the username manually when you launch the bash script by adding <code>user=x</code> where <code>x</code> is the name of the yaml file you created. For example: <code>./train_rosa.sh user=alonso</code> | ||
</ol> | ||
<br> | ||
If you wish to add <code>CUDA_VISIBLE_DEVICES</code> to specify a specific GPU, please add this in the shell script directly by <code>export CUDA_VISIBLE_DEVICES=x</code> where <code>x</code> is the ID of the GPU you wish to use. | ||
</details> | ||
|
||
|
||
### Step 5: Launch Panza! | ||
<!-- **Step 5: Launch Panza!** --> | ||
|
||
- To run Panza after a full training run, try something like `CUDA_VISIBLE_DEVICES=0 python3 runner.py user=USERNAME interfaces=cli writer/llm=transformers`. | ||
- To run Panza after a RoSA or LoRA training run, replace `writer/llm=transformers` with `writer/llm=peft` TODO Armand: can we fix this? | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Integrate with the inference markdown+ resolve TODO |
||
|
||
:email: **Have fun with your new email writing assistant!** :email: | ||
|
||
<!-- For in depth customization of each step of the pipeline, refer to ... --> | ||
|
||
|
||
## :microscope: Advanced usage | ||
- [Data Preparation Guide](./scripts/README.md#data-guide) | ||
- [Hyper-Parameter Tuning Guide](./scripts/README.md#hyper-parameter-tuning-guide) | ||
- [Prompt Preambles Tutorial](prompt_preambles/README.md) | ||
|
||
## Authors | ||
|
||
Panza was conceived by Nir Shavit and Dan Alistarh and built by the [Distributed Algorithms and Systems group](https://ist.ac.at/en/research/alistarh-group/) at IST Austria. The contributors are (in alphabetical order): | ||
|
||
Dan Alistarh, Eugenia Iofinova, Eldar Kurtic, Ilya Markov, Armand Nicolicioiu, Mahdi Nikdan, Andrei Panferov, and Nir Shavit. | ||
|
||
Contact: [email protected] | ||
|
||
We thank our collaborators Michael Goin and Tony Wang at NeuralMagic and MIT for their helpful testing and feedback. |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to rename, and to link back to the original README |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# How to run inference in Panza3 | ||
|
||
There are two backend options: Ollama (no GPU) or Local (with GPU). The dependencies necessary for each backend are different. | ||
|
||
## Step 1: Install Dependencies for Panza | ||
|
||
For Ollama, simply run: | ||
```bash | ||
pip install -e . | ||
``` | ||
|
||
For Local, run: | ||
```bash | ||
pip install -e . | ||
``` | ||
and | ||
```bash | ||
pip install panza_mail[training] | ||
``` | ||
|
||
## Step 2a: Ollama Prerequisites | ||
|
||
If running with Ollama, then Ollama needs to be installed from the [web page](https://ollama.com/). | ||
|
||
Then, you will need to convert your model into a GGUF file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it beneficial to add more support for this? |
||
|
||
## Step 2b: Local Prerequisites | ||
|
||
If running locally, then the Panza model needs to be located in `data`. | ||
|
||
## Step 3: Set configurations | ||
|
||
In the `configs folder` add a user YAML file for yourself in `/user`. | ||
|
||
If running with Ollama, edit the `name` and `gguf` fields in `/writer/llm/ollama.yaml` with a name of your choice and the path to the GGUF file. | ||
|
||
## Step 4: Run Panza | ||
|
||
To run Panza, cd into the `scripts` directory and run: | ||
```bash | ||
python3 runner.py user=<your name> interfaces=<cli/gui/web> writer/llm=<ollama/peft/transformers> | ||
``` | ||
For example, to run with Ollama and the CLI interface with the user `test`, run: | ||
```bash | ||
python3 runner.py user=test interfaces=cli writer/llm=ollama | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
defaults: | ||
- user: default | ||
|
||
panza_workspace: ${hydra:runtime.cwd}/../ | ||
checkpoint_dir: ${panza_workspace}/checkpoints | ||
seed: 41 | ||
|
||
embedding_model: "sentence-transformers/all-mpnet-base-v2" | ||
model_precision: bf16 # bf16 or fp32 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
defaults: | ||
- base | ||
|
||
|
||
max_duration: 3ep | ||
lr: 1e-5 | ||
batch_size: 8 | ||
eval_interval: 1 | ||
seed: ${seed} | ||
model_name_or_path: "ISTA-DASLab/Meta-Llama-3-8B-Instruct" | ||
|
||
fsdp_config: | ||
sharding_strategy: FULL_SHARD | ||
mixed_precision: FULL | ||
activation_checkpointing: true | ||
activation_checkpointing_reentrant: false | ||
activation_cpu_offload: false | ||
limit_all_gathers: true | ||
verbose: false | ||
|
||
callbacks: | ||
hf_checkpointer: | ||
overwrite: true | ||
precision: # TODO | ||
save_folder: ${finetuning.hf_save_path}/${finetuning.run_name} | ||
save_interval: 1dur | ||
|
||
scheduler: | ||
t_warmup: 20ba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean TODO