Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Codebase Refactor #22

Draft
wants to merge 120 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
52099b6
Restructure as python package (#19)
shawseanyang Aug 16, 2024
053dc09
Add Ollama inference
ArmandNM Aug 16, 2024
0e8eb89
Panza Web Server (#12)
shawseanyang Aug 20, 2024
498fdc1
Add black line length configuration
ArmandNM Aug 20, 2024
aec8d50
Add new Panza src path
ArmandNM Aug 20, 2024
39d37e2
Create class interfaces
ArmandNM Aug 22, 2024
769277c
Set up unit testing
ArmandNM Aug 23, 2024
42acc78
web hosting
shawseanyang Aug 23, 2024
081bf94
implement ollama llm
shawseanyang Aug 23, 2024
de2fe12
Add FAISS retriever and update Document interface to preepare for ind…
ArmandNM Aug 24, 2024
32a8a39
Fix missing method in retriever interface
ArmandNM Aug 24, 2024
094fa24
Make thread and past_messages optional in EmailInstruction
ArmandNM Aug 24, 2024
c9ea456
Add email prompt builder
ArmandNM Aug 24, 2024
6a9897e
Add local transformers inference
ArmandNM Aug 25, 2024
30b3dc1
Add Peft models and conditional imports
ArmandNM Aug 26, 2024
1d73aff
Add Panza Writer
ArmandNM Aug 26, 2024
c648f98
Add support to return full prompt from writer
ArmandNM Aug 27, 2024
09f08e4
Set corresponding retriever document type in prompt builder
ArmandNM Aug 27, 2024
a3c6ac6
Remove debugging print from prompting utils
ArmandNM Aug 27, 2024
6b67156
Add Hydra config-based runner for Panza writer
ArmandNM Aug 27, 2024
cd9b041
rename ollama_llm.py to ollama.py
shawseanyang Aug 28, 2024
7e3ee43
add type annotations to OllamaLLM
shawseanyang Aug 28, 2024
8b1ef60
add some more type annotations to OllamaLLM
shawseanyang Aug 28, 2024
73b37be
check installation for OllamaLLM
shawseanyang Aug 28, 2024
90d59d1
rename test_llm.py to test_local_llm.py
shawseanyang Aug 28, 2024
f9ddb8f
add pytest to dev dependencies
shawseanyang Aug 28, 2024
af88a28
add sampling_params to super() init call
shawseanyang Aug 28, 2024
28dce17
add unit tests for ollama_llm.py
shawseanyang Aug 28, 2024
b8fba39
black formatting
shawseanyang Aug 28, 2024
5f05228
fix types
shawseanyang Aug 28, 2024
91ef882
add to gitignore
shawseanyang Aug 28, 2024
d4d0a95
black formatting
shawseanyang Aug 28, 2024
a9f9152
add omegaconf to dependencies
shawseanyang Aug 28, 2024
49dfb1d
Add FFT runner
ArmandNM Sep 1, 2024
c1775ca
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 2, 2024
48ba916
comment out the running examples
shawseanyang Sep 2, 2024
b662f80
split dependencies into base and training and add documentation in RE…
shawseanyang Sep 2, 2024
4b2fa76
add hydra to dependencies
shawseanyang Sep 2, 2024
d4bd1aa
Fix unused num_thread_emails parameter
ArmandNM Sep 3, 2024
8df8f11
Add RoSA runner
ArmandNM Sep 3, 2024
2c221cc
Temporarily rename serialized document in vector db metadata
ArmandNM Sep 3, 2024
e72d13c
move some dependencies from training into base
shawseanyang Sep 3, 2024
7447aac
add outputs folder to gitignore
shawseanyang Sep 3, 2024
b151fc3
add writer to the constructor arguments of the web service
shawseanyang Sep 3, 2024
77c1867
delete run_panza.py bc its just a test file
shawseanyang Sep 3, 2024
fc7db3f
rename constructor argument for the Ollama LLM class to match the loc…
shawseanyang Sep 3, 2024
b6f9921
add none retriever to allow running without RAG
shawseanyang Sep 3, 2024
d70f512
add config for Ollama LLM
shawseanyang Sep 3, 2024
5d79649
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 4, 2024
01b4052
remove DEFAULT_PORT and add integer type hint
shawseanyang Sep 4, 2024
1d05864
add interfaces
shawseanyang Sep 5, 2024
a522768
add hydra yaml overrides to training script (works for full training …
Sep 5, 2024
3a2c70b
remove my user from configs and add comments to show how to enable ot…
shawseanyang Sep 5, 2024
3801351
add temporary inference instructions
shawseanyang Sep 5, 2024
f1bf461
use command line config specifications in the inference instructions
shawseanyang Sep 5, 2024
9c821b3
add example to inference instructions
shawseanyang Sep 5, 2024
46f222d
update training script for RoSA
Sep 6, 2024
a4a38d5
Merge branch 'refactor' into jen/train-refactor
Sep 6, 2024
62d3b27
remove redundant, unnecessary and problematic use_rag and use_thread …
Sep 6, 2024
bd42248
minor training cleanups
Sep 6, 2024
7e8f343
add config for peft writer
Sep 10, 2024
090c8f1
deprecate panza_finetuning.yaml
Sep 11, 2024
aa6a586
small config fixes
Sep 11, 2024
19e5b54
Refactor data summarization
ArmandNM Sep 11, 2024
4fb5c8d
add sampling parameters to ollama LLM
shawseanyang Sep 11, 2024
422a2a5
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
shawseanyang Sep 11, 2024
21a3fcf
refactor configs to get unnecessary params out of configs/base
Sep 12, 2024
9a1c9c2
allow code execution during model loading to allow phi3.5
Sep 12, 2024
3c44172
greatly simplify the .sh training script to take advantage of the con…
Sep 12, 2024
6849903
add streaming to cli
shawseanyang Sep 12, 2024
764f980
add streaming to transformers
shawseanyang Sep 12, 2024
50dcde2
update training scripts to process arguments correctly
Sep 13, 2024
6937018
minor fix for train_fft
Sep 13, 2024
0dc314a
write the full checkpoint to the expected location
Sep 13, 2024
9fa93f0
add sampling parameters to ollama LLM
shawseanyang Sep 11, 2024
ddd6041
Refactor data summarization
ArmandNM Sep 11, 2024
91a64f0
add streaming to cli
shawseanyang Sep 12, 2024
41d27c3
add streaming to transformers
shawseanyang Sep 12, 2024
66cf2d4
update web.py to match LLM interface
shawseanyang Sep 13, 2024
098ef32
Merge branch 'refactor' of github.com:IST-DASLab/PanzaMail into refactor
Sep 16, 2024
0c456b5
first pass at porting evaluation to new framework
Sep 16, 2024
2e20f12
do NOT split test and train data by default
Sep 16, 2024
298d023
make the json interface more robust to json file format
Sep 17, 2024
02e0322
fix bug where RoSA FFT still tries to move folder over
Sep 19, 2024
d4b70f8
emergency bugfix for streaming to cli
Sep 19, 2024
226082e
Merge branch 'refactor' into jen/eval-refactor
Oct 18, 2024
3314abe
add creating RAG DB to data preparation script
Oct 18, 2024
ab96fea
add test-train splitting to data preparation
Oct 18, 2024
e3f12e0
undo accidental commenting out
Oct 18, 2024
dbcc5d0
bug fix
Oct 18, 2024
dfbde00
add tqdm to json interface
Oct 18, 2024
eb720d1
let panza writer load latest checkpoint
Oct 18, 2024
d7b298f
add email extraction to data preparation
Oct 21, 2024
924033e
update panza readme
Oct 21, 2024
ea5ffea
update env preparation script
Oct 21, 2024
ca4b690
slight refactor of runner.py
Oct 28, 2024
233083e
remove some unused .sh files
Oct 28, 2024
6f94379
qq
Oct 28, 2024
8ce3c4a
make the first part of the data preparation script optional (in case …
Oct 29, 2024
c1f36a8
Edits and Bug Fixes
maddox-j Oct 31, 2024
bacbdf7
Add additional clarification on username importance
maddox-j Nov 4, 2024
6be5813
update data preparation
Nov 4, 2024
302bae6
Miscellanous updates
Nov 8, 2024
c13aa07
Fix function address
Nov 8, 2024
fee7cbd
Clean up code TODOs and revert to defaults
Nov 12, 2024
99f5c7c
move top-level README to default location
Nov 12, 2024
1238d39
update the scripts/ readme
Nov 12, 2024
c0a94a3
remove useless assert
Nov 12, 2024
99a0b68
Merge branch 'jen/eval-refactor' of github.com:IST-DASLab/PanzaMail i…
Nov 12, 2024
3e2203c
Merge changes.
Nov 12, 2024
1e62597
Once again, try to centralize the main README.
Nov 12, 2024
71a85c9
Update the README
Nov 12, 2024
73308a0
Update README.md remove resolved TODO
ohaijen Nov 13, 2024
e5a9e44
update hyperparameter tuning guide
Nov 13, 2024
49258b9
Merge branch 'jen/eval-refactor' of github.com:IST-DASLab/PanzaMail i…
Nov 13, 2024
b3bc00f
Refactor panza3 -> panza
Nov 13, 2024
be76c39
Clear ollama and web use-case
Nov 14, 2024
677dd8a
Update README.md remove confusing period.
ohaijen Nov 18, 2024
e1e8e6d
Update README.md Add instructions for quantized training
ohaijen Nov 18, 2024
525d0e3
correct README for quantized training
Nov 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Store API keys here
API_KEYS=apikey1,apikey2,apikey3
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@ __pycache__/
checkpoints/
results/
wandb/
outputs/

*.log
*.log
*.egg-info
.vscode
build/
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
repos:
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
language_version: python3.10
143 changes: 86 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Its main features are as follows:
</div>


## Prerequisites
## TODO: Prerequisites
- Your emails, exported to `mbox` format (see tutorial below).
- A computer, preferably with a NVIDIA GPU with at least 24 GiB of memory (alternatively, check out [running in Google Colab](#cloud-try-out-panza-in-google-colab)).
- A Hugging Face [account](https://huggingface.co/login) to download the models (free of charge).
Expand Down Expand Up @@ -62,30 +62,21 @@ The overall structure of Panza is as follows:

### Conda
1. Make sure you have a version of [conda](https://docs.anaconda.com/free/miniconda/miniconda-install/) installed.
2. Run `source prepare_env.sh`. This script will create a conda environment named `panza` and install the required packages.

### Docker
As an alternative to the conda option above, you can run the following commands to pull a docker image with all the dependencies installed.
```
docker pull istdaslab/panzamail
```

or alternatively, you can build the image yourself:
```
docker build . -f Dockerfile -t istdaslab/panzamail
```

Then run it with:
```
docker run -it --gpus all istdaslab/panzamail /bin/bash
2. Create a new conda environment named 'panza' (or something else) and activate it:
``` bash
conda create -n panza python=3.10 -y
conda activate panza
```

In the docker you can activate the `panza` environment with:
3. Install the required packages:
``` bash
pip install .
```
micromamba activate panza
4. If you want to also finetune models using Panza, you will need to install the additional packages:
``` bash
pip install .[training]
```

## :rocket: Getting started
## TODO: :rocket: Getting started

To quickly get started with building your own personalized email assistant, follow the steps bellow:

Expand Down Expand Up @@ -118,16 +109,26 @@ At the end of this step you should have the downloaded emails placed inside `dat
### Step 1: Environment configuration

<!-- 🎛️ -->
Panza is configured through a set of environment variables defined in `scripts/config.sh` and shared along all running scripts.
Panza is configured through a set of yaml configurations defined in `configs/`. There is a single high-level config under `configs/base.yaml`, and the rest are organized under the main functionalities of the code.
Note that these task-specific configs can, in some cases, be used to override base configs.
Specific use cases, such as hyperparameter tuning, are covered in more detail in `scripts/README.md`. (TODO jen: write this up.)

<!-- 💬 -->
The LLM prompt is controlled by a set of `prompt_preambles` that give the model more insight about its role, the user and how to reuse existing emails for *Retrieval-Augmented Generation (RAG)*. See more details in the [prompting section](prompt_preambles/README.md).
1. Data preparation: `configs/data_preparation.yaml`. Additionally, a custom user config must be added under `config/users/` (see below).
1. Finetuning: the main config is in `configs/panza_finetuning.yaml` and the method-specific ones are in `configs/finetuning/`
1. Serving: Serving consists of two parts - a serving infrastructure (that we call 'writer') that runs the LLM and so converts prompts to Panza outputs, and an `interface`, which presents the outputs in a useful form - through a command-line interface, a web interface, a gmail client (TODO:Sean), or in a bulk `.json` format (useful for evaluation). The configs for serving are in `panza_writer.yaml`, and for the interfaces, under `configs/interfaces`.

<!-- 💬 -->
These scripts are described in more detail in `scripts/README.md`, but a few customizations need to happen immediately.
:warning: Before continuing, make sure you complete the following setup:
- Modifiy the environment variable `PANZA_EMAIL_ADDRESS` inside `scripts/config.sh` with your own email address.
- Modifiy `prompt_preambles/user_preamble.txt` with your own information. If you choose, this can even be empty.
- Copy `users/default.yaml` to `users/[YOURNAME].yaml`. If this is skipped, perform the following modifications on `users/default.yaml` directly. A useful tip for choosing the name of `[YOURNAME]` is to set it to the output of `whoami`. If you modify the default yaml, you will need specify `user=default` as an extra flag in the succeeding steps.
- In the user config, set the email address and username. The email address should be the sender address in the exported emails. (Panza uses this to edit out responses and other emails sent by a different author in the `.mbox` dump.). The username does not have to link to the email itself - it is simply used as a name for the various data files that will come out of the data preparation process. A handy way to set this is if you set it to be the output of the `whoami` call in your shell.
- Modify the personal prompt in `prompt_preambles/user_preamble.txt` to include some basic information about yourself that Panza can use to customize your emails with your correct full name, address, phone number, etc.


Additionally, please perform the following login steps to be able to download the base model.
- Login to Hugging Face to be able to download pretrained models: `huggingface-cli login`.
- [Optional] Login to Weights & Biases to log metrics during training: `wandb login`. Then, set `PANZA_WANDB_DISABLED=False` in `scripts/config.sh`.
- [Optional] Login to Weights & Biases to log metrics during training: `wandb login`. Then, set `wandb_disabled=false` in `configs/finetuning/base.yaml`.


You are now ready to move to `scripts`.
``` bash
Expand All @@ -137,74 +138,102 @@ cd scripts
### Step 2: Extract emails
<!-- **Step 2: Extract emails** -->

1. Run `./extract_emails.sh`. This extracts your emails in text format to `data/<username>_clean.jsonl` which you can manually inspect.

2. If you wish to eliminate any emails from the training set (e.g. containing certain personal information), you can simply remove the corresponding rows.

### Step 3: Prepare dataset
<!-- **Step 3: Prepare dataset** -->

1. Simply run `./prepare_dataset.sh`.<details>
1. Run `CUDA_VISIBLE_DEVICES=X ./prepare_data.sh`.<details>
<summary> This scripts takes care of all the prerequisites before training (expand for details). </summary>

- Extracts your emails in text format to `data/<username>_clean.jsonl` which you can manually inspect.
- Creates synthetic prompts for your emails as described in the [data playback](#film_projector-step-1-data-playback) section. The results are stored in `data/<username>_clean_summarized.jsonl` and you can inspect the `"summary"` field.
- Splits data into training and test subsets. See `data/train.jsonl` and `data/test.jsonl`.
- Creates a vector database from the embeddings of the training emails which will later be used for *Retrieval-Augmented Generation (RAG)*. See `data/<username>.pkl` and `data/<username>.faiss`.
</details>
**NB**: if you did not change the default configuration in `user/default.yaml` to reflect your particulars but rather created a new file, you need to add the additional flag to the above command where you specify `user=x` where your config file was named `x.yaml`.

<details>
<summary> FAQs. </summary>
When running the above script, you may encounter an <code>OutOfMemoryError</code>. If this is the case, you can either:
<ol>
<li> Reduce the batch size for the data processing step. This can be found in <code>configs/panza_preparation.yaml</code>.
<li> Move to a machine that has more memory.
</ol>
</details>


### Step 4: Train a LLM on your emails
<!-- **Step 4: Train a LLM on your emails** -->
### Step 3: Train a LLM on your emails
<!-- **Step 3: Train a LLM on your emails** -->

We currently support `LLaMA3-8B-Instruct` and `Mistral-Instruct-v0.2` LLMs as base models; the former is the default, but we obtained good results with either model.

1. [Recommended] For parameter efficient fine-tuning, run `./train_rosa.sh`.
If a larger GPU is available and full-parameter fine-tuning is possible, run `./train_fft.sh`.

2. We have prepopulated the training scripts with parameter values that worked best for us. We recommend you try those first, but you can also experiment with different hyper-parameters by passing extra arguments to the training script, such as `LR`, `LORA_LR`, `NUM_EPOCHS`. All the trained models are saved in the `checkpoints` directory.
2. We have prepopulated the training configs with parameter values that worked best for us. We recommend you try those first, but you can also experiment with different hyper-parameters by passing extra arguments to the training script, such as `lr`, `lora_lr`, `num_epochs`. All the trained models are saved in the `checkpoints` directory.

Examples:
``` bash
./train_rosa.sh # Will use the default parameters.
CUDA_VISIBLE_DEVICES=X ./train_rosa.sh # Will use the default parameters.

./train_rosa.sh LR=1e-6 LORA_LR=1e-6 NUM_EPOCHS=7 # Will override LR, LORA_LR, and NUM_EPOCHS.
CUDA_VISIBLE_DEVICES=X ./train_rosa.sh finetuning.lr=1e-6 finetuning.rosa_lr=1e-6 finetuning.max_duration=7ep
```
<details>
<summary> FAQs. </summary>
The bash scripts that are used to execute the finetuning procedure assume by default that your username is what is returned by the <code>whoami</code> command. This is used to locate the name of the user configs inside the <code>configs/user</code> directory as above. If you directly modified <code>default.yaml</code>, or created another yaml file where the name of that file does not match with the output of <code>whoami</code>, there will be an error. This is an easy fix. You can either:
<ol>
<li> Change the name of the yaml file to be the output of <code>whoami</code>.
<li> You can override the username manually when you launch the bash script by adding <code>user=x</code> where <code>x</code> is the name of the yaml file you created. For example: <code>./train_rosa.sh user=alonso</code>
</ol>
<br>
If you wish to add <code>CUDA_VISIBLE_DEVICES</code> to specify a specific GPU, please add this in the shell script directly by <code>export CUDA_VISIBLE_DEVICES=x</code> where <code>x</code> is the ID of the GPU you wish to use.
<br><br>
A known issue is that when you fine-tune your model with RAG, there can be a case when the tokenization of the dataset seemingly hangs. This is due to a known bug with with HF's <code>map</code> function where <code>n_proc>1</code>. To alleviate this issue, you can set <code>torch.set_num_threads(1)</code> in <code>src/panza/finetuning/train.py</code> or set the equivalent parameter in <code>configs/finetuning/rosa.yaml</code>.
</details>

### Step 5: Launch Panza!
<!-- **Step 5: Launch Panza!** -->

1. Run `./run_panza_gui.sh MODEL=<path-to-your-trained-model>` to serve the trained model in a friendly GUI.
Alternatively, if you prefer using the CLI to interact with Panza, run `./run_panza_cli.sh` instead.

You can experiment with the following arguments:
- If `MODEL` is not specified, it will use a pretrained `Meta-Llama-3-8B-Instruct` model by default, although Panza also works with `Mistral-7B-Instruct-v2`. Try it out to compare the syle difference!
- To disable RAG, run with `PANZA_DISABLE_RAG_INFERENCE=1`.
On a smaller GPU, it may be necessary to further train in lower precision (QRoSA). This can be run as follows:

Example:
``` bash
./run_panza_gui.sh \
MODEL=/local/path/to/this/repo/checkpoints/models/panza-rosa_1e-6-seed42_7908 \
PANZA_DISABLE_RAG_INFERENCE=0 # this is the default behaviour, so you can omit it
./train_rosa.sh finetuning.precision=amp_bf16 finetuning.model.weight_bias_dtype=4bit
```

:email: **Have fun with your new email writing assistant!** :email:

<!-- For in depth customization of each step of the pipeline, refer to ... -->
### Step 5: Launch Panza!
<!-- **Step 5: Launch Panza!** -->

## :cloud: Try out Panza in Google Colab
- To run Panza after a full training run, run a command like `CUDA_VISIBLE_DEVICES=0 ./runner.sh user=USERNAME interfaces=cli writer/llm=transformers model=latest`.
- To run Panza after a RoSA or LoRA training run, replace `writer/llm=transformers` with `writer/llm=peft`

- You can run Panza in a Google Colab instance [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IST-DASLab/PanzaMail/blob/main/notebooks/panza_colab.ipynb).

:email: **Have fun with your new email writing assistant!** :email:

<!-- For in depth customization of each step of the pipeline, refer to ... -->


## :microscope: Advanced usage
- [Data Preparation Guide](./scripts/README.md#data-guide)
- [Hyper-Parameter Tuning Guide](./scripts/README.md#hyper-parameter-tuning-guide)
- [Prompt Preambles Tutorial](prompt_preambles/README.md)

## :woman_technologist: Contributing
If you liked our work and want to contribute to improve the system, please feel free to do so! Make a _fork_ of our repository and once you have made your changes, submit a pull request so that we can review!

One thing to mention: we want to make sure that we all adhere to the same coding standards, so we have added Black, a code formatter, as a prehook. To ensure that all your files are formatted with Black, do the following:

1. Install the necessary dependencies
```
pip install .[contributing]
```

2. Run the precommit command
```
pre-commit install
```

3. Continue adding code as usual. All your code will be formatted by Black before commiting!

## Authors

Panza was conceived by Nir Shavit and Dan Alistarh and built by the [Distributed Algorithms and Systems group](https://ist.ac.at/en/research/alistarh-group/) at IST Austria. The contributors are (in alphabetical order):

Dan Alistarh, Eugenia Iofinova, Eldar Kurtic, Ilya Markov, Armand Nicolicioiu, Mahdi Nikdan, Andrei Panferov, and Nir Shavit.
Dan Alistarh, Eugenia Iofinova, Andrej Jovanovic, Eldar Kurtic, Ilya Markov, Armand Nicolicioiu, Mahdi Nikdan, Andrei Panferov, Nir Shavit, and Sean Yang.

Contact: [email protected]

Expand Down
Loading