This repo contains code of the following paper:
Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Bowen Tan, Yun Zhu, Lijuan Liu, Eric Xing, Zhiting Hu, Jindong Chen
NeurIPS 2023
[arXiv] [Model Card (btan2/cappy-large)]
- Cappy is a pretrained small scorer designed to enhance the performance and efficiency of multi-task LLMs.
- Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction.
- With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance.
- Also, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters.
- Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, and prompt tuning, offering additional performance enhancement.
Now, Cappy can be loaded with transformers
either as a Jax/Flax model or a PyTorch model.
from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = FlaxAutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')
instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'
inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = AutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')
instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'
inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()
Below are the scripts to recover the experiments in the paper.
Cappy's pretraining and finetuning are both based on Redco, a lightweight tool automating distributed training on both GPUs and TPUs.
To install redco
pip install redco==0.4.13
Sometimes the Jax version needs be adjusted based on your device & environment. Here are some instructions.
To install other requirements,
pip install -r requirements.txt
Cappy's pretraining uses the code from this example in Redco. We will release Cappy's pretraining data soon.
Following the setting from OPT-IML paper (Section 5.2). We conduct zero-shot evaluation on 11 held-out classification tasks from PromptSource.
bash scripts/download_promptsource_test_data.sh
python cappy_promptsource.py --model_name_or_path btan2/cappy-large
OPT 30B | OPT-IML 30B | OPT 175B | OPT-IML 175B | T0 11B | Cappy (ours, 0.36B) | |
---|---|---|---|---|---|---|
ANLI R1 | 33.7 | 37.1 | 34.1 | 42.2 | 42.1 | 34.3 |
ANLI R2 | 34.1 | 35.4 | 34.1 | 38.5 | 37.9 | 33.9 |
ANLI R3 | 34.7 | 36.6 | 34.7 | 39.6 | 39.7 | 34.7 |
CB | 24.6 | 43.2 | 38.9 | 56.4 | 58.5 | 59.4 |
RTE | 56.4 | 67.8 | 54.0 | 73.4 | 80.2 | 71.9 |
StoryCloze | 55.5 | 90.7 | 57.0 | 95.0 | 96.7 | 93.7 |
WSC | 43.5 | 58.2 | 51.0 | 59.2 | 58.6 | 63.8 |
WiC | 50.8 | 54.7 | 49.7 | 53.6 | 56.0 | 51.9 |
Winogrande | 50.2 | 53.4 | 50.1 | 56.6 | 62.5 | 51.7 |
WinoGender | 54.9 | 64.6 | 53.9 | 72.7 | 83.8 | 68.9 |
Crows-Pairs | 85.5 | 22.3 | 85.5 | 34.4 | 24.0 | 57.8 |
Average | 47.6 | 51.3 | 49.3 | 56.5 | 58.2 | 56.6 |
Baseline results come from OPT-IML paper (Section 5.2).
We take all 45 generative tasks from Big-Bench in our experiment. The command below process the tasks into .jsonl
format.
python scripts/get_bigbench_data.py
The processed datasets can be found in ./bigbench_data
, where ./bigbench_data/subset_names.json
records all the task names.
We collect generated outputs (as well as log-likelihoods on evaluation sets) from FLAN-T5 models (from -small
to -xxl
). They can be downloaded with
bash scripts/download_bigbench_flan_gens.sh
If you want to generate outputs by your self and/or adjust some generation settings, we provide generation code as below that supports distributed inference using multiple GPUs together (in case the model is too large to accomodate on a single GPU, e.g., FLAN-T5-XXL (11B)
).
python scripts/bigbench_flan_generate.py \
--model_name_or_path google/flan-t5-xl \
--n_model_shards 4
where --n_model_shards
refers to the number of shards you want to split the large model into (it's usually the number of GPUs on your device if it's not 1).
XLA_PYTHON_CLIENT_MEM_FRACTION=.95 python cappy_bigbench.py \
--model_name_or_path btan2/cappy-large \
--bigbench_subset_name auto_categorization \
--bigbench_gen_model flan-t5-xxl \
--train_size 102400
XLA_PYTHON_CLIENT_MEM_FRACTION=.95
: (In case GPU memory exceeds) adjust the GPU memory pre-allocation to Jax, see here for more details.--bigbench_subset_name
: the name of subset from Big-Bench (see./bigbench_data/subset_names.json
for all of them).--bigbench_gen_model
: the FLAN model to be boosted.--train_size
: the target data size to construct for Cappy's finetuning on the task (collect FLAN outputs, and then truncate or repeat).
See def main(...)
in cappy_bigbench.py for all the arguments.
Every sub-task takes 40 mins to run on a single A10G GPU. The result will be logged in ./bigbench_cappy_results/{flan_model}/{subset_name}.json
.
Besides, to run all the Big-Bench subsets at once,
python scripts/run_cappy_bigbench.py --cuda_idx 0
To present baseline results, python scripts/present_bigbench_baselines.py
To present Cappy results on all 45 Big-Bench subtasks,
python scripts/present_cappy_bigbench_results.py --gen_model_name flan-t5-xxl
The reported numbers on the paper are produced on TPU machines. Here we provide our
reproduction results on A10G GPUs in ./bigbench_cappy_results
. The gap between
them is slight (ΔrougeL <= 0.8
).
flan-t5-small | flan-t5-base | flan-t5-large | flan-t5-xl | flan-t5-xxl | |
---|---|---|---|---|---|
Beam Search (beam=4) | 16.4025 | 19.8594 | 23.4802 | 26.1177 | 29.6608 |
Sampling | 11.4317 | 15.7909 | 19.6248 | 23.2191 | 25.7273 |
Temperature (t=0.9) | 12.0126 | 17.0571 | 20.0481 | 24.2702 | 27.0985 |
Topk (k=40) | 11.5157 | 15.7481 | 19.7634 | 22.6692 | 25.8226 |
Nucleus (p=0.95) | 11.9171 | 16.6174 | 20.1986 | 24.1654 | 26.9036 |
Self-Score (sum) | 15.0806 | 20.711 | 24.1224 | 28.4665 | 32.0156 |
Self-Score (mean) | 16.4223 | 20.1317 | 23.7828 | 26.7694 | 30.246 |
Cappy (ours) | 23.6543 | 27.6178 | 30.3802 | 33.2775 | 37.1678 |
Cappy is Mario's ally throughout Super Mario Odyssey and assists him in various ways. We thank Nintendo for the nice game!