Skip to content

Commit

Permalink
docs: update the fine-tuning examples
Browse files Browse the repository at this point in the history
  • Loading branch information
LongxingTan authored Sep 17, 2024
1 parent d922bf8 commit 13d30f8
Show file tree
Hide file tree
Showing 14 changed files with 381 additions and 217 deletions.
117 changes: 98 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@
![structure](./docs/source/_static/structure.png)

**Open-retrievals** unify text embedding, retrieval, reranking and RAG. It's easy, flexible and scalable.
- Embedding fine-tuned through point-wise, pairwise, listwise, contrastive learning, and LLM.
- Reranking fine-tuned with Cross Encoder, ColBERT, and LLM.
- Easily build enhanced modular RAG, integrated with Transformers, Langchain, and LlamaIndex.
- Embedding fine-tuned through point-wise, pairwise, listwise, contrastive learning and LLM.
- Reranking fine-tuned with Cross-Encoder, ColBERT and LLM.
- Easily build enhanced modular RAG, integrated with Transformers, Langchain and LlamaIndex.

| Experiment | Model | Original | Finetuned | Demo |
|-------------------------------|------------------------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Expand All @@ -48,7 +48,7 @@
| **rerank** colbert | bge-m3 | 0.657 | **0.695** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QVtqhQ080ZMltXoJyODMmvEQYI6oo5kO?usp=sharing) |
| **rerank** LLM (LoRA) | bge-reranker-v2-gemma | 0.637 | **0.706** | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fzq1iV7-f8hNKFnjMmpVhVxadqPb9IXk?usp=sharing) |

* The metrics is MAP in 10% eval [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking).
* The eval metrics is MAP in 10% [t2-reranking data](https://huggingface.co/datasets/C-MTEB/T2Reranking).
* Read [more examples](./examples)


Expand Down Expand Up @@ -76,7 +76,7 @@ python -m pip install -U git+https://github.com/LongxingTan/open-retrievals.git

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-WBMisdWLeHUKlzJ2DrREXY_kSV8vjP3?usp=sharing)

<details><summary> Embeddings from pretrained weights </summary>
<details><summary> Embedding from pretrained weights </summary>

```python
from retrievals import AutoModelForEmbedding
Expand All @@ -89,7 +89,7 @@ sentences = [
]
model_name_or_path = 'intfloat/e5-base-v2'
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="mean")
embeddings = model.encode(sentences, normalize_embeddings=True, convert_to_tensor=True)
embeddings = model.encode(sentences, normalize_embeddings=True)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
```
Expand All @@ -103,7 +103,7 @@ from retrievals import AutoModelForEmbedding, AutoModelForRetrieval
sentences = ['A dog is chasing car.', 'A man is playing a guitar.']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
index_path = './database/faiss/faiss.index'
model = AutoModelForEmbedding.from_pretrained(model_name_or_path)
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method='mean')
model.build_index(sentences, index_path=index_path)

query_embed = model.encode("He plays guitar.")
Expand Down Expand Up @@ -216,7 +216,7 @@ epochs: int = 3
train_dataset = load_dataset('shibing624/nli_zh', 'STS-B')['train']
train_dataset = train_dataset.rename_columns({'sentence1': 'query', 'sentence2': 'positive'})
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="cls")
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="mean")
model = model.set_train_type('pairwise')

optimizer = AdamW(model.parameters(), lr=5e-5)
Expand Down Expand Up @@ -252,14 +252,22 @@ import torch.nn as nn
from datasets import load_dataset
from transformers import AutoTokenizer, AdamW, get_linear_schedule_with_warmup, TrainingArguments
from retrievals import AutoModelForEmbedding, RetrievalTrainer, PairCollator, TripletCollator
from retrievals.losses import ArcFaceAdaptiveMarginLoss, InfoNCE, SimCSE, TripletLoss
from retrievals.losses import InfoNCE, SimCSE, TripletLoss

def add_instructions(example):
example['query'] = query_instruction + example['query']
example['positive'] = document_instruction + example['positive']
return example

model_name_or_path: str = "Qwen/Qwen2-1.5B-Instruct"
batch_size: int = 8
epochs: int = 3
query_instruction = "Retrieve relevant passages that answer the query\nQuery: "
document_instruction = "Document: "

train_dataset = load_dataset('shibing624/nli_zh', 'STS-B')['train']
train_dataset = train_dataset.rename_columns({'sentence1': 'query', 'sentence2': 'positive'})
train_dataset = train_dataset.map(add_instructions)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="last", use_lora=True)
model = model.set_train_type('pairwise', loss_fn=InfoNCE(nn.CrossEntropyLoss(label_smoothing=0.05)))
Expand All @@ -272,6 +280,7 @@ training_arguments = TrainingArguments(
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
remove_unused_columns=False,
logging_steps=100,
)
trainer = RetrievalTrainer(
model=model,
Expand All @@ -291,25 +300,32 @@ trainer.train()
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankTrainDataset

model_name_or_path: str = "microsoft/deberta-v3-base"
model_name_or_path: str = "BAAI/bge-reranker-base"
max_length: int = 128
learning_rate: float = 3e-5
batch_size: int = 4
epochs: int = 3
output_dir: str = "./checkpoints"

train_dataset = RerankTrainDataset('./t2rank.json', positive_key='pos', negative_key='neg')
train_dataset = RerankTrainDataset("C-MTEB/T2Reranking", positive_key="positive", negative_key="negative", dataset_split='dev')
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForRanking.from_pretrained(model_name_or_path)
optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=0.05 * num_train_steps,
num_training_steps=num_train_steps,
)

training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir='./checkpoints',
output_dir=output_dir,
remove_unused_columns=False,
logging_steps=100,
report_to="none",
)
trainer = RerankTrainer(
model=model,
Expand Down Expand Up @@ -348,9 +364,7 @@ epochs: int = 3
colbert_dim: int = 1024
output_dir: str = './checkpoints'

train_dataset = RetrievalTrainDataset(
'C-MTEB/T2Reranking', positive_key='positive', negative_key='negative', dataset_split='dev'
)
train_dataset = RetrievalTrainDataset('C-MTEB/T2Reranking', positive_key='positive', negative_key='negative', dataset_split='dev')
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
data_collator = ColBertCollator(
tokenizer,
Expand All @@ -367,9 +381,7 @@ model = ColBERT.from_pretrained(

optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps
)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)

training_args = TrainingArguments(
learning_rate=learning_rate,
Expand All @@ -394,7 +406,74 @@ trainer.train()
<details><summary> Fine-tune LLM reranking </summary>

```python
from transformers import (
AdamW,
AutoTokenizer,
TrainingArguments,
get_cosine_schedule_with_warmup,
)

from retrievals import (
LLMRanker,
LLMRerankCollator,
RerankTrainer,
RetrievalTrainDataset,
)
from retrievals.losses import TokenLoss

model_name_or_path: str = "Qwen/Qwen2-1.5B-Instruct"
max_length: int = 512
learning_rate: float = 3e-5
batch_size: int = 8
epochs: int = 3
task_prompt: str = (
"""Given a query A and a passage B, determine whether the passage contains an answer to the query"""
"""by providing a prediction of either 'Yes' or 'No'."""
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
train_dataset = RetrievalTrainDataset(
data_name_or_path='C-MTEB/T2Reranking',
positive_key='positive',
negative_key='negative',
query_instruction='A: ',
document_instruction='B: ',
dataset_split='dev',
)
data_collator = LLMRerankCollator(tokenizer=tokenizer, max_length=max_length, prompt=task_prompt, add_target_token='Yes')
token_index = tokenizer('Yes', add_special_tokens=False)['input_ids'][-1]
model = LLMRanker.from_pretrained(
model_name_or_path,
causal_lm=True,
use_fp16=True,
loss_fn=TokenLoss(token_index=token_index),
use_lora=True,
)

optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=0.05 * num_train_steps,
num_training_steps=num_train_steps,
)

training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir="./checkpoints",
remove_unused_columns=False,
)
trainer = RerankTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator,
)
trainer.optimizer = optimizer
trainer.scheduler = scheduler
trainer.train()
```
</details>

Expand Down
26 changes: 18 additions & 8 deletions README_ja-JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ sentences = [
]
model_name_or_path = 'intfloat/e5-base-v2'
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="mean")
embeddings = model.encode(sentences, normalize_embeddings=True, convert_to_tensor=True)
embeddings = model.encode(sentences, normalize_embeddings=True)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
```
Expand All @@ -91,7 +91,7 @@ from retrievals import AutoModelForEmbedding, AutoModelForRetrieval
sentences = ['A dog is chasing car.', 'A man is playing a guitar.']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
index_path = './database/faiss/faiss.index'
model = AutoModelForEmbedding.from_pretrained(model_name_or_path)
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method='mean')
model.build_index(sentences, index_path=index_path)

query_embed = model.encode("He plays guitar.")
Expand Down Expand Up @@ -199,8 +199,9 @@ epochs: int = 3
train_dataset = load_dataset('shibing624/nli_zh', 'STS-B')['train']
train_dataset = train_dataset.rename_columns({'sentence1': 'query', 'sentence2': 'document'})
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="cls")
# model = model.set_train_type('pointwise') # 'pointwise', 'pairwise', 'listwise'
model = AutoModelForEmbedding.from_pretrained(model_name_or_path, pooling_method="mean")
model = model.set_train_type('pairwise')

optimizer = AdamW(model.parameters(), lr=5e-5)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)
Expand Down Expand Up @@ -240,25 +241,34 @@ model = AutoModelForEmbedding.from_pretrained(
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankTrainDataset

model_name_or_path: str = "microsoft/deberta-v3-base"
model_name_or_path: str = "BAAI/bge-reranker-base"
max_length: int = 128
learning_rate: float = 3e-5
batch_size: int = 4
epochs: int = 3
output_dir: str = "./checkpoints"

train_dataset = RerankTrainDataset('./t2rank.json', positive_key='pos', negative_key='neg')
train_dataset = RerankTrainDataset(
"C-MTEB/T2Reranking", positive_key="positive", negative_key="negative", dataset_split='dev'
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForRanking.from_pretrained(model_name_or_path)
optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=0.05 * num_train_steps,
num_training_steps=num_train_steps,
)

training_args = TrainingArguments(
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
num_train_epochs=epochs,
output_dir='./checkpoints',
output_dir=output_dir,
remove_unused_columns=False,
logging_steps=100,
report_to="none",
)
trainer = RerankTrainer(
model=model,
Expand Down
Loading

0 comments on commit 13d30f8

Please sign in to comment.