diff --git a/finetuning/README.md b/finetuning/README.md new file mode 100644 index 0000000..5dba376 --- /dev/null +++ b/finetuning/README.md @@ -0,0 +1,78 @@ +# Train Models with Jina's Finetuner + +This folder contains some example scripts to fine-tune embedding (vector search) models. +More details on the models produced by these scripts and fine-tuning in general are provided in the 13th kata. + +All the commands displayed below are meant to be executed from within the folder `finetuning`. + +## Setup the Environment + +Finetuner runs with Python. +Accordingly, you need a Python environment to run it. +We recommend you to create also a virtual environment by running: +``python3 -m venv venv`` +Afterward, you can activate the environment by running: +``source venv/bin/activate`` +To install the dependencies, to run the fine-tuning examples, run +``pip install -r requirements.txt`` + +## Run Unsupervised SBERT fine-tuning (Text-to-Text Search) + +To automatically create training data finetuner needs a dataset of documents and a set of queries. +For the corpus we use the product data in `data-encoder/ecommerce/vectors/data`. +Since this dataset does not come with a set of queries (which can usually be extracted from a query log), we generate some queries. +To download them one can execute the following command: +``` +wget https://finetuner-ecommerce-experiment.s3.eu-central-1.amazonaws.com/generated-queries.jsonl -P sbert_unsupervised/ +``` +To extract queries and documents and transform them into Finetuner's preferred format for finetuner execute the following command*: +``` +python3 sbert_unsupervised/data_preparation.py +``` +*While running the script, a browser window will open up and ask you to log into your Jina account. If you don't have one, you need to sign up to run the script. +After you logged into your account, a session token will be sent back to the script. This token is then used to authenticate the script to the Jina Cloud in order to download or upload models and datasets. + +After that, you can run a job to generated training data for a text-to-text embedding model automatically by running: +``` +python3 sbert_unsupervised/data_synthesis.py +``` + +If you terminated the script or the connection to the log stream got interrupted (which happens from time to time), +you can execute the following code in a script or the Python interpreter to get the current status the logs of the job: + +```python +import finetuner + +finetuner.login() + +run = finetuner.get_run('ecommerce-synthesis') +print(f'Status: {run.status()["status"]}') +print('Logs:\n', run.logs()) + +``` + + +Finally, you can run the fine-tuning job by running: +``` +python3 sbert_unsupervised/finetune.py +``` +This will save a fine-tuned model in the `finetuning` folder with the name `sbert_unsupervised`. + +## Run Unsupervised CLIP fine-tuning (Text-to-Image Search) + +For fine-tuning a CLIP model we directly create a dataset with text image pairs from the product dataset. +This can be executed with the following command: +``` +python3 clip_unsupervised/data_preparation.py +``` +After that the fine-tuning job can be executed with the following command: +``` +python3 clip_unsupervised/finetune.py +``` + +## Integrate the Fine-Tuned Models into Chorus + +After running the scripts mentioned above, you obtain fine-tuned embedding models which can be integrated into Chorus. +If you are only interested in the integration, you can download the final models from the following links: +- [sbert_unsupervised](https://finetuner-ecommerce-experiment.s3.eu-central-1.amazonaws.com/fine-tuned-sbert-model.zip) +- [clip_unsupervised](https://finetuner-ecommerce-experiment.s3.eu-central-1.amazonaws.com/fine-tuned-clip-model.zip) diff --git a/finetuning/clip_unsupervised/data_preparation.py b/finetuning/clip_unsupervised/data_preparation.py new file mode 100644 index 0000000..02e9db5 --- /dev/null +++ b/finetuning/clip_unsupervised/data_preparation.py @@ -0,0 +1,62 @@ +import json +import click +from tqdm import tqdm +from zipfile import ZipFile + +from docarray import Document, DocumentArray + + +def prepare_training_data(filepaths: list[str]) -> dict[str, str]: + dataset = DocumentArray() + for filepath in filepaths: + with ZipFile(filepath, 'r') as archive: + for filename in archive.namelist(): + with archive.open(filename, 'r') as f: + objs = json.load(f) + for obj in tqdm(objs, desc=f'Load data from "{filename}"'): + title = obj.get('title', '') + supplier = obj.get('supplier', '') + product_type = obj.get('attr_t_product_type') + if 'img_500x500' in obj: + if not title and supplier: + continue + else: + text_value = ( + f'{title} {supplier} {product_type}' + if product_type + else f'{title} {supplier}' + ) + try: + img_doc = Document( + uri=obj.get('img_500x500') + ).load_uri_to_blob() + text_doc = Document(text=text_value) + dataset.append( + Document(chunks=DocumentArray([text_doc, img_doc])) + ) + except: + pass + return dataset + + +@click.command() +@click.option( + '--input', + '-i', + default=[ + '../data-encoder/ecommerce/vectors/data/1.json.zip', + '../data-encoder/ecommerce/vectors/data/2.json.zip', + '../data-encoder/ecommerce/vectors/data/3.json.zip', + '../data-encoder/ecommerce/vectors/data/4.json.zip', + ], + multiple=True, + help='Input filepaths', +) +def main(input): + train_dataset = prepare_training_data(input) + train_dataset.summary() + train_dataset.save_binary('clip_unsupervised/clip_train_dataset.da') + + +if __name__ == '__main__': + main() diff --git a/finetuning/clip_unsupervised/finetune.py b/finetuning/clip_unsupervised/finetune.py new file mode 100644 index 0000000..2cd42fd --- /dev/null +++ b/finetuning/clip_unsupervised/finetune.py @@ -0,0 +1,32 @@ +import finetuner +from docarray import DocumentArray + +# To use finetuner an account for the Jina AI Cloud is required. +finetuner.login() + +training_data = DocumentArray.from_bytes('clip_unsupervised/clip_train_dataset.da') + +training_run = finetuner.fit( + model='clip-large-en', + # name of the pre-trained model + train_data=training_data, + # path to the prepared training dataset + loss='CLIPLoss', + # contrastive loss function for text-image pairs + optimizer='AdamW', + learning_rate=1e-7, + # choose a small learning rate since the model is already pre-trained + batch_size=8, + # too high batch-sizes can lead to memory issues + epochs=1, + # for fine-tuning usually a low number of epochs is enough +) + +print(training_run.name) + +# print logs +for entry in training_run.stream_logs(): + print(entry) + +# download model +training_run.save_artifact('finetuned_clip_model') diff --git a/finetuning/requirements.txt b/finetuning/requirements.txt new file mode 100644 index 0000000..82d8e9e --- /dev/null +++ b/finetuning/requirements.txt @@ -0,0 +1,4 @@ +finetuner[full]=0.7.8 +docarray[commons]==0.21.0 +click==8.1.3 + diff --git a/finetuning/sbert_unsupervised/data_preparation.py b/finetuning/sbert_unsupervised/data_preparation.py new file mode 100644 index 0000000..dccb4ef --- /dev/null +++ b/finetuning/sbert_unsupervised/data_preparation.py @@ -0,0 +1,84 @@ +import json +import click +from zipfile import ZipFile + +from typing import List + +from docarray import Document, DocumentArray + +QUERIES_DA_PATH = 'sbert_unsupervised/queries.da' +CORPUS_DA_PATH = 'sbert_unsupervised/corpus.da' + + +def load_queries(filename: str) -> List[str]: + """Loads queries from a jsonl file.""" + queries = list() + with open(filename, 'r') as f: + for i, line in enumerate(f): + obj = json.loads(line) + queries.append(obj['text']) + return queries + + +def load_corpus(filepaths: List[str]) -> List[str]: + """Loads corpus from a list of zipped jsonl files with product properties. + For the training, we only consider the properties `title` and `supplier`. + """ + corpus = list() + for filepath in filepaths: + with ZipFile(filepath, 'r') as archive: + for filename in archive.namelist(): + with archive.open(filename, 'r') as f: + objs = json.load(f) + for obj in objs: + title = obj.get('title', '') + supplier = obj.get('supplier', '') + if title and supplier: + corpus.append(f'{title} {supplier}') + return corpus + + +def prepare_queries(queries: List[str]) -> DocumentArray: + """Transforms queries into Finetuner's dataset format (DocumentArray).""" + return DocumentArray([Document(text=q) for q in queries]) + + +def prepare_corpus(corpus: List[str]) -> DocumentArray: + """Transforms product data into Finetuner's dataset format (DocumentArray).""" + return DocumentArray([Document(text=p) for p in corpus]) + + +@click.command() +@click.option( + '--queries', + default='sbert_unsupervised/generated-queries.jsonl', + help='Path to queries file', +) +@click.option( + '--corpus', + default=[ + '../data-encoder/ecommerce/vectors/data/1.json.zip', + '../data-encoder/ecommerce/vectors/data/2.json.zip', + '../data-encoder/ecommerce/vectors/data/3.json.zip', + '../data-encoder/ecommerce/vectors/data/4.json.zip', + ], + help='Path to corpus files', + multiple=True, +) +def main(queries: str, corpus: str): + queries = load_queries(queries) + corpus = load_corpus(corpus) + queries = prepare_queries(queries) + corpus = prepare_corpus(corpus) + queries.summary() + corpus.summary() + queries.save_binary(QUERIES_DA_PATH) + corpus.save_binary(CORPUS_DA_PATH) + print( + f'Prepared queries and corpus for unsupervised training of SBERT model:', + f'{QUERIES_DA_PATH}, {CORPUS_DA_PATH}', + ) + + +if __name__ == '__main__': + main() diff --git a/finetuning/sbert_unsupervised/data_synthesis.py b/finetuning/sbert_unsupervised/data_synthesis.py new file mode 100644 index 0000000..7a751da --- /dev/null +++ b/finetuning/sbert_unsupervised/data_synthesis.py @@ -0,0 +1,37 @@ +import finetuner +from finetuner.model import synthesis_model_en + +from docarray import DocumentArray + +DATASET_RUN_NAME = 'ecommerce-synthesis' + +finetuner.login(force=True) + +# load the datasets pre-prepared for Finetuner +queries = DocumentArray.load_binary('sbert_unsupervised/queries.da') +corpus = DocumentArray.load_binary('sbert_unsupervised/corpus.da') + +# upload the data to the Jina AI Cloud +queries.push('ecommerce-queries') +corpus.push('ecommerce-corpus') + +# start the data synthesis cloud job +synthesis_run = finetuner.synthesize( + query_data='ecommerce-queries', + corpus_data='ecommerce-corpus', + run_name=DATASET_RUN_NAME, + models=synthesis_model_en, + num_relations=10, +) + +# print the name of the run (should be the same as DATASET_RUN_NAME) +print(synthesis_run.name) + +# print logs +for entry in synthesis_run.stream_logs(): + print(entry) + +# download results +train_data_name = synthesis_run.train_data +train_data = DocumentArray.pull(train_data_name) +train_data.summary() diff --git a/finetuning/sbert_unsupervised/finetune.py b/finetuning/sbert_unsupervised/finetune.py new file mode 100644 index 0000000..da36399 --- /dev/null +++ b/finetuning/sbert_unsupervised/finetune.py @@ -0,0 +1,35 @@ +import finetuner + +from finetuner.callback import EvaluationCallback + +# To use finetuner an account for the Jina AI Cloud is required. +finetuner.login() + +DATASET_RUN_NAME = 'ecommerce-synthesis' # should be the same as in data_synthesis.py + +train_dataset_name = finetuner.get_run(DATASET_RUN_NAME).train_data + +training_run = finetuner.fit( + model='sbert-base-en', + train_data=train_dataset_name, + loss='MarginMSELoss', + # This loss function is specific for the unsupervised training + # on generated data. It expects triplet of a query and two + # documents associated with a margin relevance score. + optimizer='Adam', + learning_rate=1e-5, + # choose a small learning rate since the model is already pre-trained + epochs=1, + # for fine-tuning usually one epoch is enough + batch_size=16, + # too high batch-sizes can lead to memory issues +) + +print(training_run.name) + +# print logs +for entry in training_run.stream_logs(): + print(entry) + +# download model +training_run.save_artivact('finetuned_sbert_model') diff --git a/katas/013_finetuning_embedding_models.md b/katas/013_finetuning_embedding_models.md new file mode 100644 index 0000000..2d80c9c --- /dev/null +++ b/katas/013_finetuning_embedding_models.md @@ -0,0 +1,69 @@ +# 13th Kata: Fine-Tuning Embedding Models + +In the previous kata, we explored the utilization of pre-trained embedding models to create a search index. +However, these models are trained on generic data and are not specifically tailored to the domain of our dataset. +In this kata, we will delve into the process of fine-tuning these models to make them more attuned to the specific domain. +This will enhance the accuracy of our vector search and improving the overall search experience for our users. + +## Understanding Fine-Tuning + +Pre-trained models excel at assessing semantic similarity in text values and images. +However, they lack optimization for specific use cases and datasets. +Consequently, they often struggle to capture the relevant notion of similarity within a particular domain. +For instance, when we used a pre-trained text model to evaluate the query "gaming mouse pad", it returned results for +game pads, as both phrases contain tokens derived from the words "game" and "pad." +Although one could argue for a semantic connection between these items, given their shared purpose in playing PC games, +recommending a game pad for a gaming mouse pad on an e-commerce website would not be ideal. +Additionally, pre-trained models frequently fail to comprehend the specific semantic understanding of domain-specific vocabulary. +As an example, when retrieving products for the query "microsoft office" using a pre-trained model, +we received results for printers named "OfficeJet". +To address these issues, we can employ fine-tuning to adapt pre-trained models to the domain-specific terms and improve its grasp of the relevant notion of similarity. + +## Using Jina's Finetuner to Create Tuned Embedding Models + +Let's look at the steps which we need to perform to obtain a fine-tuned model with the [finetuner framework](https://github.com/jina-ai/finetuner): + +1. Data Preparation: Extract the training data from your corpus of documents. This process varies depending on whether you are performing unsupervised or supervised training and whether you are using a text-to-text or text-to-image embedding model. + +2. Configure and Submit the Fine-Tuning Job: Set up a fine-tuning job by configuring the desired model and its training parameters in the finetuners' fit function. After execution, the Finetuner will initiate a cloud computing job that performs the training using high-end GPU machines. + +3. Monitor Your Job: Fine-tuning can be time-consuming. While the job is in progress, you can monitor it using either a Python script or the Finetuner's web interface. This allows you to retrieve logs, track evaluation metrics, and ensure everything is running smoothly. + +4. Integrate the Model: Once the fine-tuning is complete, you can download the model, which can be integrated as a replacement into Chorus. + +### Unsupervised Training of Text-to-Text Embedding Models + +We begin by fine-tuning a pure text embedding model using an unsupervised approach. +In this approach, data annotation is not required. +Instead, the finetuner framework generates a dataset using a given corpus, such as our product database, along with a set of queries that can be extracted from a query log. +Since there is no query log for the demo database, we provide generated [queries](https://finetuner-ecommerce-experiment.s3.eu-central-1.amazonaws.com/generated-queries.jsonl) for download to run the demo code. +Those queries are generated with a [T5 language model trained on generating questions](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1). + +To establish relationships between queries and documents, we employ a pseudo labeling approach, similar to the one described in [this paper](https://arxiv.org/abs/2112.07577). +Specifically, for each query, we select two related documents using a pre-trained embedding model, as explained in the twelfth kata. +Subsequently, a cross-encoder model is employed to obtain a more precise estimation of relevance for both documents. +It is important to note that the cross-encoder model is more accurate but computationally more expensive, as it operates on the raw text inputs of the query and document, instead of relying on condensed vector representations. +Once the relevant data is generated, a fine-tuning job can be submitted to the finetuner cloud. +The code for this process can be found in `finetuning/sbert_unsupervised.` + +### Supervised Training of Text-to-Text Embedding Models + +... + +### Unsupervised Training of CLIP Models + +For text-to-image retrieval, CLIP has emerged as one of the most popular retrieval models (learn more about CLIP [here]((https://openai.com/research/clip))). +Since there is no equivalent model like a cross encoder, which significantly outperforms the CLIP, we can still fine-tune CLIP for our specific domain by leveraging the existing connections between text and image attributes in our corpus. +To achieve this, we extract pairs of text attributes and product images from our dataset and utilize them as training data. +During fine-tuning the model learns to adjust semantically similar content as the product attributes to similar embeddings as their product images. +In contrast, queries which are rather related to the attributes of other products should be mapped to different embeddings. +The implementation of this approach, can be found in the `finetuning/clip_unsupervised` directory. + +### Supervised Training of CLIP Models + +... + + +## Integrating the Fine-Tuned Models + +... diff --git a/smui/Dockerfile b/smui/Dockerfile index 9719094..f6f367d 100644 --- a/smui/Dockerfile +++ b/smui/Dockerfile @@ -1,12 +1,10 @@ -FROM querqy/smui:3.14.0 +FROM querqy/smui:4.0.11 -COPY conf/smui2solrcloud.sh /smui/conf/smui2solrcloud.sh -COPY conf/*.py /smui/conf/ -COPY conf/predefined_tags.json /smui/conf/predefined_tags.json +COPY --chown=smui:smui conf/* conf/ USER root -RUN apt-get update -y -RUN apt-get install -y python3 -RUN apt-get install -y python3-pip +RUN chmod +x conf/smui2*.sh +RUN apt-get update -y && apt-get install -y python3 python3-pip + USER smui RUN python3 -m pip install requests