Skip to content

Unsupervised Information Retrieval by generating question using Large Language Models

Notifications You must be signed in to change notification settings

CarlosDominguezBecerril/InformationRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InformationRetrieval

The repository pertains to the field of Information Retrieval and focuses on the implementation of an unsupervised question generation method for the purpose of information retrieval. It encompasses the following processes:

Further information about the project can be found at: https://drive.google.com/file/d/1JGQH_SvFMhNlcYqrM3beMTCaWJznO7j8/view?usp=sharing

How to use the code

Note: The code makes use of Slurm to simplify the workflow of the project.

Question Generation

To get started, download the datasets by executing the following command:

python3 utils/download_datasets.py

You can generate the dataset by executing the following command inside the 'DatasetGeneration' folder:

python generate_unsupervised_dataset.py

Important parameters in the script include:

  • method: the values are "cropping" or "LLM". Cropping stands for contriever generation method and LLM for large language model.
  • dataset name: name of the dataset to generate questions for.
  • create dev: whether to create a dev dataset or not.
  • dev_rato: how many documents to use from the corpus to create the dev split. (float. 0.2 -> 20% of the documents)
  • question_per_document: how many questions to create per document.

This parameters are only useful when using LLM method

  • gpus: specify the number of GPUs to use.
  • prompt name: "prompt" used to generate the questions. (See prompts.json for the different available options).
  • model name: name of the model to use (HuggingFace).
  • shard size: size of the shard to create the shard of the dataset (this is used to parallelize the generation process across different GPUs).

If the system stops during the generation of the dataset using the LLM method, you can run the script again to recover from where it was left.

After generating the dataset run the following command (update the necessary parameters).

python3 postprocessing.py

(OTHER) Using your own dataset

To use your dataset own dataset use the BEIR format. You will need to create a file called "corpus.jsonl" inside the "datasets/my_dataset" folder.

The format should be one dictionary per line with the following keys:

"_id" -> Unique id of the document

"title" -> Title of the document

"text" -> Document

Training

Note: Before training the model please update the BEIR and Sentence Transformers with the files inside libraries_to_update folder. More explanations about it can be found in the README inside.

Training is as easy as calling to

python train.py

Make sure that the parameters are set correctly according to what you generated during the dataset generation, and update other parameters such as batch size.

You can also evaluate on BEIR by updating the 'beir_datasets' variable.

Multi-gpu is not supported due to sentence transformers.

Evaluation (BEIR)

Ideally, you should set up the datasets you want to evaluate while training. If you forget to add a dataset, you can evaluate it by running the following command in the BEIR folder:

python evaluate_beir.py

Pretrained models and datasets

Pretrained models and datasets are available under request.

Information about the project

  • Author: Carlos Domínguez Becerril.
  • Supervisors: Eneko Agirre Bengoa, Jon Ander Campos Gorka Azkune Galparsoro

About

Unsupervised Information Retrieval by generating question using Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages