Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging ReQue and RePair #42

Open
DelaramRajaei opened this issue Oct 30, 2023 · 10 comments
Open

Merging ReQue and RePair #42

DelaramRajaei opened this issue Oct 30, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@DelaramRajaei
Copy link
Member

This is the issue where I log all my processes while adding ReQue's expanders to RePair.

@DelaramRajaei DelaramRajaei added the enhancement New feature or request label Oct 30, 2023
@DelaramRajaei DelaramRajaei self-assigned this Oct 30, 2023
@DelaramRajaei
Copy link
Member Author

Hello @hosseinfani,

I successfully integrated all the refiners and merged the ReQue project with the RePair project. With this, I introduced the query_refinement setting in the parameter file. When set to true, the selected expanders will be invoked, generating refined queries stored in a refiner.#name_of_the_refiner file.

Example:

refiner.backtranslation_pes_arab

Currently, it is distinct from T5, but my future plan is to include T5 as a refiner alongside others. In the refiner, we now have the AbstractQRefiner class, which generates the original query. I observed that in the main code, we treat the original query separately from the generated refined queries. I propose considering AbstractQRefiner as a refiner and calling it along with others.

Moreover, I incorporated the semsim (Semantic Similarity) score as a mandatory score for all the refiners. It has been relocated from Backtranslation to the AbstractQRefiner class. After generating q', semsim is calculated and stored.

I introduced preprocess_query_batch to accommodate refiners like the backtranslation model that can work with batches. However, I haven't had the time to debug it yet.

After incorporating the refiners, I added the Query class. In the Dataset class, I implemented a function that reads all the queries from the dataset's path, creates a query object, and stores them in a list. The msmarco and aol child classes override this function according to their datasets. With this addition, RePair can now work with datasets like robust04, gov2, and others that were part of ReQue.

The main pipeline structure has been modified according to the Query class. Although it can still be optimized, I anticipate it will eventually transition to using the Query class exclusively.

I initially planned to include the search, eval, and other pipeline commands in the Query class as we discussed. However, I realized that keeping these functions in the Dataset might be more practical for accessing all queries, running with batches, and other functionalities. I am still deliberating on the most suitable architecture.

Tasks for the future:

  • Rename files: The current practice of storing everything in one output is not efficient. I will revise it so that each refiner creates its output, and results are stored there, similar to T5.
  • Incorporate T5 as a refiner. The structure is yet to be finalized, but I am considering adding the pairing step in the refiner_factory.
  • Refactor the code and eliminate snippets that read original queries separately. (In progress)
  • Adjust the pipeline to exclusively use the Query class and not involve working with files. (In progress)

@hosseinfani
Copy link
Member

@DelaramRajaei Awesome! Thanks.

@DelaramRajaei
Copy link
Member Author

Hey @hosseinfani ,

I wanted to provide you with a project update. Currently, the pipeline is operational, although I'm addressing some minor bugs related to reading different datasets. I've initiated backtranslation on two datasets, robust04 and dbpedia, across 10 languages. Below are the logs.
robust04_dbpedia_backtranslation.zip

@DelaramRajaei
Copy link
Member Author

Hey @hosseinfani,

I wanted to give you an update on the project.

The successful merger of ReQue and RePair is now complete. I have executed backtranslation for all five datasets, employing two IR rankers (BM25, QLD) and two evaluation metrics (MAP, MRR).

Encountered challenges in loading different datasets, particularly with clueweb09b and gov2, which have split their queries across multiple trecs. Currently, the code reads all files, but I plan to modify it to run each trec separately and aggregate the results, following the approach used in the ReQue project.

Presently, the project is running for all expanders for gov2 across various IR rankers and evaluation metrics. The log of the ongoing run is provided.

logs.zip

The log file contains records for Backtranslation, Conceptnet, Thesaurus, Wordnet, and Tagme refiners. I have also updated the RePair_StoryBoard in the Query Refinement channel on Teams.

In parallel, I am working on the query class and rag fusion, though there hasn't been significant success in those areas yet. I am ensuring the expanders run flawlessly and addressing other bugs.

Additionally, a minor change has been made in the output structure. After creating a folder for each dataset, it will store the refined data there and subsequently store the results of the ranker and metric in a new folder within the dataset folder. Below is an overview of the file storage:

├── output
│   ├── gov2 [Dataset's name]
│   │   ├── refined_queries_files
│   │   └── ranker.metric [such as bm25.map]
│   │       └── [This is where all the results from the search, eval, aggregate, and boxing are stored]

@hosseinfani
Copy link
Member

Hi @DelaramRajaei
Thanks for the update. This is great.
We need a meeting to demo a sample run for me.

@DelaramRajaei
Copy link
Member Author

Hello, @hosseinfani

I am currently facing issues with the RelevanceFeedback refiner. As we have transitioned from using Anserini to only Pyserini, one potential solution involves utilizing SimpleSearcher from Pyserini. However, this approach encounters problems with multiprocessing (multiprocessing as mp), which is deprecated, and the library suggests using Lucene. Unfortunately, I couldn't find a similar method in the library.

While exploring slides on RelevanceFeedback and the Rocchio Algorithm, I am contemplating implementing the algorithm myself. This refiner holds significance and serves as the parent for other important refiners like RM3, BertQE, Termluster, and more.

All other refiners are functioning well, providing results for the gov2 dataset. During fixing issues, I encountered a minor issue with the Anchor and Wiki refiners. They faced challenges in calling and using their parent variables. Additionally, the recent version of gensim (4.x) removed the vocab attribute in the Word2Vec model, replacing it with index_to_key. I found a helpful resource here.

Currently, my focus is on resolving the issue with RelevanceFeedback along with working on RAG-fusion.

@DelaramRajaei
Copy link
Member Author

Hello @hosseinfani,

I looked into a few more solutions to address the problem with the RelevanceFeedback refiner, but unfortunately, I couldn't find a successful fix. As a temporary measure, I'll stick to using only Anserini for this refiner until I come across a better solution.

Here's the code snippet that utilizes Anserini:

    def get_tfidf(self, docid):
        # command = "target/appassembler/bin/IndexUtils -index lucene-index.robust04.pos+docvectors+rawdocs -dumpDocVector FBIS4-40260 -docVectorWeight TF_IDF "
        cli_cmd = f'\"./src/anserini/target/appassembler/bin/IndexUtils\" -index \"{self.index}\" -dumpDocVector \"{docid}\" -docVectorWeight TF_IDF'
        stream = os.popen(cli_cmd)
        return stream.read()

Meantime, I discovered some resources that might be useful in resolving the issue.

Anerini
Extraction of TF-IDF vectors

Pyserini
Pyserini: Reproducing Vector PRF Results
To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers
Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls
Pseudo-Relevance Feedback with Dense Retrievers in Pyserini

Keywords Extraction Using TF-IDF Method
sklearn.feature_extraction.text
sklearn.feature_extraction.text
gensim.models.TfidfModel

I came across this tool called Spacerini (link), which combines features from Pyserini and the Hugging Face ecosystem. It provides a simple and user-friendly method for researchers to explore and analyze large text datasets through interactive search applications. I'm not certain if we'll use it, but it could be helpful down the line.

@hosseinfani
Copy link
Member

@DelaramRajaei
thanks for the update. that's fine for the time being but create an issue page for it as a bug/issue so we can fix it in future.

for code reference, you can paste the codeline permanent link at github like this:

def get_tfidf(self, docid):

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jan 12, 2024

Hello @hosseinfani,

I've fixed the issues with RM3 and BertQ. Here's a brief overview of the changes:

RM3:
I noticed that RM3 in pyserini was only used for document reranking, and a similar approach was used to select the top word in the Relevance feedback. To address this, I updated the get_topn_relevant_docids function. The refiner now calls the get_refined_query from its parent, the Relevance feedback.

BertQ:
Dealing with BertQ was challenging due to its reliance on pygaggle for importing transformers, causing conflicts with other libraries. After reviewing their paper, I referred to this link and the bert documentation for implementing the code as per their guidelines.

Both refiners are now working, and I've stored their results. While two other refiners (adoptonfields and onfields) are pending, my focus is currently on implementing Rag fusion and creating dense indexes to compare results with the existing refiners.

Other helpful links:

@DelaramRajaei
Copy link
Member Author

Hello @hosseinfani,
I wanted to let you know about the work I've accomplished in the past weeks.

I've stored the outcomes of the refinement process applied to Rapir across all five datasets (robust04, gov2, antique, dbpedia, clueweb09b), where Sparse indices were available. Additionally, I've updated the Rapir's storyboard on Teams.

There have been changes to the pipeline, with the addition of more commands:

  1. query_refinement: This command triggers the execution of selected refiners in the refiner.param, including the original query. If the files already exist, they will be read and stored in the list of Query class objects for efficiency. If no refiners are selected only original query refiner will be called.

  2. similarity: This command computes rouge, bleu, and semsim for all refined queries along with the original query. All the results will be stored in similarity folder. The output will be structured as follows:

├── output
│   └── dataset_name
│       └── similarity
│       └── refined_queries_files
  1. rag_fusion: This command gathers the outcomes of selected ranker for either all the refiners or just backtranslation, then calculates reciprocal_rank_fusion (RRF). Promising initial results have been achieved, although I'm still refining this step.

Additionally, several minor updates have been made to the Rapir project:

  • The search function has been updated, and search_df has been removed. This change was made because Lucene doesn't allow multiple processes to modify an index simultaneously, as discussed here.
  • set_index has been added to accommodate different index formats for msmarco and aol.
  • The T5 model has been added as a refiner, and commands have been removed from the main pipeline. However, there have been some library conflicts between T5 and other refiners, leading to bugs during project execution.
  • Efforts have been made to minimize redundancy in the code and to separate T5 settings from general settings.

Currently, my focus is on working on rag-fusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants