Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Question-Relevant Content Pairs for Retrieval Testing #302

Open
Gautam-Rajeev opened this issue Mar 27, 2024 · 1 comment
Open

Create Question-Relevant Content Pairs for Retrieval Testing #302

Gautam-Rajeev opened this issue Mar 27, 2024 · 1 comment
Assignees

Comments

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Mar 27, 2024

Goal:

Develop a method to generate pairs of questions and relevant content strings from a given dataset, aimed at enhancing retrieval testing. The relevant content should consist of collections of sentences from the source material that are necessary to derive the final answer, rather than being direct answers or chunks themselves. This approach will facilitate better retrieval testing, including that of chunking strategies

Description

For effective retrieval testing, you need question-content pairs where the content is not simply the answer or a chunk of text directly related to the question. Instead, the content should be a curated collection of sentences from various parts of the provided text. These sentences should collectively contain the necessary information to answer the question, making the retrieval challenge more complex and representative of real-world scenarios. The goal is to assess retrieval performance by determining if these critical sentences are included in the chunks retrieved by the system.

Key requirements include:

  • Ability to input free text (such as a page from a book) and generate question-relevant content pairs from it.
  • Ensuring the relevant content is sourced from disparate parts of the 'free text', to simulate more challenging retrieval scenarios.
  • Evaluation of retrieval performance based on whether all these critical sentences are part of the retrieved chunks.

Implementation Details

The project will involve:

  • Designing an algorithm or model that can analyze free text and identify segments of text that, together, can form the basis of a question-answer pair, with the emphasis on the answer being a coherent collection of information from across the text.
  • Developing a mechanism for automatically crafting questions based on the identified relevant content, ensuring the questions are clear, concise, and accurately represent the information contained within the content strings.
  • Implementing a test suite that uses these question-content pairs to evaluate the performance of retrieval systems, specifically looking at the system's ability to fetch chunks containing all parts of the relevant content.
  • Creating documentation and examples demonstrating how to use the generated pairs for retrieval testing effectively.

Open for collaboration: This project is initially unassigned and open to anyone interested. Discussion and solution proposals can be exchanged in comments. Contributors with impactful pull requests may be considered for assignment.

Product Name

retrieval testing

Organization Name

Samagra

Domain

Data Science / Machine Learning

Tech Skills Needed

  • Python
  • Natural Language Processing (NLP)
  • information retrieval

Category

Feature

Question-Content Pair Generation

Mentor(s)

@ChakshuGautam

Complexity

Medium

@kabirrajsingh
Copy link

Hi @ChakshuGautam . There can be 2 approaches to solve this, the first one involving LLMs, langchains and things like OpenAI call agents but I feel this might be an overkill for now.
The second one which I was thinking might be more suitable at the moment.
It involves first

  1. preprocessing the dataset to extract individual sentences or segments of text
  2. We can then use named entity recognition (NER), keyword extraction, or topic modeling to identify important elements in the text.
  3. We segment the text into smaller portions based on the identified key concepts
  4. We employ some transformer based model to generate question based on the small text portion
  5. We use paraphrasing techniques to vary the wording of the questions generated
  6. We make validation metrics such as overlap and sentence coverage.

Would like to discuss more and get your opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants