Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a test set of question and answers for testing capabilities #323

Open
2 tasks done
xorsuyash opened this issue Aug 6, 2024 · 7 comments
Open
2 tasks done
Assignees

Comments

@xorsuyash
Copy link
Collaborator

xorsuyash commented Aug 6, 2024

cc @GautamR-Samagra

Tasks

  • Generating question answer chunks (Global) from agri pdfs.
  • Using Raptor to cluster chunks and GPT(autotune) to create more context rich question answer pair.
@harshaharod21
Copy link

Hello I can contribute!

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Aug 9, 2024

@xorsuyash please populate a new sheet with English PDFs that you'll use for KG creation

@harshaharod21 Please find older documentation links done by Suyash here :

  • Llama Index implementation here

@dev-SARDAR
Copy link

@xorsuyash @GautamR-Samagra hi, can i work on these tasks? i did work on identifying similar questions from a dataset based on quora queries.
this issue is aligned to my interests

@harshaharod21
Copy link

@xorsuyash Link to the repo where the given issue is implemented https://github.com/harshaharod21/qa_raptor

Note that for now I have used llama as LLM and not openai

@Gautam-Rajeev
Copy link
Collaborator

@xorsuyash

Can we create a question-answer set on these pdfs first:

Listing out PDFs to start with here:

@Gautam-Rajeev
Copy link
Collaborator

Next steps:

  • Figure out how to extract and visualize the created KG from the parquet files
  • Figure out if GraphRAG supports providing an initial ontology while creating the graph
  • Figure out how the querying engine works for global and local :
    • Are they creating cypher queries?
    • Are they doing some vector search on the entities, nodes?
  • Use the Kharif book (first 247 pages) to test out once code is clear.

@harshaharod21
Copy link

harshaharod21 commented Aug 29, 2024

Next steps:

  • Figure out how to extract and visualize the created KG from the parquet files

  • Figure out if GraphRAG supports providing an initial ontology while creating the graph

  • Figure out how the querying engine works for global and local :

    • Are they creating cypher queries?
    • Are they doing some vector search on the entities, nodes?
  • Use the Kharif book (first 247 pages) to test out once code is clear.

  1. I have figured out how to visualize the KG there are three ways given in the issue raised:
  • To enable umpa and graphml in the init_content.py file, with this we will get graphml files in the output ,we can use gephi software to visualize
  • To use the notebook to get the visuals
  • Using grahrag visualizer
  1. I looked at the source code, but cannot find a way to include base entities except for prompt auto tuning, where we can provide the domain for entity extraction, a similar issue has also being raised for the same: [[Feature Request]: Prompt Tuning with given entities · Issue #1010 · microsoft/graphrag (github.com)
    ]([Feature Request]: Prompt Tuning with given entities microsoft/graphrag#1010)

Update: This is the response I got from the issue where i commented for base entity "In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning."

  1. Indexing and querying:
    The indexing pipelineis configurable, they are composed of workflows,standard and custome steps, prompt templates and input/output adapators.The pipeline is designed to:
  • extract entities, relationships and claims from raw text
  • perform community detection in entities
  • generate community summaries and reports at multiple levels of granularity
  • embed entities into a graph vector space
  • embed text chunks into a textual vector space
    The output of the pipleine gives json and parquet files

Querying:
For local search, they have vector store(lancedb),so likely uses vector-based similarity search
microsoft.github.io/graphrag/posts/query/notebooks/local_search_nb/

For global :
microsoft.github.io/graphrag/posts/query/notebooks/global_search_nb/

The search here is not vector or cypher query based as the no database is created to store the embeddings..

Instead they use map reduce approach here.

  • Map Phase: Divides the data into manageable chunks and processes them independently.
  • Reduce Phase: Aggregates the intermediate results from the map phase to produce the final output.
    These both steps is done by LLM, that is why they mention that the process is "This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants