Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QAing on a larger corpus? #11

Open
bigrig2212 opened this issue May 9, 2020 · 1 comment
Open

QAing on a larger corpus? #11

bigrig2212 opened this issue May 9, 2020 · 1 comment

Comments

@bigrig2212
Copy link

Hi. Awesome project. So fun.
Wondering what is the technique to ask a question to a larger corpus? More like hundreds of documents versus a short text snippet as in the sample code. As it seems to take longer and get less accurate the more text I supply - i'm wondering if there's another technique to work with a larger corpus? Filter first using TF-IDF and then run this QA only on the returned documents?

Thx.

@martinnormark
Copy link

You could divide your corpus into sections of "a reasonable size" (whatever that size is), then run QnA on all sections, perhaps in parallel, then sort all the answers by the score returned by the model and grab the 10 answers with the highest score.

You will potentially end up running QnA on lots of irrelevant text.

Is there a way your corpus is structured so that you can filter it down?

You could also label sections of your corpus into a set of categories, then build a corpus of questions label by the same categories, then run classification on the question to get the category, filter the corpus by that category and then run QnA.

That would require a labelled dataset for both corpus sections and a good corpus of questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants