SQUAD - Document Retrieval test

Objective of this repo :

This project was my first interaction with Natural Language Processing. I was asked to create a code that, given a question and a corpus, would return the text that is most likely to answer the question. The link to the original database is here.

File structure :

The context_class.py file creates our corpus object and parses texts with their according contexts, questions, and answers.
The explo.py file explores and visualizes the data distribution in the given corpus.
The NN.py file offers a first solution to our problem. Using a NN classification and a tfidf embedding, we are able to return the closest text to each given question.
The validation.py file gives the user the opportunity to test our classification on a random or a selected question, including questions in the validation set that don't have a target value.

Performance and next steps :

The current results are the following :

An accuracy of 0.53 based on the returned context
An accuracy of 0.63 based on the returned text title If I had more time to work on this, here are the first couple of things I would try :
I would change it so that my classification doesn't only return one context but 3 (or the optimal number). The goal is to ensure I improve my accuracy, or change the performance metric altogether into a k-accuracy. It is true that, alike on a search engine, the user wants the right document to pop up, but doesn't care if it's not the first.
I would try to switch embeddings, and perhaps go the route of word2vec.
It would be interesting to use a Bert model or another well established NLP tool.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
NN.py		NN.py
README.md		README.md
context_class.py		context_class.py
explo.py		explo.py
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQUAD - Document Retrieval test

Objective of this repo :

File structure :

Performance and next steps :

About

Releases

Packages

Languages

luclemot/squad_document_retrieval

Folders and files

Latest commit

History

Repository files navigation

SQUAD - Document Retrieval test

Objective of this repo :

File structure :

Performance and next steps :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages