InformationRetrieval

The project is divided into three parts:

Tokenization & Stemming
Indexing
Ranked Retrieval

Cranfield text document collection is used. It contains 1400 documents.

Tokenization & Stemming

As part of preprocessing, the documents are first parsed to extract the relevant text. The documents are later tokenized and then stemmed using the open source implementation of Porter Stemmer.

Indexing

The documents are stemmed using Porter Stemmer and lemmatized using NLTK's WordNetLemmatizer. Two different types of indexes are created: one using stemming and another using lemmatization. The indexes are also compressed using blocked compression and front coding.

Ranked Retrieval

The indexed documents are then created into document vectors using two different weighting schemes: max tf term weighting and okapi term weighting. The ranked documents are then tested on a list of queries to test the relevancy of the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

InformationRetrieval

Tokenization & Stemming

Indexing

Ranked Retrieval

Files

README.md

Latest commit

History

README.md

File metadata and controls

InformationRetrieval

Tokenization & Stemming

Indexing

Ranked Retrieval