Skip to content

Latest commit

 

History

History
31 lines (21 loc) · 1.62 KB

README.md

File metadata and controls

31 lines (21 loc) · 1.62 KB

Preprocessing Text

Cleaning and preprocessing the text is a prerequisite for all the IR and NLP tasks. Cleaning text by removing tags and punctuations, stopword removal, stemming and lemmatization was performed on the text.

TF-IDF

Representation of text in an important step in all the IR and NLP tasks. TF-IDF representation was implemented from scratch on a set of documents and comparison was done with the Sklearn implementation.

Word2Vec Representation

Document Retrieval using SkipGram and CBOW word representation and evaluation using Precision, Recall and F1 score.

LSI

Implementation of LSI on set of documents with the help of SVD and testing Retrieval of documents using cosine similarity measure.

YASS Stemmer

Stemming is implemented using agglomerative clustering using various distance measures for the strings. https://dl.acm.org/doi/10.1145/1281485.1281489

Query Expansion and Relevance feedback

Document retrieval using query was evaluated by performing query expansion(synonyms of query words) and relevance feedback(rocchio algorithm).

Question Answering

Question answering using unsupervised approach using word2vec representation and evaluation using Exact Match and F1 score.

Text Summarization

Extractive text summarization using Texrank and Lexrank and evaluation using ROGUE1 and ROGUE 2 score.

Text Classification

Multiclass text classification using TF-IDF and word2vec representation using SVM.

Text classification using Ensemble based approach

Multiclass text classification using Stacking and voting classifiers. Ensemble of Multinomial Naive Bayes, Logistic Regression and Random Forests.