Cleaning and preprocessing the text is a prerequisite for all the IR and NLP tasks. Cleaning text by removing tags and punctuations, stopword removal, stemming and lemmatization was performed on the text.
Representation of text in an important step in all the IR and NLP tasks. TF-IDF representation was implemented from scratch on a set of documents and comparison was done with the Sklearn implementation.
Document Retrieval using SkipGram and CBOW word representation and evaluation using Precision, Recall and F1 score.
Implementation of LSI on set of documents with the help of SVD and testing Retrieval of documents using cosine similarity measure.
Stemming is implemented using agglomerative clustering using various distance measures for the strings.
Document retrieval using query was evaluated by performing query expansion(synonyms of query words) and relevance feedback(rocchio algorithm).
Question answering using unsupervised approach using word2vec representation and evaluation using Exact Match and F1 score.
Extractive text summarization using Texrank and Lexrank and evaluation using ROGUE1 and ROGUE 2 score.
Multiclass text classification using TF-IDF and word2vec representation using SVM.
Multiclass text classification using Stacking and voting classifiers. Ensemble of Multinomial Naive Bayes, Logistic Regression and Random Forests.