In this NLP project we will be performing the following steps
-> preprocessing and cleaning the data
-> Train Test Split
-> Applying BOW,TF-IDF,Word2Vec
-> Ml algorithms
for converting a dataset to w2v
- Data cleaning
- Train W2V model from gensim library
- Before training in W2V, the training data should tokenized
- We should convert each data to list of words
- These W2V provides embeddings for each indiviual words but for ML models it requires numerical features for entire sentence
- So we should create a function to convert sentences to vectors
- words = [word for word in doc if word in model.wv]
- Then the vector is of size 100 we should be avg
- Training the model
- Prediction of the model
- For better prediction score we should perform data preprocessing and cleaning
Basic code for Preprocessing and cleaning of the dataset
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
def cleanData(txt):
cleanTxt = re.sub(r'http\S+\s',' ',txt)
cleanTxt = re.sub(r'@\S+',' ',cleanTxt)
cleanTxt = re.sub(r'#\S+',' ',cleanTxt)
cleanTxt = re.sub(r'[^A-Za-z0-9\s]',' ',cleanTxt)
cleanTxt = cleanTxt.split()
cleanTxt = [word for word in cleanTxt if word.lower() not in stopwords.words('english')]
lemma = WordNetLemmatizer()
cleanTxt = [lemma.lemmatize(word.lower(),pos='v') for word in cleanTxt]
cleanTxt = ' '.join(cleanTxt)
return cleanTxt