Welcome to our NLP Translation Detection Project! In this thrilling (and sometimes confusing) adventure, we embarked on the quest to differentiate between spanish sentences translated by machines and those translated by professional translators. Think of it as our way of saying, "We know your secret, Google Translate!"
We made our own choices and text treatments, trying different features and embeddings to crack the code. Spoiler alert: The Hugging Face transformer BERT model in Spanish was our hero.
- Data Preprocessing: Cleaned the text data by removing unnecessary characters, and performed tokenization.
- Feature Engineering: Tried different features such as word count, punctuation frequency count, POS tagging,
- Embeddings: Tested various embeddings ; TF-IDF, word2vectors, and BERT model.
- Model Training: Trained several machine learning models, including:
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest
- Naive Bayes classifier
- Model Evaluation: Selected the SVM model
- Prediction: Used the best model to make predictions on the REAL_DATA.txt.
As a scientist with feature engineering embedded in my DNA 🥼, I found this aspect of machine learning the most intriguing. I love trying and testing different approaches to see if there's an improvement (or, as is often the case, not) 🤓.
During this project, I experimented with several feature extraction techniques, such as counting male and female pronouns. This was because automatic translations can sometimes introduce errors, like using a male pronoun regardless of the actual gender.
After trying different ways to process text, I realized that in this particular case, less is more—we decided not to lemmatize the text.
When it came to embeddings, the Hugging Face transformer BERT model in Spanish proved to be a game-changer in this adventure.