NLP Translation Detection Project

Project Description

Welcome to our NLP Translation Detection Project! In this thrilling (and sometimes confusing) adventure, we embarked on the quest to differentiate between spanish sentences translated by machines and those translated by professional translators. Think of it as our way of saying, "We know your secret, Google Translate!"

We made our own choices and text treatments, trying different features and embeddings to crack the code. Spoiler alert: The Hugging Face transformer BERT model in Spanish was our hero.

Methodology

Data Preprocessing: Cleaned the text data by removing unnecessary characters, and performed tokenization.
Feature Engineering: Tried different features such as word count, punctuation frequency count, POS tagging,
Embeddings: Tested various embeddings ; TF-IDF, word2vectors, and BERT model.
Model Training: Trained several machine learning models, including:
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest
- Naive Bayes classifier
Model Evaluation: Selected the SVM model
Prediction: Used the best model to make predictions on the REAL_DATA.txt.

Insights on this Project

As a scientist with feature engineering embedded in my DNA 🥼, I found this aspect of machine learning the most intriguing. I love trying and testing different approaches to see if there's an improvement (or, as is often the case, not) 🤓.

During this project, I experimented with several feature extraction techniques, such as counting male and female pronouns. This was because automatic translations can sometimes introduce errors, like using a male pronoun regardless of the actual gender.

After trying different ways to process text, I realized that in this particular case, less is more—we decided not to lemmatize the text.

When it came to embeddings, the Hugging Face transformer BERT model in Spanish proved to be a game-changer in this adventure.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LydiaRuiz.txt		LydiaRuiz.txt
ProjectIH_NLP.ipynb		ProjectIH_NLP.ipynb
README.md		README.md
REAL_DATA.txt		REAL_DATA.txt
TRAINING_DATA.txt		TRAINING_DATA.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Translation Detection Project

Project Description

Methodology

Insights on this Project

About

Releases

Packages

Languages

Lylrg/natural-language-processing

Folders and files

Latest commit

History

Repository files navigation

NLP Translation Detection Project

Project Description

Methodology

Insights on this Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages