Skip to content

Welcome to my NLP Translation Detection Project! We embarked on the quest to differentiate between spanish sentences translated by machines and those translated by professional translators. The Hugging face transformer BERT was a game-changer in this adventure.

Notifications You must be signed in to change notification settings

Lylrg/natural-language-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Translation Detection Project

Static Badge Static Badge Static Badge Static Badge Static Badge

Project Description

Welcome to our NLP Translation Detection Project! In this thrilling (and sometimes confusing) adventure, we embarked on the quest to differentiate between spanish sentences translated by machines and those translated by professional translators. Think of it as our way of saying, "We know your secret, Google Translate!"

We made our own choices and text treatments, trying different features and embeddings to crack the code. Spoiler alert: The Hugging Face transformer BERT model in Spanish was our hero.

Methodology

  1. Data Preprocessing: Cleaned the text data by removing unnecessary characters, and performed tokenization.
  2. Feature Engineering: Tried different features such as word count, punctuation frequency count, POS tagging,
  3. Embeddings: Tested various embeddings ; TF-IDF, word2vectors, and BERT model.
  4. Model Training: Trained several machine learning models, including:
    • Logistic Regression
    • Support Vector Machine (SVM)
    • Random Forest
    • Naive Bayes classifier
  5. Model Evaluation: Selected the SVM model
  6. Prediction: Used the best model to make predictions on the REAL_DATA.txt.

Insights on this Project

As a scientist with feature engineering embedded in my DNA 🥼, I found this aspect of machine learning the most intriguing. I love trying and testing different approaches to see if there's an improvement (or, as is often the case, not) 🤓.

During this project, I experimented with several feature extraction techniques, such as counting male and female pronouns. This was because automatic translations can sometimes introduce errors, like using a male pronoun regardless of the actual gender.

After trying different ways to process text, I realized that in this particular case, less is more—we decided not to lemmatize the text.

When it came to embeddings, the Hugging Face transformer BERT model in Spanish proved to be a game-changer in this adventure.

About

Welcome to my NLP Translation Detection Project! We embarked on the quest to differentiate between spanish sentences translated by machines and those translated by professional translators. The Hugging face transformer BERT was a game-changer in this adventure.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published