Given a query and a result list of products retrieved for this query, classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query.
- Arya Mhaiskar (Lead)
- Rayyan Ashraf
- Lucy Yin
- Isabella Qian
- Lily Zhou
Challenge Advisor: Chen Luo, Sr. Applied Scientist at Amazon Search
Teaching Assistant: Vaibhav Tiwari
Dataset 1: Labeled dataset with user search query to product ID mappings
Dataset 2: Product metadata including the product ID, title, description, brand, and color
Languages: User search queries and product metadata in English, Spanish, and Japanese
Source: Task 2 from Amazon KDD Cup'22 Challenge
- Data cleaning
- Merged datasets, removed unnecessary columns
- Converted remaining columns to lowercase
- Removed HTML tags and non-alphanumeric characters
- Removed stopwords
- Data preprocessing
- Stemming
- nltk PorterStemmer for English, SnowballStemmer for Spanish
- MeCab for Japanese
- Lemmatizing
- nltk WordNetLemmatizer for English, spaCy for Spanish
- Stemming
- BERT Base Multilingual Cased
- Same model used for language-independent tokenization
- Created TensorFlow Dataset objects for training and testing
- Fine-tuned the model with training data
- Logistic Regression
- TD-IDF and Count Vectorier for language-independent tokenization
- Trained the model on tokenized training data
Used micro-averaging F1-score to account for class imbalance in the label classes
F1-scores:
Baseline BERT scores are from the bert-base-multiligual-cased prior to fine-tuning
- Logistic Regression (pure ML approach) and fine-tuned BERT perform equally well for this task (this was a novel, unexpected finding)
- Logistic Regression takes considerably less training time and memory than BERT
- GPU unit limits in the free tier
- We had several RAM overflow issues in the training and preprocessing phases. We mitigated these through random sampling from our large-scale datasets and separating preprocessing and modeling in separate notebooks
- Saved preprocessed, merged dataaet as a separate file we could load in our modeling notebook without wasting available RAM
- With a more powerful local environment, or by purchasing more GPU units on Colab, we can further improve our models' performance by training on more data