- Angela Detweiler
- Hee Kang
- Alexander Lam
- Behesteh Mostaghni
Dataset link: Yelp Dataset in Kaggle with a focus on Restaurants- https://www.kaggle.com/yelp-dataset/yelp-dataset
Problem: When you are researching restaurants on Yelp, do you look at the star rating or do you read the review? Do you look at both? Given that reviews are highly subjective, and star ratings can be influenced by various aspects of business performance, can we use machine learning to standardize the interpretation of reviews?
Goal: Our goal is to apply Natural Language Processing (NLP) and other features from the Yelp reviews into a model that outputs a new 5-star-rating, so that there is less discrepancy between reviews and star ratings. In order to make our model more robust, we will also incorporate new user star-ratings based on reviews read (meaning that someone who did not write the review gives a star-rating based on the review text alone) into our model so that it better reflects the review sentiment.
Hypothesis: We hypothesize that automating star ratings based on NLP of restaurant reviews will improve Yelp review experience by normalizing reviewer sentiment.
ML algorithms:
- Naive Bayes
- k-NN
- K-Means
- LSTM
- N-Gram
- TD-IDF
- Linear Regression
Libraries:
- Numpy
- Scipy
- Scikit_Learn
- Pandas
- Matplotlib
- NLTK
- PySpark
- Keras
- HTML/ CSS/ Bootstrap
- Tableau
Sentiment Analysis Lexicon:
- AFINN
- VADER
Project components, steps, analyses, and final products:
-
Components and final products
- ML algorithms
- Game (user rates reviews)/HTML page
- Database with game data to be reincorporated into model
- Model output/vizualizations in JN
-
Steps and analyses
- Select and clean restaurant/food category data from Yelp
- Cluster reviews into 5 categories (5 star-rating)
- Use NLP to train model
- Test Yelp rating/review data (user inputs both)
- Incorporate new user star-rating from game into the model
- Other...
Questions/Topics of Interest:
- (ML) Are yelp reviews highly correlated to restaurant quality (based on star rating) ? In other words, are the reviews useful?
- What percentage of reviews talk about the quality of the food versus the quality of the service?
- Correlate photo captions to reviews.
- (ML) Is there consistency in review style for a particular user?
- Distribution of ratings (stars)- Is it a bell curve or does it peak at both extremes (1 and/or 5 star ratings)?
- (ML) Is there a pattern to Yelp Elite status? Elite vs non-elite.
- Patterns in ratings/review sentiment correlated to business attributes? (Outdoor seating, live music, etc.)
- Patterns in 'useful' reviews?
- Use NLP to train model, test then have HUMANS rate as well and compare the difference