course-dprep · lucia-ramos-dominguez · Sep 2, 2024 · Sep 3, 2024 · Sep 3, 2024 · Sep 3, 2024
diff --git a/Data Preparation.Rmd b/Data Preparation.Rmd
@@ -0,0 +1,110 @@
+---
+title: "Data preperation"
+output: pdf_document
+date: "2024-09-12"
+---
+
+
+```{r}
+library(googledrive)
+
+# Authenticate with Google Drive
+drive_auth()
+
+# Folder ID from your Google Drive link
+folder_id <- "1ioJVCsr5pJ5tAa2dPJ9yxIvL6rYmDSl1"
+
+# List all files in the folder
+files_in_folder <- drive_ls(as_id(folder_id))
+
+# Filter for the file named 'yelp_academic_dataset_business.csv'
+csv_file <- files_in_folder[files_in_folder$name == "yelp_academic_dataset_business.csv", ]
+
+# Check if the file exists before attempting download
+if (nrow(csv_file) > 0) {
+  # Download the CSV file to the working directory
+  drive_download(as_id(csv_file$id), path = file.path(getwd(), "yelp_academic_dataset_business.csv"), overwrite = TRUE)
+  cat("File downloaded successfully.")
+} else {
+  cat("The file 'yelp_academic_dataset_business.csv' was not found in the folder.")
+}
+
+business <- read.csv("yelp_academic_dataset_business.csv")
+```
+
+
+```{r exploring the data like class etc. }
+
+# Quick view of the data
+head(business$hours)
+tail(business)
+colnames(business)
+
+#Checking the classes of our variables
+
+class(business$business_id) #character
+class(business$hours)       #character
+class(business$postal_code) #character
+class(business$is_open)     #integer
+class(business$address)     #character
+class(business$categories)  #character
+class(business$latitude)    #numeric
+class(business$city)        #character
+class(business$longitude)   #numeric
+class(business$state)       #character
+class(business$review_count)#integer
+class(business$name)        #character
+class(business$stars)       #numeric
+class(business$attributes)  #character
+
+```
+```{r}
+install.packages("dplyr")
+library(dplyr)
+
+#remove the columns that is not needed for the research
+
+business <- business %>% select(-latitude, -longitude, -hours)
+
+#rearrange the columns in a logical order
+
+business <- business %>% select(business_id, name, is_open ,address, postal_code, city, state, categories, review_count, stars, attributes)
+
+```
+
+```{r}
+# Create a dummy variable to see if it is open / yes or no
+
+business$dummy_open <- business$is_open == 1
+
+# We want to delete the rows that have the value FALSE / 0
+
+business_filtered <- business[business$dummy_open == TRUE,]
+
+business_filtered$dummy_open <- NULL
+
+```
+
+```{r}
+# We have to take in consideration when we can say that a review is valid.
+# we have to set a banchmark to in our dataset to know what values are valid. 
+
+length(which(business_filtered$review_count >= 50))
+#output = 23.396
+
+length(which(business_filtered$review_count < 50))
+
+#output = 96.302
+
+# Work out the dataset to filter this data
+
+
+```
+
+```{r}
+
+business_filtered$categories[4]
+business_filtered$attributes[4]
+
+```
+
diff --git a/DataExploration.Rmd b/DataExploration.Rmd
@@ -0,0 +1,142 @@
+---
+title: "Data Exploration"
+output:
+  html_document: default
+  pdf_document: default
+date: "2024-09-12"
+name: "Team 4"
+---
+
+```{r setup, include=FALSE, results='hide'}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+
+```{r cars, echo=FALSE, results='hide'}
+summary(cars)
+```
+
+Loading the data - will be changed to the data from the drive
+
+```{r, echo = FALSE, results='hide', message=FALSE, warning=FALSE}
+setwd("C:/Users/lucil/Desktop/Master/MA/Dprep/Team 4 Project")
+library(readr)
+data_business <- read_csv("yelp_academic_dataset_business.csv")
+
+```
+# Summary Statistics 
+### __Stars__ 
+
+This section will explain the key statistics for the stars column as well as depict a plot of this column for a better understanding of the data. As our research will focus on the impact on these ratings, it is important to have a good understanding of this variable.  
+
+```{r, echo = FALSE, results = 'hide'}
+library(ggplot2)
+summary(data_business$stars)
+mean_stars <- mean(data_business$stars, na.rm = TRUE)
+median_star <- median(data_business$stars, na.rm = TRUE)
+min_starvalue <- min(data_business$stars, na.rm = TRUE)
+max_starvalue <- max(data_business$stars, na.rm = TRUE)
+rounded_mean_star <- round(mean_stars,2)
+
+print(paste("Mean Star Rating:", mean_stars))
+print(paste("Rounded Mean Star Rating:", rounded_mean_star))
+print(paste("Median Star Rating:", median_star))
+print(paste("Maximum Star Rating:", max_starvalue))
+print(paste("Minimum Star Rating:",min_starvalue))
+```
+```{r, echo = FALSE}
+# Create a data frame for structured output
+summary_df <- data.frame(
+  Statistic = c("Mean Star Rating", "Rounded Mean Star Rating", "Median Star Rating", "Maximum Star Rating", "Minimum Star Rating"),
+  Value = c(mean_stars, rounded_mean_star, median_star, max_starvalue, min_starvalue)
+)
+
+# Print the summary table with better visuals
+knitr::kable(summary_df, format = "markdown", caption = "Summary of Star Ratings")
+```
+```{r, echo=FALSE}
+ggplot(data_business, aes(x = stars)) +
+  geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
+  labs(title = "Distribution of Star Ratings", x = "Star Ratings", y = "Frequency") +
+  theme_minimal()
+```
+
+As depicted in the graph the most common rating obtained by the business on Yelp is of 4 stars. On the other hand, 1 star ratings are the least common. 
+
+
+### __States__ 
+
+This section will depict the location distribution of business among the different states in the USA as the Yelp Reviews are from these location. 
+
+```{r, echo=FALSE, message=FALSE, warning=FALSE}
+library(ggplot2)
+data_business$state <- as.factor(data_business$state)
+ggplot(data_business, aes(x = state)) +
+  geom_bar(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
+  labs(title = "Count of Businesses by State", 
+       x = "State", 
+       y = "Count") +
+  theme_minimal()
+```
+
+This figure allows us to better understand the geographical distribution of the businesses, which might of interest when assesing the reviews and ratings. 
+
+
+### __Categories__ 
+
+
+Only the top 20 categories are depicted in the following table for illustrative purposes. 
+
+```{r, echo=FALSE, warning=FALSE, message=FALSE}
+library(dplyr)
+library(tidyr)
+library(knitr)
+
+# Split the categories and count the occurrences
+category_counts <- data_business %>%
+  mutate(categories = strsplit(as.character(categories), ", ")) %>% # Split by comma and space
+  unnest(categories) %>% # Transform the list column into rows
+  count(categories, sort = TRUE) 
+
+# View the summary statistics
+top_categories <- category_counts %>%
+  top_n(20, n) %>%  
+  arrange(desc(n))  
+
+kable(top_categories, col.names = c("Category", "Count"),
+      caption = "Count of Top 20 Categories", 
+      format = "markdown")
+```
+
+
+The code to plot all business categories can be found below hiding. 
+
+```{r, echo=FALSE, results='hide', include=FALSE}
+library(ggplot2)
+
+ggplot(category_counts, aes(x = reorder(categories, n), y = n)) +
+  geom_bar(stat = "identity", fill = "blue", color = "black", alpha = 0.7) +
+  labs(title = "Count of Categories", 
+       x = "Categories", 
+       y = "Count") +
+  coord_flip() + 
+  theme_minimal()
+```
+
+This figure represents the __Top 20 categories__ of businesses that appear more on Yelp. 
+
+To obtain a better illustrative depiction of the categories only the 20 top categories are depicted on this plot.  
+
+```{r, echo=FALSE}
+top_categories <- category_counts %>%
+  top_n(20, n) # You can also use slice_max(n, n = 20) from dplyr 1.0.0 or higher
+
+ggplot(top_categories, aes(x = reorder(categories, n), y = n)) +
+  geom_bar(stat = "identity", fill = "blue", color = "black", alpha = 0.7) +
+  labs(title = "Top 20 Categories", 
+       x = "Categories", 
+       y = "Count") +
+  coord_flip() + 
+  theme_minimal()
+```
+
diff --git a/README.md b/README.md
@@ -0,0 +1,59 @@
+# Yelp - What influences consumer ratings?
+This projects examines the influence of specific business attributes on the consumer rating of businesses. We created a model to predict the relative impact of specific business attributes that are listed on Yelp on consumer ratings on Yelp of these business. 
+
+## Research Motivation
+In the digital age, consumer ratings and online reviews have become one of the most influential sources of information in shaping consumer expectations and purchase decisions. This means that in the increasingly competitive digital market it is critical to understand which business attributes are drivers for positive and negative consumer ratings. In order to find these attributes that are drivers of consumer rating, we exploit a large open dataset created by Yelp.
+
+Yelp serves as a public forum where consumers can share their experiences and evaluate business. With Yelp's widespread use in the recreation industries like restaurants, retail, and services, it is one of the top online website consumers visit for a trustworthy review. The dataset of Yelp provides use with various variables (including the name, location, category, attributes, ratings, etc.) of over 150,000 businesses across the United States of America. By analyzing this data we hope to find managerial insights, that will make us understand the importance of specific business attributes for the online consumer reviews.
+
+This research emphasizes the importance of using data to drive business decisions, rather than relying on anecdotal feedback or industry trends. This aligns with the growing trend toward data-driven marketing, where decisions are based on hard evidence rather than intuition.
+
+### Research Question
+*Which business attributes have the strongest impact on Yelp's consumer ratings within the United States of America?*
+
+## Data
+The data incorporated in this research is an open dataset provided by Yelp. This dataset includes various business variables, such as names, opening hours, addresses, star reviews, and attributes, of more than 150,000 recreational businesses in the United States of America. These businesses have a extremely broad range of categories, from massage saloons to casinos, and from cheese tastings to tattoo shops. This widespread of types of business allows us to analyse the influence of specific business attributes on consumer ratings on a very broad level. This means that our findings are implementable by a broad range of businesses and industries.
+
+The table below shows the variables within the raw dataset including a brief description of each of the variables. The variables are in order of how the original dataset is build:
+
+|Variable                        |Description                                                                                     |
+|--------------------------------|------------------------------------------------------------------------------------------------|
+|business_id                      |The unique Yelp id for each registered business                                               |
+|hours                            |The opening hours of the business, using a 24hr clock                                         |
+|postal_code                   |The postal code of the business                                                                 |
+|is_open                       |A dummy variable that indicates if the business is permanently closed, 0 implies permanently closed and 1 implies open for business           |
+|accommodates                    |The maximum capacity of the listing                                                            |
+|address                     |The address of the business                                                                        |
+|categories                   |A list of Yelp business categories the business is classified as                                  |
+|latitude                    |The geographical latitude of the business                                                         |
+|city                        |The city the business is located in                                                                |
+|longitude                   |The geographical longitude of the business                                                         |
+|state                       |The American stat the business is located in                                                      |
+|review_count               |The number of Yelp reviews the business has                                                       |
+|name                        |The name of the business                                                                          |
+|stars                        |The business rating in stars, ranges between 0-5 and is rounded to half-stars                |   
+|attributes                    |A list of Yelp business attributes and whether the business has this attributes or not        |   
+
+If there is a need to download the Yelp dataset used for this research, the dataset can be downloaded [here](https://drive.google.com/drive/folders/1ioJVCsr5pJ5tAa2dPJ9yxIvL6rYmDSl1?usp=sharing) (yelp_academic_dataset_business.csv)
+
+## Research Method
+In order to analyse the influence of specific business attributes on consumer ratings on Yelp, we will conduct a linear regression model. Firstly, we will recode the variable 'attribute' into separate dummy variables for each attribute, indicating whether each business possesses that attributes or not (1 if possessed, 0 if not). Then we will design our linear regression by regressing the dependent variable 'stars' on all these previously created attribute dummy variables to find the relative impact of each attribute on the consumer ratings.
+
+## Relevence
+The findings of this research have several important implications for marketing strategies. First of all it could enhance customer experience by identifying key drivers of high ratings. Businesses could focus on improving the attributes that are found to have a high effect on consumer ratings, which will allow for more targeted improvements rather than broad inefficient changes. The insights of this research could also strengthen a company's branding and communication strategy. If specific attributes are found to be significantly important for higher consumer ratings, businesses can highlight these attributes in their promotional material to attract more consumers.
+
+## Prediction Model
+
+## Repository Overview
+
+## Dependencies
+
+## Running the Code
+
+## Authors
+This repository is produced by group 4 of the course Data Preperation & Workflow Management taught by Hannes Datta at Tilburg University. This course is part of the Master's program Marketing Analytics. The groupmembers and authors of this repository:
+- Claudia van Hoof ([[email protected]](mailto:[email protected]))
+- Beste Özyürekoğlu ([[email protected]](mailto:[email protected]))
+- Lucía Ramos Dominguez ([[email protected]](mailto:[email protected]))
+- Ashley Saarloos ([[email protected]](mailto:[email protected]))
+- Renske Vincken ([[email protected]](mailto:[email protected]))
diff --git a/Team Project dPrep.rmd b/Team Project dPrep.rmd
@@ -0,0 +1,50 @@
+---
+title: "Team Project Dprep"
+author: "Team 4"
+date: "2024-09-05"
+output: pdf_document
+---
+
+## Team Project Week 2
+
+This is the deliverable of the dPrep team project of week 2, group 4
+
+```{r}
+library(readr)
+data_business = read.csv('data/yelp_academic_dataset_business.csv')
+summary(data_business)
+```
+This dataset includes data from the website Yelp on various different businesses. The variables included are the following, with a short elaboration:
+
+* __Business_id__:The unique id of each business
+* __Hours__: The opening hours of the business
+* __Postal_code__: The postal code of the business
+* __Is_open__: If the business is currently open
+* __Address__: The address of the business
+* __Categories__: The categories the business is classified as
+* __Latitude__: The geographical latitude of the business
+* __City__: The city the business is located in
+* __Longitude__: The geographical longitude of the business
+* __State__: The state the business is located in
+* __Review_count__: The number of reviews the business had on Yelp
+* __Name__: The name of the business
+* __Stars__: The star rating of the business on Yelp
+* __Attributes__: The attributes of the business
+
+# MOTIVATION
+Our team decided to use the dataset from Yelp. After downloading the data, and scanning through it, it was decided that the subset "yelp_academic_dataset_business.csv" was the most interesting to use for our research. 
+
+__Possible research question__ 
+Which business attributes lead to a higher star rating? 
+
+We found it to be interesting to see if specific attributes lead to higher ratings for the businesses. For example, if businesses that have “Free Wifi” obtain higher ratings than those that do not present this attribute. We want to study this 
+
+__Research Method__
+Regression analysis:
+
+* Converting the attributes into a new dummy variable, indicating whether it is present or not 
+* Seeing what effect the presence of these dummies has on the ratings 
+* We will use the column "Stars" as well as "Attributes" from the dataset to conduct this analysis. 
+* The attributes for a business will be the independent variables (e.g., “Free Wi-Fi,” “Parking,” “Take out”), and the dependent variable would be the business rating (star ranking from 1-5) 
+* By using a regression analysis we will be able to quantify the impact of each attribute on the rating.
+