Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First addition to the data exploration #11

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
7f6e772
Initial commit
github-classroom[bot] Sep 2, 2024
c80d5d8
add online IDE url
github-classroom[bot] Sep 3, 2024
19f5d4c
Add files via upload
lucia-ramos-dominguez Sep 3, 2024
1df3de3
json_to_csv_converter.py
lucia-ramos-dominguez Sep 3, 2024
3c639ae
Add files via upload
besteozy Sep 5, 2024
fa3a14a
Update Team Project dPrep.rmd
lucia-ramos-dominguez Sep 5, 2024
cc51158
Update Team Project dPrep.rmd
lucia-ramos-dominguez Sep 5, 2024
9cb1545
Update Team Project dPrep.rmd
lucia-ramos-dominguez Sep 5, 2024
02d6374
Update Team Project dPrep.rmd
asaarloos Sep 6, 2024
147d023
Update Team Project dPrep.rmd
claudiavhoof Sep 6, 2024
d1792db
Update Team Project dPrep.rmd
claudiavhoof Sep 6, 2024
8f65f7c
Update Team Project dPrep.rmd
claudiavhoof Sep 6, 2024
cdbe118
Delete b6f10b8ec9c7b318acb706a9189d0f68-afb941a00eb59e9d0b7a4cb8122ee…
lucia-ramos-dominguez Sep 6, 2024
7a44f30
Changes in README document
claudiavhoof Sep 10, 2024
db0d1b3
Merge pull request #6 from course-dprep/Claudia
claudiavhoof Sep 10, 2024
d24f5e4
Update README.md
lucia-ramos-dominguez Sep 10, 2024
eb340c8
Update README.md
lucia-ramos-dominguez Sep 10, 2024
eceb59c
Update README.md
lucia-ramos-dominguez Sep 10, 2024
6a7189b
Adding drive link
lucia-ramos-dominguez Sep 10, 2024
593fc22
Merge pull request #12 from course-dprep/DataExploration
lucia-ramos-dominguez Sep 10, 2024
3066e70
README updated
claudiavhoof Sep 10, 2024
24591cd
email adresses updated in README
claudiavhoof Sep 10, 2024
36f703e
updated email adresses in README
claudiavhoof Sep 10, 2024
9a4f1f1
README research motivation updated
claudiavhoof Sep 10, 2024
978df81
README almost finished, only table of variables has to be added
claudiavhoof Sep 10, 2024
d27539e
Merge pull request #13 from course-dprep/README
claudiavhoof Sep 10, 2024
7c22fd0
README update
claudiavhoof Sep 10, 2024
ac90dc6
README update
claudiavhoof Sep 10, 2024
469b739
README updated
claudiavhoof Sep 10, 2024
6157eca
README updated
claudiavhoof Sep 10, 2024
c7dcffd
Merge pull request #14 from course-dprep/README
claudiavhoof Sep 10, 2024
36bf16a
Merge branch 'course-dprep:main' into main
lucia-ramos-dominguez Sep 10, 2024
aed2304
Update README.md
lucia-ramos-dominguez Sep 10, 2024
d55446e
Update README.md
besteozy Sep 10, 2024
6b0eb4a
Update README.md
lucia-ramos-dominguez Sep 10, 2024
cf51334
Add files via upload
asaarloos Sep 11, 2024
1f126d6
README with added table of variables
claudiavhoof Sep 11, 2024
20dd34f
Merge pull request #15 from course-dprep/README
claudiavhoof Sep 11, 2024
932e95f
Update and rename Data.preperation (2).Rmd to Data.preparation (2).Rmd
lucia-ramos-dominguez Sep 12, 2024
4c46e58
Add files via upload
asaarloos Sep 12, 2024
2e54ef9
Merge pull request #16 from course-dprep/11-downloading-data-through-…
asaarloos Sep 12, 2024
6a06aa8
Delete Data.preparation (2).Rmd
lucia-ramos-dominguez Sep 12, 2024
86419f0
adding data exploration file first draft
lucia-ramos-dominguez Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions Data Preparation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Data preperation"
output: pdf_document
date: "2024-09-12"
---


```{r}
library(googledrive)

# Authenticate with Google Drive
drive_auth()

# Folder ID from your Google Drive link
folder_id <- "1ioJVCsr5pJ5tAa2dPJ9yxIvL6rYmDSl1"

# List all files in the folder
files_in_folder <- drive_ls(as_id(folder_id))

# Filter for the file named 'yelp_academic_dataset_business.csv'
csv_file <- files_in_folder[files_in_folder$name == "yelp_academic_dataset_business.csv", ]

# Check if the file exists before attempting download
if (nrow(csv_file) > 0) {
# Download the CSV file to the working directory
drive_download(as_id(csv_file$id), path = file.path(getwd(), "yelp_academic_dataset_business.csv"), overwrite = TRUE)
cat("File downloaded successfully.")
} else {
cat("The file 'yelp_academic_dataset_business.csv' was not found in the folder.")
}

business <- read.csv("yelp_academic_dataset_business.csv")
```


```{r exploring the data like class etc. }

# Quick view of the data
head(business$hours)
tail(business)
colnames(business)

#Checking the classes of our variables

class(business$business_id) #character
class(business$hours) #character
class(business$postal_code) #character
class(business$is_open) #integer
class(business$address) #character
class(business$categories) #character
class(business$latitude) #numeric
class(business$city) #character
class(business$longitude) #numeric
class(business$state) #character
class(business$review_count)#integer
class(business$name) #character
class(business$stars) #numeric
class(business$attributes) #character

```
```{r}
install.packages("dplyr")
library(dplyr)

#remove the columns that is not needed for the research

business <- business %>% select(-latitude, -longitude, -hours)

#rearrange the columns in a logical order

business <- business %>% select(business_id, name, is_open ,address, postal_code, city, state, categories, review_count, stars, attributes)

```

```{r}
# Create a dummy variable to see if it is open / yes or no

business$dummy_open <- business$is_open == 1

# We want to delete the rows that have the value FALSE / 0

business_filtered <- business[business$dummy_open == TRUE,]

business_filtered$dummy_open <- NULL

```

```{r}
# We have to take in consideration when we can say that a review is valid.
# we have to set a banchmark to in our dataset to know what values are valid.

length(which(business_filtered$review_count >= 50))
#output = 23.396

length(which(business_filtered$review_count < 50))

#output = 96.302

# Work out the dataset to filter this data


```

```{r}

business_filtered$categories[4]
business_filtered$attributes[4]

```

142 changes: 142 additions & 0 deletions DataExploration.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: "Data Exploration"
output:
html_document: default
pdf_document: default
date: "2024-09-12"
name: "Team 4"
---

```{r setup, include=FALSE, results='hide'}
knitr::opts_chunk$set(echo = TRUE)
```


```{r cars, echo=FALSE, results='hide'}
summary(cars)
```

Loading the data - will be changed to the data from the drive

```{r, echo = FALSE, results='hide', message=FALSE, warning=FALSE}
setwd("C:/Users/lucil/Desktop/Master/MA/Dprep/Team 4 Project")
library(readr)
data_business <- read_csv("yelp_academic_dataset_business.csv")

```
# Summary Statistics
### __Stars__

This section will explain the key statistics for the stars column as well as depict a plot of this column for a better understanding of the data. As our research will focus on the impact on these ratings, it is important to have a good understanding of this variable.

```{r, echo = FALSE, results = 'hide'}
library(ggplot2)
summary(data_business$stars)
mean_stars <- mean(data_business$stars, na.rm = TRUE)
median_star <- median(data_business$stars, na.rm = TRUE)
min_starvalue <- min(data_business$stars, na.rm = TRUE)
max_starvalue <- max(data_business$stars, na.rm = TRUE)
rounded_mean_star <- round(mean_stars,2)

print(paste("Mean Star Rating:", mean_stars))
print(paste("Rounded Mean Star Rating:", rounded_mean_star))
print(paste("Median Star Rating:", median_star))
print(paste("Maximum Star Rating:", max_starvalue))
print(paste("Minimum Star Rating:",min_starvalue))
```
```{r, echo = FALSE}
# Create a data frame for structured output
summary_df <- data.frame(
Statistic = c("Mean Star Rating", "Rounded Mean Star Rating", "Median Star Rating", "Maximum Star Rating", "Minimum Star Rating"),
Value = c(mean_stars, rounded_mean_star, median_star, max_starvalue, min_starvalue)
)

# Print the summary table with better visuals
knitr::kable(summary_df, format = "markdown", caption = "Summary of Star Ratings")
```
```{r, echo=FALSE}
ggplot(data_business, aes(x = stars)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Star Ratings", x = "Star Ratings", y = "Frequency") +
theme_minimal()
```

As depicted in the graph the most common rating obtained by the business on Yelp is of 4 stars. On the other hand, 1 star ratings are the least common.


### __States__

This section will depict the location distribution of business among the different states in the USA as the Yelp Reviews are from these location.

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
data_business$state <- as.factor(data_business$state)
ggplot(data_business, aes(x = state)) +
geom_bar(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Count of Businesses by State",
x = "State",
y = "Count") +
theme_minimal()
```

This figure allows us to better understand the geographical distribution of the businesses, which might of interest when assesing the reviews and ratings.


### __Categories__


Only the top 20 categories are depicted in the following table for illustrative purposes.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
library(tidyr)
library(knitr)

# Split the categories and count the occurrences
category_counts <- data_business %>%
mutate(categories = strsplit(as.character(categories), ", ")) %>% # Split by comma and space
unnest(categories) %>% # Transform the list column into rows
count(categories, sort = TRUE)

# View the summary statistics
top_categories <- category_counts %>%
top_n(20, n) %>%
arrange(desc(n))

kable(top_categories, col.names = c("Category", "Count"),
caption = "Count of Top 20 Categories",
format = "markdown")
```


The code to plot all business categories can be found below hiding.

```{r, echo=FALSE, results='hide', include=FALSE}
library(ggplot2)

ggplot(category_counts, aes(x = reorder(categories, n), y = n)) +
geom_bar(stat = "identity", fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Count of Categories",
x = "Categories",
y = "Count") +
coord_flip() +
theme_minimal()
```

This figure represents the __Top 20 categories__ of businesses that appear more on Yelp.

To obtain a better illustrative depiction of the categories only the 20 top categories are depicted on this plot.

```{r, echo=FALSE}
top_categories <- category_counts %>%
top_n(20, n) # You can also use slice_max(n, n = 20) from dplyr 1.0.0 or higher

ggplot(top_categories, aes(x = reorder(categories, n), y = n)) +
geom_bar(stat = "identity", fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Top 20 Categories",
x = "Categories",
y = "Count") +
coord_flip() +
theme_minimal()
```

59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Yelp - What influences consumer ratings?
This projects examines the influence of specific business attributes on the consumer rating of businesses. We created a model to predict the relative impact of specific business attributes that are listed on Yelp on consumer ratings on Yelp of these business.

## Research Motivation
In the digital age, consumer ratings and online reviews have become one of the most influential sources of information in shaping consumer expectations and purchase decisions. This means that in the increasingly competitive digital market it is critical to understand which business attributes are drivers for positive and negative consumer ratings. In order to find these attributes that are drivers of consumer rating, we exploit a large open dataset created by Yelp.

Yelp serves as a public forum where consumers can share their experiences and evaluate business. With Yelp's widespread use in the recreation industries like restaurants, retail, and services, it is one of the top online website consumers visit for a trustworthy review. The dataset of Yelp provides use with various variables (including the name, location, category, attributes, ratings, etc.) of over 150,000 businesses across the United States of America. By analyzing this data we hope to find managerial insights, that will make us understand the importance of specific business attributes for the online consumer reviews.

This research emphasizes the importance of using data to drive business decisions, rather than relying on anecdotal feedback or industry trends. This aligns with the growing trend toward data-driven marketing, where decisions are based on hard evidence rather than intuition.

### Research Question
*Which business attributes have the strongest impact on Yelp's consumer ratings within the United States of America?*

## Data
The data incorporated in this research is an open dataset provided by Yelp. This dataset includes various business variables, such as names, opening hours, addresses, star reviews, and attributes, of more than 150,000 recreational businesses in the United States of America. These businesses have a extremely broad range of categories, from massage saloons to casinos, and from cheese tastings to tattoo shops. This widespread of types of business allows us to analyse the influence of specific business attributes on consumer ratings on a very broad level. This means that our findings are implementable by a broad range of businesses and industries.

The table below shows the variables within the raw dataset including a brief description of each of the variables. The variables are in order of how the original dataset is build:

|Variable |Description |
|--------------------------------|------------------------------------------------------------------------------------------------|
|business_id |The unique Yelp id for each registered business |
|hours |The opening hours of the business, using a 24hr clock |
|postal_code |The postal code of the business |
|is_open |A dummy variable that indicates if the business is permanently closed, 0 implies permanently closed and 1 implies open for business |
|accommodates |The maximum capacity of the listing |
|address |The address of the business |
|categories |A list of Yelp business categories the business is classified as |
|latitude |The geographical latitude of the business |
|city |The city the business is located in |
|longitude |The geographical longitude of the business |
|state |The American stat the business is located in |
|review_count |The number of Yelp reviews the business has |
|name |The name of the business |
|stars |The business rating in stars, ranges between 0-5 and is rounded to half-stars |
|attributes |A list of Yelp business attributes and whether the business has this attributes or not |

If there is a need to download the Yelp dataset used for this research, the dataset can be downloaded [here](https://drive.google.com/drive/folders/1ioJVCsr5pJ5tAa2dPJ9yxIvL6rYmDSl1?usp=sharing) (yelp_academic_dataset_business.csv)

## Research Method
In order to analyse the influence of specific business attributes on consumer ratings on Yelp, we will conduct a linear regression model. Firstly, we will recode the variable 'attribute' into separate dummy variables for each attribute, indicating whether each business possesses that attributes or not (1 if possessed, 0 if not). Then we will design our linear regression by regressing the dependent variable 'stars' on all these previously created attribute dummy variables to find the relative impact of each attribute on the consumer ratings.

## Relevence
The findings of this research have several important implications for marketing strategies. First of all it could enhance customer experience by identifying key drivers of high ratings. Businesses could focus on improving the attributes that are found to have a high effect on consumer ratings, which will allow for more targeted improvements rather than broad inefficient changes. The insights of this research could also strengthen a company's branding and communication strategy. If specific attributes are found to be significantly important for higher consumer ratings, businesses can highlight these attributes in their promotional material to attract more consumers.

## Prediction Model

## Repository Overview

## Dependencies

## Running the Code

## Authors
This repository is produced by group 4 of the course Data Preperation & Workflow Management taught by Hannes Datta at Tilburg University. This course is part of the Master's program Marketing Analytics. The groupmembers and authors of this repository:
- Claudia van Hoof ([[email protected]](mailto:[email protected]))
- Beste Özyürekoğlu ([[email protected]](mailto:[email protected]))
- Lucía Ramos Dominguez ([[email protected]](mailto:[email protected]))
- Ashley Saarloos ([[email protected]](mailto:[email protected]))
- Renske Vincken ([[email protected]](mailto:[email protected]))
50 changes: 50 additions & 0 deletions Team Project dPrep.rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Team Project Dprep"
author: "Team 4"
date: "2024-09-05"
output: pdf_document
---

## Team Project Week 2

This is the deliverable of the dPrep team project of week 2, group 4

```{r}
library(readr)
data_business = read.csv('data/yelp_academic_dataset_business.csv')
summary(data_business)
```
This dataset includes data from the website Yelp on various different businesses. The variables included are the following, with a short elaboration:

* __Business_id__:The unique id of each business
* __Hours__: The opening hours of the business
* __Postal_code__: The postal code of the business
* __Is_open__: If the business is currently open
* __Address__: The address of the business
* __Categories__: The categories the business is classified as
* __Latitude__: The geographical latitude of the business
* __City__: The city the business is located in
* __Longitude__: The geographical longitude of the business
* __State__: The state the business is located in
* __Review_count__: The number of reviews the business had on Yelp
* __Name__: The name of the business
* __Stars__: The star rating of the business on Yelp
* __Attributes__: The attributes of the business

# MOTIVATION
Our team decided to use the dataset from Yelp. After downloading the data, and scanning through it, it was decided that the subset "yelp_academic_dataset_business.csv" was the most interesting to use for our research.

__Possible research question__
Which business attributes lead to a higher star rating?

We found it to be interesting to see if specific attributes lead to higher ratings for the businesses. For example, if businesses that have “Free Wifi” obtain higher ratings than those that do not present this attribute. We want to study this

__Research Method__
Regression analysis:

* Converting the attributes into a new dummy variable, indicating whether it is present or not
* Seeing what effect the presence of these dummies has on the ratings
* We will use the column "Stars" as well as "Attributes" from the dataset to conduct this analysis.
* The attributes for a business will be the independent variables (e.g., “Free Wi-Fi,” “Parking,” “Take out”), and the dependent variable would be the business rating (star ranking from 1-5)
* By using a regression analysis we will be able to quantify the impact of each attribute on the rating.

Loading