Skip to content

Commit

Permalink
Merge pull request #14 from imtvwy/review
Browse files Browse the repository at this point in the history
Update the dead link on Readme and Make the EDA conclusion more prominent
  • Loading branch information
shivajena authored Nov 21, 2021
2 parents 156f6e7 + be3e05e commit 3bec51c
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 14 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ The raw data comprises most of the features as character type where some of the

The desired outputs are processed data set in form of .csv file., RMD file/ Notebook for reproducible codes.

The initial EDA can be viewed and explored [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/tree/main/src/pumpkin_eda.ipynb).
The initial EDA can be viewed and explored [here](/src/eda/pumpkin_eda.pdf).

3. **Predictive Modelling**

Expand All @@ -59,4 +59,4 @@ The raw data comprises most of the features as character type where some of the

4. **Report**

Results of the analysis can be found [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/tree/main/doc) (folder link for WIP, report to be generated once analysis is completed).
Results of the analysis can be found [here](/doc) (folder link for WIP, report to be generated once analysis is completed).
23 changes: 11 additions & 12 deletions src/eda/pumpkin_eda.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: "Giant Pumpkins EDA"
author: Mahsa Sarafrazi, Rowan Sivanandam, Shiva Jena, and Vanessa Yuen
date: November 20, 2021
output:
pdf_document:
Expand All @@ -9,7 +10,6 @@ output:
df_print: kable
---


## Exploratory Data Analysis of the Giant Pumpkins data set

## Summary of the data set
Expand All @@ -18,19 +18,23 @@ The data set of this project is from [BigPumpkins.com](http://www.bigpumpkins.co

Each row of the data set represents a GPC weighoff result. Each row includes the id (year-type), place/ranking, grower name, city, state, country, gpc site and variety of the giant pumpkin. It also contains genetic info such as the seed mother and pollinator father. Measurements taken include the weight in lbs and ott in inches (Over the top measurement to estimate weight).

In the raw data, there are rows of 'seperator' inserted after the records of the same 'id'. We have removed these separator records and saved the processed data in the `processed_pumpkins.csv`. A script file for this data processing can be found in ????.
In the raw data, there are rows of 'seperator' inserted after the records of the same 'id'. We have removed these separator records and saved the processed data in the `processed_pumpkins.csv`. A script file for this data processing can be found in [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/blob/main/src/script/download_data.py).

There are in total 28,011 observations and 14 features. Some null values found in the city, seed mother, pollinator father, ott, estimated weight, pct_chart and variety features.

### **Partition the data set into Training and Test sets**
## **Partition the data set into Training and Test sets**

We will split the data with 70% training data and 30% test data. After splitting, the number of observations in the training set and test set are 19,607 and 8,404 respectively.

### Exploratory Data Analysis on the Training set
## Exploratory Data Analysis on the Training set

We have plotted distribution of the target 'weight (Lbs)' and some features in the training set to explore if the features will be useful to predict the weight of the giant pumpkins.

The plot shows most of the observations are from the United States. The distribution of the GPC sites, city and state/province are more evenly distributed. We consider these columns are all good features to be used. Plots of the mean weight of giant pumpkins against different features (ott, country, city, state, gpc site) also suggest these features relates to the target (weight).
The plots in the section [Plots of Data Distribution and Relationship] show that most of the observations are from the United States. The distribution of the GPC sites, city and state/province are more evenly distributed. We consider these columns are all good features to be used. Plots of the mean weight of giant pumpkins against different features (ott, country, city, state, gpc site) also suggest that these features relates to the target (weight).

From the [Summary Statistics] of the training set, it is noticed that the grower name, seed mother and pollinator father are free-text columns. We think this genetic information might be useful for the prediction of the weight. It is found from the [GPC website](https://gpc1.org/about/resources/) that there is a naming convention for the seed / pollinator (Parent Weight : Grower Name: Year). We may consider to transform this data to separate features at later stage.

The number of non-null values in the variety column is very low. We will drop this column for training as the information may not be useful when there are so many null values.

\newpage

Expand All @@ -40,19 +44,14 @@ The plot shows most of the observations are from the United States. The distribu

### ![Distribution of Country](files/country_dist_plot.png "Distribution of Country")


### ![Distribution of ott, est_weight and pct_chart](files/numeric_dist_plot.png "Distribution of ott, est_weight and pct_chart")

\newpage

### ![""](files/dist_combined.png)

![""](files/correlation.png)
\newpage
![""](files/correlation.png) \newpage

### Summary Statistics

From the describe summary of the training set (see below), it is noticed that the grower name, seed mother and pollinator father are free-text columns. We think this genetic information might be useful for the prediction of the weight. It is found from the [GPC website](https://gpc1.org/about/resources/) that there is a naming convention for the seed / pollinator (Parent Weight : Grower Name: Year). We may consider to transform this data to separate features at later stage.

The number of non-null values in the variety column is very low. We will drop this column for training as the information may not be useful when there are so many null values.

![Output of Descriptive Summary of the Training Set from Jupyter Notebook](files/train_df_describe_summary.PNG)
Binary file modified src/eda/pumpkin_eda.pdf
Binary file not shown.

0 comments on commit 3bec51c

Please sign in to comment.