Merge pull request #14 from imtvwy/review

Update the dead link on Readme and Make the EDA conclusion more prominent
UBC-MDS · Nov 21, 2021 · 3bec51c · 3bec51c
2 parents 156f6e7 + be3e05e
commit 3bec51c
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -45,7 +45,7 @@ The raw data comprises most of the features as character type where some of the
 
     The desired outputs are processed data set in form of .csv file., RMD file/ Notebook for reproducible codes.
 
-    The initial EDA can be viewed and explored [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/tree/main/src/pumpkin_eda.ipynb).
+    The initial EDA can be viewed and explored [here](/src/eda/pumpkin_eda.pdf).
 
 3.  **Predictive Modelling**
 
@@ -59,4 +59,4 @@ The raw data comprises most of the features as character type where some of the
 
 4.  **Report**
 
-    Results of the analysis can be found [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/tree/main/doc) (folder link for WIP, report to be generated once analysis is completed).
+    Results of the analysis can be found [here](/doc) (folder link for WIP, report to be generated once analysis is completed).
diff --git a/src/eda/pumpkin_eda.Rmd b/src/eda/pumpkin_eda.Rmd
@@ -1,5 +1,6 @@
 ---
 title: "Giant Pumpkins EDA"
+author: Mahsa Sarafrazi, Rowan Sivanandam, Shiva Jena, and Vanessa Yuen
 date: November 20, 2021
 output: 
   pdf_document:
@@ -9,7 +10,6 @@ output:
     df_print: kable
 ---
 
-
 ## Exploratory Data Analysis of the Giant Pumpkins data set
 
 ## Summary of the data set
@@ -18,19 +18,23 @@ The data set of this project is from [BigPumpkins.com](http://www.bigpumpkins.co
 
 Each row of the data set represents a GPC weighoff result. Each row includes the id (year-type), place/ranking, grower name, city, state, country, gpc site and variety of the giant pumpkin. It also contains genetic info such as the seed mother and pollinator father. Measurements taken include the weight in lbs and ott in inches (Over the top measurement to estimate weight).
 
-In the raw data, there are rows of 'seperator' inserted after the records of the same 'id'. We have removed these separator records and saved the processed data in the `processed_pumpkins.csv`. A script file for this data processing can be found in ????.
+In the raw data, there are rows of 'seperator' inserted after the records of the same 'id'. We have removed these separator records and saved the processed data in the `processed_pumpkins.csv`. A script file for this data processing can be found in [here](https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/blob/main/src/script/download_data.py).
 
 There are in total 28,011 observations and 14 features. Some null values found in the city, seed mother, pollinator father, ott, estimated weight, pct_chart and variety features.
 
-### **Partition the data set into Training and Test sets**
+## **Partition the data set into Training and Test sets**
 
 We will split the data with 70% training data and 30% test data. After splitting, the number of observations in the training set and test set are 19,607 and 8,404 respectively.
 
-### Exploratory Data Analysis on the Training set
+## Exploratory Data Analysis on the Training set
 
 We have plotted distribution of the target 'weight (Lbs)' and some features in the training set to explore if the features will be useful to predict the weight of the giant pumpkins.
 
-The plot shows most of the observations are from the United States. The distribution of the GPC sites, city and state/province are more evenly distributed. We consider these columns are all good features to be used. Plots of the mean weight of giant pumpkins against different features (ott, country, city, state, gpc site) also suggest these features relates to the target (weight).
+The plots in the section [Plots of Data Distribution and Relationship] show that most of the observations are from the United States. The distribution of the GPC sites, city and state/province are more evenly distributed. We consider these columns are all good features to be used. Plots of the mean weight of giant pumpkins against different features (ott, country, city, state, gpc site) also suggest that these features relates to the target (weight).
+
+From the [Summary Statistics] of the training set, it is noticed that the grower name, seed mother and pollinator father are free-text columns. We think this genetic information might be useful for the prediction of the weight. It is found from the [GPC website](https://gpc1.org/about/resources/) that there is a naming convention for the seed / pollinator (Parent Weight : Grower Name: Year). We may consider to transform this data to separate features at later stage.
+
+The number of non-null values in the variety column is very low. We will drop this column for training as the information may not be useful when there are so many null values.
 
 \newpage
 
@@ -40,19 +44,14 @@ The plot shows most of the observations are from the United States. The distribu
 
 ### ![Distribution of Country](files/country_dist_plot.png "Distribution of Country")
 
-
 ### ![Distribution of ott, est_weight and pct_chart](files/numeric_dist_plot.png "Distribution of ott, est_weight and pct_chart")
+
 \newpage
 
 ### ![""](files/dist_combined.png)
 
-![""](files/correlation.png)
-\newpage
+![""](files/correlation.png) \newpage
 
 ### Summary Statistics
 
-From the describe summary of the training set (see below), it is noticed that the grower name, seed mother and pollinator father are free-text columns. We think this genetic information might be useful for the prediction of the weight. It is found from the [GPC website](https://gpc1.org/about/resources/) that there is a naming convention for the seed / pollinator (Parent Weight : Grower Name: Year). We may consider to transform this data to separate features at later stage.
-
-The number of non-null values in the variety column is very low. We will drop this column for training as the information may not be useful when there are so many null values.
-
 ![Output of Descriptive Summary of the Training Set from Jupyter Notebook](files/train_df_describe_summary.PNG)
diff --git a/src/eda/pumpkin_eda.pdf b/src/eda/pumpkin_eda.pdf