Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: GROUP 17: Giant_Pumpkins_Weight_Prediction #15

Open
1 task done
imtvwy opened this issue Nov 30, 2021 · 7 comments
Open
1 task done

Submission: GROUP 17: Giant_Pumpkins_Weight_Prediction #15

imtvwy opened this issue Nov 30, 2021 · 7 comments
Assignees

Comments

@imtvwy
Copy link

imtvwy commented Nov 30, 2021

Submitting authors: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy

Repository: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction
Report link: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/blob/main/doc/pumpkin.html

Abstract/executive summary:
This project is an attempt to build a prediction model using regression based machine learning models to estimate the weight of giant pumpkins based on their features such as year of cultivation, place, and over the top(ott) size in order to predict the next year’s winner of the GP competition. Different regression based prediction models such as Linear, Ridge and Random Forest were used for training and cross-validation on the training data. For the Ridge model, the hyperparameter (α) was optimised to return the best cross validation score. This model performed fairly well in predicting on the test data which led us to finalise the use of the model for prediction. The best score on cross validation sets is 0.6666134 and the mean test score is 0.6619808. The Random Forest model had similar cross-validation and test scores, but due to its high fit times, it was not chosen for this report. Therefore, for the purpose of reproducibility, we have decided to utilise the Ridge model as our prediction model. For better performance and precision, other models may also be tried on the data.

The data used for this project comes from BigPumpkins.com.
The dataset is a public domain resource which pertains to the attributes of giant pumpkins grown in around 20 countries across the world in different regions. The raw data which was used in this project for the analysis can be found here : https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv

Editor: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy
Reviewer: @RamiroMejia, @riddhisansare, @stevenleung2018, @ruben1dlg

  • I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.
@stevenleung2018
Copy link

stevenleung2018 commented Dec 1, 2021

(Work in progress)

Data analysis review checklist

Reviewer: @stevenleung2018

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [1/2] Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • [1/3] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • [1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • [1/2] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

About 1.5 hours.

Review Comments:

  1. I like the fact that you have tried multiple models and chose the best model in the report, together with a well-thought rationale of the choice.
  2. The whole data pipeline is present and the code written and organized in a readable fashion for me to follow the whole analysis from the beginning to the end.

There are, however, a few issues I have spotted:
3. Installation instructions:
a. The list of dependencies is not available;
b. Installation of environment fails on my computer. Here is the error command and error message:

(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main)
$ conda env create -f environment.yaml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - pyqt5-sip==4.19.18=py39h415ef7b_8
  - graphite2==1.3.13=1000
  - openjpeg==2.4.0=hb211442_1
  - libffi==3.4.2=h8ffe710_5
  - selenium==3.141.0=py39hb82d6ee_1003
  - intel-openmp==2021.4.0=h57928b3_3556
  - pandas==1.3.4=py39h2e25243_1
  - win_inet_pton==1.1.0=py39hcbf5309_3
  - ucrt==10.0.20348.0=h57928b3_0
  - libxgboost==1.3.0=h0e60522_3
  - xorg-libx11==1.7.2=hcd874cb_0
  - sqlite==3.36.0=h8ffe710_2
  - setuptools==59.2.0=py39hcbf5309_0
  - zeromq==4.3.4=h0e60522_1
  - nodejs==14.17.4=h57928b3_0
  - preshed==3.0.6=py39h415ef7b_1
  - libsodium==1.0.18=h8d14728_1
  - xorg-libice==1.0.10=hcd874cb_0
  - murmurhash==1.0.6=py39h415ef7b_2
  - cairo==1.16.0=hb19e0ff_1008
  - cffi==1.15.0=py39h0878f49_0
  - jpeg==9d=h8ffe710_0
  - xorg-libxau==1.0.9=hcd874cb_0
  - libpng==1.6.37=h1d00b33_2
  - libwebp==1.2.1=h57928b3_0
  - pyqt==5.12.3=py39hcbf5309_8
  - statsmodels==0.13.1=py39h5d4886f_0
  - scikit-learn==1.0.1=py39he931e04_2
  - zlib==1.2.11=h8ffe710_1013
  - gettext==0.19.8.1=ha2e2712_1008
  - py-xgboost==1.3.0=py39hcbf5309_3
  - pyqtwebengine==5.12.1=py39h415ef7b_8
  - tornado==6.1=py39hb82d6ee_2
  - psutil==5.8.0=py39hb82d6ee_2
  - libclang==11.1.0=default_h5c34c98_1
  - libbrotlicommon==1.0.9=h8ffe710_6
  - xorg-libxt==1.2.1=hcd874cb_2
  - certifi==2021.10.8=py39hcbf5309_1
  - m2w64-gcc-libs-core==5.3.0=7
  - graphviz==2.49.3=hefbd956_0
  - libbrotlienc==1.0.9=h8ffe710_6
  - catalogue==2.0.6=py39hcbf5309_0
  - libxcb==1.13=hcd874cb_1004
  - harfbuzz==3.1.1=hc601d6f_0
  - pyzmq==22.3.0=py39he46f08e_1
  - zstd==1.5.0=h6255e5f_0
  - regex==2021.11.10=py39hb82d6ee_0
  - xorg-libxpm==3.5.13=hcd874cb_0
  - vega-cli==5.17.0=h0e60522_4
  - srsly==2.4.2=py39h415ef7b_0
  - pcre==8.45=h0e60522_0
  - pyrsistent==0.18.0=py39hb82d6ee_0
  - scipy==1.7.2=py39hc0c34ad_0
  - ipykernel==6.5.0=py39h832f523_1
  - xorg-libsm==1.2.3=hcd874cb_1000
  - cython-blis==0.7.5=py39h5d4886f_1
  - pthread-stubs==0.4=hcd874cb_1001
  - jbig==2.1=h8d14728_2003
  - lcms2==2.12=h2a16943_0
  - vc==14.2=hb210afc_5
  - matplotlib==3.5.0=py39hcbf5309_0
  - libwebp-base==1.2.1=h8ffe710_0
  - fonttools==4.28.1=py39hb82d6ee_0
  - debugpy==1.5.1=py39h415ef7b_0
  - importlib-metadata==4.8.2=py39hcbf5309_0
  - libbrotlidec==1.0.9=h8ffe710_6
  - chardet==4.0.0=py39hcbf5309_2
  - python==3.9.7=h7840368_3_cpython
  - libxml2==2.9.12=hf5bbc77_1
  - liblapack==3.9.0=12_win64_mkl
  - libblas==3.9.0=12_win64_mkl
  - llvmlite==0.36.0=py39ha0cd8c8_0
  - numpy==1.21.4=py39h6635163_0
  - libzlib==1.2.11=h8ffe710_1013
  - m2w64-gmp==6.1.0=2
  - pysocks==1.7.1=py39hcbf5309_4
  - tk==8.6.11=h8ffe710_1
  - libglib==2.70.1=h3be07f2_0
  - lerc==3.0=h0e60522_0
  - brotli==1.0.9=h8ffe710_6
  - msys2-conda-epoch==20160418=1
  - cryptography==36.0.0=py39h7bc7c5c_0
  - pywin32==302=py39hb82d6ee_2
  - libdeflate==1.8=h8ffe710_0
  - numba==0.53.0=py39h69f9ab1_0
  - vega-lite-cli==4.17.0=h57928b3_2
  - fribidi==1.0.10=h8d14728_0
  - catboost==1.0.3=py39hcbf5309_1
  - click==8.0.3=py39hcbf5309_1
  - m2w64-gcc-libs==5.3.0=7
  - xz==5.2.5=h62dcd97_1
  - jupyter_core==4.9.1=py39hcbf5309_1
  - matplotlib-base==3.5.0=py39h581301d_0
  - gts==0.7.6=h7c369d9_2
  - jedi==0.18.1=py39hcbf5309_0
  - ca-certificates==2021.10.8=h5b45459_0
  - brotli-bin==1.0.9=h8ffe710_6
  - libgd==2.3.3=h8bb91b0_0
  - markupsafe==2.0.1=py39hb82d6ee_1
  - lightgbm==3.3.1=py39h415ef7b_1
  - libtiff==4.3.0=hd413186_2
  - xorg-libxdmcp==1.1.3=hcd874cb_0
  - kiwisolver==1.3.2=py39h2e07f2f_1
  - m2w64-gcc-libgfortran==5.3.0=6
  - qt==5.12.9=h5909a2a_4
  - m2w64-libwinpthread-git==5.0.0.4634.697f757=2
  - xorg-xextproto==7.3.0=hcd874cb_1002
  - pyqt-impl==5.12.3=py39h415ef7b_8
  - fontconfig==2.13.1=h1989441_1005
  - pango==1.48.10=h33e4779_2
  - vs2015_runtime==14.29.30037=h902a5da_5
  - spacy==3.2.0=py39hefe7e4c_0
  - freetype==2.10.4=h546665d_1
  - lz4-c==1.9.3=h8ffe710_1
  - xorg-kbproto==1.0.7=hcd874cb_1002
  - mkl==2021.4.0=h0e2418a_729
  - brotlipy==0.7.0=py39hb82d6ee_1003
  - xorg-xproto==7.0.31=hcd874cb_1007
  - thinc==8.0.12=py39hefe7e4c_0
  - tbb==2021.4.0=h2d74725_1
  - pyqtchart==5.12=py39h415ef7b_8
  - cymem==2.0.6=py39h415ef7b_2
  - libiconv==1.16=he774522_0
  - pandoc==2.16.2=h8ffe710_0
  - icu==68.2=h0e60522_0
  - ipython==7.29.0=py39h832f523_2
  - expat==2.4.1=h39d44d4_0
  - pixman==0.40.0=h8ffe710_0
  - libcblas==3.9.0=12_win64_mkl
  - pydantic==1.8.2=py39hb82d6ee_2
  - pillow==8.4.0=py39h916092e_0
  - shap==0.40.0=py39h2e25243_0
  - xorg-libxext==1.3.4=hcd874cb_1
  - getopt-win32==0.1=h8ffe710_0

Thus the subsequent conda activate pumpkin also fails.

The bash run_all.sh runs fine after I conda install scikit-learn manually.
I have 2 suggestions for you to improve on this:
i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and
ii. The list should come from which libraries you actually need to import in your Python scripts and your .Rmd files. By looking at your environmental.yaml file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the command conda env export -f environment.yaml --from-history (where the flag --from-history is the key), and conda will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the last prefix line.

  1. Community Guidelines: The following are missing: "2) Report issues or problems with the software 3) Seek support".
  2. Final report: There is a block quotation under "Data" section about Great Pumpkin Commonweath (GPC) is a bit out of place and redundant since it was already mentioned in the "Introduction" section immediately before it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@RamiroMejia
Copy link

RamiroMejia commented Dec 1, 2021

Data analysis review checklist

Reviewer: @RamiroMejia

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.

  2. You mentioned :

For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’. 
We then used one hot encoding to encode the categorical features.

  • Why did you decide to leave the NAs with 'missing' you did not use another strategy for the imputation like "most_frequent"
  1. Seems like your Random Forest model takes a lot of time to train. Did you try DecisionTreeRegressor?

    • Use RandomizedSearchCV to find the best parameters
    • Also, it would be interesting if you can include a table of results of the models with the training and the test scores.
    • How much your base model improve after the hyperparameter tuning?
  2. It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.

  3. You can add the question of your analysis in the introduction text as part of your motivation.

  4. Add a flow chart of how the scripts are executed in the README.md

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@shivajena
Copy link

(Work in progress)

Data analysis review checklist

Reviewer: @stevenleung2018

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [1/2] Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • [1/3] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • [1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • [1/2] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

About 1.5 hours.

Review Comments:

  1. I like the fact that you have tried multiple models and chose the best model in the report, together with a well-thought rationale of the choice.
  2. The whole data pipeline is present and the code written and organized in a readable fashion for me to follow the whole analysis from the beginning to the end.

There are, however, a few issues I have spotted: 3. Installation instructions: a. The list of dependencies is not available; b. Installation of environment fails on my computer. Here is the error command and error message:

(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main)
$ conda env create -f environment.yaml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - pyqt5-sip==4.19.18=py39h415ef7b_8
  - graphite2==1.3.13=1000
  - openjpeg==2.4.0=hb211442_1
  - libffi==3.4.2=h8ffe710_5
  - selenium==3.141.0=py39hb82d6ee_1003
  - intel-openmp==2021.4.0=h57928b3_3556
  - pandas==1.3.4=py39h2e25243_1
  - win_inet_pton==1.1.0=py39hcbf5309_3
  - ucrt==10.0.20348.0=h57928b3_0
  - libxgboost==1.3.0=h0e60522_3
  - xorg-libx11==1.7.2=hcd874cb_0
  - sqlite==3.36.0=h8ffe710_2
  - setuptools==59.2.0=py39hcbf5309_0
  - zeromq==4.3.4=h0e60522_1
  - nodejs==14.17.4=h57928b3_0
  - preshed==3.0.6=py39h415ef7b_1
  - libsodium==1.0.18=h8d14728_1
  - xorg-libice==1.0.10=hcd874cb_0
  - murmurhash==1.0.6=py39h415ef7b_2
  - cairo==1.16.0=hb19e0ff_1008
  - cffi==1.15.0=py39h0878f49_0
  - jpeg==9d=h8ffe710_0
  - xorg-libxau==1.0.9=hcd874cb_0
  - libpng==1.6.37=h1d00b33_2
  - libwebp==1.2.1=h57928b3_0
  - pyqt==5.12.3=py39hcbf5309_8
  - statsmodels==0.13.1=py39h5d4886f_0
  - scikit-learn==1.0.1=py39he931e04_2
  - zlib==1.2.11=h8ffe710_1013
  - gettext==0.19.8.1=ha2e2712_1008
  - py-xgboost==1.3.0=py39hcbf5309_3
  - pyqtwebengine==5.12.1=py39h415ef7b_8
  - tornado==6.1=py39hb82d6ee_2
  - psutil==5.8.0=py39hb82d6ee_2
  - libclang==11.1.0=default_h5c34c98_1
  - libbrotlicommon==1.0.9=h8ffe710_6
  - xorg-libxt==1.2.1=hcd874cb_2
  - certifi==2021.10.8=py39hcbf5309_1
  - m2w64-gcc-libs-core==5.3.0=7
  - graphviz==2.49.3=hefbd956_0
  - libbrotlienc==1.0.9=h8ffe710_6
  - catalogue==2.0.6=py39hcbf5309_0
  - libxcb==1.13=hcd874cb_1004
  - harfbuzz==3.1.1=hc601d6f_0
  - pyzmq==22.3.0=py39he46f08e_1
  - zstd==1.5.0=h6255e5f_0
  - regex==2021.11.10=py39hb82d6ee_0
  - xorg-libxpm==3.5.13=hcd874cb_0
  - vega-cli==5.17.0=h0e60522_4
  - srsly==2.4.2=py39h415ef7b_0
  - pcre==8.45=h0e60522_0
  - pyrsistent==0.18.0=py39hb82d6ee_0
  - scipy==1.7.2=py39hc0c34ad_0
  - ipykernel==6.5.0=py39h832f523_1
  - xorg-libsm==1.2.3=hcd874cb_1000
  - cython-blis==0.7.5=py39h5d4886f_1
  - pthread-stubs==0.4=hcd874cb_1001
  - jbig==2.1=h8d14728_2003
  - lcms2==2.12=h2a16943_0
  - vc==14.2=hb210afc_5
  - matplotlib==3.5.0=py39hcbf5309_0
  - libwebp-base==1.2.1=h8ffe710_0
  - fonttools==4.28.1=py39hb82d6ee_0
  - debugpy==1.5.1=py39h415ef7b_0
  - importlib-metadata==4.8.2=py39hcbf5309_0
  - libbrotlidec==1.0.9=h8ffe710_6
  - chardet==4.0.0=py39hcbf5309_2
  - python==3.9.7=h7840368_3_cpython
  - libxml2==2.9.12=hf5bbc77_1
  - liblapack==3.9.0=12_win64_mkl
  - libblas==3.9.0=12_win64_mkl
  - llvmlite==0.36.0=py39ha0cd8c8_0
  - numpy==1.21.4=py39h6635163_0
  - libzlib==1.2.11=h8ffe710_1013
  - m2w64-gmp==6.1.0=2
  - pysocks==1.7.1=py39hcbf5309_4
  - tk==8.6.11=h8ffe710_1
  - libglib==2.70.1=h3be07f2_0
  - lerc==3.0=h0e60522_0
  - brotli==1.0.9=h8ffe710_6
  - msys2-conda-epoch==20160418=1
  - cryptography==36.0.0=py39h7bc7c5c_0
  - pywin32==302=py39hb82d6ee_2
  - libdeflate==1.8=h8ffe710_0
  - numba==0.53.0=py39h69f9ab1_0
  - vega-lite-cli==4.17.0=h57928b3_2
  - fribidi==1.0.10=h8d14728_0
  - catboost==1.0.3=py39hcbf5309_1
  - click==8.0.3=py39hcbf5309_1
  - m2w64-gcc-libs==5.3.0=7
  - xz==5.2.5=h62dcd97_1
  - jupyter_core==4.9.1=py39hcbf5309_1
  - matplotlib-base==3.5.0=py39h581301d_0
  - gts==0.7.6=h7c369d9_2
  - jedi==0.18.1=py39hcbf5309_0
  - ca-certificates==2021.10.8=h5b45459_0
  - brotli-bin==1.0.9=h8ffe710_6
  - libgd==2.3.3=h8bb91b0_0
  - markupsafe==2.0.1=py39hb82d6ee_1
  - lightgbm==3.3.1=py39h415ef7b_1
  - libtiff==4.3.0=hd413186_2
  - xorg-libxdmcp==1.1.3=hcd874cb_0
  - kiwisolver==1.3.2=py39h2e07f2f_1
  - m2w64-gcc-libgfortran==5.3.0=6
  - qt==5.12.9=h5909a2a_4
  - m2w64-libwinpthread-git==5.0.0.4634.697f757=2
  - xorg-xextproto==7.3.0=hcd874cb_1002
  - pyqt-impl==5.12.3=py39h415ef7b_8
  - fontconfig==2.13.1=h1989441_1005
  - pango==1.48.10=h33e4779_2
  - vs2015_runtime==14.29.30037=h902a5da_5
  - spacy==3.2.0=py39hefe7e4c_0
  - freetype==2.10.4=h546665d_1
  - lz4-c==1.9.3=h8ffe710_1
  - xorg-kbproto==1.0.7=hcd874cb_1002
  - mkl==2021.4.0=h0e2418a_729
  - brotlipy==0.7.0=py39hb82d6ee_1003
  - xorg-xproto==7.0.31=hcd874cb_1007
  - thinc==8.0.12=py39hefe7e4c_0
  - tbb==2021.4.0=h2d74725_1
  - pyqtchart==5.12=py39h415ef7b_8
  - cymem==2.0.6=py39h415ef7b_2
  - libiconv==1.16=he774522_0
  - pandoc==2.16.2=h8ffe710_0
  - icu==68.2=h0e60522_0
  - ipython==7.29.0=py39h832f523_2
  - expat==2.4.1=h39d44d4_0
  - pixman==0.40.0=h8ffe710_0
  - libcblas==3.9.0=12_win64_mkl
  - pydantic==1.8.2=py39hb82d6ee_2
  - pillow==8.4.0=py39h916092e_0
  - shap==0.40.0=py39h2e25243_0
  - xorg-libxext==1.3.4=hcd874cb_1
  - getopt-win32==0.1=h8ffe710_0

Thus the subsequent conda activate pumpkin also fails.

The bash run_all.sh runs fine after I conda install scikit-learn manually. I have 2 suggestions for you to improve on this: i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and ii. The list should come from which libraries you actually need to import in your Python scripts and your .Rmd files. By looking at your environmental.yaml file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the command conda env export -f environment.yaml --from-history (where the flag --from-history is the key), and conda will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the last prefix line.

  1. Community Guidelines: The following are missing: "2) Report issues or problems with the software 3) Seek support".
  2. Final report: There is a block quotation under "Data" section about Great Pumpkin Commonweath (GPC) is a bit out of place and redundant since it was already mentioned in the "Introduction" section immediately before it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks Steven for the detailed review and feedbacks. We are planning on implementing your feedbacks in a phased manner, and here is a tentative list of some of the immediate changes we are working on:

  1. Documentation: On dependencies, we really appreciate your suggestions to improve. We have improved our environment file and finalising a clear and shorter list of dependencies. For example, we have made changes in the environment file to make it platform independent and solve some of the issues which came up in the first release. Here is a link to the specific PR. We will also be appending a list of dependencies in the readme as per the environment file.
    We are also trying to improve Community guidelines by amending our contribution file.
  2. Code Quality: We are planning to work on building and improving the tests which should be ready by next week.
  3. Reproducibility: We understand your inputs regarding the environment issue and working on it as stated above.

@riddhisansare
Copy link

Data analysis review checklist

Reviewer: @riddhisansare

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • [] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.7 hours.

Review Comments:

  1. I appreciate that the code is legible and understandable throughout the project.
  2. The README file gives a subtle description of the project however, including a flowchart for the executed scripts and pipelines would give users a better overall review of the process.
  3. It would be nice to add the analysis question in the introduction text as part of your motivation.
  4. Adding a line to download the data in the EDA file would avoid trouble if the user has not downloaded the data.
  5. The report gives a good overall understanding of the process.
    Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
  6. Adding a file that describes the step-by-step process of project setup would be beneficial for a completely new user with no coding background.
    I would recommend adding a line to download your date within your EDA jupyter notebook, so when other people run your script, they can have the data if not downloaded earlier.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@shivajena
Copy link

shivajena commented Dec 4, 2021

Data analysis review checklist

Reviewer: @RamiroMejia

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.
  2. You mentioned :
For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’. 
We then used one hot encoding to encode the categorical features.
  • Why did you decide to leave the NAs with 'missing' you did not use another strategy for the imputation like "most_frequent"
  1. Seems like your Random Forest model takes a lot of time to train. Did you try DecisionTreeRegressor?

    • Use RandomizedSearchCV to find the best parameters
    • Also, it would be interesting if you can include a table of results of the models with the training and the test scores.
    • How much your base model improve after the hyperparameter tuning?
  2. It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.

  3. You can add the question of your analysis in the introduction text as part of your motivation.

  4. Add a flow chart of how the scripts are executed in the README.md

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks Rameiro for the inputs, truly appreciate your detailed observations. Following are our immediate plans to implement your feedbacks and suggestions:

  1. Code quality: Tests - we are targetting next week to improve upon the tests.
  2. Reproducibility- we have improved the environment file, made it OS independent (see PR) and trying to come up with glitch free dependencies installations.
  3. Analysis report - we are improving it as per your and TA feedbacks and will notify soon on the next week's release.
  4. EDA observations - Actually yes, we did have problems with missing values, and we have used median strategy for numerical features and missing for categorical to deal with it which you can see it in our report. Our categorical features like place/country/state of origin are very very context specific and therefore, using most_frequent would be misleading. That is the reason why we didnt use it.
    We will get back on the rest of the comments soon.

@ruben1dlg
Copy link

Data analysis review checklist

Reviewer: @ruben1dlg

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [1/2] Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • [1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. In general, I think the structure of the repo is well done! I would recommend having an .md file for the final report already in the repo for visualization purposes only. Also, I would probably change the name of the final report, just so it is clearer that it is in fact the final results that you got.

  2. I think it would be interesting to dig in deeper into the conclusions of the results of the model. It is shown that the model performs well and the score that it got, but we do not get to hear any conclusions about the features and if your assumptions held on to be true.

  3. I feel like the list of dependencies that we need to install is too long, and it should be limited to the packages that are actually needed for running the analysis.

  4. I guess this will be added later, but it would be a good idea to have tests in your code as well, just to make sure that any manual parts work properly.

  5. I liked the introduction of the report. It was fun to read, and it gives a very good idea on what the purpose of the analysis is.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@imtvwy
Copy link
Author

imtvwy commented Dec 10, 2021

Thanks everyone for your valuable feedback so that we can improve our project. We have made the following changes in regarding to your comment:

  1. Regarding comment 3 in this issue, we have
  1. Regarding comment 5 in this issue, we have updated the EDA document to add figure caption and reposition the plots below the reasoning
    UBC-MDS/Giant_Pumpkins_Weight_Prediction@01f0083

  2. Regarding the comments in this issue about the final report, we have

  1. Regarding comment 5 in this issue, we have
  1. Regarding comment 2 in this issue, we have added a "Critique, Limitations and Future Improvements" section in the final report
    UBC-MDS/Giant_Pumpkins_Weight_Prediction@dfafe4f

  2. Regarding the comments of adding process flow in this issue as well as the other issue, we have added a makefile dependency diagram in the README file
    UBC-MDS/Giant_Pumpkins_Weight_Prediction@6650914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants