Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate documentation edits from Schloss Lab meeting #232

Merged
merged 50 commits into from
Nov 20, 2020
Merged
Changes from 2 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
88b66ee
Fix gh link
kelly-sovacool Nov 16, 2020
3ef722d
Remove DOI to not-yet-published paper
kelly-sovacool Nov 16, 2020
4ddd27f
Update man/
kelly-sovacool Nov 16, 2020
8feddf3
Incorporate feedback from @tomkoset & @sbrifkin
kelly-sovacool Nov 16, 2020
5a7e1eb
Merge 8feddf3c08a338bb166022f6a86461f0869fa77a into 190af06bcc7592c7d…
kelly-sovacool Nov 16, 2020
2cade0f
🎨 Style R code
github-actions[bot] Nov 16, 2020
2333a1f
📑 Build docs site
github-actions[bot] Nov 16, 2020
5f1ee62
Define hyperparameter
BTopcuoglu Nov 16, 2020
296e24a
Incorporate changes
BTopcuoglu Nov 16, 2020
7c26557
Don't run long example
kelly-sovacool Nov 16, 2020
91d36b1
Update date
kelly-sovacool Nov 16, 2020
d428d55
Update comments
kelly-sovacool Nov 16, 2020
ef59919
Remove "inclusive" word
BTopcuoglu Nov 16, 2020
a835362
Finish adding links
BTopcuoglu Nov 16, 2020
e9a1d35
Merge a8353629b8db343e94543f0c8c51ee94678e2a34 into 190af06bcc7592c7d…
kelly-sovacool Nov 16, 2020
f4358dc
build_site()
kelly-sovacool Nov 16, 2020
1a0ccbd
Re-submit to CRAN
kelly-sovacool Nov 16, 2020
cf90ad7
📚 Render Roxygen documentation
github-actions[bot] Nov 17, 2020
ad44f77
📑 Build docs site
github-actions[bot] Nov 17, 2020
7104ba6
Ignore tuning vignette from R build
kelly-sovacool Nov 18, 2020
017212d
Skip some tests on CRAN to reduce checktime
kelly-sovacool Nov 18, 2020
6f04bc3
Don't run long examples
kelly-sovacool Nov 18, 2020
da21ca5
Only include preprocessing vignette on website
kelly-sovacool Nov 18, 2020
9add369
Skip evaluation of a few chunks
kelly-sovacool Nov 18, 2020
d376034
Keep the shortest test option
kelly-sovacool Nov 18, 2020
dfd9a49
Use more precomputed data for the intro vignette
kelly-sovacool Nov 18, 2020
e45a490
Fix dataset name
kelly-sovacool Nov 18, 2020
5a9658c
build_site()
kelly-sovacool Nov 18, 2020
2cd5d20
Update cran comments
kelly-sovacool Nov 18, 2020
e7ab715
Cran release in review
kelly-sovacool Nov 18, 2020
35a32d7
Merge branch 'master' of https://github.com/SchlossLab/mikropml
kelly-sovacool Nov 18, 2020
104cb23
Incorporate JW feedback
kelly-sovacool Nov 18, 2020
1d10b8a
Minor changes
Nov 19, 2020
50154ee
More minor edits with ZL
kelly-sovacool Nov 19, 2020
6a8012a
Minor minor edits
kelly-sovacool Nov 19, 2020
b25c598
Add more edits
zenalapp Nov 19, 2020
318f265
Merge branch 'paper' of https://github.com/SchlossLab/mikropml into p…
kelly-sovacool Nov 19, 2020
3e6ea98
Merge branch 'doc-edits' of https://github.com/SchlossLab/mikropml in…
kelly-sovacool Nov 19, 2020
7ec8e7d
Add KLS contribution
kelly-sovacool Nov 19, 2020
649a95a
Explain co-first author order
kelly-sovacool Nov 19, 2020
117f1fc
Include Begüm's middle initial
kelly-sovacool Nov 19, 2020
f982bca
Resolve #228
kelly-sovacool Nov 20, 2020
dbcd809
Resolve #229
kelly-sovacool Nov 20, 2020
96889b2
Reflow comment everything
kelly-sovacool Nov 20, 2020
f2a76ed
Merge branch 'cran-release' into doc-edits
kelly-sovacool Nov 20, 2020
32d81e4
Rebuild vignette
kelly-sovacool Nov 20, 2020
44d9e08
Merge 32d81e4ca35a86e6d462300d6c4f5d7349067f21 into 117f1fc69df014799…
kelly-sovacool Nov 20, 2020
63ee701
📚 Render Roxygen documentation
github-actions[bot] Nov 20, 2020
482fcfe
🎨 Style R code
github-actions[bot] Nov 20, 2020
50970f0
📑 Build docs site
github-actions[bot] Nov 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 17 additions & 14 deletions vignettes/paper.Rmd
Original file line number Diff line number Diff line change
@@ -96,7 +96,7 @@ begin to perform ML analyses.

To enable a broader range of researchers to apply ML to their problem domains,
we created [`mikropml`](https://github.com/SchlossLab/mikropml/), an easy-to-use
package in R [@r_core_team_r_2020] that implements the ML pipeline created by
R package [@r_core_team_r_2020] that implements the ML pipeline created by
Topçuoğlu _et al._ [@topcuoglu_framework_2020] in a single function that returns
the best model performance metrics and feature importances.
`mikropml` leverages the `caret` package to support several ML algorithms:
@@ -120,31 +120,32 @@ and to predict gender-based biases in academic publishing [@hagan_women_2020].
# mikropml package

The `mikropml` package includes functionality to preprocess the data, train ML
models, and quantify feature importance (Figure 1). We also provide
[vignettes](http://www.schlosslab.org/mikropml/articles/index.html) and an
models, evaluate model performance, and quantify feature importance (Figure 1).
We also provide [vignettes](http://www.schlosslab.org/mikropml/articles/index.html)
and an
[example Snakemake workflow](https://github.com/SchlossLab/mikropml-snakemake-workflow) [@koster_snakemakescalable_2012]
to showcase how to run an ideal ML pipeline with multiple different train/test
data splits. The results can be visualized using helper functions that use
`ggplot2` [@wickham_ggplot2_2016].

## Preprocessing data

We provide a function `preprocess_data()` to preprocess features using several
different functions from the `caret` package. The `preprocess_data()` function
We provide the function `preprocess_data()` to preprocess features using several
different functions from the `caret` package. `preprocess_data()`
takes continuous and categorical data, re-factors categorical data into binary
features, and provides options to normalize continuous data, remove features
with near-zero variance, and keep only one instance of perfectly correlated
features. We set the default options based on best practices implemented in
features. We set the default options based on those implemented in
FIDDLE [@tang_democratizing_2020]. More details on how to use
`preprocess_data()` can be found in the accompanying
`preprocess_data()` can be found in the accompanying
[vignette](http://www.schlosslab.org/mikropml/articles/preprocess.html).

## Running ML

The main function in mikropml, `run_ml()`, minimally takes in the model choice
and a data frame with an outcome column and remaining columns as categorical
or continuous features. For model choice, `mikropml` currently supports logistic
and linear regression [`glm`: @friedman_regularization_2010], support vector
and a data frame with an outcome column and feature columns.
For model choice, `mikropml` currently supports logistic
and linear regression [`glmnet`: @friedman_regularization_2010], support vector
machines with a radial basis kernel [`kernlab`: @karatzoglou_kernlab_2004],
decision trees [`rpart`: @therneau_rpart_2019],
random forest [`randomForest`: @liaw_classication_2002],
@@ -173,7 +174,7 @@ contains a comprehensive tutorial on how to use `run_ml()`.

To investigate the variation in model performance depending on the train and
test set used [@topcuoglu_framework_2020; @lapp_machine_2020],
we provide examples of how to run the `run_ml()` function many times with
we provide examples of how to `run_ml()` many times with
different train/test splits and how to get summary information about model
performance on
[a local computer](http://www.schlosslab.org/mikropml/articles/parallel.html)
@@ -183,10 +184,12 @@ or on a high-performance computing cluster using a
## Tuning & visualization

One particularly important aspect of ML is hyperparameter tuning.
Practitioners must explore a range of hyperparameter possibilities.
We provide a reasonable range of default hyperparameters for each model type.
However practitioners should explore whether that range is appropriate for their data,
or if they should customize the hyperparameter range.
Therefore, we provide a function `plot_hp_performance()` to plot the
cross-validation performance metric of models built using different train/test
splits. This helps evaluate if the hyperparameter range is being searched
cross-validation performance metric of a single model or models built using
different train/test splits. This helps evaluate if the hyperparameter range is being searched
exhaustively and allows the user to pick the ideal set. We also provide summary
plots of test performance metrics for the many train/test splits with different
models using `plot_model_performance()`. Examples are described in the
31 changes: 17 additions & 14 deletions vignettes/paper.md
Original file line number Diff line number Diff line change
@@ -88,7 +88,7 @@ begin to perform ML analyses.

To enable a broader range of researchers to apply ML to their problem domains,
we created [`mikropml`](https://github.com/SchlossLab/mikropml/), an easy-to-use
package in R [@r_core_team_r_2020] that implements the ML pipeline created by
R package [@r_core_team_r_2020] that implements the ML pipeline created by
Topçuoğlu _et al._ [@topcuoglu_framework_2020] in a single function that returns
the best model performance metrics and feature importances.
`mikropml` leverages the `caret` package to support several ML algorithms:
@@ -112,31 +112,32 @@ and to predict gender-based biases in academic publishing [@hagan_women_2020].
# mikropml package

The `mikropml` package includes functionality to preprocess the data, train ML
models, and quantify feature importance (Figure 1). We also provide
[vignettes](http://www.schlosslab.org/mikropml/articles/index.html) and an
models, evaluate model performance, and quantify feature importance (Figure 1).
We also provide [vignettes](http://www.schlosslab.org/mikropml/articles/index.html)
and an
[example Snakemake workflow](https://github.com/SchlossLab/mikropml-snakemake-workflow) [@koster_snakemakescalable_2012]
to showcase how to run an ideal ML pipeline with multiple different train/test
data splits. The results can be visualized using helper functions that use
`ggplot2` [@wickham_ggplot2_2016].

## Preprocessing data

We provide a function `preprocess_data()` to preprocess features using several
different functions from the `caret` package. The `preprocess_data()` function
We provide the function `preprocess_data()` to preprocess features using several
different functions from the `caret` package. `preprocess_data()`
takes continuous and categorical data, re-factors categorical data into binary
features, and provides options to normalize continuous data, remove features
with near-zero variance, and keep only one instance of perfectly correlated
features. We set the default options based on best practices implemented in
features. We set the default options based on those implemented in
FIDDLE [@tang_democratizing_2020]. More details on how to use
`preprocess_data()` can be found in the accompanying
`preprocess_data()` can be found in the accompanying
[vignette](http://www.schlosslab.org/mikropml/articles/preprocess.html).

## Running ML

The main function in mikropml, `run_ml()`, minimally takes in the model choice
and a data frame with an outcome column and remaining columns as categorical
or continuous features. For model choice, `mikropml` currently supports logistic
and linear regression [`glm`: @friedman_regularization_2010], support vector
and a data frame with an outcome column and feature columns.
For model choice, `mikropml` currently supports logistic
and linear regression [`glmnet`: @friedman_regularization_2010], support vector
machines with a radial basis kernel [`kernlab`: @karatzoglou_kernlab_2004],
decision trees [`rpart`: @therneau_rpart_2019],
random forest [`randomForest`: @liaw_classication_2002],
@@ -165,7 +166,7 @@ contains a comprehensive tutorial on how to use `run_ml()`.

To investigate the variation in model performance depending on the train and
test set used [@topcuoglu_framework_2020; @lapp_machine_2020],
we provide examples of how to run the `run_ml()` function many times with
we provide examples of how to `run_ml()` many times with
different train/test splits and how to get summary information about model
performance on
[a local computer](http://www.schlosslab.org/mikropml/articles/parallel.html)
@@ -175,10 +176,12 @@ or on a high-performance computing cluster using a
## Tuning & visualization

One particularly important aspect of ML is hyperparameter tuning.
Practitioners must explore a range of hyperparameter possibilities.
We provide a reasonable range of default hyperparameters for each model type.
However practitioners should explore whether that range is appropriate for their data,
or if they should customize the hyperparameter range.
Therefore, we provide a function `plot_hp_performance()` to plot the
cross-validation performance metric of models built using different train/test
splits. This helps evaluate if the hyperparameter range is being searched
cross-validation performance metric of a single model or models built using
different train/test splits. This helps evaluate if the hyperparameter range is being searched
exhaustively and allows the user to pick the ideal set. We also provide summary
plots of test performance metrics for the many train/test splits with different
models using `plot_model_performance()`. Examples are described in the