diff --git a/docs/articles/introduction.html b/docs/articles/introduction.html index ceb61f90..769488ae 100644 --- a/docs/articles/introduction.html +++ b/docs/articles/introduction.html @@ -102,7 +102,7 @@ -
+

In the above case, all observations from A & B will be used for training, all from C & D will be used for testing, and the remaining groups will be randomly assigned to one or the other to satisfy the training_frac as closely as possible.

@@ -414,6 +416,7 @@

#> Fraction of data in the training set: 0.5 #> Groups in the training set: A B C D E F #> Groups in the testing set: A B C D E F G H +#> Groups will be kept together in CV partitions #> Training the model... #> Training complete.

If you need even more control than this, take a look at setting custom training indices. You might also prefer to provide your own train control scheme with the cross_val parameter in run_ml().

@@ -435,17 +438,17 @@

Now, we can check out the feature importances:

 results_imp$feature_importance
-#>    perf_metric perf_metric_diff    names method perf_metric_name seed
-#> 1    0.5542375        0.0082625 Otu00001     rf              AUC 2019
-#> 2    0.5731750       -0.0106750 Otu00002     rf              AUC 2019
-#> 3    0.5548750        0.0076250 Otu00003     rf              AUC 2019
-#> 4    0.6414750       -0.0789750 Otu00004     rf              AUC 2019
-#> 5    0.5049625        0.0575375 Otu00005     rf              AUC 2019
-#> 6    0.5444500        0.0180500 Otu00006     rf              AUC 2019
-#> 7    0.5417125        0.0207875 Otu00007     rf              AUC 2019
-#> 8    0.5257750        0.0367250 Otu00008     rf              AUC 2019
-#> 9    0.5395750        0.0229250 Otu00009     rf              AUC 2019
-#> 10   0.4977625        0.0647375 Otu00010     rf              AUC 2019
+#> perf_metric perf_metric_diff pvalue names method perf_metric_name seed +#> 1 0.5542375 0.0082625 0.37 Otu00001 rf AUC 2019 +#> 2 0.5731750 -0.0106750 0.57 Otu00002 rf AUC 2019 +#> 3 0.5548750 0.0076250 0.38 Otu00003 rf AUC 2019 +#> 4 0.6414750 -0.0789750 0.99 Otu00004 rf AUC 2019 +#> 5 0.5049625 0.0575375 0.05 Otu00005 rf AUC 2019 +#> 6 0.5444500 0.0180500 0.18 Otu00006 rf AUC 2019 +#> 7 0.5417125 0.0207875 0.21 Otu00007 rf AUC 2019 +#> 8 0.5257750 0.0367250 0.05 Otu00008 rf AUC 2019 +#> 9 0.5395750 0.0229250 0.02 Otu00009 rf AUC 2019 +#> 10 0.4977625 0.0647375 0.05 Otu00010 rf AUC 2019

There are several columns:

  1. @@ -483,10 +486,10 @@

    #> Finding feature importance... #> Feature importance complete. results_imp_corr$feature_importance -#> perf_metric perf_metric_diff -#> 1 0.5502105 0.09715789 -#> 2 0.6369474 0.01042105 -#> 3 0.5951316 0.05223684 +#> perf_metric perf_metric_diff pvalue +#> 1 0.5502105 0.09715789 0.08 +#> 2 0.6369474 0.01042105 0.40 +#> 3 0.5951316 0.05223684 0.08 #> names #> 1 Otu00001|Otu00002|Otu00003|Otu00005|Otu00006|Otu00007|Otu00009|Otu00010 #> 2 Otu00004 @@ -541,7 +544,7 @@

    'svmRadial', cv_times = 5, seed = 2019)

-

If you get a message “maximum number of iterations reached,” see this issue in caret.

+

If you get a message “maximum number of iterations reached”, see this issue in caret.

@@ -552,7 +555,7 @@

Multiclass data

We provide otu_mini_multi with a multiclass outcome (three or more outcomes):

-otu_mini_multi %>% dplyr::pull('dx') %>% unique()
+otu_mini_multi %>% dplyr::pull('dx') %>% unique()
 #> [1] "adenoma"   "carcinoma" "normal"

Here’s an example of running multiclass data:

@@ -593,12 +596,12 @@ 

References

-
-
-Tang, Shengpu, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W. Sjoding, and Jenna Wiens. 2020. “Democratizing EHR Analyses with FIDDLE: A Flexible Data-Driven Preprocessing Pipeline for Structured Clinical Data.” J Am Med Inform Assoc, October. https://doi.org/10.1093/jamia/ocaa139. +
+
+

Tang, Shengpu, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W. Sjoding, and Jenna Wiens. 2020. “Democratizing EHR Analyses with FIDDLE: A Flexible Data-Driven Preprocessing Pipeline for Structured Clinical Data.” J Am Med Inform Assoc, October. https://doi.org/10.1093/jamia/ocaa139.

-
-Topçuoğlu, Begüm D., Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens, and Patrick D. Schloss. 2020. “A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems.” mBio 11 (3). https://doi.org/10.1128/mBio.00434-20. +
+

Topçuoğlu, Begüm D., Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens, and Patrick D. Schloss. 2020. “A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems.” mBio 11 (3). https://doi.org/10.1128/mBio.00434-20.

diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index 7a3cb23d..18bf8145 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -7,7 +7,7 @@ articles: parallel: parallel.html preprocess: preprocess.html tuning: tuning.html -last_built: 2021-11-28T03:01Z +last_built: 2021-11-30T18:45Z urls: reference: http://www.schlosslab.org/mikropml/reference article: http://www.schlosslab.org/mikropml/articles diff --git a/docs/reference/get_feature_importance.html b/docs/reference/get_feature_importance.html index 2f93efb4..759ea311 100644 --- a/docs/reference/get_feature_importance.html +++ b/docs/reference/get_feature_importance.html @@ -256,16 +256,31 @@

Arg

Value

-

Dataframe with performance metrics for when each feature (or group of -correlated features; names) is permuted (perf_metric), and differences -between test performance metric and permuted performance metric -(perf_metric_diff; test minus permuted performance). Features with a -larger perf_metric_diff are more important. The performance metric name -(perf_metric_name) and seed (seed) are also returned.

+

Data frame with performance metrics for when each feature (or group +of correlated features; names) is permuted (perf_metric), differences +between the actual test performance metric on and the permuted performance +metric (perf_metric_diff; test minus permuted performance), and the +p-value (pvalue: the probability of obtaining the actual performance +value under the null hypothesis). Features with a larger perf_metric_diff +are more important. The performance metric name (perf_metric_name) and +seed (seed) are also returned.

+

Details

+ +

For permutation tests, the p-value is the number of permutation statistics +that are greater than the test statistic, divided by the number of +permutations. In our case, the permutation statistic is the model performance +(e.g. AUROC) after randomizing the order of observations for one feature, and +the test statistic is the actual performance on the test data. By default we +perform 100 permutations per feature; increasing this will increase the +precision of estimating the null distribution, but also increases runtime. +The p-value represents the probability of obtaining the actual performance in +the event that the null hypothesis is true, where the null hypothesis is that +the feature is not important for model performance.

Author

Begüm Topçuoğlu, topcuoglu.begum@gmail.com

Zena Lapp, zenalapp@umich.edu

+

Kelly Sovacool, sovacool@umich.edu

Examples

if (FALSE) {
diff --git a/docs/reference/get_perf_metric_fn.html b/docs/reference/get_perf_metric_fn.html
index f5d033fd..82c5d2d0 100644
--- a/docs/reference/get_perf_metric_fn.html
+++ b/docs/reference/get_perf_metric_fn.html
@@ -182,7 +182,7 @@ 

Examp #> data$obs <- factor(data$obs, levels = lev) #> postResample(data[, "pred"], data[, "obs"]) #> } -#> <bytecode: 0x7f9380617c08> +#> <bytecode: 0x7fc428237898> #> <environment: namespace:caret> get_perf_metric_fn("binary") #> function (data, lev = NULL, model = NULL) @@ -240,7 +240,7 @@

Examp #> stats <- stats[c(stat_list)] #> return(stats) #> } -#> <bytecode: 0x7f939d1362b0> +#> <bytecode: 0x7fc445fd1c50> #> <environment: namespace:caret> get_perf_metric_fn("multiclass") #> function (data, lev = NULL, model = NULL) @@ -298,7 +298,7 @@

Examp #> stats <- stats[c(stat_list)] #> return(stats) #> } -#> <bytecode: 0x7f939d1362b0> +#> <bytecode: 0x7fc445fd1c50> #> <environment: namespace:caret>