diff --git a/docs/articles/introduction.html b/docs/articles/introduction.html index ceb61f90..769488ae 100644 --- a/docs/articles/introduction.html +++ b/docs/articles/introduction.html @@ -102,7 +102,7 @@ -
When training_frac
is a fraction between 0 and 1, a random sample of observations in the dataset are chosen for the training set to satisfy the training_frac
. However, in some cases you might wish to control exactly which observations are in the training set. You can instead assign training_frac
a vector of indices that correspond to which rows of the dataset should go in the training set (all remaining sequences will go in the testing set).
-n_obs <- otu_mini_bin %>% nrow()
+n_obs <- otu_mini_bin %>% nrow()
training_size <- 0.8 * n_obs
training_rows <- sample(n_obs, training_size)
results_custom_train <- run_ml(otu_mini_bin,
@@ -373,6 +373,7 @@
#> Fraction of data in the training set: 0.795
#> Groups in the training set: A B D F G H
#> Groups in the testing set: C E
+#> Groups will be kept together in CV partitions
#> Training the model...
#> Training complete.
The one difference here is run_ml()
will report how much of the data is in the training set if you run the above code chunk. This can be a little finicky depending on how many samples and groups you have. This is because it won’t be exactly what you specify with training_frac
, since you have to include all of one group in either the training set or the test set.
In the above case, all observations from A & B will be used for training, all from C & D will be used for testing, and the remaining groups will be randomly assigned to one or the other to satisfy the training_frac
as closely as possible.
If you need even more control than this, take a look at setting custom training indices. You might also prefer to provide your own train control scheme with the cross_val
parameter in run_ml()
.
Now, we can check out the feature importances:
results_imp$feature_importance
-#> perf_metric perf_metric_diff names method perf_metric_name seed
-#> 1 0.5542375 0.0082625 Otu00001 rf AUC 2019
-#> 2 0.5731750 -0.0106750 Otu00002 rf AUC 2019
-#> 3 0.5548750 0.0076250 Otu00003 rf AUC 2019
-#> 4 0.6414750 -0.0789750 Otu00004 rf AUC 2019
-#> 5 0.5049625 0.0575375 Otu00005 rf AUC 2019
-#> 6 0.5444500 0.0180500 Otu00006 rf AUC 2019
-#> 7 0.5417125 0.0207875 Otu00007 rf AUC 2019
-#> 8 0.5257750 0.0367250 Otu00008 rf AUC 2019
-#> 9 0.5395750 0.0229250 Otu00009 rf AUC 2019
-#> 10 0.4977625 0.0647375 Otu00010 rf AUC 2019
There are several columns:
If you get a message “maximum number of iterations reached,” see this issue in caret.
+If you get a message “maximum number of iterations reached”, see this issue in caret.
We provide otu_mini_multi
with a multiclass outcome (three or more outcomes):
-otu_mini_multi %>% dplyr::pull('dx') %>% unique()
+otu_mini_multi %>% dplyr::pull('dx') %>% unique()
#> [1] "adenoma" "carcinoma" "normal"
Here’s an example of running multiclass data:
@@ -593,12 +596,12 @@
References
---Tang, Shengpu, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W. Sjoding, and Jenna Wiens. 2020. “Democratizing EHR Analyses with FIDDLE: A Flexible Data-Driven Preprocessing Pipeline for Structured Clinical Data.” J Am Med Inform Assoc, October. https://doi.org/10.1093/jamia/ocaa139. ++diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index 7a3cb23d..18bf8145 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -7,7 +7,7 @@ articles: parallel: parallel.html preprocess: preprocess.html tuning: tuning.html -last_built: 2021-11-28T03:01Z +last_built: 2021-11-30T18:45Z urls: reference: http://www.schlosslab.org/mikropml/reference article: http://www.schlosslab.org/mikropml/articles diff --git a/docs/reference/get_feature_importance.html b/docs/reference/get_feature_importance.html index 2f93efb4..759ea311 100644 --- a/docs/reference/get_feature_importance.html +++ b/docs/reference/get_feature_importance.html @@ -256,16 +256,31 @@+-Tang, Shengpu, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W. Sjoding, and Jenna Wiens. 2020. “Democratizing EHR Analyses with FIDDLE: A Flexible Data-Driven Preprocessing Pipeline for Structured Clinical Data.” J Am Med Inform Assoc, October. https://doi.org/10.1093/jamia/ocaa139.
-Topçuoğlu, Begüm D., Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens, and Patrick D. Schloss. 2020. “A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems.” mBio 11 (3). https://doi.org/10.1128/mBio.00434-20. ++Topçuoğlu, Begüm D., Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens, and Patrick D. Schloss. 2020. “A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems.” mBio 11 (3). https://doi.org/10.1128/mBio.00434-20.
Arg
Value
-Dataframe with performance metrics for when each feature (or group of -correlated features;
+names
) is permuted (perf_metric
), and differences -between test performance metric and permuted performance metric -(perf_metric_diff
; test minus permuted performance). Features with a -largerperf_metric_diff
are more important. The performance metric name -(perf_metric_name
) and seed (seed
) are also returned.Data frame with performance metrics for when each feature (or group +of correlated features;
+names
) is permuted (perf_metric
), differences +between the actual test performance metric on and the permuted performance +metric (perf_metric_diff
; test minus permuted performance), and the +p-value (pvalue
: the probability of obtaining the actual performance +value under the null hypothesis). Features with a largerperf_metric_diff
+are more important. The performance metric name (perf_metric_name
) and +seed (seed
) are also returned.Details
+ +For permutation tests, the p-value is the number of permutation statistics +that are greater than the test statistic, divided by the number of +permutations. In our case, the permutation statistic is the model performance +(e.g. AUROC) after randomizing the order of observations for one feature, and +the test statistic is the actual performance on the test data. By default we +perform 100 permutations per feature; increasing this will increase the +precision of estimating the null distribution, but also increases runtime. +The p-value represents the probability of obtaining the actual performance in +the event that the null hypothesis is true, where the null hypothesis is that +the feature is not important for model performance.
Author
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Zena Lapp, zenalapp@umich.edu
+Kelly Sovacool, sovacool@umich.edu
Examples
if (FALSE) { diff --git a/docs/reference/get_perf_metric_fn.html b/docs/reference/get_perf_metric_fn.html index f5d033fd..82c5d2d0 100644 --- a/docs/reference/get_perf_metric_fn.html +++ b/docs/reference/get_perf_metric_fn.html @@ -182,7 +182,7 @@Examp #> data$obs <- factor(data$obs, levels = lev) #> postResample(data[, "pred"], data[, "obs"]) #> } -#> <bytecode: 0x7f9380617c08> +#> <bytecode: 0x7fc428237898> #> <environment: namespace:caret> get_perf_metric_fn("binary") #> function (data, lev = NULL, model = NULL) @@ -240,7 +240,7 @@
Examp #> stats <- stats[c(stat_list)] #> return(stats) #> } -#> <bytecode: 0x7f939d1362b0> +#> <bytecode: 0x7fc445fd1c50> #> <environment: namespace:caret> get_perf_metric_fn("multiclass") #> function (data, lev = NULL, model = NULL) @@ -298,7 +298,7 @@
Examp #> stats <- stats[c(stat_list)] #> return(stats) #> } -#> <bytecode: 0x7f939d1362b0> +#> <bytecode: 0x7fc445fd1c50> #> <environment: namespace:caret>