-
Notifications
You must be signed in to change notification settings - Fork 1
bips-hb/bsscomparison
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Supplementary information / R code for the manuscript "Variable selection in linear regression models: choosing the best subset is not always the best choice" Authors: Hanke, M., Dijkstra, L., Foraita, R. and Didelez, V. (2023) Code was written bei Hanke, M. and Dijkstra, L. In case of questions or comments please contact [email protected] OVERVIEW/INSTRUCTIONS: In order to run the R-code, you need to set the working directory to the folder of the local copy of this repository > setwd("<file>/<path>/<to>/bsscomparison") and source the masterscript.R > source("masterscript.R") The master script calls two types of scripts: 1. interactive scripts asking which simulations to run 2. the actual scripts to run the selected simulations You can also run each of the scripts manually via the master script. Since our simulation study is computationally challenging we also give the user the option either to run examples of the simulation study (including an comparison of the results to the original ones) or to generate just the plots based on the raw results provided by us (see below for the links). If you run the interactive version of this code(which we highly recommend) the R-Console will give information about the simulation status, errors and options. NOTE: If you only want to generate the plots please skip the first four steps of the interactive prompts (i.e. selecting option 5, 2, 2, 2). Please see also the explanation for Chapter IV. below. Alternatively, you can start the master script from chapter 'IV. Generate plots' The master script contains 7+1 chapters: 0.: Set-Up A script asking the user to install the necessary packages. If you run into trouble installing the packages via this script please install them manually. See our session info at the end of this README. I.: Run the simulation for the semi-synthetic setting II.: Run the simulation for the synthetic setting III.: Apply Best Subset Selection with different time limits IV.: Generate all plots for the results of Chapter I and II (Semi- synthetic and synthetic settings, different subset sizes and correlation vs dimensionality; corresponds to figures 2-8 & 10 of the paper and Figures 1-27 of the appendix) NOTE: you can run this script without running Chapter I & II but in this case you will need to download the raw results and save them under "./results". See below for the links and download options. ! ! ! IMPORTANT ! ! ! If you use our build-in R-script "download-intermediate results-medium-high.R" and get an timeout error please make sure to set an appropriate timeout limit (e.g. options(timeout=1000)). V.: Generate all plots for the results of Chapter III (time limits; corresponds to figure 11 of the paper and figures 30-57 of the appendix) NOTE: you can run this script without running Chapter III based on the raw results in ./results. However, you will need to download the results of the medium- and high-dimensional settings to generate all certification plots of the appendix, i.e. figures 30-57 VII.: Plot the results of Stability Selection, BIC, mBIC2 and HQC (figure 9 of the paper and figures 58-81 of the appendix) VI.: Simulation for Stability Selection and BIC, mBIC2 and HQC (with option of performing only an example simulation) Please see also the default parameter values in the corresponding chapter. PLOTS/FIGURES: This repository generates all plots of our simulation study (shown in the paper and the appendix/supplement) and saves them into "./plots". The plots are named "Figure_02", "Figure_03", etc. according to figure numbers in the paper and "Appendix_Figure_01", "Appendix_Figure_02", etc. according to the numbers in the appendix. Note: Figure 1 in the paper is just a schematic representation of the different correlation structures and the positioning of the direct predictors. Hence, we do not provide any code for generating this figure. To generate the plots you have two options: you can run all simulations by yourself (please see above) or you can use the raw data of our simulation runs. These are provided in this repository and via an additional file repository (please see below). DATASETS & AVAILABILITY: A TCGA dataset is needed for the semi-synthetic data generation and is stored in the subfolder ./data. If you do not want to re-run the simulation for the synthetic data you can download the result for the medium- and high-dimensional settings under https://zenodo.org/record/8139859/files/BestSubsetResults.zip?download=1 or alternatively under https://figshare.com/articles/dataset/Simualtion_Results/23578647 You have to un-zip the files and save the RDS-files into ./results This can also be down automatically by our interactive master script (just follow its instructions). If you do not want to re-run the time-limit simulation you find the raw results under ./results. If you do not want to re-run the Stability Selection and BIC/mBIC2/HQC simulation our raw results are stored in ./results . FURTHER INFORMATION: The code was written R, run on a Linux High Performance Cluster and used Gurobi Optimizer version 8.1 (linux64) which is mandatory to run the simulation study including best subset selection. However, we implemented also examples to re-run the code without best subset selection (the masterscript.R will ask what kind of simulation to re-run. See above.). The following R output shows the session Info on our cluster: > sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /home/local/R/4.0.2/lib64/R/lib/libRblas.so LAPACK: /home/local/R/4.0.2/lib64/R/lib/libRlapack.so Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] caret_6.0-92 lattice_0.20-45 simsham_0.1.0 batchtools_0.9.15 [5] mvtnorm_1.1-3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 [9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tidyverse_1.3.1 [13] tibble_3.1.7 bestsubset_1.0.10 ggplot2_3.3.6 glmnet_4.0-2 [17] Matrix_1.4-1 Rmpi_0.6-9.2 snow_0.4-4 loaded via a namespace (and not attached): [1] nlme_3.1-158 fs_1.5.2 lubridate_1.8.0 [4] progress_1.2.2 httr_1.4.3 tools_4.0.2 [7] backports_1.4.1 utf8_1.2.2 R6_2.5.1 [10] rpart_4.1.16 DBI_1.1.3 colorspace_2.0-3 [13] nnet_7.3-17 withr_2.5.0 tidyselect_1.1.2 [16] prettyunits_1.1.1 compiler_4.0.2 cli_3.3.0 [19] rvest_1.0.2 xml2_1.3.3 scales_1.2.0 [22] checkmate_2.1.0 rappdirs_0.3.3 digest_0.6.29 [25] pkgconfig_2.0.3 parallelly_1.32.0 dbplyr_2.2.1 [28] rlang_1.0.3 readxl_1.4.0 rstudioapi_0.13 [31] shape_1.4.6 generics_0.1.2 jsonlite_1.8.0 [34] ModelMetrics_1.2.2.2 magrittr_2.0.3 Rcpp_1.0.8.3 [37] munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.1 [40] pROC_1.18.0 stringi_1.7.6 MASS_7.3-57 [43] plyr_1.8.7 recipes_0.2.0 grid_4.0.2 [46] listenv_0.8.0 crayon_1.5.1 haven_2.5.0 [49] splines_4.0.2 hms_1.1.1 pillar_1.7.0 [52] base64url_1.4 stats4_4.0.2 reshape2_1.4.4 [55] future.apply_1.9.0 codetools_0.2-18 reprex_2.0.1 [58] glue_1.6.2 data.table_1.14.2 modelr_0.1.8 [61] vctrs_0.4.1 tzdb_0.3.0 foreach_1.5.2 [64] cellranger_1.1.0 gtable_0.3.0 future_1.26.1 [67] assertthat_0.2.1 gower_1.0.0 prodlim_2019.11.13 [70] broom_0.8.0 class_7.3-20 survival_3.3-1 [73] timeDate_3043.102 iterators_1.0.14 hardhat_1.1.0 [76] lava_1.6.10 globals_0.15.1 ellipsis_0.3.2 [79] brew_1.0-7 ipred_0.9-13
About
Generates all the results from the paper "When choosing the best subset is not the best choice” by M. Hanke, L. Dijkstra, R. Foraita and V. Didelez
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published