GitHub - bips-hb/bsscomparison: Generates all the results from the paper "When choosing the best subset is not the best choice” by M. Hanke, L. Dijkstra, R. Foraita and V. Didelez

bips-hb / bsscomparison Public

Notifications You must be signed in to change notification settings
Fork 1
Star 0

Generates all the results from the paper "When choosing the best subset is not the best choice” by M. Hanke, L. Dijkstra, R. Foraita and V. Didelez

0 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
bscomparison		bscomparison
data		data
exec		exec
plots		plots
results		results
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
README.txt		README.txt
masterscript.R		masterscript.R
set-up.R		set-up.R
utils-output.R		utils-output.R

Repository files navigation

Supplementary information / R code for the manuscript  
"Variable selection in linear regression models: choosing the best 
subset is not always the best choice"

Authors: Hanke, M., Dijkstra, L., Foraita, R. and Didelez, V. (2023)
Code was written bei Hanke, M. and Dijkstra, L.
In case of questions or comments please contact [email protected]


OVERVIEW/INSTRUCTIONS: 

    In order to run the R-code, you need to set the working directory to 
    the folder of the local copy of this repository 
    
    > setwd("<file>/<path>/<to>/bsscomparison")

    and source the masterscript.R
    
    > source("masterscript.R")

    The master script calls two types of scripts: 
    1. interactive scripts asking which simulations to run
    2. the actual scripts to run the selected simulations
    
    You can also run each of the scripts manually via the master script. 
    Since our simulation study is computationally challenging we also 
    give the user the option either to run examples of the simulation 
    study (including an comparison of the results to the original ones) 
    or to generate just the plots based on the raw results provided by us
    (see below for the links).
    If you run the interactive version of this code(which we highly recommend) 
    the  R-Console will give information about the simulation status, errors 
    and options.

    NOTE: If you only want to generate the plots please skip the first
    four steps of the interactive prompts (i.e. selecting option 5, 2, 2, 2).
    Please see also the explanation for Chapter IV. below.

    Alternatively, you can start the master script from chapter 'IV. Generate plots'

    The master script contains 7+1 chapters:

    0.: Set-Up
    A script asking the user to install the necessary packages. 
    If you run into trouble installing the packages via this script please 
    install them manually. See our session info at the end of this README.

    I.: Run the simulation for the semi-synthetic setting 

    II.: Run the simulation for the synthetic setting

    III.: Apply Best Subset Selection with different time limits

    IV.: Generate all plots for the results of Chapter I and II (Semi-
    synthetic and synthetic settings, different subset sizes and 
    correlation vs dimensionality; corresponds to figures 2-8 & 10 of the 
    paper and Figures 1-27 of the appendix)
    NOTE: you can run this script without running Chapter I & II
    but in this case you will need to download the raw results and save 
    them under "./results". See below for the links and download options.
    ! ! ! IMPORTANT ! ! ! If you use our build-in R-script 
    "download-intermediate results-medium-high.R" and get an timeout error 
    please make sure to set an appropriate timeout limit (e.g. 
    options(timeout=1000)). 

    V.: Generate all plots for the results of Chapter III (time limits; 
    corresponds to figure 11 of the paper and figures 30-57 of the appendix)
    NOTE: you can run this script without running Chapter III based
    on the raw results in ./results. However, you will need to download the
    results of the medium- and high-dimensional settings to generate all
    certification plots of the appendix, i.e. figures 30-57

    VII.: Plot the results of Stability Selection, BIC, mBIC2 and HQC 
    (figure 9 of the paper and figures 58-81 of the appendix) 

    VI.: Simulation for Stability Selection and BIC, mBIC2 and HQC (with
    option of performing only an example simulation)
    Please see also the default parameter values in the corresponding 
    chapter. 


PLOTS/FIGURES: 

    This repository generates all plots of our simulation study (shown in 
    the paper and the appendix/supplement) and saves them into "./plots".
    The plots are named "Figure_02", "Figure_03", etc. according to figure 
    numbers in the paper and "Appendix_Figure_01", "Appendix_Figure_02", etc.
    according to the numbers in the appendix.
    Note: Figure 1 in the paper is just a schematic representation of the 
    different correlation structures and the positioning of the direct 
    predictors. Hence, we do not provide any code for generating this figure.

    To generate the plots you have two options: you can run all simulations
    by yourself (please see above) or you can use the raw data of our simulation
    runs. These are provided in this repository and via an additional file
    repository (please see below).


DATASETS & AVAILABILITY:

    A TCGA dataset is needed for the semi-synthetic data generation and 
    is stored in the subfolder ./data. 

    If you do not want to re-run the simulation for the synthetic data you
    can download the result for the medium- and high-dimensional settings
    under 
    https://zenodo.org/record/8139859/files/BestSubsetResults.zip?download=1
    or alternatively under 
    https://figshare.com/articles/dataset/Simualtion_Results/23578647
    You have to un-zip the files and save the RDS-files into ./results
    This can also be down automatically by our interactive master script 
    (just follow its instructions).

    If you do not want to re-run the time-limit simulation you find the 
    raw results under ./results.

    If you do not want to re-run the Stability Selection and BIC/mBIC2/HQC 
    simulation our raw results are stored in ./results .


FURTHER INFORMATION:

    The code was written R, run on a Linux High Performance Cluster and used 
    Gurobi Optimizer version 8.1 (linux64) which is mandatory to run the 
    simulation study including best subset selection. However, we implemented 
    also examples to re-run the code without best subset selection (the 
    masterscript.R will ask what kind of simulation to re-run. See above.).

    The following R output shows the session Info on our cluster: 

    > sessionInfo()
    R version 4.0.2 (2020-06-22)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: CentOS Linux 7 (Core)

    Matrix products: default
    BLAS:   /home/local/R/4.0.2/lib64/R/lib/libRblas.so
    LAPACK: /home/local/R/4.0.2/lib64/R/lib/libRlapack.so

    Random number generation:
    RNG:     Mersenne-Twister 
    Normal:  Inversion 
    Sample:  Rounding 
    
    locale:
    [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
    [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
    [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
    [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods  
    [8] base     

    other attached packages:
    [1] caret_6.0-92      lattice_0.20-45   simsham_0.1.0     batchtools_0.9.15
    [5] mvtnorm_1.1-3     forcats_0.5.1     stringr_1.4.0     dplyr_1.0.9      
    [9] purrr_0.3.4       readr_2.1.2       tidyr_1.2.0       tidyverse_1.3.1  
    [13] tibble_3.1.7      bestsubset_1.0.10 ggplot2_3.3.6     glmnet_4.0-2     
    [17] Matrix_1.4-1      Rmpi_0.6-9.2      snow_0.4-4       

    loaded via a namespace (and not attached):
    [1] nlme_3.1-158         fs_1.5.2             lubridate_1.8.0     
    [4] progress_1.2.2       httr_1.4.3           tools_4.0.2         
    [7] backports_1.4.1      utf8_1.2.2           R6_2.5.1            
    [10] rpart_4.1.16         DBI_1.1.3            colorspace_2.0-3    
    [13] nnet_7.3-17          withr_2.5.0          tidyselect_1.1.2    
    [16] prettyunits_1.1.1    compiler_4.0.2       cli_3.3.0           
    [19] rvest_1.0.2          xml2_1.3.3           scales_1.2.0        
    [22] checkmate_2.1.0      rappdirs_0.3.3       digest_0.6.29       
    [25] pkgconfig_2.0.3      parallelly_1.32.0    dbplyr_2.2.1        
    [28] rlang_1.0.3          readxl_1.4.0         rstudioapi_0.13     
    [31] shape_1.4.6          generics_0.1.2       jsonlite_1.8.0      
    [34] ModelMetrics_1.2.2.2 magrittr_2.0.3       Rcpp_1.0.8.3        
    [37] munsell_0.5.0        fansi_1.0.3          lifecycle_1.0.1     
    [40] pROC_1.18.0          stringi_1.7.6        MASS_7.3-57         
    [43] plyr_1.8.7           recipes_0.2.0        grid_4.0.2          
    [46] listenv_0.8.0        crayon_1.5.1         haven_2.5.0         
    [49] splines_4.0.2        hms_1.1.1            pillar_1.7.0        
    [52] base64url_1.4        stats4_4.0.2         reshape2_1.4.4      
    [55] future.apply_1.9.0   codetools_0.2-18     reprex_2.0.1        
    [58] glue_1.6.2           data.table_1.14.2    modelr_0.1.8        
    [61] vctrs_0.4.1          tzdb_0.3.0           foreach_1.5.2       
    [64] cellranger_1.1.0     gtable_0.3.0         future_1.26.1       
    [67] assertthat_0.2.1     gower_1.0.0          prodlim_2019.11.13  
    [70] broom_0.8.0          class_7.3-20         survival_3.3-1      
    [73] timeDate_3043.102    iterators_1.0.14     hardhat_1.1.0       
    [76] lava_1.6.10          globals_0.15.1       ellipsis_0.3.2      
    [79] brew_1.0-7           ipred_0.9-13