diff --git a/_pkgdown.yml b/_pkgdown.yml index de4bb9f36..763c23504 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -23,6 +23,7 @@ navbar: - benchmarks - predictors - bestpractice + - clinicalmodels - news right: [hades, github] components: @@ -41,6 +42,9 @@ navbar: bestpractice: text: Best Practices href: articles/BestPractices.html + clinicalmodels: + text: Clinical Models + href: articles/ClinicalModels.html benchmarks: text: Benchmarks href: articles/BenchmarkTasks.html diff --git a/docs/404.html b/docs/404.html new file mode 100644 index 000000000..7dbf2ed28 --- /dev/null +++ b/docs/404.html @@ -0,0 +1,184 @@ + + +
+ + + + +vignettes/AddingCustomFeatureEngineering.Rmd
+ AddingCustomFeatureEngineering.Rmd
This vignette describes how you can add your own custom function for
+feature engineering in the Observational Health Data Sciences and
+Informatics (OHDSI) PatientLevelPrediction
+package. This vignette assumes you have read and are comfortable with
+building single patient level prediction models as described in the BuildingPredictiveModels
+vignette.
We invite you to share your new feature engineering functions +with the OHDSI community through our GitHub +repository.
+To make a custom feature engineering function that can be used within +PatientLevelPrediction you need to write two different functions. The +‘create’ function and the ‘implement’ function.
+The ‘create’ function, e.g., +create<FeatureEngineeringFunctionName>, takes the parameters of +the feature engineering ‘implement’ function as input, checks these are +valid and outputs these as a list of class ‘featureEngineeringSettings’ +with the ‘fun’ attribute specifying the ‘implement’ function to +call.
+The ‘implement’ function, e.g., +implement<FeatureEngineeringFunctionName>, must take as input:
+trainData
- a list containing:
covariateData
: the
+plpData$covariateData
restricted to the training
+patients
labels
: a data frame that contain
+rowId
(patient identifier) and outcomeCount
+(the class labels)
folds
: a data.frame that contains rowId
+(patient identifier) and index
(the cross validation
+fold)
featureEngineeringSettings
- the output of your
+create<FeatureEngineeringFunctionName>
The ‘implement’ function can then do any manipulation of the
+trainData
(adding new features or removing features) but
+must output a trainData
object containing the new
+covariateData
, labels
and folds
+for the training data patients.
Let’s consider the situation where we wish to create an age spline +feature. To make this custom feature engineering function we need to +write the ‘create’ and ‘implement’ R functions.
+Our age spline feature function will create a new feature using the
+plpData$cohorts$ageYear
column. We will implement a
+restricted cubic spline that requires specifying the number of knots.
+Therefore, the inputs for this are: knots
- an
+integer/double specifying the number of knots.
+createAgeSpline <- function(
+ knots = 5
+ ){
+
+ # create list of inputs to implement function
+ featureEngineeringSettings <- list(
+ knots = knots
+ )
+
+ # specify the function that will implement the sampling
+ attr(featureEngineeringSettings, "fun") <- "implementAgeSplines"
+
+ # make sure the object returned is of class "sampleSettings"
+ class(featureEngineeringSettings) <- "featureEngineeringSettings"
+ return(featureEngineeringSettings)
+
+}
We now need to create the ‘implement’ function
+implementAgeSplines()
All ‘implement’ functions must take as input the
+trainData
and the featureEngineeringSettings
+(this is the output of the ‘create’ function). They must return a
+trainData
object containing the new
+covariateData
, labels
and
+folds
.
In our example, the createAgeSpline()
will return a list
+with ‘knots’. The featureEngineeringSettings
therefore
+contains this.
+implementAgeSplines <- function(trainData, featureEngineeringSettings, model=NULL) {
+ # if there is a model, it means this function is called through applyFeatureengineering, meaning it # should apply the model fitten on training data to the test data
+ if (is.null(model)) {
+ knots <- featureEngineeringSettings$knots
+ ageData <- trainData$labels
+ y <- ageData$outcomeCount
+ X <- ageData$ageYear
+ model <- mgcv::gam(
+ y ~ s(X, bs='cr', k=knots, m=2)
+ )
+ newData <- data.frame(
+ rowId = ageData$rowId,
+ covariateId = 2002,
+ covariateValue = model$fitted.values
+ )
+ }
+ else {
+ ageData <- trainData$labels
+ X <- trainData$labels$ageYear
+ y <- ageData$outcomeCount
+ newData <- data.frame(y=y, X=X)
+ yHat <- predict(model, newData)
+ newData <- data.frame(
+ rowId = trainData$labels$rowId,
+ covariateId = 2002,
+ covariateValue = yHat
+ )
+ }
+
+ # remove existing age if in covariates
+ trainData$covariateData$covariates <- trainData$covariateData$covariates |>
+ dplyr::filter(!covariateId %in% c(1002))
+
+ # update covRef
+ Andromeda::appendToTable(trainData$covariateData$covariateRef,
+ data.frame(covariateId=2002,
+ covariateName='Cubic restricted age splines',
+ analysisId=2,
+ conceptId=2002))
+
+ # update covariates
+ Andromeda::appendToTable(trainData$covariateData$covariates, newData)
+
+ featureEngineering <- list(
+ funct = 'implementAgeSplines',
+ settings = list(
+ featureEngineeringSettings = featureEngineeringSettings,
+ model = model
+ )
+ )
+
+ attr(trainData$covariateData, 'metaData')$featureEngineering = listAppend(
+ attr(trainData$covariateData, 'metaData')$featureEngineering,
+ featureEngineering
+ )
+
+ return(trainData)
+}
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+vignettes/AddingCustomModels.Rmd
+ AddingCustomModels.Rmd
This vignette describes how you can add your own custom algorithms in
+the Observational Health Data Sciencs and Informatics (OHDSI) PatientLevelPrediction
+package. This allows you to fully leverage the OHDSI
+PatientLevelPrediction framework for model development and validation.
+This vignette assumes you have read and are comfortable with building
+single patient level prediction models as described in the BuildingPredictiveModels
+vignette.
We invite you to share your new algorithms with the OHDSI +community through our GitHub +repository.
+Each algorithm in the package should be implemented in its own +<Name>.R file, e.g. KNN.R, containing a set<Name> function, +a fit<Name> function and a predict<Name> function. +Occasionally the fit and prediction functions may be reused (if using an +R classifier see RClassifier.R or if using a scikit-learn classifier see +SklearnClassifier.R). We will now describe each of these functions in +more detail below.
+The set<Name> is a function that takes as input the different
+hyper-parameter values to do a grid search when training. The output of
+the functions needs to be a list as class modelSettings
+containing:
The param object can have a setttings attribute containing any extra +settings. For example to specify the model name and the seed used for +reproducibility:
+ +For example, if you were adding a model called madeUp that has two +hyper-parameters then the set function should be:
+
+setMadeUp <- function(a=c(1,4,10), b=2, seed=NULL){
+ # add input checks here...
+
+ param <- split(
+ expand.grid(
+ a=a,
+ b=b
+ ),
+ 1:(length(a)*length(b))
+ )
+
+ attr(param, 'settings') <- list(
+ modelName = "Made Up",
+ requiresDenseMatrix = TRUE,
+ seed = seed
+ )
+
+ # now create list of all combinations:
+ result <- list(
+ fitFunction = 'fitMadeUp', # this will be called to train the made up model
+ param = param
+ )
+ class(result) <- 'modelSettings'
+
+ return(result)
+}
This function should train your custom model for each parameter +entry, pick the best parameters and train a final model for that +setting.
+The fit<Model> should have as inputs:
+The fit function should return a list of class plpModel
+with the following objects:
In additon the plpModel requires two attributes:
+For example
+attr(result, 'predictionFunction') <- 'madeupPrediction'
+means when the model is applied to new data, the ‘madeupPrediction’
+function is called to make predictions. If this doesnt exist, then the
+model will fail. The other attribute is the modelType
+attr(result, 'modelType') <- 'binary'
this is needed
+when evaluating the model to ensure the correct evaluation is applied.
+Currently the evaluation supports ‘binary’ and ‘survival’ modelType.
Note: If a new modelType is desired, then the evalaution code within +PatientLevelPrediction must be updated to specify how the new type is +evaluated. This requires making edits to PatientLevelPrediction and then +making a pull request to the PatientLevelPrediction github. The +evaluation cannot have one off customization because the evaluation must +be standardized to enable comparison across similar models.
+A full example of a custom ‘binary’ classifier fit function is:
+
+fitMadeUp <- function(trainData, modelSettings, search, analysisId){
+
+ param <- modelSettings$param
+
+ # **************** code to train the model here
+ # trainedModel <- this code should apply each hyper-parameter combination
+ # (param[[i]]) using the specified search (e.g., cross validation)
+ # then pick out the best hyper-parameter setting
+ # and finally fit a model on the whole train data using the
+ # optimal hyper-parameter settings
+ # ****************
+
+ # **************** code to apply the model to trainData
+ # prediction <- code to apply trainedModel to trainData
+ # ****************
+
+ # **************** code to get variable importance (if possible)
+ # varImp <- code to get importance of each variable in trainedModel
+ # ****************
+
+
+ # construct the standard output for a model:
+ result <- list(model = trainedModel,
+ prediction = prediction, # the train and maybe the cross validation predictions for the trainData
+ preprocessing = list(
+ featureEngineering = attr(trainData$covariateData, "metaData")$featureEngineering,
+ tidyCovariates = attr(trainData$covariateData, "metaData")$tidyCovariateDataSettings,
+ requireDenseMatrix = attr(param, 'settings')$requiresDenseMatrix,
+
+ ),
+ modelDesign = list(
+ outcomeId = attr(trainData, "metaData")$outcomeId,
+ targetId = attr(trainData, "metaData")$targetId,
+ plpDataSettings = attr(trainData, "metaData")$plpDataSettings,
+ covariateSettings = attr(trainData, "metaData")$covariateSettings,
+ populationSettings = attr(trainData, "metaData")$populationSettings,
+ featureEngineeringSettings = attr(trainData$covariateData, "metaData")$featureEngineeringSettings,
+ prerocessSettings = attr(trainData$covariateData, "metaData")$prerocessSettings,
+ modelSettings = list(
+ model = attr(param, 'settings')$modelName, # the model name
+ param = param,
+ finalModelParameters = param[[bestInd]], # best hyper-parameters
+ extraSettings = attr(param, 'settings')
+ ),
+ splitSettings = attr(trainData, "metaData")$splitSettings,
+ sampleSettings = attr(trainData, "metaData")$sampleSettings
+ ),
+
+ trainDetails = list(
+ analysisId = analysisId,
+ developmentDatabase = attr(trainData, "metaData")$cdmDatabaseSchema,
+ attrition = attr(trainData, "metaData")$attrition,
+ trainingTime = timeToTrain, # how long it took to train the model
+ trainingDate = Sys.Date(),
+ hyperParamSearch = hyperSummary # the hyper-parameters and performance data.frame
+ ),
+ covariateImportance = merge(trainData$covariateData$covariateRef, varImp, by='covariateId') # add variable importance to covariateRef if possible
+ )
+ class(result) <- 'plpModel'
+ attr(result, 'predictionFunction') <- 'madeupPrediction'
+ attr(result, 'modelType') <- 'binary'
+ return(result)
+
+}
You could make the fitMadeUp function cleaner by adding helper
+function in the MadeUp.R file that are called by the fit function (for
+example a function to run cross validation). It is important to ensure
+there is a valid prediction function (the one specified by
+attr(result, 'predictionFunction') <- 'madeupPrediction'
+is madeupPrediction()
) as specified below.
The prediction function takes as input the plpModel returned by fit, +new data and a corresponding cohort. It returns a data.frame with the +same columns as cohort but with an additional column:
+For example:
+
+madeupPrediction <- function(plpModel, data, cohort){
+
+ # ************* code to do prediction for each rowId in cohort
+ # predictionValues <- code to do prediction here returning the predicted risk
+ # (value) for each rowId in cohort
+ #**************
+
+ prediction <- merge(cohort, predictionValues, by='rowId')
+ attr(prediction, "metaData") <- list(modelType = attr(plpModel, 'modelType'))
+ return(prediction)
+
+}
Below a fully functional algorithm example is given, however we +highly recommend you to have a look at the available algorithms in the +package (see GradientBoostingMachine.R for the set function, +RClassifier.R for the fit and prediction function for R +classifiers).
+
+setMadeUp <- function(a=c(1,4,6), b=2, seed=NULL){
+ # add input checks here...
+
+ if(is.null(seed)){
+ seed <- sample(100000,1)
+ }
+
+ param <- split(
+ expand.grid(
+ a=a,
+ b=b
+ ),
+ 1:(length(a)*length(b))
+ )
+
+ attr(param, 'settings') <- list(
+ modelName = "Made Up",
+ requiresDenseMatrix = TRUE,
+ seed = seed
+ )
+
+ # now create list of all combinations:
+ result <- list(
+ fitFunction = 'fitMadeUp', # this will be called to train the made up model
+ param = param
+ )
+ class(result) <- 'modelSettings'
+
+ return(result)
+}
fitMadeUp <- function(trainData, modelSettings, search, analysisId){
+
+ # set the seed for reproducibility
+ param <- modelSettings$param
+ set.seed(attr(param, 'settings')$seed)
+
+ # add folds to labels:
+ trainData$labels <- merge(trainData$labels, trainData$folds, by= 'rowId')
+ # convert data into sparse R Matrix:
+ mappedData <- toSparseM(trainData,map=NULL)
+ matrixData <- mappedData$dataMatrix
+ labels <- mappedData$labels
+ covariateRef <- mappedData$covariateRef
+
+ #============= STEP 1 ======================================
+ # pick the best hyper-params and then do final training on all data...
+ writeLines('Cross validation')
+ param.sel <- lapply(
+ param,
+ function(x){
+ do.call(
+ made_up_model,
+ list(
+ param = x,
+ final = F,
+ data = matrixData,
+ labels = labels
+ )
+ )
+ }
+ )
+ hyperSummary <- do.call(rbind, lapply(param.sel, function(x) x$hyperSum))
+ hyperSummary <- as.data.frame(hyperSummary)
+ hyperSummary$auc <- unlist(lapply(param.sel, function(x) x$auc))
+ param.sel <- unlist(lapply(param.sel, function(x) x$auc))
+ bestInd <- which.max(param.sel)
+
+ #get cross val prediction for best hyper-parameters
+ prediction <- param.sel[[bestInd]]$prediction
+ prediction$evaluationType <- 'CV'
+
+ writeLines('final train')
+ finalResult <- do.call(
+ made_up_model,
+ list(
+ param = param[[bestInd]],
+ final = T,
+ data = matrixData,
+ labels = labels
+ )
+ )
+
+ trainedModel <- finalResult$model
+
+ # prediction risk on training data:
+ finalResult$prediction$evaluationType <- 'Train'
+
+ # get CV and train prediction
+ prediction <- rbind(prediction, finalResult$prediction)
+
+ varImp <- covariateRef %>% dplyr::collect()
+ # no feature importance available
+ vqrImp$covariateValue <- 0
+
+ timeToTrain <- Sys.time() - start
+
+ # construct the standard output for a model:
+ result <- list(model = trainedModel,
+ prediction = prediction,
+ preprocessing = list(
+ featureEngineering = attr(trainData$covariateData, "metaData")$featureEngineering,
+ tidyCovariates = attr(trainData$covariateData, "metaData")$tidyCovariateDataSettings,
+ requireDenseMatrix = attr(param, 'settings')$requiresDenseMatrix,
+
+ ),
+ modelDesign = list(
+ outcomeId = attr(trainData, "metaData")$outcomeId,
+ targetId = attr(trainData, "metaData")$targetId,
+ plpDataSettings = attr(trainData, "metaData")$plpDataSettings,
+ covariateSettings = attr(trainData, "metaData")$covariateSettings,
+ populationSettings = attr(trainData, "metaData")$populationSettings,
+ featureEngineeringSettings = attr(trainData$covariateData, "metaData")$featureEngineeringSettings,
+ prerocessSettings = attr(trainData$covariateData, "metaData")$prerocessSettings,
+ modelSettings = list(
+ model = attr(param, 'settings')$modelName, # the model name
+ param = param,
+ finalModelParameters = param[[bestInd]], # best hyper-parameters
+ extraSettings = attr(param, 'settings')
+ ),
+ splitSettings = attr(trainData, "metaData")$splitSettings,
+ sampleSettings = attr(trainData, "metaData")$sampleSettings
+ ),
+
+ trainDetails = list(
+ analysisId = analysisId,
+ developmentDatabase = attr(trainData, "metaData")$cdmDatabaseSchema,
+ attrition = attr(trainData, "metaData")$attrition,
+ trainingTime = timeToTrain, # how long it took to train the model
+ trainingDate = Sys.Date(),
+ hyperParamSearch = hyperSummary # the hyper-parameters and performance data.frame
+ ),
+ covariateImportance = merge(trainData$covariateData$covariateRef, varImp, by='covariateId') # add variable importance to covariateRef if possible
+ ),
+ covariateImportance = varImp
+ )
+ class(result) <- 'plpModel'
+ attr(result, 'predictionFunction') <- 'madeupPrediction'
+ attr(result, 'modelType') <- 'binary'
+ return(result)
+
+}
In the fit model a helper function made_up_model
is
+called, this is the function that trains a model given the data, labels
+and hyper-parameters.
+made_up_model <- function(param, data, final=F, labels){
+
+ if(final==F){
+ # add value column to store all predictions
+ labels$value <- rep(0, nrow(labels))
+ attr(labels, "metaData") <- list(modelType = "binary")
+
+ foldPerm <- c() # this holds CV aucs
+ for(index in 1:max(labels$index)){
+ model <- madeup::model(
+ x = data[labels$index!=index,], # remove left out fold
+ y = labels$outcomeCount[labels$index!=index],
+ a = param$a,
+ b = param$b
+ )
+
+ # predict on left out fold
+ pred <- stats::predict(model, data[labels$index==index,])
+ labels$value[labels$index==index] <- pred
+
+ # calculate auc on help out fold
+ aucVal <- computeAuc(labels[labels$index==index,])
+ foldPerm<- c(foldPerm,aucVal)
+ }
+ auc <- computeAuc(labels) # overal AUC
+
+ } else {
+ model <- madeup::model(
+ x = data,
+ y = labels$outcomeCount,
+ a = param$a,
+ b = param$b
+ )
+
+ pred <- stats::predict(model, data)
+ labels$value <- pred
+ attr(labels, "metaData") <- list(modelType = "binary")
+ auc <- computeAuc(labels)
+ foldPerm <- auc
+ }
+
+ result <- list(
+ model = model,
+ auc = auc,
+ prediction = labels,
+ hyperSum = c(a = a, b = b, fold_auc = foldPerm)
+ )
+
+ return(result)
+}
The final step is to create a predict function for the model. In the
+example above the predeiction function
+attr(result, 'predictionFunction') <- 'madeupPrediction'
+was madeupPrediction, so a madeupPrediction
function is
+required when applying the model. The predict function needs to take as
+input the plpModel returned by the fit function, new data to apply the
+model on and the cohort specifying the patients of interest to make the
+prediction for.
+madeupPrediction <- function(plpModel, data , cohort){
+
+ if(class(data) == 'plpData'){
+ # convert
+ matrixObjects <- toSparseM(
+ plpData = data,
+ cohort = cohort,
+ map = plpModel$covariateImportance %>%
+ dplyr::select("columnId", "covariateId")
+ )
+
+ newData <- matrixObjects$dataMatrix
+ cohort <- matrixObjects$labels
+
+ }else{
+ newData <- data
+ }
+
+ if(class(plpModel) == 'plpModel'){
+ model <- plpModel$model
+ } else{
+ model <- plpModel
+ }
+
+ cohort$value <- stats::predict(model, data)
+
+ # fix the rowIds to be the old ones
+ # now use the originalRowId and remove the matrix rowId
+ cohort <- cohort %>%
+ dplyr::select(-"rowId") %>%
+ dplyr::rename(rowId = "originalRowId")
+
+ attr(cohort, "metaData") <- list(modelType = attr(plpModel, 'modelType'))
+ return(cohort)
+
+}
As the madeup model uses the standard R prediction, it has the same
+prediction function as xgboost, so we could have not added a new
+prediction function and instead made the predictionFunction of the
+result returned by fitMadeUpModel to
+attr(result, 'predictionFunction') <- 'predictXgboost'
.
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+vignettes/AddingCustomSamples.Rmd
+ AddingCustomSamples.Rmd
This vignette describes how you can add your own custom function for
+sampling the target population in the Observational Health Data Sciencs
+and Informatics (OHDSI) PatientLevelPrediction
+package. This vignette assumes you have read and are comfortable with
+building single patient level prediction models as described in the BuildingPredictiveModels
+vignette.
We invite you to share your new sample functions with the +OHDSI community through our GitHub +repository.
+To make a sampling function that can be used within +PatientLevelPrediction you need to write two different functions. The +‘create’ function and the ‘implement’ function.
+The ‘create’ function, e.g., create<SampleFunctionName>, takes +the parameters of the sample ‘implement’ function as input, checks these +are valid and outputs these as a list of class ‘sampleSettings’ with the +‘fun’ attribute specifying the ‘implement’ function to call.
+The ‘implement’ function, e.g., implement<SampleFunctionName>, +must take as input: * trainData - a list containing: - covariateData: +the plpData$covariateData restricted to the training patients - labels: +a data frame that contain rowId (patient identifier) and outcomeCount +(the class labels) - folds: a data.frame that contains rowId (patient +identifier) and index (the cross validation fold) * sampleSettings - the +output of your create<SampleFunctionName>
+The ‘implement’ function can then do any manipulation of the +trainData (such as undersampling or oversampling) but must output a +trainData object containing the covariateData, labels and folds for the +new training data sample.
+Let’s consider the situation where we wish to take a random sample of +the training data population. To make this custom sampling function we +need to write the ‘create’ and ‘implement’ R functions.
+Our random sampling function will randomly sample n
+patients from the trainData. Therefore, the inputs for this are: *
+n
an integer/double specifying the number of patients to
+sample * sampleSeed
an integer/double specifying the seed
+for reproducibility
+createRandomSampleSettings <- function(
+ n = 10000,
+ sampleSeed = sample(10000,1)
+ ){
+
+ # add input checks
+ checkIsClass(n, c('numeric','integer'))
+ checkHigher(n,0)
+ checkIsClass(sampleSeed, c('numeric','integer'))
+
+ # create list of inputs to implement function
+ sampleSettings <- list(
+ n = n,
+ sampleSeed = sampleSeed
+ )
+
+ # specify the function that will implement the sampling
+ attr(sampleSettings, "fun") <- "implementRandomSampleSettings"
+
+ # make sure the object returned is of class "sampleSettings"
+ class(sampleSettings) <- "sampleSettings"
+ return(sampleSettings)
+
+}
We now need to create the ‘implement’ function
+implementRandomSampleSettings()
All ‘implement’ functions must take as input the trainData and the +sampleSettings (this is the output of the ‘create’ function). They must +return a trainData object containing the covariateData, labels and +folds.
+In our example, the createRandomSampleSettings()
will
+return a list with ‘n’ and ‘sampleSeed’. The sampleSettings therefore
+contains these.
+implementRandomSampleSettings <- function(trainData, sampleSettings){
+
+ n <- sampleSetting$n
+ sampleSeed <- sampleSetting$sampleSeed
+
+ if(n > nrow(trainData$labels)){
+ stop('Sample n bigger than training population')
+ }
+
+ # set the seed for the randomization
+ set.seed(sampleSeed)
+
+ # now implement the code to do your desired sampling
+
+ sampleRowIds <- sample(trainData$labels$rowId, n)
+
+ sampleTrainData <- list()
+
+ sampleTrainData$labels <- trainData$labels %>%
+ dplyr::filter(.data$rowId %in% sampleRowIds) %>%
+ dplyr::collect()
+
+ sampleTrainData$folds <- trainData$folds %>%
+ dplyr::filter(.data$rowId %in% sampleRowIds) %>%
+ dplyr::collect()
+
+ sampleTrainData$covariateData <- Andromeda::andromeda()
+ sampleTrainData$covariateData$covariateRef <-trainData$covariateData$covariateRef
+ sampleTrainData$covariateData$covariates <- trainData$covariateData$covariates %>% dplyr::filter(.data$rowId %in% sampleRowIds)
+
+ #update metaData$populationSize
+ metaData <- attr(trainData$covariateData, 'metaData')
+ metaData$populationSize = n
+ attr(sampleTrainData$covariateData, 'metaData') <- metaData
+
+ # make the cocvariateData the correct class
+ class(sampleTrainData$covariateData) <- 'CovariateData'
+
+ # return the updated trainData
+ return(sampleTrainData)
+}
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+vignettes/AddingCustomSplitting.Rmd
+ AddingCustomSplitting.Rmd
This vignette describes how you can add your own custom function for
+splitting the labelled data into training data and validation data in
+the Observational Health Data Sciencs and Informatics (OHDSI) PatientLevelPrediction
+package. This vignette assumes you have read and are comfortable with
+building single patient level prediction models as described in the BuildingPredictiveModels
+vignette.
We invite you to share your new data splitting functions with +the OHDSI community through our GitHub +repository.
+To make a custom data splitting function that can be used within +PatientLevelPrediction you need to write two different functions. The +‘create’ function and the ‘implement’ function.
+The ‘create’ function, e.g., create<DataSplittingFunction>, +takes the parameters of the data splitting ‘implement’ function as +input, checks these are valid and outputs these as a list of class +‘splitSettings’ with the ‘fun’ attribute specifying the ‘implement’ +function to call.
+The ‘implement’ function, e.g., +implement<DataSplittingFunction>, must take as input: * +population: a data frame that contain rowId (patient identifier), +ageYear, gender and outcomeCount (the class labels) * splitSettings - +the output of your create<DataSplittingFunction>
+The ‘implement’ function then needs to implement code to assign each +rowId in the population to a splitId (<0 means in the train data, 0 +means not used and >0 means in the training data with the value +defining the cross validation fold).
+Let’s consider the situation where we wish to create a split where +females are used to train a model but males are used to evaluate the +model.
+Our gender split function requires a single parameter, the number of +folds used in cross validation. Therefore create a function with a +single nfold input that returns a list of class ‘splitSettings’ with the +‘fun’ attribute specifying the ‘implement’ function we will use.
+
+createGenderSplit <- function(nfold)
+ {
+
+ # create list of inputs to implement function
+ splitSettings <- list(nfold = nfold)
+
+ # specify the function that will implement the sampling
+ attr(splitSettings, "fun") <- "implementGenderSplit"
+
+ # make sure the object returned is of class "sampleSettings"
+ class(splitSettings) <- "splitSettings"
+ return(splitSettings)
+
+}
We now need to create the ‘implement’ function
+implementGenderSplit()
All ‘implement’ functions for data splitting must take as input the +population and the splitSettings (this is the output of the ‘create’ +function). They must return a data.frame containing columns: rowId and +index.
+The index is used to determine whether the patient (identifed by the +rowId) is in the test set (index = -1) or train set (index > 0). In +in the train set, the value corresponds to the cross validation fold. +For example, if rowId 2 is assigned index 5, then it means the patient +with the rowId 2 is used to train the model and is in fold 5.
+
+implementGenderSplit <- function(population, splitSettings){
+
+ # find the people who are male:
+ males <- population$rowId[population$gender == 8507]
+ females <- population$rowId[population$gender == 8532]
+
+ splitIds <- data.frame(
+ rowId = c(males, females),
+ index = c(
+ rep(-1, length(males)),
+ sample(1:splitSettings$nfold, length(females), replace = T)
+ )
+ )
+
+ # return the updated trainData
+ return(splitIds)
+}
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+vignettes/BenchmarkTasks.Rmd
+ BenchmarkTasks.Rmd
Here we provide a set of diverse prediction tasks that can be used +when evaluating the impact of the model design choice when developing +models using observational data.
+Target Cohort (index) | +Outcome | +Time-at-risk | +Link | +
---|---|---|---|
Patients with an outpatient visit in 2017 with no prior cancer +(first visit in 2017) | +Lung cancer | +1 day - 3 years after index | ++ |
Patients newly diagnosed with major depressive disorder (date of +first record) | +Bipolar | +1 day - 365 day after index | ++ |
Patients with an outpatient visit in 2019 | +Dementia | +1 day - 3 years after index | ++ |
Patients with an outpatient visit and a positive COVID test | +Hospitalization with pneumonia | +1 day - 30 days after index | ++ |
Patients with an outpatient visit and a positive COVID test | +Hospitalization with pneumonia that required intensive services +(ventilation, intubation, tracheotomy, or extracorporeal membrane +oxygenation) or death | +1 day - 30 days after index | ++ |
Patients with an outpatient visit and a positive COVID test | +Death | +1 day - 30 days after index | ++ |
Patients with T2DM who were treated with metformin and who became +new adult users of one of sulfonylureas, thiazolidinediones, dipeptidyl +peptidase-4 inhibitors, glucagon-like peptide-1 receptor agonists, or +sodium-glucose co-transporter-2 inhibitors (date of secondary drug). +Patients with HF or patients treated with insulin on or prior to the +index date were excluded from the analysis. Patients were required to +have been enrolled for at least 365 days before cohort entry. | +Heart Failure | +1 to 365 days | ++ |
Patients newly diagnosed with atrial fibrilation (date of initial +afib record) | +Ischemic stroke | +1 to 365 days | ++ |
Patients undergoing elective major non-cardiac surgery (date of +surgery). Patients were required to have been enrolled for at least 365 +days before cohort entry. | +Earliest of AMI cardiac arrest or death (MACE) | +O to 30 days | ++ |
Patients starting intravitreal Anti-VEGF (date of +administration) | +Kidney Failure | +1 to 365 days | ++ |
Pregnancy women (start of pregnancy) | +Preeclampsia | +During pregnancy | ++ |
Pregnancy women (start of pregnancy) | +Still birth | +During pregnancy | ++ |
Patients with COPD (first record) | +Cardiovascular event and death | +1-30 days and 1-90 days | ++ |
Patients starting menopause (first record) | +Depression | +1 day - 3-years | ++ |
Patients with anemia (date of first anemia record) | +Colorectal cancer | +1 day - 1-year | ++ |
Patients with quadriplegia (date of first quadriplegia record) | +Death | +1 day - 1-year | ++ |
Patient undergoing | ++ | + | + |
+ | + | + | + |
vignettes/BestPractices.rmd
+ BestPractices.rmd
+Topic + | ++Research Summary + | ++Link + | +
---|---|---|
+Problem Specification + | ++When is prediction suitable in observational data? + | ++Guidelines needed + | +
+Data Creation + | ++Comparison of cohort vs case-control design + | ++Journal +of Big Data + | +
+Data Creation + | ++Addressing loss to follow-up (right censoring) + | ++BMC +medical informatics and decision makingk + | +
+Data Creation + | ++Investigating how to address left censoring in features construction + | ++BMC +Medical Research Methodology + | +
+Data Creation + | ++Impact of over/under-sampling + | ++ +Journal of big data + | +
+Data Creation + | ++Impact of phenotypes + | ++Study Done - Paper submitted + | +
+Model development + | ++How much data do we need for prediction - Learning curves at scale + | ++International +Journal of Medical Informatics + | +
+Model development + | ++What impact does test/train/validation design have on model performance + | ++BMJ Open + | +
+Model development + | ++What is the impact of the classifier + | ++JAMIA + | +
+Model development + | ++Can we find hyper-parameter combinations per classifier that +consistently lead to good performing models when using claims/EHR data? + | ++Study needs to be done + | +
+Model development + | ++Can we use ensembles to combine different algorithm models within a +database to improve models transportability? + | ++ Caring is +Sharing–Exploiting the Value in Data for Health and Innovation + | +
+Model development + | ++Can we use ensembles to combine models developed using different +databases to improve models transportability? + | ++ +BMC Medical Informatics and Decision Making + | +
+Model development + | ++Impact of regularization method + | ++ +JAMIA + | +
+Evaluation + | ++Why prediction is not suitable for risk factor identification + | ++ Machine +Learning for Healthcare Conference + | +
+Evaluation + | ++Iterative pairwise external validation to put validation into context + | ++ +Drug Safety + | +
+Evaluation + | ++A novel method to estimate external validation using aggregate +statistics + | ++ Study under review + | +
+Evaluation + | ++How should we present model performance? (e.g., new visualizations) + | ++JAMIA +Open + | +
+Evaluation + | ++How to interpret external validation performance (can we figure out why +the performance drops or stays consistent)? + | ++Study needs to be done + | +
+Evaluation + | ++Recalibration methods + | ++Study needs to be done + | +
+Evaluation + | ++Is there a way to automatically simplify models? + | ++Study +protocol under development + | +
vignettes/BuildingMultiplePredictiveModels.Rmd
+ BuildingMultiplePredictiveModels.Rmd
In our paper
,
+we propose a standardised framework for patient-level prediction that
+utilizes the OMOP CDM and standardized vocabularies, and describe the
+open-source software that we developed implementing the framework’s
+pipeline. The framework is the first to enforce existing best practice
+guidelines and will enable open dissemination of models that can be
+extensively validated across the network of OHDSI collaborators.
One our best practices is that we see the selection of models and all
+study setting as an emperical question, i.e. we should use a data-driven
+approach in which we try many settings. This vignette describes how you
+can use the Observational Health Data Sciencs and Informatics (OHDSI) PatientLevelPrediction
+package to automatically build multiple patient-level predictive models,
+e.g. different population settings, covariate settings, and
+modelsetting. This vignette assumes you have read and are comfortable
+with building single patient level prediction models as described in the
+BuildingPredictiveModels
+vignette.
Note that it is also possible to generate a Study Package directly in +Atlas that allows for multiple patient-level prediction analyses this is +out-of-scope for this vignette.
+The first step is to specify each model you wish to develop by using
+the createModelDesign
function. This function requires the
+following:
input | +Description | +
---|---|
targetId | +The id for the target cohort | +
outcomeId | +The id for the outcome | +
restrictPlpDataSettings | +The settings used to restrict the target population, +created with createRestrictPlpDataSettings() | +
populationSettings | +The settings used to restrict the target population and +create the outcome labels, created with +createStudyPopulationSettings() | +
covariateSettings | +The settings used to define the covariates, created +with FeatureExtraction::createDefaultCovariateSettings() | +
sampleSettings | +The settings used to define any under/over sampling, +created with createSampleSettings() | +
featureEngineeringSettings | +The settings used to define any feature engineering, +created with createFeatureEngineeringSettings() | +
preprocessSettings | +The settings used to define any preprocessing, created +with createPreprocessSettings() | +
modelSettings | +The settings used to define the model fitting settings, +such as setLassoLogisticRegression() | +
For example, if we wanted to predict the outcome (id 2) occuring for +the first time within 180 days of the the target population index date +(id 1). We are only interested in index dates betwrrn 2018-2020. +Finally, we only want to use age, gender in 5 year buckets and +conditions as features. The model can be specified by:
+
+# Model 1 is only using data between 2018-2020:
+restrictPlpDataSettings <- createRestrictPlpDataSettings(
+ studyStartDate = '20180101',
+ studyEndDate = '20191231'
+ )
+
+# predict outcome within 1 to 180 days after index
+# remove people with outcome prior and with < 365 days observation
+populationSettings <- createStudyPopulationSettings(
+ binary = T,
+ firstExposureOnly = T,
+ washoutPeriod = 365,
+ removeSubjectsWithPriorOutcome = T,
+ priorOutcomeLookback = 9999,
+ requireTimeAtRisk = F,
+ riskWindowStart = 1,
+ riskWindowEnd = 180
+)
+
+# use age/gender in groups and condition groups as features
+covariateSettings <- FeatureExtraction::createCovariateSettings(
+ useDemographicsGender = T,
+ useDemographicsAgeGroup = T,
+ useConditionGroupEraAnyTimePrior = T
+)
+
+modelDesign1 <- createModelDesign(
+ targetId = 1,
+ outcomeId = 2,
+ restrictPlpDataSettings = restrictPlpDataSettings,
+ populationSettings = populationSettings,
+ covariateSettings = covariateSettings,
+ featureEngineeringSettings = createFeatureEngineeringSettings(),
+ sampleSettings = createSampleSettings(),
+ splitSettings = createDefaultSplitSetting(),
+ preprocessSettings = createPreprocessSettings(),
+ modelSettings = setLassoLogisticRegression()
+ )
For the second example, we want to predict the outcome (id 2) +occuring for the first time within 730 days of the the target population +index date (id 1). We want to train a random forest classifier. Finally, +we want to use age, gender in 5 year buckets, drug ingredients (and +groups) and conditions as features. The model can be specified by:
+
+# Model 2 has no restrictions when extracting data
+restrictPlpDataSettings <- createRestrictPlpDataSettings(
+ )
+
+# predict outcome within 1 to 730 days after index
+# remove people with outcome prior and with < 365 days observation
+populationSettings <- createStudyPopulationSettings(
+ binary = T,
+ firstExposureOnly = T,
+ washoutPeriod = 365,
+ removeSubjectsWithPriorOutcome = T,
+ priorOutcomeLookback = 9999,
+ requireTimeAtRisk = F,
+ riskWindowStart = 1,
+ riskWindowEnd = 730
+)
+
+# use age/gender in groups and condition/drug groups as features
+covariateSettings <- FeatureExtraction::createCovariateSettings(
+ useDemographicsGender = T,
+ useDemographicsAgeGroup = T,
+ useConditionGroupEraAnyTimePrior = T,
+ useDrugGroupEraAnyTimePrior = T
+)
+
+modelDesign2 <- createModelDesign(
+ targetId = 1,
+ outcomeId = 2,
+ restrictPlpDataSettings = restrictPlpDataSettings,
+ populationSettings = populationSettings,
+ covariateSettings = covariateSettings,
+ featureEngineeringSettings = createRandomForestFeatureSelection(ntrees = 500, maxDepth = 7),
+ sampleSettings = createSampleSettings(),
+ splitSettings = createDefaultSplitSetting(),
+ preprocessSettings = createPreprocessSettings(),
+ modelSettings = setRandomForest()
+ )
For the third example, we want to predict the outcome (id 5) occuring +during the cohort exposure of the the target population (id 1). We want +to train a gradient boosting machine. Finally, we want to use age, +gender in 5 year buckets and indications of measurements taken as +features. The model can be specified by:
+
+# Model 3 has no restrictions when extracting data
+restrictPlpDataSettings <- createRestrictPlpDataSettings(
+ )
+
+# predict outcome during target cohort start/end
+# remove people with < 365 days observation
+populationSettings <- createStudyPopulationSettings(
+ binary = T,
+ firstExposureOnly = T,
+ washoutPeriod = 365,
+ removeSubjectsWithPriorOutcome = F,
+ requireTimeAtRisk = F,
+ riskWindowStart = 0,
+ startAnchor = 'cohort start',
+ riskWindowEnd = 0,
+ endAnchor = 'cohort end'
+)
+
+# use age/gender in groups and measurement indicators as features
+covariateSettings <- FeatureExtraction::createCovariateSettings(
+ useDemographicsGender = T,
+ useDemographicsAgeGroup = T,
+ useMeasurementAnyTimePrior = T,
+ endDays = -1
+)
+
+modelDesign3 <- createModelDesign(
+ targetId = 1,
+ outcomeId = 5,
+ restrictPlpDataSettings = restrictPlpDataSettings,
+ populationSettings = populationSettings,
+ covariateSettings = covariateSettings,
+ featureEngineeringSettings = createFeatureEngineeringSettings(),
+ sampleSettings = createSampleSettings(),
+ splitSettings = createDefaultSplitSetting(),
+ preprocessSettings = createPreprocessSettings(),
+ modelSettings = setGradientBoostingMachine()
+ )
As we will be downloading loads of data in the multiple plp analysis
+it is useful to set the Andromeda temp folder to a directory with write
+access and plenty of space.
+options(andromedaTempFolder = "c:/andromedaTemp")
To run the study requires setting up a connectionDetails object
+
+dbms <- "your dbms"
+user <- "your username"
+pw <- "your password"
+server <- "your server"
+port <- "your port"
+
+connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = dbms,
+ server = server,
+ user = user,
+ password = pw,
+ port = port)
Next you need to specify the cdmDatabaseSchema where your cdm +database is found and workDatabaseSchema where your target population +and outcome cohorts are and you need to specify a label for the database +name: a string with a shareable name of the database (this will be shown +to OHDSI researchers if the results get transported).
+cdmDatabaseSchema <- "your cdmDatabaseSchema"
+workDatabaseSchema <- "your workDatabaseSchema"
+cdmDatabaseName <- "your cdmDatabaseName"
+cohortTable <- "your cohort table",
+
+databaseDetails <- createDatabaseDetails(
+ connectionDetails = connectionDetails,
+ cdmDatabaseSchema = cdmDatabaseSchema,
+ cdmDatabaseName = cdmDatabaseName ,
+ cohortDatabaseSchema = workDatabaseSchema,
+ cohortTable = cohortTable,
+ outcomeDatabaseSchema = workDatabaseSchema,
+ outcomeTable = cohortTable
+ cdmVersion = 5
+ )
Now you can run the multiple patient-level prediction analysis:
+
+results <- runMultiplePlp(
+ databaseDetails = databaseDetails,
+ modelDesignList = list(
+ modelDesign1,
+ modelDesign2,
+ modelDesign3
+ ),
+ onlyFetchData = F,
+ logSettings = createLogSettings(),
+ saveDirectory = "./PlpMultiOutput"
+ )
This will then save all the plpData objects from the study into +“./PlpMultiOutput/plpData_T1_L” and the results into +“./PlpMultiOutput/Analysis_”. The csv file named settings.csv found +in “./PlpMultiOutput” has a row for each prediction model developed and +points to the plpData and settings used for the model development, it +also has descriptions of the cohorts if these are input by the user.
+Note that if for some reason the run is interrupted, e.g. because of
+an error, a new call to runMultiplePlp
will continue and
+not restart until you remove the output folder.
If you have access to multiple databases on the same server in +different schemas you could evaluate accross these using this call:
+
+validationDatabaseDetails <- createDatabaseDetails(
+ connectionDetails = connectionDetails,
+ cdmDatabaseSchema = 'new cdm schema',
+ cdmDatabaseName = 'validation database',
+ cohortDatabaseSchema = workDatabaseSchema,
+ cohortTable = cohortTable,
+ outcomeDatabaseSchema = workDatabaseSchema,
+ outcomeTable = cohortTable,
+ cdmVersion = 5
+ )
+
+val <- validateMultiplePlp(
+ analysesLocation = "./PlpMultiOutput",
+ valdiationDatabaseDetails = validationDatabaseDetails,
+ validationRestrictPlpDataSettings = createRestrictPlpDataSettings(),
+ recalibrate = NULL,
+ saveDirectory = "./PlpMultiOutput/Validation"
+ )
This then saves the external validation results in the
+Validation
folder of the main study (the outputLocation you
+used in runPlpAnalyses).
To view the results for the multiple prediction analysis:
+
+viewMultiplePlp(analysesLocation="./PlpMultiOutput")
If the validation directory in “./PlpMultiOutput” has a sqlite +results database, the external validation will also be displayed.
+Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +vignettes/BuildingPredictiveModels.Rmd
+ BuildingPredictiveModels.Rmd
Observational healthcare data, such as administrative claims and +electronic health records, are increasingly used for clinical +characterization of disease progression, quality improvement, and +population-level effect estimation for medical product safety +surveillance and comparative effectiveness. Advances in machine learning +for large dataset analysis have led to increased interest in applying +patient-level prediction on this type of data. Patient-level prediction +offers the potential for medical practice to move beyond average +treatment effects and to consider personalized risks as part of clinical +decision-making. However, many published efforts in +patient-level-prediction do not follow the model development guidelines, +fail to perform extensive external validation, or provide insufficient +model details that limits the ability of independent researchers to +reproduce the models and perform external validation. This makes it hard +to fairly evaluate the predictive performance of the models and reduces +the likelihood of the model being used appropriately in clinical +practice. To improve standards, several papers have been written +detailing guidelines for best practices in developing and reporting +prediction models.
+The Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement
+provides clear recommendations for reporting prediction model
+development and validation and addresses some of the concerns related to
+transparency. However, data structure heterogeneity and inconsistent
+terminologies still make collaboration and model sharing difficult as
+different researchers are often required to write new code to extract
+the data from their databases and may define variables differently.
In our paper
,
+we propose a standardised framework for patient-level prediction that
+utilizes the OMOP Common Data Model (CDM) and standardized vocabularies,
+and describe the open-source software that we developed implementing the
+framework’s pipeline. The framework is the first to support existing
+best practice guidelines and will enable open dissemination of models
+that can be extensively validated across the network of OHDSI
+collaborators.
Figure 1, illustrates the prediction problem we address. Among a +population at risk, we aim to predict which patients at a defined moment +in time (t = 0) will experience some outcome during a time-at-risk. +Prediction is done using only information about the patients in an +observation window prior to that moment in time.
+As shown in Figure 2, to define a prediction problem we have to +define t=0 by a Target Cohort (T), the outcome we like to predict by an +outcome cohort (O), and the time-at-risk (TAR). Furthermore, we have to +make design choices for the model we like to develop, and determine the +observational datasets to perform internal and external validation. This +conceptual framework works for all type of prediction problems, for +example those presented in Figure 3.
+This vignette describes how you can use the
+PatientLevelPrediction
package to build patient-level
+predictive models. The package enables data extraction, model building,
+and model evaluation using data from databases that are translated into
+the OMOP CDM. In this vignette we assume you have installed the package
+correctly using the InstallationGuide
.
We have to clearly specify our study upfront to be able to implement +it. This means we need to define the prediction problem we like to +address, in which population we will build the model, which model we +will build and how we will evaluate its performance. To guide you +through this process we will use a “Disease onset and progression” +prediction type as an example.
+Atrial fibrillation is a disease characterized by an irregular heart +rate that can cause poor blood flow. Patients with atrial fibrillation +are at increased risk of ischemic stroke. Anticoagulation is a +recommended prophylaxis treatment strategy for patients at high risk of +stroke, though the underuse of anticoagulants and persistent severity of +ischemic stroke represents a substantial unmet medical need. Various +strategies have been developed to predict risk of ischemic stroke in +patients with atrial fibrillation. CHADS2 (Gage JAMA 2001) was developed +as a risk score based on history of congestive heart failure, +hypertension, age>=75, diabetes and stroke. CHADS2 was initially +derived using Medicare claims data, where it achieved good +discrimination (AUC=0.82). However, subsequent external validation +studies revealed the CHADS2 had substantially lower predictive accuracy +(Keogh Thromb Haemost 2011). Subsequent stroke risk calculators have +been developed and evaluated, including the extension of CHADS2Vasc. The +management of atrial fibrillation has evolved substantially over the +last decade, for various reasons that include the introduction of novel +oral anticoagulants. With these innovations has come a renewed interest +in greater precision medicine for stroke prevention.
+We will apply the PatientLevelPrediction package to observational +healthcare data to address the following patient-level prediction +question:
+Amongst patients who are newly diagnosed with Atrial Fibrillation, +which patients will go on to have Ischemic Stroke within 1 year?
+We will define ‘patients who are newly diagnosed with Atrial +Fibrillation’ as the first condition record of cardiac arrhythmia, which +is followed by another cardiac arrhythmia condition record, at least two +drug records for a drug used to treat arrhythmias, or a procedure to +treat arrhythmias. We will define ‘Ischemic stroke events’ as ischemic +stroke condition records during an inpatient or ER visit; successive +records with > 180 day gap are considered independent episodes.
+Angiotensin converting enzyme inhibitors (ACE inhibitors) are +medications used by patients with hypertension that widen the blood +vessles and therefore increse the amount of blood pumped by the heart +and decreases blood pressure. Ace inhibitors reduce a patients risk of +cardiovasular disease but can lead to drug-induced angioedema.
+We will apply the PatientLevelPrediction package to observational +healthcare data to address the following patient-level prediction +question:
+Amongst patients who are newly dispensed an ACE inhibitor, which +patients will go on to have angioedema within 1 year?
+We will define ‘patients who are newly dispensed an ACE inhibitor’ as +the first drug record of sny ACE inhibitor, […]which is followed by +another cardiac arrhythmia condition record, at least two drug records +for a drug used to treat arrhythmias, or a procedure to treat +arrhythmias. We will define ‘angioedema’ as an angioedema condition +record.
+The final study population in which we will develop our model is +often a subset of the Target population, because we will e.g. apply +criteria that are dependent on T and O or we want to do sensitivity +analyses with subpopulations of T. For this we have to answer the +following questions:
+What is the minimum amount of observation time we require +before the start of the target cohort? This choice could depend on +the available patient time in your training data, but also on the time +you expect to be available in the data sources you want to apply the +model on in the future. The longer the minimum observation time, the +more baseline history time is available for each person to use for +feature extraction, but the fewer patients will qualify for analysis. +Moreover, there could be clinical reasons to choose a short or longer +lookback period. For our example, we will use a prior history as +lookback period (washout period).
Can patients enter the target cohort multiple times? In +the target cohort definition, a person may qualify for the cohort +multiple times during different spans of time, for example if they had +different episodes of a disease or separate periods of exposure to a +medical product. The cohort definition does not necessarily apply a +restriction to only let the patients enter once, but in the context of a +particular patient-level prediction problem, a user may want to restrict +the cohort to the first qualifying episode. In our example, a person +could only enter the target cohort once since our criteria was based on +first occurrence of atrial fibrillation.
Do we allow persons to enter the cohort if they experienced +the outcome before? Do we allow persons to enter the target cohort +if they experienced the outcome before qualifying for the target cohort? +Depending on the particular patient-level prediction problem, there may +be a desire to predict ‘incident’ first occurrence of an outcome, in +which case patients who have previously experienced the outcome are not +‘at-risk’ for having a first occurrence and therefore should be excluded +from the target cohort. In other circumstances, there may be a desire to +predict ‘prevalent’ episodes, whereby patients with prior outcomes can +be included in the analysis and the prior outcome itself can be a +predictor of future outcomes. For our prediction example, the answer to +this question is ‘Yes, allow persons with prior outcomes’ because we +know from the CHADS2 score that prior strokes are very predictive of +future strokes. If this answer would have been ‘No’ we also have to +decide how long we would look back for previous occurrences of the +outcome.
How do we define the period in which we will predict our +outcome relative to the target cohort start? We actually have to +make two decisions to answer that question. First, does the time-at-risk +window start at the date of the start of the target cohort or later? +Arguments to make it start later could be that you want to avoid +outcomes that were entered late in the record that actually occurred +before the start of the target cohort or you want to leave a gap where +interventions to prevent the outcome could theoretically be implemented. +Second, you need to define the time-at-risk by setting the risk window +end, as some specification of days offset relative to the target cohort +start or end dates. For our problem we will predict in a ‘time-at-risk’ +window starting 1 day after the start of the target cohort up to 365 +days later (to look for 1-year risk following atrial fibrillation +diagnosis).
Do we require a minimum amount of time-at-risk? We have +to decide if we want to include patients that did not experience the +outcome but did leave the database earlier than the end of our +time-at-risk period. These patients may experience the outcome when we +do not observe them. For our prediction problem we decide to answer this +question with ‘Yes, require a mimimum time-at-risk’ for that reason. +Furthermore, we have to decide if this constraint also applies to +persons who experienced the outcome or we will include all persons with +the outcome irrespective of their total time at risk. For example, if +the outcome is death, then persons with the outcome are likely censored +before the full time-at-risk period is complete.
To develop the model we have to decide which algorithm(s) we like to
+train. We see the selection of the best algorithm for a certain
+prediction problem as an empirical question, i.e. you need to let the
+data speak for itself and try different approaches to find the best one.
+There is no algorithm that will work best for all problems (no free
+lunch). In our package we therefore aim to implement many algorithms.
+Furthermore, we made the system modular so you can add your own custom
+algorithms as described in more detail in the AddingCustomModels
+vignette.
Our package currently contains the following algorithms to choose +from:
+Algorihm | +Description | +Hyper-parameters | +
---|---|---|
Regularized Logistic Regression | +Lasso logistic regression belongs to the family of generalized +linear models, where a linear combination of the variables is learned +and finally a logistic function maps the linear combination to a value +between 0 and 1. The lasso regularization adds a cost based on model +complexity to the objective function when training the model. This cost +is the sum of the absolute values of the linear combination of the +coefficients. The model automatically performs feature selection by +minimizing this cost. We use the Cyclic coordinate descent for logistic, +Poisson and survival analysis (Cyclops) package to perform large-scale +regularized logistic regression: https://github.com/OHDSI/Cyclops + | +var (starting variance), seed | +
Gradient boosting machines | +Gradient boosting machines is a boosting ensemble technique and in +our framework it combines multiple decision trees. Boosting works by +iteratively adding decision trees but adds more weight to the +data-points that are misclassified by prior decision trees in the cost +function when training the next tree. We use Extreme Gradient Boosting, +which is an efficient implementation of the gradient boosting framework +implemented in the xgboost R package available from CRAN. | +ntree (number of trees), max depth (max levels in tree), min rows +(minimum data points in in node), learning rate, balance (balance class +labels), seed | +
Random forest | +Random forest is a bagging ensemble technique that combines multiple +decision trees. The idea behind bagging is to reduce the likelihood of +overfitting, by using weak classifiers, but combining multiple diverse +weak classifiers into a strong classifier. Random forest accomplishes +this by training multiple decision trees but only using a subset of the +variables in each tree and the subset of variables differ between trees. +Our packages uses the sklearn learn implementation of Random Forest in +python. | +mtry (number of features in each tree),ntree (number of trees), +maxDepth (max levels in tree), minRows (minimum data points in in +node),balance (balance class labels), seed | +
K-nearest neighbors | +K-nearest neighbors (KNN) is an algorithm that uses some metric to +find the K closest labelled data-points, given the specified metric, to +a new unlabelled data-point. The prediction of the new data-points is +then the most prevalent class of the K-nearest labelled data-points. +There is a sharing limitation of KNN, as the model requires labelled +data to perform the prediction on new data, and it is often not possible +to share this data across data sites.We included the BigKnn classifier +developed in OHDSI which is a large scale k-nearest neighbor classifier +using the Lucene search engine: https://github.com/OHDSI/BigKnn + | +k (number of neighbours),weighted (weight by inverse frequency) | +
Naive Bayes | +The Naive Bayes algorithm applies the Bayes theorem with the ‘naive’ +assumption of conditional independence between every pair of features +given the value of the class variable. Based on the likelihood the data +belongs to a class and the prior distribution of the class, a posterior +distribution is obtained. | +none | +
AdaBoost | +AdaBoost is a boosting ensemble technique. Boosting works by +iteratively adding classifiers but adds more weight to the data-points +that are misclassified by prior classifiers in the cost function when +training the next classifier. We use the sklearn ‘AdaboostClassifier’ +implementation in Python. | +nEstimators (the maximum number of estimators at which boosting is +terminated), learningRate (learning rate shrinks the contribution of +each classifier by learning_rate. There is a trade-off between +learningRate and nEstimators) | +
Decision Tree | +A decision tree is a classifier that partitions the variable space +using individual tests selected using a greedy approach. It aims to find +partitions that have the highest information gain to separate the +classes. The decision tree can easily overfit by enabling a large number +of partitions (tree depth) and often needs some regularization (e.g., +pruning or specifying hyper-parameters that limit the complexity of the +model). We use the sklearn ‘DecisionTreeClassifier’ implementation in +Python. | +maxDepth (the maximum depth of the tree), +minSamplesSplit,minSamplesLeaf, minImpuritySplit (threshold for early +stopping in tree growth. A node will split if its impurity is above the +threshold, otherwise it is a leaf.), seed,classWeight (‘Balance’ or +‘None’) | +
Multilayer Perception | +Neural networks contain multiple layers that weight their inputs +using a non-linear function. The first layer is the input layer, the +last layer is the output layer the between are the hidden layers. Neural +networks are generally trained using feed forward back-propagation. This +is when you go through the network with a data-point and calculate the +error between the true label and predicted label, then go backwards +through the network and update the linear function weights based on the +error. This can also be performed as a batch, where multiple data-points +are fee | +size (the number of hidden nodes), alpha (the l2 regularisation), +seed | +
Deep Learning (now in seperate DeepPatientLevelPrediction R +package) | +Deep learning such as deep nets, convolutional neural networks or +recurrent neural networks are similar to a neural network but have +multiple hidden layers that aim to learn latent representations useful +for prediction. In the seperate BuildingDeepLearningModels vignette we +describe these models and hyper-parameters in more detail | +see OHDSI/DeepPatientLevelPrediction | +
Furthermore, we have to decide on the covariates +that we will use to train our model. This choice can be driven by domain +knowledge of available computational resources. In our example, we like +to add the Gender, Age, Conditions, Drugs Groups, and Visit Count. We +also have to specify in which time windows we will look and we decide to +look in year before and any time prior.
+Finally, we have to define how we will train and test our model on +our data, i.e. how we perform internal validation. For +this we have to decide how we divide our dataset in a training and +testing dataset and how we randomly assign patients to these two sets. +Dependent on the size of the training set we can decide how much data we +like to use for training, typically this is a 75%, 25% split. If you +have very large datasets you can use more data for training. To randomly +assign patients to the training and testing set, there are two commonly +used approaches:
+We now completely defined our studies and implement them:
+ +For our first prediction model we decide to start with a Regularized +Logistic Regression and will use the default parameters. We will do a +75%-25% split by person.
+Definition | +Value | +
---|---|
Problem Definition | ++ |
Target Cohort (T) | +‘Patients who are newly diagnosed with Atrial Fibrillation’ defined +as the first condition record of cardiac arrhythmia, which is followed +by another cardiac arrhythmia condition record, at least two drug +records for a drug used to treat arrhythmias, or a procedure to treat +arrhythmias. | +
Outcome Cohort (O) | +‘Ischemic stroke events’ defined as ischemic stroke condition +records during an inpatient or ER visit; successive records with > +180 day gap are considered independent episodes. | +
Time-at-risk (TAR) | +1 day till 365 days from cohort start | +
+ | + |
Population Definition | ++ |
Washout Period | +1095 | +
Enter the target cohort multiple times? | +No | +
Allow prior outcomes? | +Yes | +
Start of time-at-risk | +1 day | +
End of time-at-risk | +365 days | +
Require a minimum amount of time-at-risk? | +Yes (364 days) | +
+ | + |
Model Development | ++ |
Algorithm | +Regularized Logistic Regression | +
Hyper-parameters | +variance = 0.01 (Default) | +
Covariates | +Gender, Age, Conditions (ever before, <365), Drugs Groups (ever +before, <365), and Visit Count | +
Data split | +75% train, 25% test. Randomly assigned by person | +
According to the best practices we need to make a protocol that +completely specifies how we plan to execute our study. This protocol +will be assessed by the governance boards of the participating data +sources in your network study. For this a template could be used but we +prefer to automate this process as much as possible by adding +functionality to automatically generate study protocol from a study +specification. We will discuss this in more detail later.
+Now we have completely design our study we have to implement the +study. We have to generate the target and outcome cohorts and we need to +develop the R code to run against our CDM that will execute the full +study.
+For our study we need to know when a person enters the target and +outcome cohorts. This is stored in a table on the server that contains +the cohort start date and cohort end date for all subjects for a +specific cohort definition. This cohort table has a very simple +structure as shown below:
+cohort_definition_id
, a unique identifier for
+distinguishing between different types of cohorts, e.g. cohorts of
+interest and outcome cohorts.subject_id
, a unique identifier corresponding to the
+person_id
in the CDM.cohort_start_date
, the date the subject enters the
+cohort.cohort_end_date
, the date the subject leaves the
+cohort.How do we fill this table according to our cohort definitions? There +are two options for this:
+use the interactive cohort builder tool in ATLAS which can be used to create +cohorts based on inclusion criteria and will automatically populate this +cohort table.
write your own custom SQL statements to fill the cohort +table.
Both methods are described below for our example prediction +problem.
+ATLAS allows you to define cohorts interactively by specifying cohort +entry and cohort exit criteria. Cohort entry criteria involve selecting +one or more initial events, which determine the start date for cohort +entry, and optionally specifying additional inclusion criteria which +filter to the qualifying events. Cohort exit criteria are applied to +each cohort entry record to determine the end date when the person’s +episode no longer qualifies for the cohort. For the outcome cohort the +end date is less relevant. As an example, Figure 4 shows how we created +the Atrial Fibrillation cohort and Figure 5 shows how we created the +stroke cohort in ATLAS.
+The T and O cohorts can be found here:
+In depth explanation of cohort creation in ATLAS is out of scope of +this vignette but can be found on the OHDSI wiki pages (link).
+Note that when a cohort is created in ATLAS the cohortid is needed to +extract the data in R. The cohortid can be found at the top of the ATLAS +screen, e.g. 1769447 in Figure 4.
+It is also possible to create cohorts without the use of ATLAS. Using +custom cohort code (SQL) you can make more advanced cohorts if +needed.
+For our example study, we need to create at table to hold the cohort +data and we need to create SQL code to instantiate this table for both +the AF and Stroke cohorts. Therefore, we create a file called +AfStrokeCohorts.sql with the following contents:
+/***********************************
+File AfStrokeCohorts.sql
+***********************************/
+/*
+Create a table to store the persons in the T and C cohort
+*/
+
+IF OBJECT_ID('@resultsDatabaseSchema.PLPAFibStrokeCohort', 'U') IS NOT NULL
+DROP TABLE @resultsDatabaseSchema.PLPAFibStrokeCohort;
+
+CREATE TABLE @resultsDatabaseSchema.PLPAFibStrokeCohort
+(
+cohort_definition_id INT,
+subject_id BIGINT,
+cohort_start_date DATE,
+cohort_end_date DATE
+);
+
+
+/*
+T cohort: [PatientLevelPrediction vignette]: T : patients who are newly
+diagnosed with Atrial fibrillation
+- persons with a condition occurrence record of 'Atrial fibrillation' or
+any descendants, indexed at the first diagnosis
+- who have >1095 days of prior observation before their first diagnosis
+- and have no warfarin exposure any time prior to first AFib diagnosis
+*/
+INSERT INTO @resultsDatabaseSchema.AFibStrokeCohort (cohort_definition_id,
+subject_id,
+cohort_start_date,
+cohort_end_date)
+SELECT 1 AS cohort_definition_id,
+AFib.person_id AS subject_id,
+AFib.condition_start_date AS cohort_start_date,
+observation_period.observation_period_end_date AS cohort_end_date
+FROM
+(
+ SELECT person_id, min(condition_start_date) as condition_start_date
+ FROM @cdmDatabaseSchema.condition_occurrence
+ WHERE condition_concept_id IN (SELECT descendant_concept_id FROM
+ @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
+ (313217 /*atrial fibrillation*/))
+ GROUP BY person_id
+) AFib
+ INNER JOIN @cdmDatabaseSchema.observation_period
+ ON AFib.person_id = observation_period.person_id
+ AND AFib.condition_start_date >= dateadd(dd,1095,
+ observation_period.observation_period_start_date)
+ AND AFib.condition_start_date <= observation_period.observation_period_end_date
+ LEFT JOIN
+ (
+ SELECT person_id, min(drug_exposure_start_date) as drug_exposure_start_date
+ FROM @cdmDatabaseSchema.drug_exposure
+ WHERE drug_concept_id IN (SELECT descendant_concept_id FROM
+ @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
+ (1310149 /*warfarin*/))
+ GROUP BY person_id
+ ) warfarin
+ ON Afib.person_id = warfarin.person_id
+ AND Afib.condition_start_date > warfarin.drug_exposure_start_date
+ WHERE warfarin.person_id IS NULL
+ ;
+
+ /*
+ C cohort: [PatientLevelPrediction vignette]: O: Ischemic stroke events
+ - inpatient visits that include a condition occurrence record for
+ 'cerebral infarction' and descendants, 'cerebral thrombosis',
+ 'cerebral embolism', 'cerebral artery occlusion'
+ */
+ INSERT INTO @resultsDatabaseSchema.AFibStrokeCohort (cohort_definition_id,
+ subject_id,
+ cohort_start_date,
+ cohort_end_date)
+ SELECT 2 AS cohort_definition_id,
+ visit_occurrence.person_id AS subject_id,
+ visit_occurrence.visit_start_date AS cohort_start_date,
+ visit_occurrence.visit_end_date AS cohort_end_date
+ FROM
+ (
+ SELECT person_id, condition_start_date
+ FROM @cdmDatabaseSchema.condition_occurrence
+ WHERE condition_concept_id IN (SELECT DISTINCT descendant_concept_id FROM
+ @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
+ (443454 /*cerebral infarction*/) OR descendant_concept_id IN
+ (441874 /*cerebral thrombosis*/, 375557 /*cerebral embolism*/,
+ 372924 /*cerebral artery occlusion*/))
+ ) stroke
+ INNER JOIN @cdmDatabaseSchema.visit_occurrence
+ ON stroke.person_id = visit_occurrence.person_id
+ AND stroke.condition_start_date >= visit_occurrence.visit_start_date
+ AND stroke.condition_start_date <= visit_occurrence.visit_end_date
+ AND visit_occurrence.visit_concept_id IN (9201, 262 /*'Inpatient Visit' or
+ 'Emergency Room and Inpatient Visit'*/)
+ GROUP BY visit_occurrence.person_id, visit_occurrence.visit_start_date,
+ visit_occurrence.visit_end_date
+ ;
+
This is parameterized SQL which can be used by the SqlRender
+package. We use parameterized SQL so we do not have to pre-specify the
+names of the CDM and result schemas. That way, if we want to run the SQL
+on a different schema, we only need to change the parameter values; we
+do not have to change the SQL code. By also making use of translation
+functionality in SqlRender
, we can make sure the SQL code
+can be run in many different environments.
To execute this sql against our CDM we first need to tell R how to
+connect to the server. PatientLevelPrediction
uses the DatabaseConnector
+package, which provides a function called
+createConnectionDetails
. Type
+?createConnectionDetails
for the specific settings required
+for the various database management systems (DBMS). For example, one
+might connect to a PostgreSQL database using this code:
+ connectionDetails <- createConnectionDetails(dbms = "postgresql",
+ server = "localhost/ohdsi",
+ user = "joe",
+ password = "supersecret")
+
+ cdmDatabaseSchema <- "my_cdm_data"
+ cohortsDatabaseSchema <- "my_results"
+ cdmVersion <- "5"
The last three lines define the cdmDatabaseSchema
and
+cohortsDatabaseSchema
variables, as well as the CDM
+version. We will use these later to tell R where the data in CDM format
+live, where we want to create the cohorts of interest, and what version
+CDM is used. Note that for Microsoft SQL Server, databaseschemas need to
+specify both the database and the schema, so for example
+cdmDatabaseSchema <- "my_cdm_data.dbo"
.
+ library(SqlRender)
+ sql <- readSql("AfStrokeCohorts.sql")
+ sql <- renderSql(sql,
+ cdmDatabaseSchema = cdmDatabaseSchema,
+ cohortsDatabaseSchema = cohortsDatabaseSchema,
+ post_time = 30,
+ pre_time = 365)$sql
+ sql <- translateSql(sql, targetDialect = connectionDetails$dbms)$sql
+
+ connection <- connect(connectionDetails)
+ executeSql(connection, sql)
In this code, we first read the SQL from the file into memory. In the
+next line, we replace four parameter names with the actual values. We
+then translate the SQL into the dialect appropriate for the DBMS we
+already specified in the connectionDetails
. Next, we
+connect to the server, and submit the rendered and translated SQL.
If all went well, we now have a table with the events of interest. We +can see how many events per type:
+
+ sql <- paste("SELECT cohort_definition_id, COUNT(*) AS count",
+ "FROM @cohortsDatabaseSchema.AFibStrokeCohort",
+ "GROUP BY cohort_definition_id")
+ sql <- renderSql(sql, cohortsDatabaseSchema = cohortsDatabaseSchema)$sql
+ sql <- translateSql(sql, targetDialect = connectionDetails$dbms)$sql
+
+ querySql(connection, sql)
## cohort_definition_id count
+## 1 1 527616
+## 2 2 221555
+In this section we assume that our cohorts have been created either +by using ATLAS or a custom SQL script. We will first explain how to +create an R script yourself that will execute our study as we have +defined earlier.
+Now we can tell PatientLevelPrediction
to extract all
+necessary data for our analysis. This is done using the FeatureExtractionPackage
.
+In short the FeatureExtractionPackage allows you to specify which
+features (covariates) need to be extracted, e.g. all conditions and drug
+exposures. It also supports the creation of custom covariates. For more
+detailed information on the FeatureExtraction package see its vignettes. For our
+example study we decided to use these settings:
+ covariateSettings <- createCovariateSettings(useDemographicsGender = TRUE,
+ useDemographicsAge = TRUE,
+ useConditionGroupEraLongTerm = TRUE,
+ useConditionGroupEraAnyTimePrior = TRUE,
+ useDrugGroupEraLongTerm = TRUE,
+ useDrugGroupEraAnyTimePrior = TRUE,
+ useVisitConceptCountLongTerm = TRUE,
+ longTermStartDays = -365,
+ endDays = -1)
The final step for extracting the data is to run the
+getPlpData
function and input the connection details, the
+database schema where the cohorts are stored, the cohort definition ids
+for the cohort and outcome, and the washoutPeriod which is the minimum
+number of days prior to cohort index date that the person must have been
+observed to be included into the data, and finally input the previously
+constructed covariate settings.
+databaseDetails <- createDatabaseDetails(
+ connectionDetails = connectionDetails,
+ cdmDatabaseSchema = cdmDatabaseSchema,
+ cdmDatabaseName = '',
+ cohortDatabaseSchema = resultsDatabaseSchema,
+ cohortTable = 'AFibStrokeCohort',
+ cohortId = 1,
+ outcomeDatabaseSchema = resultsDatabaseSchema,
+ outcomeTable = 'AFibStrokeCohort',
+ outcomeIds = 2,
+ cdmVersion = 5
+ )
+
+# here you can define whether you want to sample the target cohort and add any
+# restrictions based on minimum prior observation, index date restrictions
+# or restricting to first index date (if people can be in target cohort multiple times)
+restrictPlpDataSettings <- createRestrictPlpDataSettings(sampleSize = 10000)
+
+ plpData <- getPlpData(
+ databaseDetails = databaseDetails,
+ covariateSettings = covariateSettings,
+ restrictPlpDataSettings = restrictPlpDataSettings
+ )
Note that if the cohorts are created in ATLAS its corresponding
+cohort database schema needs to be selected. There are many additional
+parameters for the createRestrictPlpDataSettings
function
+which are all documented in the PatientLevelPrediction
+manual. The resulting plpData
object uses the package
+Andromeda
(which uses SQLite) to store
+information in a way that ensures R does not run out of memory, even
+when the data are large.
Creating the plpData
object can take considerable
+computing time, and it is probably a good idea to save it for future
+sessions. Because plpData
uses Andromeda
, we
+cannot use R’s regular save function. Instead, we’ll have to use the
+savePlpData()
function:
+savePlpData(plpData, "stroke_in_af_data")
We can use the loadPlpData()
function to load the data
+in a future session.
To completely define the prediction problem the final study
+population is obtained by applying additional constraints on the two
+earlier defined cohorts, e.g., a minumim time at risk can be enforced
+(requireTimeAtRisk, minTimeAtRisk
) and we can specify if
+this also applies to patients with the outcome
+(includeAllOutcomes
). Here we also specify the start and
+end of the risk window relative to target cohort start. For example, if
+we like the risk window to start 30 days after the at-risk cohort start
+and end a year later we can set riskWindowStart = 30
and
+riskWindowEnd = 365
. In some cases the risk window needs to
+start at the cohort end date. This can be achieved by setting
+addExposureToStart = TRUE
which adds the cohort (exposure)
+time to the start date.
In Appendix 1, we demonstrate the effect of these settings on the +subset of the persons in the target cohort that end up in the final +study population.
+In the example below all the settings we defined for our study are +imposed:
+
+ populationSettings <- createStudyPopulationSettings(
+ washoutPeriod = 1095,
+ firstExposureOnly = FALSE,
+ removeSubjectsWithPriorOutcome = FALSE,
+ priorOutcomeLookback = 1,
+ riskWindowStart = 1,
+ riskWindowEnd = 365,
+ startAnchor = 'cohort start',
+ endAnchor = 'cohort start',
+ minTimeAtRisk = 364,
+ requireTimeAtRisk = TRUE,
+ includeAllOutcomes = TRUE
+ )
When developing a prediction model using supervised learning (when +you have features paired with labels for a set of patients), the first +step is to design the development/internal validation process. This +requires specifying how to select the model hyper-parameters, how to +learn the model parameters and how to fairly evaluate the model. In +general, the validation set is used to pick hyper-parameters, the +training set is used to learn the model parameters and the test set is +used to perform fair internal validation. However, cross-validation can +be implemented to pick the hyper-parameters on the training data (so a +validation data set is not required). Cross validation can also be used +to estimate internal validation (so a testing data set is not +required).
+In small data the best approach for internal validation has been +shown to be boostrapping. However, in big data (many patients and many +features) bootstrapping is generally not feasible. In big data our +research has shown that it is just important to have some form of fair +evaluation (use a test set or cross validation). For full details see our BMJ open paper.
+In the PatientLevelPrediction package, the splitSettings define how +the plpData are partitioned into training/validation/testing data. Cross +validation is always done, but using a test set is optional (when the +data are small, it may be optimal to not use a test set). For the +splitSettings we can use the type (stratified/time/subject) and +testFraction parameters to split the data in a 75%-25% split and run the +patient-level prediction pipeline:
+
+ splitSettings <- createDefaultSplitSetting(
+ trainFraction = 0.75,
+ testFraction = 0.25,
+ type = 'stratified',
+ nfold = 2,
+ splitSeed = 1234
+ )
Note: it is possible to add a custom method to specify how the +plpData are partitioned into training/validation/testing data, see vignette +for custom splitting.
+There a numerous data processing settings that a user must specify +when developing a prediction model. These are: * Whether to under-sample +or over-sample the training data (this may be useful when there is class +imballance (e.g., the outcome is very rare or very common)) * Whether to +perform feature engineering or feature selection (e.g., create latent +variables that are not observed in the data or reduce the dimensionality +of the data) * Whether to remove redundant features and normalize the +data (this is required for some models)
+The default sample settings does nothing, it simply returns the +trainData as input, see below:
+
+ sampleSettings <- createSampleSettings()
However, the current package contains methods of under-sampling the
+non-outcome patients. To perform undersampling, the type
+input should be ‘underSample’ and
+numberOutcomestoNonOutcomes
must be specified (an integer
+specifying the number of non-outcomes per outcome). It is possible to
+add any custom function for over/under sampling, see vignette
+for custom sampling.
It is possible to specify a combination of feature engineering +functions that take as input the trainData and output a new trainData +with different features. The default feature engineering setting does +nothing:
+
+ featureEngineeringSettings <- createFeatureEngineeringSettings()
However, it is possible to add custom feature engineering functions +into the pipeline, see vignette +for custom feature engineering.
+Finally, the preprocessing setting is required. For this setting the
+user can define minFraction
, this removes any features that
+is observed in the training data for less than 0.01 fraction of the
+patients. So, if minFraction = 0.01
then any feature that
+is seen in less than 1 percent of the target population is removed. The
+input normalize
specifies whether the features are scaled
+between 0 and 1, this is required for certain models (e.g., LASSO
+logistic regression). The input removeRedundancy
specifies
+whether features that are observed in all of the target population are
+removed.
+ preprocessSettingsSettings <- createPreprocessSettings(
+ minFraction = 0.01,
+ normalize = T,
+ removeRedundancy = T
+ )
In the set function of an algorithm the user can specify a list of +eligible values for each hyper-parameter. All possible combinations of +the hyper-parameters are included in a so-called grid search using +cross-validation on the training set. If a user does not specify any +value then the default value is used instead.
+For example, if we use the following settings for the +gradientBoostingMachine: ntrees=c(100,200), maxDepth=4 the grid search +will apply the gradient boosting machine algorithm with ntrees=100 and +maxDepth=4 plus the default settings for other hyper-parameters and +ntrees=200 and maxDepth=4 plus the default settings for other +hyper-parameters. The hyper-parameters that lead to the +bestcross-validation performance will then be chosen for the final +model. For our problem we choose to build a logistic regression model +with the default hyper-parameters
+
+lrModel <- setLassoLogisticRegression()
The runPlP
function requires the plpData
,
+the outcomeId
specifying the outcome being predicted and
+the settings: populationSettings
,
+splitSettings
, sampleSettings
,
+featureEngineeringSettings
, preprocessSettings
+and modelSettings
to train and evaluate the model.
+ lrResults <- runPlp(
+ plpData = plpData,
+ outcomeId = 2,
+ analysisId = 'singleDemo',
+ analysisName = 'Demonstration of runPlp for training single PLP models',
+ populationSettings = populationSettings,
+ splitSettings = splitSettings,
+ sampleSettings = sampleSettings,
+ featureEngineeringSettings = featureEngineeringSettings,
+ preprocessSettings = preprocessSettings,
+ modelSettings = lrModel,
+ logSettings = createLogSettings(),
+ executeSettings = createExecuteSettings(
+ runSplitData = T,
+ runSampleData = T,
+ runfeatureEngineering = T,
+ runPreprocessData = T,
+ runModelDevelopment = T,
+ runCovariateSummary = T
+ ),
+ saveDirectory = file.path(getwd(), 'singlePlp')
+ )
Under the hood the package will now use the Cyclops
package to
+fit a large-scale regularized regression using 75% of the data and will
+evaluate the model on the remaining 25%. A results data structure is
+returned containing information about the model, its performance
+etc.
You can save the model using:
+
+savePlpModel(lrResults$model, dirPath = file.path(getwd(), "model"))
You can load the model using:
+
+plpModel <- loadPlpModel(file.path(getwd(), "model"))
You can also save the full results structure using:
+
+savePlpResult(lrResults, location = file.path(getwd(), "lr"))
To load the full results structure use:
+
+lrResults <- loadPlpResult(file.path(getwd(), "lr"))
Definition | +Value | +
---|---|
Problem Definition | ++ |
Target Cohort (T) | +‘Patients who are newly dispensed an ACE inhibitor’ defined as the +first drug record of any ACE inhibitor | +
Outcome Cohort (O) | +‘Angioedema’ defined as an angioedema condition record during an +inpatient or ER visit | +
Time-at-risk (TAR) | +1 day till 365 days from cohort start | +
+ | + |
Population Definition | ++ |
Washout Period | +365 | +
Enter the target cohort multiple times? | +No | +
Allow prior outcomes? | +No | +
Start of time-at-risk | +1 day | +
End of time-at-risk | +365 days | +
Require a minimum amount of time-at-risk? | +Yes (364 days) | +
+ | + |
Model Development | ++ |
Algorithm | +Gradient Boosting Machine | +
Hyper-parameters | +ntree:5000, max depth:4 or 7 or 10 and learning rate: 0.001 or 0.01 +or 0.1 or 0.9 | +
Covariates | +Gender, Age, Conditions (ever before, <365), Drugs Groups (ever +before, <365), and Visit Count | +
Data split | +75% train, 25% test. Randomly assigned by person | +
According to the best practices we need to make a protocol that +completely specifies how we plan to execute our study. This protocol +will be assessed by the governance boards of the participating data +sources in your network study. For this a template could be used but we +prefer to automate this process as much as possible by adding +functionality to automatically generate study protocol from a study +specification. We will discuss this in more detail later.
+Now we have completely design our study we have to implement the +study. We have to generate the target and outcome cohorts and we need to +develop the R code to run against our CDM that will execute the full +study.
+For our study we need to know when a person enters the target and +outcome cohorts. This is stored in a table on the server that contains +the cohort start date and cohort end date for all subjects for a +specific cohort definition. This cohort table has a very simple +structure as shown below:
+cohort_definition_id
, a unique identifier for
+distinguishing between different types of cohorts, e.g. cohorts of
+interest and outcome cohorts.subject_id
, a unique identifier corresponding to the
+person_id
in the CDM.cohort_start_date
, the date the subject enters the
+cohort.cohort_end_date
, the date the subject leaves the
+cohort.How do we fill this table according to our cohort definitions? There +are two options for this:
+use the interactive cohort builder tool in ATLAS which can be used to create +cohorts based on inclusion criteria and will automatically populate this +cohort table.
write your own custom SQL statements to fill the cohort +table.
Both methods are described below for our example prediction +problem.
+ATLAS allows you to define cohorts interactively by specifying cohort +entry and cohort exit criteria. Cohort entry criteria involve selecting +one or more initial events, which determine the start date for cohort +entry, and optionally specifying additional inclusion criteria which +filter to the qualifying events. Cohort exit criteria are applied to +each cohort entry record to determine the end date when the person’s +episode no longer qualifies for the cohort. For the outcome cohort the +end date is less relevant. As an example, Figure 6 shows how we created +the ACE inhibitors cohort and Figure 7 shows how we created the +angioedema cohort in ATLAS.
+The T and O cohorts can be found here:
+In depth explanation of cohort creation in ATLAS is out of scope of +this vignette but can be found on the OHDSI wiki pages (link).
+Note that when a cohort is created in ATLAS the cohortid is needed to +extract the data in R. The cohortid can be found at the top of the ATLAS +screen, e.g. 1770617 in Figure 6.
+It is also possible to create cohorts without the use of ATLAS. Using +custom cohort code (SQL) you can make more advanced cohorts if +needed.
+For our example study, we need to create at table to hold the cohort +data and we need to create SQL code to instantiate this table for both +the AF and Stroke cohorts. Therefore, we create a file called +AceAngioCohorts.sql with the following contents:
+ /***********************************
+ File AceAngioCohorts.sql
+ ***********************************/
+ /*
+ Create a table to store the persons in the T and C cohort
+ */
+
+ IF OBJECT_ID('@resultsDatabaseSchema.PLPAceAngioCohort', 'U') IS NOT NULL
+ DROP TABLE @resultsDatabaseSchema.PLPAceAngioCohort;
+
+ CREATE TABLE @resultsDatabaseSchema.PLPAceAngioCohort
+ (
+ cohort_definition_id INT,
+ subject_id BIGINT,
+ cohort_start_date DATE,
+ cohort_end_date DATE
+ );
+
+
+ /*
+ T cohort: [PatientLevelPrediction vignette]: T : patients who are newly
+ dispensed an ACE inhibitor
+ - persons with a drug exposure record of any 'ACE inhibitor' or
+ any descendants, indexed at the first diagnosis
+ - who have >364 days of prior observation before their first dispensing
+ */
+ INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id,
+ subject_id,
+ cohort_start_date,
+ cohort_end_date)
+ SELECT 1 AS cohort_definition_id,
+ Ace.person_id AS subject_id,
+ Ace.drug_start_date AS cohort_start_date,
+ observation_period.observation_period_end_date AS cohort_end_date
+ FROM
+ (
+ SELECT person_id, min(drug_exposure_date) as drug_start_date
+ FROM @cdmDatabaseSchema.drug_exposure
+ WHERE drug_concept_id IN (SELECT descendant_concept_id FROM
+ @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
+ (1342439,1334456, 1331235, 1373225, 1310756, 1308216, 1363749, 1341927, 1340128, 1335471 /*ace inhibitors*/))
+ GROUP BY person_id
+ ) Ace
+ INNER JOIN @cdmDatabaseSchema.observation_period
+ ON Ace.person_id = observation_period.person_id
+ AND Ace.drug_start_date >= dateadd(dd,364,
+ observation_period.observation_period_start_date)
+ AND Ace.drug_start_date <= observation_period.observation_period_end_date
+ ;
+
+ /*
+ C cohort: [PatientLevelPrediction vignette]: O: Angioedema
+ */
+ INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id,
+ subject_id,
+ cohort_start_date,
+ cohort_end_date)
+ SELECT 2 AS cohort_definition_id,
+ angioedema.person_id AS subject_id,
+ angioedema.condition_start_date AS cohort_start_date,
+ angioedema.condition_start_date AS cohort_end_date
+ FROM
+ (
+ SELECT person_id, condition_start_date
+ FROM @cdmDatabaseSchema.condition_occurrence
+ WHERE condition_concept_id IN (SELECT DISTINCT descendant_concept_id FROM
+ @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
+ (432791 /*angioedema*/) OR descendant_concept_id IN
+ (432791 /*angioedema*/)
+ ) angioedema
+
+ ;
+
This is parameterized SQL which can be used by the SqlRender
+package. We use parameterized SQL so we do not have to pre-specify the
+names of the CDM and result schemas. That way, if we want to run the SQL
+on a different schema, we only need to change the parameter values; we
+do not have to change the SQL code. By also making use of translation
+functionality in SqlRender
, we can make sure the SQL code
+can be run in many different environments.
To execute this sql against our CDM we first need to tell R how to
+connect to the server. PatientLevelPrediction
uses the DatabaseConnector
+package, which provides a function called
+createConnectionDetails
. Type
+?createConnectionDetails
for the specific settings required
+for the various database management systems (DBMS). For example, one
+might connect to a PostgreSQL database using this code:
+ connectionDetails <- createConnectionDetails(dbms = "postgresql",
+ server = "localhost/ohdsi",
+ user = "joe",
+ password = "supersecret")
+
+ cdmDatabaseSchema <- "my_cdm_data"
+ cohortsDatabaseSchema <- "my_results"
+ cdmVersion <- "5"
The last three lines define the cdmDatabaseSchema
and
+cohortsDatabaseSchema
variables, as well as the CDM
+version. We will use these later to tell R where the data in CDM format
+live, where we want to create the cohorts of interest, and what version
+CDM is used. Note that for Microsoft SQL Server, databaseschemas need to
+specify both the database and the schema, so for example
+cdmDatabaseSchema <- "my_cdm_data.dbo"
.
+ library(SqlRender)
+ sql <- readSql("AceAngioCohorts.sql")
+ sql <- render(sql,
+ cdmDatabaseSchema = cdmDatabaseSchema,
+ cohortsDatabaseSchema = cohortsDatabaseSchema)
+ sql <- translate(sql, targetDialect = connectionDetails$dbms)
+
+ connection <- connect(connectionDetails)
+ executeSql(connection, sql)
In this code, we first read the SQL from the file into memory. In the
+next line, we replace four parameter names with the actual values. We
+then translate the SQL into the dialect appropriate for the DBMS we
+already specified in the connectionDetails
. Next, we
+connect to the server, and submit the rendered and translated SQL.
If all went well, we now have a table with the events of interest. We +can see how many events per type:
+
+ sql <- paste("SELECT cohort_definition_id, COUNT(*) AS count",
+ "FROM @cohortsDatabaseSchema.AceAngioCohort",
+ "GROUP BY cohort_definition_id")
+ sql <- render(sql, cohortsDatabaseSchema = cohortsDatabaseSchema)
+ sql <- translate(sql, targetDialect = connectionDetails$dbms)
+
+ querySql(connection, sql)
## cohort_definition_id count
+## 1 1 0
+## 2 2 0
+In this section we assume that our cohorts have been created either +by using ATLAS or a custom SQL script. We will first explain how to +create an R script yourself that will execute our study as we have +defined earlier.
+Now we can tell PatientLevelPrediction
to extract all
+necessary data for our analysis. This is done using the FeatureExtractionPackage
.
+In short the FeatureExtractionPackage allows you to specify which
+features (covariates) need to be extracted, e.g. all conditions and drug
+exposures. It also supports the creation of custom covariates. For more
+detailed information on the FeatureExtraction package see its vignettes. For our
+example study we decided to use these settings:
+ covariateSettings <- createCovariateSettings(useDemographicsGender = TRUE,
+ useDemographicsAge = TRUE,
+ useConditionGroupEraLongTerm = TRUE,
+ useConditionGroupEraAnyTimePrior = TRUE,
+ useDrugGroupEraLongTerm = TRUE,
+ useDrugGroupEraAnyTimePrior = TRUE,
+ useVisitConceptCountLongTerm = TRUE,
+ longTermStartDays = -365,
+ endDays = -1)
The final step for extracting the data is to run the
+getPlpData
function and input the connection details, the
+database schema where the cohorts are stored, the cohort definition ids
+for the cohort and outcome, and the washoutPeriod which is the minimum
+number of days prior to cohort index date that the person must have been
+observed to be included into the data, and finally input the previously
+constructed covariate settings.
+databaseDetails <- createDatabaseDetails(
+ connectionDetails = connectionDetails,
+ cdmDatabaseSchema = cdmDatabaseSchema,
+ cohortDatabaseSchema = resultsDatabaseSchema,
+ cohortTable = 'AceAngioCohort',
+ cohortId = 1,
+ outcomeDatabaseSchema = resultsDatabaseSchema,
+ outcomeTable = 'AceAngioCohort',
+ outcomeIds = 2
+ )
+
+restrictPlpDataSettings <- createRestrictPlpDataSettings(
+ sampleSize = 10000
+ )
+
+plpData <- getPlpData(
+ databaseDetails = databaseDetails,
+ covariateSettings = covariateSettings,
+ restrictPlpDataSettings = restrictPlpDataSettings
+ )
Note that if the cohorts are created in ATLAS its corresponding
+cohort database schema needs to be selected. There are many additional
+parameters for the getPlpData
function which are all
+documented in the PatientLevelPrediction
manual. The
+resulting plpData
object uses the package ff
+to store information in a way that ensures R does not run out of memory,
+even when the data are large.
Creating the plpData
object can take considerable
+computing time, and it is probably a good idea to save it for future
+sessions. Because plpData
uses ff
, we cannot
+use R’s regular save function. Instead, we’ll have to use the
+savePlpData()
function:
+savePlpData(plpData, "angio_in_ace_data")
We can use the loadPlpData()
function to load the data
+in a future session.
To completely define the prediction problem the final study
+population is obtained by applying additional constraints on the two
+earlier defined cohorts, e.g., a minumim time at risk can be enforced
+(requireTimeAtRisk, minTimeAtRisk
) and we can specify if
+this also applies to patients with the outcome
+(includeAllOutcomes
). Here we also specify the start and
+end of the risk window relative to target cohort start. For example, if
+we like the risk window to start 30 days after the at-risk cohort start
+and end a year later we can set riskWindowStart = 30
and
+riskWindowEnd = 365
. In some cases the risk window needs to
+start at the cohort end date. This can be achieved by setting
+addExposureToStart = TRUE
which adds the cohort (exposure)
+time to the start date.
In Appendix 1, we demonstrate the effect of these settings on the +subset of the persons in the target cohort that end up in the final +study population.
+In the example below all the settings we defined for our study are +imposed:
+
+ populationSettings <- createStudyPopulationSettings(
+ washoutPeriod = 364,
+ firstExposureOnly = FALSE,
+ removeSubjectsWithPriorOutcome = TRUE,
+ priorOutcomeLookback = 9999,
+ riskWindowStart = 1,
+ riskWindowEnd = 365,
+ minTimeAtRisk = 364,
+ startAnchor = 'cohort start',
+ endAnchor = 'cohort start',
+ requireTimeAtRisk = TRUE,
+ includeAllOutcomes = TRUE
+ )
When developing a prediction model using supervised learning (when +you have features paired with labels for a set of patients), the first +step is to design the development/internal validation process. This +requires specifying how to select the model hyper-parameters, how to +learn the model parameters and how to fairly evaluate the model. In +general, the validation set is used to pick hyper-parameters, the +training set is used to learn the model parameters and the test set is +used to perform fair internal validation. However, cross-validation can +be implemented to pick the hyper-parameters on the training data (so a +validation data set is not required). Cross validation can also be used +to estimate internal validation (so a testing data set is not +required).
+In small data the best approach for internal validation has been +shown to be boostrapping. However, in big data (many patients and many +features) bootstrapping is generally not feasible. In big data our +research has shown that it is just important to have some form of fair +evaluation (use a test set or cross validation). For full details see our BMJ open paper.
+In the PatientLevelPrediction package, the splitSettings define how +the plpData are partitioned into training/validation/testing data. Cross +validation is always done, but using a test set is optional (when the +data are small, it may be optimal to not use a test set). For the +splitSettings we can use the type (stratified/time/subject) and +testFraction parameters to split the data in a 75%-25% split and run the +patient-level prediction pipeline:
+
+ splitSettings <- createDefaultSplitSetting(
+ trainFraction = 0.75,
+ testFraction = 0.25,
+ type = 'stratified',
+ nfold = 2,
+ splitSeed = 1234
+ )
Note: it is possible to add a custom method to specify how the +plpData are partitioned into training/validation/testing data, see vignette +for custom splitting.
+There a numerous data processing settings that a user must specify +when developing a prediction model. These are: * Whether to under-sample +or over-sample the training data (this may be useful when there is class +imballance (e.g., the outcome is very rare or very common)) * Whether to +perform feature engineering or feature selection (e.g., create latent +variables that are not observed in the data or reduce the dimensionality +of the data) * Whether to remove redundant features and normalize the +data (this is required for some models)
+The default sample settings does nothing, it simply returns the +trainData as input, see below:
+
+ sampleSettings <- createSampleSettings()
However, the current package contains methods of under-sampling the
+non-outcome patients. To perform undersampling, the type
+input should be ‘underSample’ and
+numberOutcomestoNonOutcomes
must be specified (an integer
+specifying the number of non-outcomes per outcome). It is possible to
+add any custom function for over/under sampling, see vignette
+for custom sampling.
It is possible to specify a combination of feature engineering +functions that take as input the trainData and output a new trainData +with different features. The default feature engineering setting does +nothing:
+
+ featureEngineeringSettings <- createFeatureEngineeringSettings()
However, it is possible to add custom feature engineering functions +into the pipeline, see vignette +for custom feature engineering.
+Finally, the preprocessing setting is required. For this setting the
+user can define minFraction
, this removes any features that
+is observed in the training data for less than 0.01 fraction of the
+patients. So, if minFraction = 0.01
then any feature that
+is seen in less than 1 percent of the target population is removed. The
+input normalize
specifies whether the features are scaled
+between 0 and 1, this is required for certain models (e.g., LASSO
+logistic regression). The input removeRedundancy
specifies
+whether features that are observed in all of the target population are
+removed.
+ preprocessSettingsSettings <- createPreprocessSettings(
+ minFraction = 0.01,
+ normalize = T,
+ removeRedundancy = T
+ )
In the set function of an algorithm the user can specify a list of +eligible values for each hyper-parameter. All possible combinations of +the hyper-parameters are included in a so-called grid search using +cross-validation on the training set. If a user does not specify any +value then the default value is used instead.
+For example, if we use the following settings for the +gradientBoostingMachine: ntrees=c(100,200), maxDepth=4 the grid search +will apply the gradient boosting machine algorithm with ntrees=100 and +maxDepth=4 plus the default settings for other hyper-parameters and +ntrees=200 and maxDepth=4 plus the default settings for other +hyper-parameters. The hyper-parameters that lead to the +bestcross-validation performance will then be chosen for the final +model. For our problem we choose to build a logistic regression model +with the default hyper-parameters
+
+gbmModel <- setGradientBoostingMachine(ntrees = 5000, maxDepth = c(4, 7, 10), learnRate = c(0.001,
+ 0.01, 0.1, 0.9))
The runPlP
function requires the plpData
,
+the outcomeId
specifying the outcome being predicted and
+the settings: populationSettings
,
+splitSettings
, sampleSettings
,
+featureEngineeringSettings
, preprocessSettings
+and modelSettings
to train and evaluate the model.
+ gbmResults <- runPlp(
+ plpData = plpData,
+ outcomeId = 2,
+ analysisId = 'singleDemo2',
+ analysisName = 'Demonstration of runPlp for training single PLP models',
+ populationSettings = populationSettings,
+ splitSettings = splitSettings,
+ sampleSettings = sampleSettings,
+ featureEngineeringSettings = featureEngineeringSettings,
+ preprocessSettings = preprocessSettings,
+ modelSettings = gbmModel,
+ logSettings = createLogSettings(),
+ executeSettings = createExecuteSettings(
+ runSplitData = T,
+ runSampleData = T,
+ runfeatureEngineering = T,
+ runPreprocessData = T,
+ runModelDevelopment = T,
+ runCovariateSummary = T
+ ),
+ saveDirectory = file.path(getwd(), 'singlePlpExample2')
+ )
Under the hood the package will now use the R xgboost package to fit +a a gradient boosting machine model using 75% of the data and will +evaluate the model on the remaining 25%. A results data structure is +returned containing information about the model, its performance +etc.
+You can save the model using:
+
+savePlpModel(gbmResults$model, dirPath = file.path(getwd(), "model"))
You can load the model using:
+
+plpModel <- loadPlpModel(file.path(getwd(), "model"))
You can also save the full results structure using:
+
+savePlpResult(gbmResults, location = file.path(getwd(), "gbm"))
To load the full results structure use:
+
+gbmResults <- loadPlpResult(file.path(getwd(), "gbm"))
The script we created manually above can also be automatically +created using a powerful feature in ATLAS. By creating a new prediction +study (left menu) you can select the Target and Outcome as created in +ATLAS, set all the study parameters, and then you can download a R +package that you can use to execute your study. What is really powerful +is that you can add multiple Ts, Os, covariate settings etc. The package +will then run all the combinations of automatically as separate +analyses. The screenshots below explain this process.
+ATLAS can build a R package for you that will execute the full study +against you CDM. Below the steps are explained how to do this in +ATLAS.
+Under utilities you can find download. Click on the button to review +the full study specification
+You now have to review that you indeed want to run all these analyses +(cartesian product of all the settings for each T and O combination.
+If you agree, you give the package a name, and download the +package as a zipfile.
By opening the R package in R studio and building the package you
+can run the study using the execute
function. Theres is
+also an example CodeToRun.R script available in the extras folder of the
+package with extra instructions.
Once we execute the study, the runPlp() function returns the trained +model and the evaluation of the model on the train/test sets.
+You can interactively view the results by running:
+viewPlp(runPlp=lrResults)
. This will generate a Shiny App
+in your browser in which you can view all performance measures created
+by the framework as shown in the figure below.
Furthermore, many interactive plots are available in the Shiny App, +for example the ROC curve in which you can move over the plot to see the +threshold and the corresponding sensitivity and specificity values.
+To generate and save all the evaluation plots to a folder run the +following code:
+ +The plots are described in more detail in the next sections.
+ +The Receiver Operating Characteristics (ROC) plot shows the +sensitivity against 1-specificity on the test set. The plot illustrates +how well the model is able to discriminate between the people with the +outcome and those without. The dashed diagonal line is the performance +of a model that randomly assigns predictions. The higher the area under +the ROC plot the better the discrimination of the model. The plot is +created by changing the probability threshold to assign the positive +class.
+## Calibration
+The calibration plot shows how close the predicted risk is to the +observed risk. The diagonal dashed line thus indicates a perfectly +calibrated model. The ten (or fewer) dots represent the mean predicted +values for each quantile plotted against the observed fraction of people +in that quantile who had the outcome (observed fraction). The straight +black line is the linear regression using these 10 plotted quantile mean +predicted vs observed fraction points. The straight vertical lines +represented the 95% lower and upper confidence intervals of the slope of +the fitted line.
+Similar to the traditional calibration shown above the Smooth +Calibration plot shows the relationship between predicted and observed +risk. the major difference is that the smooth fit allows for a more fine +grained examination of this. Whereas the traditional plot will be +heavily influenced by the areas with the highest density of data the +smooth plot will provide the same information for this region as well as +a more accurate interpretation of areas with lower density. the plot +also contains information on the distribution of the outcomes relative +to predicted risk.
+However, the increased information gain comes at a computational +cost. It is recommended to use the traditional plot for examination and +then to produce the smooth plot for final versions. To create the smooth +calibarion plot you have to run the follow command:
+
+plotSmoothCalibration(lrResults)
See the help function for more information, on how to set the +smoothing method etc.
+The example below is from another study that better demonstrates the +impact of using a smooth calibration plot. The default line fit would +not highlight the miss-calibration at the lower predicted probability +levels that well.
+## Preference distribution
+The preference distribution plots are the preference score +distributions corresponding to i) people in the test set with the +outcome (red) and ii) people in the test set without the outcome +(blue).
+## Predicted probability distribution
+The prediction distribution box plots are for the predicted risks of +the people in the test set with the outcome (class 1: blue) and without +the outcome (class 0: red).
+The box plots in the Figure show that the predicted probability of +the outcome is indeed higher for those with the outcome but there is +also overlap between the two distribution which lead to an imperfect +discrimination.
+## Test-Train similarity
+The test-train similarity is assessed by plotting the mean covariate +values in the train set against those in the test set for people with +and without the outcome.
+The results for our example of look very promising since the mean +values of the covariates are on the diagonal.
+## Variable scatter plot
+The variable scatter plot shows the mean covariate value for the +people with the outcome against the mean covariate value for the people +without the outcome. The color of the dots corresponds to the inclusion +(green) or exclusion in the model (blue), respectively. It is highly +recommended to use the Shiny App since this allows you to hoover over a +covariate to show more details (name, value etc).
+The plot shows that the mean of most of the covariates is higher for +subjects with the outcome compared to those without.
+## Precision recall
+Precision (P) is defined as the number of true positives (Tp) over +the number of true positives plus the number of false positives +(Fp).
+
+P <- Tp/(Tp + Fp)
Recall (R) is defined as the number of true positives (Tp) over the +number of true positives plus the number of false negatives (Fn).
+
+R <- Tp/(Tp + Fn)
These quantities are also related to the (F1) score, which is defined +as the harmonic mean of precision and recall.
+
+F1 <- 2 * P * R/(P + R)
Note that the precision can either decrease or increase if the +threshold is lowered. Lowering the threshold of a classifier may +increase the denominator, by increasing the number of results returned. +If the threshold was previously set too high, the new results may all be +true positives, which will increase precision. If the previous threshold +was about right or too low, further lowering the threshold will +introduce false positives, decreasing precision.
+For Recall the denominator does not depend on the classifier +threshold (Tp+Fn is a constant). This means that lowering the classifier +threshold may increase recall, by increasing the number of true positive +results. It is also possible that lowering the threshold may leave +recall unchanged, while the precision fluctuates.
+## Demographic summary
+This plot shows for females and males the expected and observed risk +in different age groups together with a confidence area.
+The results show that our model is well calibrated across gender and +age groups.
+# External validation
+We recommend to always perform external validation, i.e. apply the +final model on as much new datasets as feasible and evaluate its +performance.
+
+# load the trained model
+plpModel <- loadPlpModel(getwd(),'model')
+
+# add details of new database
+validationDatabaseDetails <- createDatabaseDetails()
+
+# to externally validate the model and perform recalibration run:
+externalValidateDbPlp(
+ plpModel = plpModel,
+ validationDatabaseDetails = validationDatabaseDetails,
+ validationRestrictPlpDataSettings = plpModel$settings$plpDataSettings,
+ settings = createValidationSettings(
+ recalibrate = 'weakRecalibration'
+ ),
+ outputFolder = getwd()
+)
This will extract the new plpData from the specified schemas and
+cohort tables. It will then apply the same population settings and the
+trained plp model. Finally, it will evaluate the performance and return
+the standard output as validation$performanceEvaluation
and
+it will also return the prediction on the population as
+validation$prediction
. They can be inserted into the shiny
+app for viewing the model and validation by running:
+viewPlp(runPlp=plpResult, validatePlp=validation )
.
The package has much more functionality than described in this +vignette and contributions have been made my many persons in the OHDSI +community. The table below provides an overview:
+Functionality | +Description | +Vignette | +
---|---|---|
Builing Multiple Models | +This vignette describes how you can run multiple models +automatically | +Vignette |
+
Custom Models | +This vignette describes how you can add your own custom algorithms +in the framework | +Vignette |
+
Custom Splitting Functions | +This vignette describes how you can add your own custom +training/validation/testing splitting functions in the framework | +Vignette |
+
Custom Sampling Functions | +This vignette describes how you can add your own custom sampling +functions in the framework | +Vignette |
+
Custom Feature Engineering/Selection | +This vignette describes how you can add your own custom feature +engineering and selection functions in the framework | +Vignette |
+
Ensemble models | +This vignette describes how you can use the framework to build +ensemble models, i.e combine multiple models in a super learner | +Vignette |
+
Learning curves | +Learning curves assess the effect of training set size on model +performance by training a sequence of prediction models on successively +larger subsets of the training set. A learning curve plot can also help +in diagnosing a bias or variance problem as explained below. | +Vignette |
+
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Further, PatientLevelPrediction
makes extensive use of
+the Cyclops
package.
+citation("Cyclops")
##
+## To cite Cyclops in publications use:
+##
+## Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D (2013). "Massive
+## parallelization of serial inference algorithms for complex
+## generalized linear models." _ACM Transactions on Modeling and
+## Computer Simulation_, *23*, 10.
+## <https://dl.acm.org/doi/10.1145/2414416.2414791>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {M. A. Suchard and S. E. Simpson and I. Zorych and P. Ryan and D. Madigan},
+## title = {Massive parallelization of serial inference algorithms for complex generalized linear models},
+## journal = {ACM Transactions on Modeling and Computer Simulation},
+## volume = {23},
+## pages = {10},
+## year = {2013},
+## url = {https://dl.acm.org/doi/10.1145/2414416.2414791},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+ +In the figures below the effect is shown of the +removeSubjectsWithPriorOutcome, requireTimAtRisk, and includeAllOutcomes +booleans on the final study population. We start with a Target Cohort +with firstExposureOnly = false and we require a washout period = 1095. +We then subset the target cohort based on additional constraints. The +final study population in the Venn diagrams below are colored green.
+)
+)
+vignettes/ClinicalModels.rmd
+ ClinicalModels.rmd
Title | +Link | +
---|---|
Using Machine Learning Applied to Real-World Healthcare Data for +Predictive Analytics: An Applied Example in Bariatric Surgery | +Value +in Health | +
Development and validation of a prognostic model predicting +symptomatic hemorrhagic transformation in acute ischemic stroke at scale +in the OHDSI network | +PLoS +One | +
Wisdom of the CROUD: development and validation of a patient-level +prediction model for opioid use disorder using population-level claims +data | +PLoS +One | +
Developing predictive models to determine Patients in End-of-life +Care in Administrative datasets | +Drug +Safety | +
Predictors of diagnostic transition from major depressive disorder +to bipolar disorder: a retrospective observational network study | +Translational +psychiatry | +
Seek COVER: using a disease proxy to rapidly develop and validate a +personalized risk calculator for COVID-19 outcomes in an international +network | +BMC +Medical Research Methodology | +
90-Day all-cause mortality can be predicted following a total knee +replacement: an international, network study to develop and validate a +prediction model | +Knee +Surgery, Sports Traumatology, Arthroscopy | +
Machine learning and real-world data to predict lung cancer risk in +routine care | +Cancer +Epidemiology, Biomarkers & Prevention | +
Development and validation of a patient-level model to predict +dementia across a network of observational databases | +BMC +medicine | +
vignettes/ConstrainedPredictors.Rmd
+ ConstrainedPredictors.Rmd
Here we provide a set of phenotypes that can be used as predictors in +prediction models or best practice research.
+These phenotypes can be extracted from the PhenotypeLibrary R +package. To install the R package run:
+
+remotes::install_github('ohdsi/PhenotypeLibrary')
To extract the cohort definition for Alcoholism with an id of 1165, +just run:
+
+PhenotypeLibrary::getPlCohortDefinitionSet(1165)
in general you can extract all the cohorts by running:
+
+phenotypeDefinitions <- PhenotypeLibrary::getPlCohortDefinitionSet(1152:1215)
Phenotype Name | +Disorder classification | +OHDSI Phenotype library ID | +
---|---|---|
Alcoholism | +Behavioral | +1165 | +
Smoking | +Behavioral | +1166 | +
Anemia | +Blood | +1188 | +
Osteoarthritis | +Bone | +1184 | +
Osteoporosis | +Bone | +1185 | +
Cancer | +Cancer | +1215 | +
Atrial fibrillation | +Cardiovascular | +1160 | +
Congestive heart failure | +Cardiovascular | +1154 | +
Coronary artery disease | +Cardiovascular | +1162 | +
Heart valve disorder | +Cardiovascular | +1172 | +
Hyperlipidemia | +Cardiovascular | +1170 | +
Hypertension | +Cardiovascular | +1198 | +
Angina | +Cardiovascular | +1159 | +
Skin Ulcer | +Debility | +1168 | +
Diabetes type 1 | +Endocrine | +1193 | +
Diabetes type 2 | +Endocrine | +1194 | +
Hypothyroidism | +Endocrine | +1171 | +
Obesity | +Endocrine | +1179 | +
Gastroesophageal reflux disease (GERD) | +GI | +1178 | +
Gastrointestinal (GI) bleed | +GI | +1197 | +
Inflammatory bowel disorder (IBD) | +GI/Rheumatology | +1180 | +
Hormonal contraceptives | +Gynecologic | +1190 | +
Antibiotics Aminoglycosides | +Infection | +1201 | +
Antibiotics Carbapenems | +Infection | +1202 | +
Antibiotics Cephalosporins | +Infection | +1203 | +
Antibiotics Fluoroquinolones | +Infection | +1204 | +
Antibiotics Glycopeptides and lipoglycopeptides | +Infection | +1205 | +
Antibiotics Macrolides | +Infection | +1206 | +
Antibiotics Monobactams | +Infection | +1207 | +
Antibiotics Oxazolidinones | +Infection | +1208 | +
Antibiotics Penicillins | +Infection | +1209 | +
Antibiotics Polypeptides | +Infection | +1210 | +
Antibiotics Rifamycins | +Infection | +1211 | +
Antibiotics Sulfonamides | +Infection | +1212 | +
Antibiotics Streptogramins | +Infection | +1213 | +
Antibiotics Tetracyclines | +Infection | +1214 | +
Pneumonia | +Infection/Respiratory | +1199 | +
Sepsis | +Infection | +1176 | +
Urinary tract infection (UTI) | +Infection | +1186 | +
Hepatitis | +Liver | +1169 | +
Anxiety | +Mood | +1189 | +
Depression (MDD) | +Mood | +1161 | +
Psychotic disorder | +Mood | +1175 | +
Antiepileptics (pain) | +Neurology/Pain | +1183 | +
Seizure | +Neurology | +1153 | +
Hemorrhagic stroke | +Neurology/Vascular | +1156 | +
Non-hemorrhagic stroke | +Neurology/Vascular | +1155 | +
Acetaminophen prescription | +Pain/Infection | +1187 | +
Low back pain | +Pain | +1173 | +
Neuropathy | +Pain/Neurology | +1174 | +
Opioids | +Pain | +1182 | +
Acute kidney injury | +Kidney | +1163 | +
Chronic kidney disease | +Kidney | +1191 | +
Asthma | +Respiratory | +1164 | +
Chronic obstructive pulmonary disorder (COPD) | +Respiratory | +1192 | +
Dyspnea | +Respiratory | +1195 | +
Respiratory failure | +Respiratory | +1177 | +
Sleep apnea | +Respiratory | +1167 | +
Rheumatoid arthritis | +Rheumatology | +1200 | +
Steroids | +Rheumatology/Pain/Pulmonary | +1181 | +
Peripheral vascular disease | +Vascular | +1157 | +
Aspirin | +Vascular | +1158 | +
Deep vein thrombosis (DVT) | +Vascular | +1152 | +
Edema | +Vascular | +1196 | +
Inpatient visit | +NA | +NA | +
vignettes/CreatingLearningCurves.Rmd
+ CreatingLearningCurves.Rmd
This vignette describes how you can use the Observational Health Data
+Sciences and Informatics (OHDSI) PatientLevelPrediction
+package to create learning curves. This vignette assumes you have read
+and are comfortable with building patient level prediction models as
+described in the BuildingPredictiveModels
+vignette.
Prediction models will show overly-optimistic performance when +predicting on the same data as used for training. Therefore, +best-practice is to partition our data into a training set and testing +set. We then train our prediction model on the training set portion and +asses its ability to generalize to unseen data by measuring its +performance on the testing set.
+Learning curves assess the effect of training set size on model +performance by training a sequence of prediction models on successively +larger subsets of the training set. A learning curve plot can also help +in diagnosing a bias or variance problem as explained below.
+Figure 1, shows an example of learning curve plot in which the +vertical axis represents the model performance and the horizontal axis +the training set size. If training set size is small, the performance on +the training set is high, because a model can often be fitted well to a +limited number of training examples. At the same time, the performance +on the testing set will be poor, because the model trained on such a +limited number of training examples will not generalize well to unseen +data in the testing set. As the training set size increases, the +performance of the model on the training set will decrease. It becomes +more difficult for the model to find a good fit through all the training +examples. Also, the model will be trained on a more representative +portion of training examples, making it generalize better to unseen +data. This can be observed by the increasin testing set performance.
+The learning curve can help us in diagnosing bias and variance +problems with our classifier which will provide guidance on how to +further improve our model. We can observe high variance (overfitting) in +a prediction model if it performs well on the training set, but poorly +on the testing set (Figure 2). Adding additional data is a common +approach to counteract high variance. From the learning curve it becomes +apparent, that adding additional data may improve performance on the +testing set a little further, as the learning curve has not yet +plateaued and, thus, the model is not saturated yet. Therefore, adding +more data will decrease the gap between training set and testing set, +which is the main indicator for a high variance problem.
+Furthermore, we can observe high bias (underfitting) if a prediction +model performs poorly on the training set as well as on the testing set +(Figure 3). The learning curves of training set and testing set have +flattened on a low performance with only a small gap in between them. +Adding additional data will in this case have little to no impact on the +model performance. Choosing another prediction algorithm that can find +more complex (for example non-linear) relationships in the data may be +an alternative approach to consider in this high bias situation.
+Use the PatientLevelPrediction
+package to create a plpData
object . Alternatively, you can
+make use of the data simulator. The following code snippet creates data
+for 12000 patients.
+set.seed(1234)
+data(plpDataSimulationProfile)
+sampleSize <- 12000
+plpData <- simulatePlpData(
+ plpDataSimulationProfile,
+ n = sampleSize
+)
Specify the population settings (this does additional exclusions such +as requiring minimum prior observation or no prior outcome as well as +specifying the time-at-risk period to enable labels to be created):
+
+populationSettings <- createStudyPopulationSettings(
+ binary = TRUE,
+ firstExposureOnly = FALSE,
+ washoutPeriod = 0,
+ removeSubjectsWithPriorOutcome = FALSE,
+ priorOutcomeLookback = 99999,
+ requireTimeAtRisk = FALSE,
+ minTimeAtRisk = 0,
+ riskWindowStart = 0,
+ riskWindowEnd = 365,
+ verbosity = "INFO"
+)
Specify the prediction algorithm to be used.
+
+# Use LASSO logistic regression
+modelSettings <- setLassoLogisticRegression()
Specify the split settings and a sequence of training set fractions
+(these over ride the splitSetting trainFraction). Alternatively, instead
+of trainFractions
, you can provide a sequence of training
+events (trainEvents
) instead of the training set fractions.
+This is recommended, because our research has shown that number of
+events is the important determinant of model performance. Make sure that
+your training set contains the number of events specified.
+splitSettings = createDefaultSplitSetting(
+ testFraction = 0.2,
+ type = 'stratified',
+ splitSeed = 1000
+ )
+
+trainFractions <- seq(0.1, 0.8, 0.1) # Create eight training set fractions
+
+# alternatively use a sequence of training events by uncommenting the line below.
+# trainEvents <- seq(100, 5000, 100)
Create the learning curve object.
+
+learningCurve <- createLearningCurve(
+ plpData = plpData,
+ outcomeId = 2,
+ parallel = T,
+ cores = 4,
+ modelSettings = modelSettings,
+ saveDirectory = getwd(),
+ analysisId = 'learningCurve',
+ populationSettings = populationSettings,
+ splitSettings = splitSettings,
+ trainFractions = trainFractions,
+ trainEvents = NULL,
+ preprocessSettings = createPreprocessSettings(
+ minFraction = 0.001,
+ normalize = T
+ ),
+ executeSettings = createExecuteSettings(
+ runSplitData = T,
+ runSampleData = F,
+ runfeatureEngineering = F,
+ runPreprocessData = T,
+ runModelDevelopment = T,
+ runCovariateSummary = F
+ )
+)
Plot the learning curve object (Figure 4). Specify one of the
+available metrics: AUROC
, AUPRC
,
+sBrier
. Moreover, you can specify what metric to put on the
+abscissa, number of observations
or number of
+events
. We recommend the latter, because
+events
are determinant of model performance and allow you
+to better compare learning curves across different prediction problems
+and databases.
+plotLearningCurve(
+ learningCurve,
+ metric = 'AUROC',
+ abscissa = 'events',
+ plotTitle = 'Learning Curve',
+ plotSubtitle = 'AUROC performance'
+)
The learning curve object can be created in parallel, which can
+reduce computation time significantly. Whether to run the code in
+parallel or not is specified using the parallel
input.
+Currently this functionality is only available for LASSO logistic
+regression and gradient boosting machines. Depending on the number of
+parallel workers it may require a significant amount of memory. We
+advise to use the parallelized learning curve function for parameter
+search and exploratory data analysis.
When running in parrallel, R will find the number of available
+processing cores automatically and register the required parallel
+backend. Alternatively, you can provide the number of cores you wish to
+use via the cores
input.
We have added a demo of the learningcurve:
+
+# Show all demos in our package:
+ demo(package = "PatientLevelPrediction")
+
+# Run the learning curve
+ demo("LearningCurveDemo", package = "PatientLevelPrediction")
Do note that running this demo can take a considerable amount of time +(15 min on Quad core running in parallel)!
+A publication titled ‘How little data do we need for patient-level +prediction?’ uses the learning curve functionality in this package and +can be accessed as preprint in the arXiv archives at https://arxiv.org/abs/2008.07361.
+Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +vignettes/CreatingNetworkStudies.Rmd
+ CreatingNetworkStudies.Rmd
The OHDSI Patient Level Prediction (PLP) package provides the
+framework to implement prediction models at scale. This can range from
+developing a large number of models across sites (methodology and study
+design insight) to extensive external validation of existing models in
+the OHDSI PLP framework (model insight). This vignette describes how you
+can use the PatientLevelPrediction
package to create a
+network study package.
The open access publication A standardized +analytics pipeline for reliable and rapid development and validation of +prediction models using observational health data details the +process used to develop and validate prediction models using the OHDSI +prediction framework and tools. This publication describes each of the +steps and then demonstrates these by focusing on predicting death in +those who have covid-19.
+The study creator has the first option to be first author, if he/she +does not wish to be first author then he/she can pick the most suitable +person from the contributors. All contributors will be listed as authors +on the paper. The last author will be the person who lead/managed the +study, if this was the first author then the first author can pick the +most suitable last author. All authors between the first and last author +will be alphabetical by last name.
+vignettes/InstallationGuide.Rmd
+ InstallationGuide.Rmd
This vignette describes how you need to install the Observational
+Health Data Science and Informatics (OHDSI) PatientLevelPrediction
+package under Windows, Mac, and Linux.
Under Windows the OHDSI Patient Level Prediction (PLP) package +requires installing:
+Under Mac and Linux the OHDSI Patient Level Prediction (PLP) package +requires installing:
+The preferred way to install the package is by using
+remotes
, which will automatically install the latest
+release and all the latest dependencies.
If you do not want the official release you could install the +bleeding edge version of the package (latest develop branch).
+Note that the latest develop branch could contain bugs, please report +them to us if you experience problems.
+To install using remotes
run:
+install.packages("remotes")
+remotes::install_github("OHDSI/PatientLevelPrediction")
When installing make sure to close any other Rstudio sessions that
+are using PatientLevelPrediction
or any dependency. Keeping
+Rstudio sessions open can cause locks that prevent the package
+installing.
Many of the classifiers in the PatientLevelPrediction
+use a Python backend. To set up a python environment run:
+library(PatientLevelPrediction)
+reticulate::install_miniconda()
+configurePython(envname='r-reticulate', envtype='conda')
Installation issues need to be posted in our issue tracker: http://github.com/OHDSI/PatientLevelPrediction/issues
+The list below provides solutions for some common issues:
+If you have an error when trying to install a package in R saying
+‘Dependancy X not available …’ then this can sometimes
+be fixed by running install.packages('X')
and then once
+that completes trying to reinstall the package that had the
+error.
I have found that using the github remotes
to
+install packages can be impacted if you have multiple R
+sessions open as one session with a library open can cause the
+library to be locked and this can prevent an install of a package that
+depends on that library.
to make sure R uses the r-reticulate python environment you may need +to edit your .Rprofile with the location of the python binary for the +PLP environment. Edit the .Rprofile by running:
+
+usethis::edit_r_profile()
and add
+
+Sys.setenv(PATH = paste("your python bin location", Sys.getenv("PATH"), sep=":"))
to the file then save. Where your python bin location is the location +returned by
+
+reticulate::conda_list()
e.g., My PLP virtual environment location was
+/anaconda3/envs/PLP/bin/python so I added:
+Sys.setenv(PATH = paste(“/anaconda3/envs/PLP/bin”, Sys.getenv(“PATH”),
+sep=“:”))
Considerable work has been dedicated to provide the
+PatientLevelPrediction
package.
+citation("PatientLevelPrediction")
##
+## To cite PatientLevelPrediction in publications use:
+##
+## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
+## and implementation of a standardized framework to generate and
+## evaluate patient-level prediction models using observational
+## healthcare data." _Journal of the American Medical Informatics
+## Association_, *25*(8), 969-975.
+## <https://doi.org/10.1093/jamia/ocy032>.
+##
+## A BibTeX entry for LaTeX users is
+##
+## @Article{,
+## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
+## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
+## journal = {Journal of the American Medical Informatics Association},
+## volume = {25},
+## number = {8},
+## pages = {969-975},
+## year = {2018},
+## url = {https://doi.org/10.1093/jamia/ocy032},
+## }
+Please reference this paper if you use the PLP Package in +your work:
+ +This work is supported in part through the National Science +Foundation grant IIS 1251151.
+vignettes/Videos.rmd
+ Videos.rmd
Click To Launch | +Description of Demo | +
---|---|
+ | Learn what a cohort table looks like and what columns are +required. | +
Click To Launch | +Description of Demo | +
---|---|
+ | Learn how to configure the connection to your OMOP CDM data from R +using the OHDSI DatabaseConnector package. | +
Click To Launch | +Description of Demo | +
---|---|
+ | Learn how to develop and validate a single PatientLevelPrediction +model. | +
Click To Launch | +Description of Demo | +
---|---|
+ | Learn how to develop and validate multiple PatientLevelPrediction +models. | +
Click To Launch | +Description of Demo | +
---|---|
+ | Learn how to interactively explore the model performance and model +via the shiny apps viewPlp() and viewMultiplePlp() | +
PatientLevelPrediction is an R package for building and validating patient-level predictive models using data in the OMOP Common Data Model format.
+Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.
+The figure below illustrates the prediction problem we address. Among a population at risk, we aim to predict which patients at a defined moment in time (t = 0) will experience some outcome during a time-at-risk. Prediction is done using only information about the patients in an observation window prior to that moment in time.
+ +To define a prediction problem we have to define t=0 by a Target Cohort (T), the outcome we like to predict by an outcome cohort (O), and the time-at-risk (TAR). Furthermore, we have to make design choices for the model we like to develop, and determine the observational datasets to perform internal and external validation. This conceptual framework works for all type of prediction problems, for example those presented below (T=green, O=red).
+ ++ + | ++ + | +
+Calibration Plot + | ++ROC Plot + | +
Demo of the Shiny Apps can be found here:
+ +PatientLevelPrediction is an R package, with some functions using python through reticulate.
+Requires R (version 4.0 or higher). Installation on Windows requires RTools. Libraries used in PatientLevelPrediction require Java and Python.
+The python installation is required for some of the machine learning algorithms. We advise to install Python 3.8 or higher using Anaconda (https://www.continuum.io/downloads).
+To install the package please read the Package Installation guide
Have a look at the video below for an extensive demo of the package.
Please read the main vignette for the package:
+ +In addition we have created vignettes that describe advanced functionality in more detail:
+Package manual: PatientLevelPrediction.pdf
+Documentation can be found on the package website.
+PDF versions of the documentation are also available, as mentioned above.
+Read here how you can contribute to this package.
+NEWS.md
+ Small bug fixes: - added analysisId into model saving/loading - made external validation saving recursive - added removal of patients with negative TAR when creating population - added option to apply model without preprocessing settings (make them NULL) - updated create study population to remove patients with negative time-at-risk
+Changes: - merged in bug fix from Martijn - fixed AUC bug causing crash with big data - update SQL code to be compatible with v6.0 OMOP CDM - added save option to external validate PLP
+Changes: - Updated splitting functions to include a splitby subject and renamed personSplitter to randomSplitter - Cast indices to integer in python functions to fix bug with non integer sparse matrix indices
+Changes: - Added GLM status to log (will now inform about any fitting issue in log) - Added GBM survival model (still under development) - Added RF quantile regression (still under development) - Updated viewMultiplePlp() to match PLP skeleton package app - Updated single plp vignette with additional example - Merge in deep learning updates from Chan
+this functions takes covariate data and a cohort/population and remaps +the covariate and row ids, restricts to pop and saves/creates mapping
+MapIds(covariateData, cohort = NULL, mapping = NULL)
a covariateData object
if specified rowIds restricted to the ones in cohort
A pre defined mapping to use
A package for running predictions using data in the OMOP CDM
+Calculate the accuracy
+accuracy(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
accuracy value
+Calculate the accuracy
+R/uploadToDatabaseDiagnostics.R
+ addDiagnosePlpToDatabase.Rd
This function inserts a diagnostic result into the result schema
+addDiagnosePlpToDatabase(
+ diagnosePlp,
+ connectionDetails,
+ databaseSchemaSettings,
+ cohortDefinitions,
+ databaseList = NULL,
+ overWriteIfExists = T
+)
An object of class diagnosePlp
A connection details created by using the
+function createConnectionDetails
in the
+DatabaseConnector
package.
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet()
(Optional) If you wish to overwrite the settings in the plp object use createdatabaseList
to specify the databases
(default: T) Whether to delete existing results and overwrite them
Returns NULL but uploads the diagnostic into the database schema specified in databaseSchemaSettings
+This function can be used to upload a diagnostic result into a database
+R/uploadToDatabaseDiagnostics.R
+ addMultipleDiagnosePlpToDatabase.Rd
This function inserts diagnosePlp results into the result schema
+addMultipleDiagnosePlpToDatabase(
+ connectionDetails,
+ databaseSchemaSettings,
+ cohortDefinitions,
+ databaseList = NULL,
+ resultLocation
+)
A connection details created by using the
+function createConnectionDetails
in the
+DatabaseConnector
package.
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables
(list) A list of cohortDefinitions (each list must contain: name, id)
(Optional) ...
The location of the diagnostic results
Returns NULL but uploads multiple diagnosePlp results into the database schema specified in databaseSchemaSettings
+This function can be used to upload diagnosePlp results into a database
+R/uploadToDatabase.R
+ addMultipleRunPlpToDatabase.Rd
This function formats and uploads results that have been generated via an ATLAS prediction package into a database
+addMultipleRunPlpToDatabase(
+ connectionDetails,
+ databaseSchemaSettings = createDatabaseSchemaSettings(resultSchema = "main"),
+ cohortDefinitions,
+ databaseList = NULL,
+ resultLocation = NULL,
+ resultLocationVector,
+ modelSaveLocation
+)
A connection details created by using the
+function createConnectionDetails
in the
+DatabaseConnector
package.
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet()
(Optional) A list created by createDatabaseList
to specify the databases
(string) location of directory where the main package results were saved
(only used when resultLocation is missing) a vector of locations with development or validation results
The location of the file system for saving the models in a subdirectory
Returns NULL but uploads all the results in resultLocation to the PatientLevelPrediction result tables in resultSchema
+This function can be used upload PatientLevelPrediction results into a database
+R/uploadToDatabase.R
+ addRunPlpToDatabase.Rd
This function adds a runPlp or external validation result into a database
+addRunPlpToDatabase(
+ runPlp,
+ connectionDetails,
+ databaseSchemaSettings,
+ cohortDefinitions,
+ modelSaveLocation,
+ databaseList = NULL
+)
An object of class runPlp
or class externalValidatePlp
A connection details created by using the
+function createConnectionDetails
in the
+DatabaseConnector
package.
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet()
The location of the directory that models will be saved to
(Optional) If you want to change the database name then used createDatabaseList
to specify the database settings but use the same cdmDatabaseId was model development/validation
Returns a data.frame with the database details
+This function is used when inserting results into the PatientLevelPrediction database results schema
+Calculate the average precision
+averagePrecision(prediction)
A prediction object
The average precision
+Calculates the average precision from a predition object
+brierScore
+brierScore(prediction)
A prediction object
A list containing the brier score and the scaled brier score
+Calculates the brierScore from prediction object
+calibrationLine
+calibrationLine(prediction, numberOfStrata = 10)
A prediction object
The number of groups to split the prediction into
Calculates the calibration from prediction object
+Compute the area under the ROC curve
+computeAuc(prediction, confidenceInterval = FALSE)
A prediction object as generated using the
+predict
functions.
Should 95 percebt confidence intervals be computed?
Computes the area under the ROC curve for the predicted probabilities, given the true observed +outcomes.
+R/SklearnClassifier.R
+ computeGridPerformance.Rd
Computes grid performance with a specified performance function
+computeGridPerformance(prediction, param, performanceFunct = "computeAuc")
a dataframe with predictions and outcomeCount per rowId
a list of hyperparameters
a string specifying which performance function to use +. Default ``'compute_AUC'``
A list with overview of the performance
+R/HelperFunctions.R
+ configurePython.Rd
Sets up a virtual environment to use for PLP (can be conda or python)
+configurePython(envname = "PLP", envtype = NULL, condaPythonVersion = "3.11")
A string for the name of the virtual environment (default is 'PLP')
An option for specifying the environment as'conda' or 'python'. If NULL then the default is 'conda' for windows users and 'python' for non-windows users
String, Python version to use when creating a conda environment
This function creates a virtual environment that can be used by PatientLevelPrediction +and installs all the required package dependancies. If using python, pip must be set up.
+Summarises the covariateData to calculate the mean and standard deviation per covaraite +if the labels are input it also stratifies this by class label and if the trainRowIds and testRowIds +specifying the patients in the train/test sets respectively are input, these values are also stratified +by train and test set
+covariateSummary(
+ covariateData,
+ cohort,
+ labels = NULL,
+ strata = NULL,
+ variableImportance = NULL,
+ featureEngineering = NULL
+)
The covariateData part of the plpData that is
+extracted using getPlpData
The patient cohort to calculate the summary
A data.frame with the columns rowId and outcomeCount
A data.frame containing the columns rowId, strataName
A data.frame with the columns covariateId and +value (the variable importance value)
(currently not used ) +A function or list of functions specifying any feature engineering +to create covariates before summarising
A data.frame containing: CovariateCount CovariateMean and CovariateStDev plus these values +for any specified stratification
+The function calculates various metrics to measure the performance of the model
+R/AdditionalCovariates.R
+ createCohortCovariateSettings.Rd
Extracts covariates based on cohorts
+createCohortCovariateSettings(
+ cohortName,
+ settingId,
+ cohortDatabaseSchema,
+ cohortTable,
+ cohortId,
+ startDay = -30,
+ endDay = 0,
+ count = F,
+ ageInteraction = F,
+ lnAgeInteraction = F,
+ analysisId = 456
+)
Name for the cohort
A unique id for the covariate time and
The schema of the database with the cohort
the table name that contains the covariate cohort
cohort id for the covariate cohort
The number of days prior to index to start observing the cohort
The number of days prior to index to stop observing the cohort
If FALSE the covariate value is binary (1 means cohort occurred between index+startDay and index+endDay, 0 means it did not) +If TRUE then the covariate value is the number of unique cohort_start_dates between index+startDay and index+endDay
If TRUE multiple covariate value by the patient's age in years
If TRUE multiple covariate value by the log of the patient's age in years
The analysisId for the covariate
An object of class covariateSettings specifying how to create the cohort covariate with the covariateId + cohortId x 100000 + settingId x 1000 + analysisId
+The user specifies a cohort and time period and then a covariate is constructed whether they are in the +cohort during the time periods relative to target population cohort index
+R/ExtractData.R
+ createDatabaseDetails.Rd
Create a setting that holds the details about the cdmDatabase connection for data extraction
+createDatabaseDetails(
+ connectionDetails,
+ cdmDatabaseSchema,
+ cdmDatabaseName,
+ cdmDatabaseId,
+ tempEmulationSchema = cdmDatabaseSchema,
+ cohortDatabaseSchema = cdmDatabaseSchema,
+ cohortTable = "cohort",
+ outcomeDatabaseSchema = cdmDatabaseSchema,
+ outcomeTable = "cohort",
+ targetId = NULL,
+ outcomeIds = NULL,
+ cdmVersion = 5,
+ cohortId = NULL
+)
An R object of type connectionDetails
created using the
+function createConnectionDetails
in the
+DatabaseConnector
package.
The name of the database schema that contains the OMOP CDM +instance. Requires read permissions to this database. On SQL +Server, this should specifiy both the database and the schema, +so for example 'cdm_instance.dbo'.
A string with the name of the database - this is used in the shiny app and when externally validating models to name the result list and to specify the folder name when saving validation results (defaults to cdmDatabaseSchema if not specified)
A string with a unique identifier for the database and version - this is stored in the plp object for future reference and used by the shiny app (defaults to cdmDatabaseSchema if not specified)
For dmbs like Oracle only: the name of the database schema where you +want all temporary tables to be managed. Requires +create/insert permissions to this database.
The name of the database schema that is the location where the +target cohorts are available. Requires read +permissions to this database.
The tablename that contains the target cohorts. Expectation is cohortTable +has format of COHORT table: COHORT_DEFINITION_ID, SUBJECT_ID, +COHORT_START_DATE, COHORT_END_DATE.
The name of the database schema that is the location where the +data used to define the outcome cohorts is available. Requires read permissions to +this database.
The tablename that contains the outcome cohorts. Expectation is +outcomeTable has format of COHORT table: COHORT_DEFINITION_ID, +SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE.
An integer specifying the cohort id for the target cohort
A single integer or vector of integers specifying the cohort ids for the outcome cohorts
Define the OMOP CDM version used: currently support "4" and "5".
(depreciated: use targetId) old input for the target cohort id
A list with the the database specific settings (this is used by the runMultiplePlp function and the skeleton packages)
+This function simply stores the settings for communicating with the cdmDatabase when extracting +the target cohort and outcomes
+R/uploadToDatabase.R
+ createDatabaseList.Rd
This function creates a list with the database details and database meta data entries used in the study
+createDatabaseList(cdmDatabaseSchemas, cdmDatabaseNames, databaseRefIds = NULL)
(string vector) A vector of the cdmDatabaseSchemas used in the study - if the schemas are not unique per database please also specify databaseRefId
Sharable names for the databases
(string vector) Unique database identifiers - what you specified as cdmDatabaseId in PatientLevelPrediction::createDatabaseDetails()
when developing the models
Returns a data.frame with the database details
+This function is used when inserting database details into the PatientLevelPrediction database results schema
+R/uploadToDatabase.R
+ createDatabaseSchemaSettings.Rd
This function specifies where the results schema is and lets you pick a different schema for the cohorts and databases
+createDatabaseSchemaSettings(
+ resultSchema = "main",
+ tablePrefix = "",
+ targetDialect = "sqlite",
+ tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"),
+ cohortDefinitionSchema = resultSchema,
+ tablePrefixCohortDefinitionTables = tablePrefix,
+ databaseDefinitionSchema = resultSchema,
+ tablePrefixDatabaseDefinitionTables = tablePrefix
+)
(string) The name of the database schema with the result tables.
(string) A string that appends to the PatientLevelPrediction result tables
(string) The database management system being used
(string) The temp schema used when the database management system is oracle
(string) The name of the database schema with the cohort definition tables (defaults to resultSchema).
(string) A string that appends to the cohort definition tables
(string) The name of the database schema with the database definition tables (defaults to resultSchema).
(string) A string that appends to the database definition tables
Returns a list of class 'plpDatabaseResultSchema' with all the database settings
+This function can be used to specify the database settings used to upload PatientLevelPrediction results into a database
+R/RunPlpHelpers.R
+ createDefaultExecuteSettings.Rd
Creates default list of settings specifying what parts of runPlp to execute
+createDefaultExecuteSettings()
list with TRUE for split, preprocess, model development and covariate summary
+runs split, preprocess, model development and covariate summary
+R/DataSplitting.R
+ createDefaultSplitSetting.Rd
Create the settings for defining how the plpData are split into test/validation/train sets using +default splitting functions (either random stratified by outcome, time or subject splitting)
+createDefaultSplitSetting(
+ testFraction = 0.25,
+ trainFraction = 0.75,
+ splitSeed = sample(1e+05, 1),
+ nfold = 3,
+ type = "stratified"
+)
(numeric) A real number between 0 and 1 indicating the test set fraction of the data
(numeric) A real number between 0 and 1 indicating the train set fraction of the data. +If not set train is equal to 1 - test
(numeric) A seed to use when splitting the data for reproducibility (if not set a random number will be generated)
(numeric) An integer > 1 specifying the number of folds used in cross validation
(character) Choice of:
'stratified' Each data point is randomly assigned into the test or a train fold set but this is done stratified such that the outcome rate is consistent in each partition
'time' Older data are assigned into the training set and newer data are assigned into the test set
'subject' Data are partitioned by subject, if a subject is in the data more than once, all the data points for the subject are assigned either into the test data or into the train data (not both).
An object of class splitSettings
Returns an object of class splitSettings
that specifies the splitting function that will be called and the settings
R/RunPlpHelpers.R
+ createExecuteSettings.Rd
Creates list of settings specifying what parts of runPlp to execute
+createExecuteSettings(
+ runSplitData = F,
+ runSampleData = F,
+ runfeatureEngineering = F,
+ runPreprocessData = F,
+ runModelDevelopment = F,
+ runCovariateSummary = F
+)
TRUE or FALSE whether to split data into train/test
TRUE or FALSE whether to over or under sample
TRUE or FALSE whether to do feature engineering
TRUE or FALSE whether to do preprocessing
TRUE or FALSE whether to develop the model
TRUE or FALSE whether to create covariate summary
list with TRUE/FALSE for each part of runPlp
+define what parts of runPlp to execute
+R/FeatureEngineering.R
+ createFeatureEngineeringSettings.Rd
Create the settings for defining any feature engineering that will be done
+createFeatureEngineeringSettings(type = "none")
(character) Choice of:
'none' No feature engineering - this is the default
An object of class featureEngineeringSettings
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
Creates a learning curve object, which can be plotted using the
+ plotLearningCurve()
function.
createLearningCurve(
+ plpData,
+ outcomeId,
+ parallel = T,
+ cores = 4,
+ modelSettings,
+ saveDirectory = getwd(),
+ analysisId = "learningCurve",
+ populationSettings = createStudyPopulationSettings(),
+ splitSettings = createDefaultSplitSetting(),
+ trainFractions = c(0.25, 0.5, 0.75),
+ trainEvents = NULL,
+ sampleSettings = createSampleSettings(),
+ featureEngineeringSettings = createFeatureEngineeringSettings(),
+ preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = T),
+ logSettings = createLogSettings(),
+ executeSettings = createExecuteSettings(runSplitData = T, runSampleData = F,
+ runfeatureEngineering = F, runPreprocessData = T, runModelDevelopment = T,
+ runCovariateSummary = F)
+)
An object of type plpData
- the patient level prediction
+data extracted from the CDM.
(integer) The ID of the outcome.
Whether to run the code in parallel
The number of computer cores to use if running in parallel
An object of class modelSettings
created using one of the function:
setLassoLogisticRegression()
A lasso logistic regression model
setGradientBoostingMachine()
A gradient boosting machine
setAdaBoost()
An ada boost model
setRandomForest()
A random forest model
setDecisionTree()
A decision tree model
setKNN()
A KNN model
The path to the directory where the results will be saved (if NULL uses working directory)
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp.
An object of type populationSettings
created using createStudyPopulationSettings
that
+specifies how the data class labels are defined and addition any exclusions to apply to the
+plpData cohort
An object of type splitSettings
that specifies how to split the data into train/validation/test.
+The default settings can be created using createDefaultSplitSetting
.
A list of training fractions to create models for.
+Note, providing trainEvents
will override your input to
+trainFractions
.
Events have shown to be determinant of model performance.
+Therefore, it is recommended to provide trainEvents
rather than
+trainFractions
. Note, providing trainEvents
will override
+your input to trainFractions
. The format should be as follows:
c(500, 1000, 1500)
- a list of training events
An object of type sampleSettings
that specifies any under/over sampling to be done.
+The default is none.
An object of featureEngineeringSettings
specifying any feature engineering to be learned (using the train data)
An object of preprocessSettings
. This setting specifies the minimum fraction of
+target population who must have a covariate for it to be included in the model training
+and whether to normalise the covariates before training
An object of logSettings
created using createLogSettings
+specifying how the logging is done
An object of executeSettings
specifying which parts of the analysis to run
A learning curve object containing the various performance measures
+ obtained by the model for each training set fraction. It can be plotted
+ using plotLearningCurve
.
if (FALSE) {
+# define model
+modelSettings = PatientLevelPrediction::setLassoLogisticRegression()
+
+# create learning curve
+learningCurve <- PatientLevelPrediction::createLearningCurve(population,
+ plpData,
+ modelSettings)
+# plot learning curve
+PatientLevelPrediction::plotLearningCurve(learningCurve)
+}
+
+
R/Logging.R
+ createLogSettings.Rd
Create the settings for logging the progression of the analysis
+createLogSettings(verbosity = "DEBUG", timeStamp = T, logName = "runPlp Log")
Sets the level of the verbosity. If the log level is at or higher in priority than the logger threshold, a message will print. The levels are:
DEBUG Highest verbosity showing all debug statements
TRACE Showing information about start and end of steps
INFO Show informative information (Default)
WARN Show warning messages
ERROR Show error messages
FATAL Be silent except for fatal errors
If TRUE a timestamp will be added to each logging statement. Automatically switched on for TRACE level.
A string reference for the logger
An object of class logSettings
Returns an object of class logSettings
that specifies the logger settings
R/RunMultiplePlp.R
+ createModelDesign.Rd
Specify settings for deceloping a single model
+createModelDesign(
+ targetId,
+ outcomeId,
+ restrictPlpDataSettings = createRestrictPlpDataSettings(),
+ populationSettings = createStudyPopulationSettings(),
+ covariateSettings = FeatureExtraction::createDefaultCovariateSettings(),
+ featureEngineeringSettings = NULL,
+ sampleSettings = NULL,
+ preprocessSettings = NULL,
+ modelSettings = NULL,
+ splitSettings = createDefaultSplitSetting(type = "stratified", testFraction = 0.25,
+ trainFraction = 0.75, splitSeed = 123, nfold = 3),
+ runCovariateSummary = T
+)
The id of the target cohort that will be used for data extraction (e.g., the ATLAS id)
The id of the outcome that will be used for data extraction (e.g., the ATLAS id)
The settings specifying the extra restriction settings when extracting the data created using createRestrictPlpDataSettings()
.
The population settings specified by createStudyPopulationSettings()
The covariate settings, this can be a list or a single 'covariateSetting'
object.
Either NULL or an object of class featureEngineeringSettings
specifying any feature engineering used during model development
Either NULL or an object of class sampleSettings
with the over/under sampling settings used for model development
Either NULL or an object of class preprocessSettings
created using createPreprocessingSettings()
The model settings such as setLassoLogisticRegression()
The train/validation/test splitting used by all analyses created using createDefaultSplitSetting()
Whether to run the covariateSummary
A list with analysis settings used to develop a single prediction model
+This specifies a single analysis for developing as single model
+R/uploadToDatabase.R
+ createPlpResultTables.Rd
This function executes a large set of SQL statements to create tables that can store models and results
+createPlpResultTables(
+ connectionDetails,
+ targetDialect = "postgresql",
+ resultSchema,
+ deleteTables = T,
+ createTables = T,
+ tablePrefix = "",
+ tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"),
+ testFile = NULL
+)
The database connection details
The database management system being used
The name of the database schema that the result tables will be created.
If true any existing tables matching the PatientLevelPrediction result tables names will be deleted
If true the PatientLevelPrediction result tables will be created
A string that appends to the PatientLevelPrediction result tables
The temp schema used when the database management system is oracle
(used for testing) The location of an sql file with the table creation code
Returns NULL but creates the required tables into the specified database schema(s).
+This function can be used to create (or delete) PatientLevelPrediction result tables
+R/PreprocessingData.R
+ createPreprocessSettings.Rd
Create the settings for preprocessing the trainData.
+createPreprocessSettings(
+ minFraction = 0.001,
+ normalize = TRUE,
+ removeRedundancy = TRUE
+)
The minimum fraction of target population who must have a covariate for it to be included in the model training
Whether to normalise the covariates before training (Default: TRUE)
Whether to remove redundant features (Default: TRUE)
An object of class preprocessingSettings
Returns an object of class preprocessingSettings
that specifies how to preprocess the training data
R/FeatureEngineering.R
+ createRandomForestFeatureSelection.Rd
Create the settings for random foreat based feature selection
+createRandomForestFeatureSelection(ntrees = 2000, maxDepth = 17)
number of tree in forest
MAx depth of each tree
An object of class featureEngineeringSettings
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
R/ExtractData.R
+ createRestrictPlpDataSettings.Rd
This function creates the settings used to restrict the target cohort when calling getPlpData
+createRestrictPlpDataSettings(
+ studyStartDate = "",
+ studyEndDate = "",
+ firstExposureOnly = F,
+ washoutPeriod = 0,
+ sampleSize = NULL
+)
A calendar date specifying the minimum date that a cohort index +date can appear. Date format is 'yyyymmdd'.
A calendar date specifying the maximum date that a cohort index +date can appear. Date format is 'yyyymmdd'. Important: the study +end data is also used to truncate risk windows, meaning no outcomes +beyond the study end date will be considered.
Should only the first exposure per subject be included? Note that
+this is typically done in the createStudyPopulation
function,
+but can already be done here for efficiency reasons.
The mininum required continuous observation time prior to index
+date for a person to be included in the at risk cohort. Note that
+this is typically done in the createStudyPopulation
function,
+but can already be done here for efficiency reasons.
If not NULL, the number of people to sample from the target cohort
A setting object of class restrictPlpDataSettings
containing a list getPlpData extra settings
Users need to specify the extra restrictions to apply when downloading the target cohort
+splitData
are sampled using
+default sample functions.R/Sampling.R
+ createSampleSettings.Rd
Create the settings for defining how the trainData from splitData
are sampled using
+default sample functions.
createSampleSettings(
+ type = "none",
+ numberOutcomestoNonOutcomes = 1,
+ sampleSeed = sample(10000, 1)
+)
(character) Choice of:
'none' No sampling is applied - this is the default
'underSample' Undersample the non-outcome class to make the data more ballanced
'overSample' Oversample the outcome class by adding in each outcome multiple times
(numeric) An numeric specifying the require number of non-outcomes per outcome
(numeric) A seed to use when splitting the data for reproducibility (if not set a random number will be generated)
An object of class sampleSettings
Returns an object of class sampleSettings
that specifies the sampling function that will be called and the settings
R/FeatureEngineering.R
+ createSplineSettings.Rd
Create the settings for adding a spline for continuous variables
+createSplineSettings(continousCovariateId, knots, analysisId = 683)
The covariateId to apply splines to
Either number of knots of vector of split values
The analysisId to use for the spline covariates
An object of class featureEngineeringSettings
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
R/FeatureEngineering.R
+ createStratifiedImputationSettings.Rd
Create the settings for adding a spline for continuous variables
+createStratifiedImputationSettings(covariateId, ageSplits = NULL)
The covariateId that needs imputed values
A vector of age splits in years to create age groups
An object of class featureEngineeringSettings
Returns an object of class featureEngineeringSettings
that specifies how to do stratified imputation
Create a study population
+createStudyPopulation(
+ plpData,
+ outcomeId,
+ populationSettings,
+ population = NULL
+)
An object of type plpData
as generated using
+getplpData
.
The ID of the outcome.
An object of class populationSettings created using createPopulationSettings
If specified, this population will be used as the starting point instead of the
+cohorts in the plpData
object.
A data frame specifying the study population. This data frame will have the following columns:
A unique identifier for an exposure
The person ID of the subject
The index date
The number of outcomes observed during the risk window
The number of days in the risk window
The number of days until either the outcome or the end of the risk window
Create a study population by enforcing certain inclusion and exclusion criteria, defining +a risk window, and determining which outcomes fall inside the risk window.
+R/PopulationSettings.R
+ createStudyPopulationSettings.Rd
create the study population settings
+createStudyPopulationSettings(
+ binary = T,
+ includeAllOutcomes = T,
+ firstExposureOnly = FALSE,
+ washoutPeriod = 0,
+ removeSubjectsWithPriorOutcome = TRUE,
+ priorOutcomeLookback = 99999,
+ requireTimeAtRisk = T,
+ minTimeAtRisk = 364,
+ riskWindowStart = 1,
+ startAnchor = "cohort start",
+ riskWindowEnd = 365,
+ endAnchor = "cohort start",
+ restrictTarToCohortEnd = F
+)
Forces the outcomeCount to be 0 or 1 (use for binary prediction problems)
(binary) indicating whether to include people with outcomes who are not observed for the whole at risk period
Should only the first exposure per subject be included? Note that
+this is typically done in the createStudyPopulation
function,
The mininum required continuous observation time prior to index +date for a person to be included in the cohort.
Remove subjects that have the outcome prior to the risk window start?
How many days should we look back when identifying prior outcomes?
Should subject without time at risk be removed?
The minimum number of days at risk required to be included
The start of the risk window (in days) relative to the index date (+
+days of exposure if the addExposureDaysToStart
parameter is
+specified).
The anchor point for the start of the risk window. Can be "cohort start" or "cohort end".
The end of the risk window (in days) relative to the index data (+
+days of exposure if the addExposureDaysToEnd
parameter is
+specified).
The anchor point for the end of the risk window. Can be "cohort start" or "cohort end".
If using a survival model and you want the time-at-risk to end at the cohort end date set this to T
A list containing all the settings required for creating the study population
+Takes as input the inputs to create study population
+Create a temporary model location
+createTempModelLoc()
R/FeatureEngineering.R
+ createUnivariateFeatureSelection.Rd
Create the settings for defining any feature selection that will be done
+createUnivariateFeatureSelection(k = 100)
This function returns the K features most associated (univariately) to the outcome
An object of class featureEngineeringSettings
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
R/ExternalValidatePlp.R
+ createValidationDesign.Rd
createValidationDesign - Define the validation design for external validation
+createValidationDesign(
+ targetId,
+ outcomeId,
+ populationSettings,
+ restrictPlpDataSettings,
+ plpModelList,
+ recalibrate = NULL,
+ runCovariateSummary = TRUE
+)
The targetId of the target cohort to validate on
The outcomeId of the outcome cohort to validate on
A list of population restriction settings created by createPopulationSettings
A list of plpData restriction settings created by createRestrictPlpDataSettings
A list of plpModels objects created by runPlp
or a path to such objects
A vector of characters specifying the recalibration method to apply,
whether to run the covariate summary for the validation data
R/ExternalValidatePlp.R
+ createValidationSettings.Rd
This function creates the settings required by externalValidatePlp
+createValidationSettings(recalibrate = NULL, runCovariateSummary = T)
A vector of characters specifying the recalibration method to apply
Whether to run the covariate summary for the validation data
A setting object of class validationSettings
containing a list of settings for externalValidatePlp
Users need to specify whether they want to sample or recalibate when performing external validation
+Run a list of predictions diagnoses
+diagnoseMultiplePlp(
+ databaseDetails = createDatabaseDetails(),
+ modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings =
+ setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3,
+ modelSettings = setLassoLogisticRegression())),
+ cohortDefinitions = NULL,
+ logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = T, logName =
+ "diagnosePlp Log"),
+ saveDirectory = getwd()
+)
The database settings created using createDatabaseDetails()
A list of model designs created using createModelDesign()
A list of cohort definitions for the target and outcome cohorts
The setting spexcifying the logging for the analyses created using createLogSettings()
Name of the folder where all the outputs will written to.
A data frame with the following columns:
analysisId | The unique identifier +for a set of analysis choices. |
targetId | The ID of the target cohort populations. |
outcomeId | The ID of the outcomeId. |
dataLocation | The location where the plpData was saved |
the settings ids | The ids for all other settings used for model development. |
This function will run all specified prediction design diagnoses as defined using .
+R/DiagnosePlp.R
+ diagnosePlp.Rd
This function runs a set of prediction diagnoses to help pick a suitable T, O, TAR and determine +whether the prediction problem is worth executing.
+diagnosePlp(
+ plpData = NULL,
+ outcomeId,
+ analysisId,
+ populationSettings,
+ splitSettings = createDefaultSplitSetting(),
+ sampleSettings = createSampleSettings(),
+ saveDirectory = NULL,
+ featureEngineeringSettings = createFeatureEngineeringSettings(),
+ modelSettings = setLassoLogisticRegression(),
+ logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = T, logName =
+ "diagnosePlp Log"),
+ preprocessSettings = createPreprocessSettings()
+)
An object of type plpData
- the patient level prediction
+data extracted from the CDM. Can also include an initial population as
+plpData$popualtion.
(integer) The ID of the outcome.
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp.
An object of type populationSettings
created using createStudyPopulationSettings
that
+specifies how the data class labels are defined and addition any exclusions to apply to the
+plpData cohort
An object of type splitSettings
that specifies how to split the data into train/validation/test.
+The default settings can be created using createDefaultSplitSetting
.
An object of type sampleSettings
that specifies any under/over sampling to be done.
+The default is none.
The path to the directory where the results will be saved (if NULL uses working directory)
An object of featureEngineeringSettings
specifying any feature engineering to be learned (using the train data)
An object of class modelSettings
created using one of the function:
setLassoLogisticRegression() A lasso logistic regression model
setGradientBoostingMachine() A gradient boosting machine
setAdaBoost() An ada boost model
setRandomForest() A random forest model
setDecisionTree() A decision tree model
setKNN() A KNN model
An object of logSettings
created using createLogSettings
+specifying how the logging is done
An object of preprocessSettings
. This setting specifies the minimum fraction of
+target population who must have a covariate for it to be included in the model training
+and whether to normalise the covariates before training
An object containing the model or location where the model is save, the data selection settings, the preprocessing +and training settings as well as various performance measures obtained by the model.
+list for each O of a data.frame containing: i) Time to observation end distribution, ii) Time from observation start distribution, iii) Time to event distribution and iv) Time from last prior event to index distribution (only for patients in T who have O before index)
list for each O of incidence of O in T during TAR
list for each O of Characterization of T, TnO, Tn~O
Users can define set of Ts, Os, databases and population settings. A list of data.frames containing details such as +follow-up time distribution, time-to-event information, characteriszation details, time from last prior event, +observation time distribution.
+if (FALSE) {
+#******** EXAMPLE 1 *********
+}
+
Calculate the diagnostic odds ratio
+diagnosticOddsRatio(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
diagnosticOddsRatio value
+Calculate the diagnostic odds ratio
+Evaluates the performance of the patient level prediction model
+evaluatePlp(prediction, typeColumn = "evaluationType")
The patient level prediction model's prediction
The column name in the prediction object that is used to +stratify the evaluation
A list containing the performance values
+The function calculates various metrics to measure the performance of the model
+R/ExternalValidatePlp.R
+ externalValidateDbPlp.Rd
This function extracts data using a user specified connection and cdm_schema, applied the model and then calcualtes the performance
+externalValidateDbPlp(
+ plpModel,
+ validationDatabaseDetails = createDatabaseDetails(),
+ validationRestrictPlpDataSettings = createRestrictPlpDataSettings(),
+ settings = createValidationSettings(recalibrate = "weakRecalibration"),
+ logSettings = createLogSettings(verbosity = "INFO", logName = "validatePLP"),
+ outputFolder = getwd()
+)
The model object returned by runPlp() containing the trained model
A list of objects of class databaseDetails
created using createDatabaseDetails
A list of population restriction settings created by createRestrictPlpDataSettings()
A settings object of class validationSettings
created using createValidationSettings
An object of logSettings
created using createLogSettings
+specifying how the logging is done
The directory to save the validation results to (subfolders are created per database in validationDatabaseDetails)
A list containing the performance for each validation_schema
+Users need to input a trained model (the output of runPlp()) and new database connections. The function will return a list of length equal to the +number of cdm_schemas input with the performance on the new data
+R/SaveLoadPlp.R
+ extractDatabaseToCsv.Rd
Exports all the results from a database into csv files
+extractDatabaseToCsv(
+ conn = NULL,
+ connectionDetails,
+ databaseSchemaSettings = createDatabaseSchemaSettings(resultSchema = "main"),
+ csvFolder,
+ minCellCount = 5,
+ sensitiveColumns = getPlpSensitiveColumns(),
+ fileAppend = NULL
+)
The connection to the database with the results
The connectionDetails for the result database
The result database schema settings
Location to save the csv files
The min value to show in cells that are sensitive (values less than this value will be replaced with -1)
A named list (name of table columns belong to) with a list of columns to apply the minCellCount to.
If set to a string this will be appended to the start of the csv file names
Extracts the results from a database into a set of csv files
+Calculate the f1Score
+f1Score(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
f1Score value
+Calculate the f1Score
+Calculate the falseDiscoveryRate
+falseDiscoveryRate(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
falseDiscoveryRate value
+Calculate the falseDiscoveryRate
+Calculate the falseNegativeRate
+falseNegativeRate(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
falseNegativeRate value
+Calculate the falseNegativeRate
+Calculate the falseOmissionRate
+falseOmissionRate(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
falseOmissionRate value
+Calculate the falseOmissionRate
+Calculate the falsePositiveRate
+falsePositiveRate(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
falsePositiveRate value
+Calculate the falsePositiveRate
+Train various models using a default parameter gird search or user specified parameters
+fitPlp(trainData, modelSettings, search = "grid", analysisId, analysisPath)
An object of type TrainData
created using splitData
+data extracted from the CDM.
An object of class modelSettings
created using one of the function:
setLassoLogisticRegression() A lasso logistic regression model
setGradientBoostingMachine() A gradient boosting machine
setRandomForest() A random forest model
setKNN() A KNN model
The search strategy for the hyper-parameter selection (currently not used)
The id of the analysis
The path of the analysis
An object of class plpModel
containing:
The trained prediction model
The preprocessing required when applying the model
The cohort data.frame with the predicted risk column added
A list specifiying the modelDesign settings used to fit the model
The model meta data
The covariate importance for the model
The user can define the machine learning model to train (regularised logistic regression, random forest, +gradient boosting machine, neural network and )
+R/CalibrationSummary.R
+ getCalibrationSummary.Rd
Get a sparse summary of the calibration
+getCalibrationSummary(
+ prediction,
+ predictionType,
+ typeColumn = "evaluation",
+ numberOfStrata = 100,
+ truncateFraction = 0.05
+)
A prediction object as generated using the
+predict
functions.
The type of prediction (binary or survival)
A column that is used to stratify the results
The number of strata in the plot.
This fraction of probability values will be ignored when plotting, to +avoid the x-axis scale being dominated by a few outliers.
A dataframe with the calibration summary
+Generates a sparse summary showing the predicted probabilities and the observed fractions. Predictions are +stratefied into equally sized bins of predicted probabilities.
+R/AdditionalCovariates.R
+ getCohortCovariateData.Rd
Extracts covariates based on cohorts
+getCohortCovariateData(
+ connection,
+ oracleTempSchema = NULL,
+ cdmDatabaseSchema,
+ cdmVersion = "5",
+ cohortTable = "#cohort_person",
+ rowIdField = "row_id",
+ aggregated,
+ cohortIds,
+ covariateSettings,
+ ...
+)
The database connection
The temp schema if using oracle
The schema of the OMOP CDM data
version of the OMOP CDM data
the table name that contains the target population cohort
string representing the unique identifier in the target population cohort
whether the covariate should be aggregated
cohort id for the target cohort
settings for the covariate cohorts and time periods
additional arguments from FeatureExtraction
The models will now be in the package
+The user specifies a cohort and time period and then a covariate is constructed whether they are in the +cohort during the time periods relative to target population cohort index
+R/DemographicSummary.R
+ getDemographicSummary.Rd
Get a calibration per age/gender groups
+getDemographicSummary(prediction, predictionType, typeColumn = "evaluation")
A prediction object
The type of prediction (binary or survival)
A column that is used to stratify the results
A dataframe with the calibration summary
+Generates a data.frame with the calibration per each 5 year age group and gender group
+This function executes a large set of SQL statements against the database in OMOP CDM format to +extract the data needed to perform the analysis.
+getPlpData(databaseDetails, covariateSettings, restrictPlpDataSettings)
The cdm database details created using createDatabaseDetails()
An object of type covariateSettings
as created using the
+createCovariateSettings
function in the
+FeatureExtraction
package.
Extra settings to apply to the target population while extracting data. Created using createRestrictPlpDataSettings()
.
Returns an object of type plpData
, containing information on the cohorts, their
+outcomes, and baseline covariates. Information about multiple outcomes can be captured at once for
+efficiency reasons. This object is a list with the following components:
A data frame listing the outcomes per person, including the time to event, and +the outcome id. Outcomes are not yet filtered based on risk window, since this is done at +a later stage.
A data frame listing the persons in each cohort, listing their +exposure status as well as the time to the end of the observation period and time to the end of the +cohort (usually the end of the exposure era).
An ffdf object listing the +baseline covariates per person in the two cohorts. This is done using a sparse representation: +covariates with a value of 0 are omitted to save space.
An ffdf object describing the covariates that have been extracted.
A list of objects with information on how the cohortMethodData object was +constructed.
The generic ()
and summary()
functions have been implemented for this object.
Based on the arguments, the at risk cohort data is retrieved, as well as outcomes
+occurring in these subjects. The at risk cohort is identified through
+user-defined cohorts in a cohort table either inside the CDM instance or in a separate schema.
+Similarly, outcomes are identified
+through user-defined cohorts in a cohort table either inside the CDM instance or in a separate
+schema. Covariates are automatically extracted from the appropriate tables within the CDM.
+If you wish to exclude concepts from covariates you will need to
+manually add the concept_ids and descendants to the excludedCovariateConceptIds
of the
+covariateSettings
argument.
R/PredictionDistribution.R
+ getPredictionDistribution.Rd
Calculates the prediction distribution
+getPredictionDistribution(
+ prediction,
+ predictionType,
+ typeColumn = "evaluation"
+)
A prediction object
The type of prediction (binary or survival)
A column that is used to stratify the results
The 0.00, 0.1, 0.25, 0.5, 0.75, 0.9, 1.00 quantile pf the prediction, +the mean and standard deviation per class
+Calculates the quantiles from a predition object
+R/PredictionDistribution.R
+ getPredictionDistribution_binary.Rd
Calculates the prediction distribution
+getPredictionDistribution_binary(prediction, evalColumn, ...)
A prediction object
A column that is used to stratify the results
Other inputs
The 0.00, 0.1, 0.25, 0.5, 0.75, 0.9, 1.00 quantile pf the prediction, +the mean and standard deviation per class
+Calculates the quantiles from a predition object
+Calculate all measures for sparse ROC
+getThresholdSummary(prediction, predictionType, typeColumn = "evaluation")
A prediction object
The type of prediction (binary or survival)
A column that is used to stratify the results
A data.frame with all the measures
+Calculates the TP, FP, TN, FN, TPR, FPR, accuracy, PPF, FOR and Fmeasure +from a prediction object
+R/ThresholdSummary.R
+ getThresholdSummary_binary.Rd
Calculate all measures for sparse ROC when prediction is bianry classification
+getThresholdSummary_binary(prediction, evalColumn, ...)
A prediction object
A column that is used to stratify the results
Other inputs
A data.frame with all the measures
+Calculates the TP, FP, TN, FN, TPR, FPR, accuracy, PPF, FOR and Fmeasure +from a prediction object
+R/EvaluationSummary.R
+ ici.Rd
Calculate the Integrated Calibration Information from Austin and Steyerberg +https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8281
+ici(prediction)
the prediction object found in the plpResult object
Integrated Calibration Information
+Calculate the Integrated Calibration Information
+
+ Extracting data from the OMOP CDM database+Functions for getting the necessary data from the database in Common Data Model and saving/loading. + |
+ |
---|---|
+ + | +Create a setting that holds the details about the cdmDatabase connection for data extraction |
+
+ + | +createRestrictPlpDataSettings define extra restriction settings when calling getPlpData |
+
+ + | +Get the patient level prediction data from the server |
+
+ + | +Save the cohort data to folder |
+
+ + | +Load the cohort data from a folder |
+
+ + | +Extracts covariates based on cohorts |
+
+ Settings for designing a prediction models+Design settings required when developing a model. + |
+ |
+ + | +create the study population settings |
+
+ + | +Create the settings for defining how the plpData are split into test/validation/train sets using +default splitting functions (either random stratified by outcome, time or subject splitting) |
+
+ + | +Create the settings for defining how the trainData from |
+
+ + | +Create the settings for defining any feature engineering that will be done |
+
+ + | +Create the settings for preprocessing the trainData. |
+
+ Optional design settings+Settings for optional steps that can be used in the PLP pipeline + |
+ |
+ + | +Extracts covariates based on cohorts |
+
+ + | +Create the settings for random foreat based feature selection |
+
+ + | +Create the settings for defining any feature selection that will be done |
+
+ + | +Create the settings for adding a spline for continuous variables |
+
+ + | +Create the settings for adding a spline for continuous variables |
+
+ External validation+ + |
+ |
+ + | +createValidationDesign - Define the validation design for external validation |
+
+ + | +externalValidatePlp - Validate model performance on new data |
+
+ + | +createValidationSettings define optional settings for performing external validation |
+
+ + | +recalibratePlp |
+
+ + | +recalibratePlpRefit |
+
+ Execution settings when developing a model+Execution settings required when developing a model. + |
+ |
+ + | +Create the settings for logging the progression of the analysis |
+
+ + | +Creates list of settings specifying what parts of runPlp to execute |
+
+ + | +Creates default list of settings specifying what parts of runPlp to execute |
+
+ Binary Classification Models+Functions for setting binary classifiers and their hyper-parameter search. + |
+ |
+ + | +Create setting for AdaBoost with python DecisionTreeClassifier base estimator |
+
+ + | +Create setting for the scikit-learn 1.0.1 DecisionTree with python |
+
+ + | +Create setting for gradient boosting machine model using gbm_xgboost implementation |
+
+ + | +Create setting for knn model |
+
+ + | +Create setting for lasso logistic regression |
+
+ + | +Create setting for neural network model with python |
+
+ + | +Create setting for naive bayes model with python |
+
+ + | +Create setting for random forest model with python (very fast) |
+
+ + | +Create setting for the python sklearn SVM (SVC function) |
+
+ + | +Create setting for lasso logistic regression |
+
+ + | +Create setting for gradient boosting machine model using lightGBM (https://github.com/microsoft/LightGBM/tree/master/R-package). |
+
+ Survival Models+Functions for setting survival models and their hyper-parameter search. + |
+ |
+ + | +Create setting for lasso Cox model |
+
+ Single Patient-Level Prediction Model+Functions for training/evaluating/applying a single patient-level-prediction model + |
+ |
+ + | +runPlp - Develop and internally evaluate a model using specified settings |
+
+ + | +externalValidateDbPlp - Validate a model on new databases |
+
+ + | +Saves the plp model |
+
+ + | +loads the plp model |
+
+ + | +Saves the result from runPlp into the location directory |
+
+ + | +Loads the evalaution dataframe |
+
+ + | +diagnostic - Investigates the prediction problem settings - use before training a model |
+
+ Multiple Patient-Level Prediction Models+Functions for training mutliple patient-level-prediction model in an efficient way. + |
+ |
+ + | +Specify settings for deceloping a single model |
+
+ + | +Run a list of predictions analyses |
+
+ + | +externally validate the multiple plp models across new datasets |
+
+ + | +Save the modelDesignList to a json file |
+
+ + | +Load the multiple prediction json settings from a file |
+
+ + | +Run a list of predictions diagnoses |
+
+ Individual pipeline functions+Functions for running parts of the PLP workflow + |
+ |
+ + | +Create a study population |
+
+ + | +Split the plpData into test/train sets using a splitting settings of class |
+
+ + | +A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data +and remove rare or redundant features |
+
+ + | +fitPlp |
+
+ + | +predictPlp |
+
+ + | +evaluatePlp |
+
+ + | +covariateSummary |
+
+ Saving results into database+Functions for saving the prediction model and performances into a database. + |
+ |
+ + | +Create sqlite database with the results |
+
+ + | +Create the results tables to store PatientLevelPrediction models and results into a database |
+
+ + | +Populate the PatientLevelPrediction results tables |
+
+ + | +Function to add the run plp (development or validation) to database |
+
+ + | +Create the PatientLevelPrediction database result schema settings |
+
+ + | +Create a list with the database details and database meta data entries |
+
+ + | +Insert a diagnostic result into a PLP result schema database |
+
+ + | +Insert mutliple diagnosePlp results saved to a directory into a PLP result schema database |
+
+ + | +Exports all the results from a database into csv files |
+
+ + | +Function to insert results into a database from csvs |
+
+ + | +Insert a model design into a PLP result schema database |
+
+ + | +Migrate Data model |
+
+ Shiny Viewers+Functions for viewing results via a shiny app + |
+ |
+ + | +viewPlp - Interactively view the performance and model settings |
+
+ + | +open a local shiny app for viewing the result of a multiple PLP analyses |
+
+ + | +open a local shiny app for viewing the result of a PLP analyses from a database |
+
+ Plotting+Functions for various performance plots + |
+ |
+ + | +Plot all the PatientLevelPrediction plots |
+
+ + | +Plot the ROC curve using the sparse thresholdSummary data frame |
+
+ + | +Plot the smooth calibration as detailed in Calster et al. "A calibration heirarchy for risk models +was defined: from utopia to empirical data" (2016) |
+
+ + | +Plot the calibration |
+
+ + | +Plot the conventional calibration |
+
+ + | +Plot the Observed vs. expected incidence, by age and gender |
+
+ + | +Plot the F1 measure efficiency frontier using the sparse thresholdSummary data frame |
+
+ + | +Plot the train/test generalizability diagnostic |
+
+ + | +Plot the precision-recall curve using the sparse thresholdSummary data frame |
+
+ + | +Plot the Predicted probability density function, showing prediction overlap between true and false cases |
+
+ + | +Plot the preference score probability density function, showing prediction overlap between true and false cases +#' |
+
+ + | +Plot the side-by-side boxplots of prediction distribution, by class#' |
+
+ + | +Plot the variable importance scatterplot |
+
+ + | +Plot the outcome incidence over time |
+
+ Learning Curves+Functions for creating and plotting learning curves + |
+ |
+ + | +createLearningCurve |
+
+ + | +plotLearningCurve |
+
+ Simulation+Functions for simulating cohort method data objects. + |
+ |
+ + | +Generate simulated data |
+
+ + | +A simulation profile |
+
+ Data manipulation functions+Functions for manipulating data + |
+ |
+ + | +Convert the plpData in COO format into a sparse R matrix |
+
+ + | +Map covariate and row Ids so they start from 1 |
+
+ Helper/utility functions+ + |
+ |
+ + | +join two lists |
+
+ + | +Cartesian product |
+
+ + | +Create a temporary model location |
+
+ + | +Sets up a virtual environment to use for PLP (can be conda or python) |
+
+ + | +Use the virtual environment created using configurePython() |
+
+ Evaluation measures+ + |
+ |
+ + | +Calculate the accuracy |
+
+ + | +Calculate the average precision |
+
+ + | +brierScore |
+
+ + | +calibrationLine |
+
+ + | +Compute the area under the ROC curve |
+
+ + | +Calculate the f1Score |
+
+ + | +Calculate the falseDiscoveryRate |
+
+ + | +Calculate the falseNegativeRate |
+
+ + | +Calculate the falseOmissionRate |
+
+ + | +Calculate the falsePositiveRate |
+
+ + | +Calculate the Integrated Calibration Information from Austin and Steyerberg +https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8281 |
+
+ + | +Calculate the model-based concordance, which is a calculation of the expected discrimination performance of a model under the assumption the model predicts the "TRUE" outcome +as detailed in van Klaveren et al. https://pubmed.ncbi.nlm.nih.gov/27251001/ |
+
+ + | +Calculate the negativeLikelihoodRatio |
+
+ + | +Calculate the negativePredictiveValue |
+
+ + | +Calculate the positiveLikelihoodRatio |
+
+ + | +Calculate the positivePredictiveValue |
+
+ + | +Calculate the sensitivity |
+
+ + | +Calculate the specificity |
+
+ + | +Computes grid performance with a specified performance function |
+
+ + | +Calculate the diagnostic odds ratio |
+
+ + | +Get a sparse summary of the calibration |
+
+ + | +Get a calibration per age/gender groups |
+
+ + | +Calculate all measures for sparse ROC |
+
+ + | +Calculate all measures for sparse ROC when prediction is bianry classification |
+
+ + | +Calculates the prediction distribution |
+
+ + | +Calculates the prediction distribution |
+
+ Saving/loading models as json+Functions for saving or loading models as json + |
+ |
+ + | +Loads sklearn python model from json |
+
+ + | +Saves sklearn python model object to json in path |
+
+ Load/save for sharing+Functions for loading/saving objects for sharing + |
+ |
+ + | +Save the plp result as json files and csv files for transparent sharing |
+
+ + | +Loads the plp result saved as json/csv files for transparent sharing |
+
+ + | +Loads the prediciton dataframe to csv |
+
+ + | +Saves the prediction dataframe to RDS |
+
+ Feature importance+ + |
+ |
+ + | +pfi |
+
+ Other functions+ + |
+ |
+ + | +Create predictive probabilities |
+
R/ImportFromCsv.R
+ insertCsvToDatabase.Rd
This function converts a folder with csv results into plp objects and loads them into a plp result database
+insertCsvToDatabase(
+ csvFolder,
+ connectionDetails,
+ databaseSchemaSettings,
+ modelSaveLocation,
+ csvTableAppend = ""
+)
The location to the csv folder with the plp results
A connection details for the plp results database that the csv results will be inserted into
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables to insert the csv results into
The location to save any models from the csv folder - this should be the same location you picked when inserting other models into the database
A string that appends the csv file names
Returns a data.frame indicating whether the results were inported into the database
+The user needs to have plp csv results in a single folder and an existing plp result database
+R/uploadToDatabaseModelDesign.R
+ insertModelDesignInDatabase.Rd
This function inserts a model design and all the settings into the result schema
+insertModelDesignInDatabase(
+ object,
+ conn,
+ databaseSchemaSettings,
+ cohortDefinitions
+)
An object of class modelDesign, runPlp or externalValidatePlp
A connection to a database created by using the
+function connect
in the
+DatabaseConnector
package.
A object created by createDatabaseSchemaSettings
with all the settings specifying the result tables
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet()
Returns NULL but uploads the model design into the database schema specified in databaseSchemaSettings
+This function can be used to upload a model design into a database
+R/uploadToDatabase.R
+ insertResultsToSqlite.Rd
This function create an sqlite database with the PLP result schema and inserts all results
+insertResultsToSqlite(
+ resultLocation,
+ cohortDefinitions,
+ databaseList = NULL,
+ sqliteLocation = file.path(resultLocation, "sqlite")
+)
(string) location of directory where the main package results were saved
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet()
A list created by createDatabaseList
to specify the databases
(string) location of directory where the sqlite database will be saved
Returns the location of the sqlite database file
+This function can be used upload PatientLevelPrediction results into an sqlite database
+join two lists
+listAppend(a, b)
A list
Another list
This function joins two lists
+Computes the Cartesian product of all the combinations of elements in a list
+listCartesian(allList)
a list of lists
A list with all possible combinations from the input list of lists
+R/RunMultiplePlp.R
+ loadPlpAnalysesJson.Rd
Load the multiple prediction json settings from a file
+loadPlpAnalysesJson(jsonFileLocation)
The location of the file 'predictionAnalysisList.json' with the modelDesignList
This function interprets a json with the multiple prediction settings and creates a list +that can be combined with connection settings to run a multiple prediction study
+if (FALSE) {
+modelDesignList <- loadPlpAnalysesJson('location of json settings')$analysis
+}
+
+
loadPlpData
loads an object of type plpData from a folder in the file
+system.
loadPlpData(file, readOnly = TRUE)
The name of the folder containing the data.
If true, the data is opened read only.
An object of class plpData.
+The data will be written to a set of files in the folder specified by the user.
+# todo
+
+
+
loads the plp model
+loadPlpModel(dirPath)
The location of the model
Loads a plp model that was saved using savePlpModel()
Loads the evalaution dataframe
+loadPlpResult(dirPath)
The directory where the evaluation was saved
Loads the evaluation
+R/SaveLoadPlp.R
+ loadPlpShareable.Rd
Loads the plp result saved as json/csv files for transparent sharing
+loadPlpShareable(loadDirectory)
The directory with the results as json/csv files
Load the main results from json/csv files into a runPlp object
+Loads the prediciton dataframe to csv
+loadPrediction(fileLocation)
The location with the saved prediction
Loads the prediciton RDS file
+Migrate data from current state to next state
+It is strongly advised that you have a backup of all data (either sqlite files, a backup database (in the case you +are using a postgres backend) or have kept the csv/zip files from your data generation.
+migrateDataModel(connectionDetails, databaseSchema, tablePrefix = "")
DatabaseConnector connection details object
String schema where database schema lives
(Optional) Use if a table prefix is used before table names (e.g. "cd_")
R/EvaluatePlp.R
+ modelBasedConcordance.Rd
Calculate the model-based concordance, which is a calculation of the expected discrimination performance of a model under the assumption the model predicts the "TRUE" outcome +as detailed in van Klaveren et al. https://pubmed.ncbi.nlm.nih.gov/27251001/
+modelBasedConcordance(prediction)
the prediction object found in the plpResult object
model-based concordance value
+Calculate the model-based concordance
+R/ThresholdSummary.R
+ negativeLikelihoodRatio.Rd
Calculate the negativeLikelihoodRatio
+negativeLikelihoodRatio(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
negativeLikelihoodRatio value
+Calculate the negativeLikelihoodRatio
+R/ThresholdSummary.R
+ negativePredictiveValue.Rd
Calculate the negativePredictiveValue
+negativePredictiveValue(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
negativePredictiveValue value
+Calculate the negativePredictiveValue
+Plot the outcome incidence over time
+outcomeSurvivalPlot(
+ plpData,
+ outcomeId,
+ populationSettings = createStudyPopulationSettings(binary = T, includeAllOutcomes = T,
+ firstExposureOnly = FALSE, washoutPeriod = 0, removeSubjectsWithPriorOutcome = TRUE,
+ priorOutcomeLookback = 99999, requireTimeAtRisk = F, riskWindowStart = 1, startAnchor
+ = "cohort start", riskWindowEnd = 3650, endAnchor = "cohort start"),
+ riskTable = T,
+ confInt = T,
+ yLabel = "Fraction of those who are outcome free in target population"
+)
The plpData object returned by running getPlpData()
The cohort id corresponding to the outcome
The population settings created using createStudyPopulationSettings
(binary) Whether to include a table at the bottom of the plot showing the number of people at risk over time
(binary) Whether to include a confidence interval
(string) The label for the y-axis
TRUE if it ran
+This creates a survival plot that can be used to pick a suitable time-at-risk period
+Calculate the permutation feature importance for a PLP model.
+pfi(
+ plpResult,
+ population,
+ plpData,
+ repeats = 1,
+ covariates = NULL,
+ cores = NULL,
+ log = NULL,
+ logthreshold = "INFO"
+)
An object of type runPlp
The population created using createStudyPopulation() who will have their risks predicted
An object of type plpData
- the patient level prediction
+data extracted from the CDM.
The number of times to permute each covariate
A vector of covariates to calculate the pfi for. If NULL it uses all covariates included in the model.
Number of cores to use when running this (it runs in parallel)
A location to save the log for running pfi
The log threshold (e.g., INFO, TRACE, ...)
A dataframe with the covariateIds and the pfi (change in AUC caused by permuting the covariate) value
+The function permutes the each covariate/features <repeats> times and calculates the mean AUC change caused by the permutation.
+R/Plotting.R
+ plotDemographicSummary.Rd
Plot the Observed vs. expected incidence, by age and gender
+plotDemographicSummary(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the Observed vs. expected incidence, by age and gender +#'
+R/Plotting.R
+ plotF1Measure.Rd
Plot the F1 measure efficiency frontier using the sparse thresholdSummary data frame
+plotF1Measure(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the F1 measure efficiency frontier using the sparse thresholdSummary data frame
+R/Plotting.R
+ plotGeneralizability.Rd
Plot the train/test generalizability diagnostic
+plotGeneralizability(
+ covariateSummary,
+ saveLocation = NULL,
+ fileName = "Generalizability.png"
+)
A prediction object as generated using the
+runPlp
function.
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the train/test generalizability diagnostic +#'
+Create a plot of the learning curve using the object returned
+from createLearningCurve
.
plotLearningCurve(
+ learningCurve,
+ metric = "AUROC",
+ abscissa = "events",
+ plotTitle = "Learning Curve",
+ plotSubtitle = NULL,
+ fileName = NULL
+)
An object returned by createLearningCurve
+function.
Specifies the metric to be plotted:
'AUROC'
- use the area under the Receiver Operating
+ Characteristic curve
'AUPRC'
- use the area under the Precision-Recall curve
'sBrier'
- use the scaled Brier score
Specify the abscissa metric to be plotted:
'events'
- use number of events
'observations'
- use number of observations
Title of the learning curve plot.
Subtitle of the learning curve plot.
Filename of plot to be saved, for example 'plot.png'
.
+See the function ggsave
in the ggplot2 package for supported file
+formats.
A ggplot object. Use the ggsave
function to save to
+file in a different format.
if (FALSE) {
+# create learning curve object
+learningCurve <- createLearningCurve(population,
+ plpData,
+ modelSettings)
+# plot the learning curve
+plotLearningCurve(learningCurve)
+}
+
+
Plot all the PatientLevelPrediction plots
+plotPlp(plpResult, saveLocation = NULL, typeColumn = "evaluation")
Object returned by the runPlp() function
Name of the directory where the plots should be saved (NULL means no saving)
The name of the column specifying the evaluation type +(to stratify the plots)
TRUE if it ran
+Create a directory with all the plots
+R/Plotting.R
+ plotPrecisionRecall.Rd
Plot the precision-recall curve using the sparse thresholdSummary data frame
+plotPrecisionRecall(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the precision-recall curve using the sparse thresholdSummary data frame
+R/Plotting.R
+ plotPredictedPDF.Rd
Plot the Predicted probability density function, showing prediction overlap between true and false cases
+plotPredictedPDF(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "PredictedPDF.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the predicted probability density function, showing prediction overlap between true and false cases
+R/Plotting.R
+ plotPredictionDistribution.Rd
Plot the side-by-side boxplots of prediction distribution, by class#'
+plotPredictionDistribution(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "PredictionDistribution.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the side-by-side boxplots of prediction distribution, by class +#'
+R/Plotting.R
+ plotPreferencePDF.Rd
Plot the preference score probability density function, showing prediction overlap between true and false cases +#'
+plotPreferencePDF(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "plotPreferencePDF.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the preference score probability density function, showing prediction overlap between true and false cases +#'
+R/Plotting.R
+ plotSmoothCalibration.Rd
Plot the smooth calibration as detailed in Calster et al. "A calibration heirarchy for risk models +was defined: from utopia to empirical data" (2016)
+plotSmoothCalibration(
+ plpResult,
+ smooth = "loess",
+ span = 0.75,
+ nKnots = 5,
+ scatter = FALSE,
+ bins = 20,
+ sample = TRUE,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "smoothCalibration.pdf"
+)
The result of running runPlp
function. An object containing the
+model or location where the model is save, the data selection settings, the
+preprocessing and training settings as well as various performance measures
+obtained by the model.
options: 'loess' or 'rcs'
This specifies the width of span used for loess. This will allow for faster +computing and lower memory usage.
The number of knots to be used by the rcs evaluation. Default is 5
plot the decile calibrations as points on the graph. Default is False
The number of bins for the histogram. Default is 20.
If using loess then by default 20,000 patients will be sampled to save time
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object.
+Create a plot showing the smoothed calibration #'
+Plot the calibration
+plotSparseCalibration(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the calibration +#'
+Plot the conventional calibration
+plotSparseCalibration2(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the calibration +#'
+R/Plotting.R
+ plotSparseRoc.Rd
Plot the ROC curve using the sparse thresholdSummary data frame
+plotSparseRoc(
+ plpResult,
+ typeColumn = "evaluation",
+ saveLocation = NULL,
+ fileName = "roc.png"
+)
A plp result object as generated using the runPlp
function.
The name of the column specifying the evaluation type
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the Receiver Operator Characteristics (ROC) curve.
+Plot the variable importance scatterplot
+plotVariableScatterplot(
+ covariateSummary,
+ saveLocation = NULL,
+ fileName = "VariableScatterplot.png"
+)
A prediction object as generated using the
+runPlp
function.
Directory to save plot (if NULL plot is not saved)
Name of the file to save to plot, for example
+'plot.png'. See the function ggsave
in the ggplot2 package for
+supported file formats.
A ggplot object. Use the ggsave
function to save to file in a different
+format.
Create a plot showing the variable importance scatterplot +#'
+A simulation profile
+data(plpDataSimulationProfile)
A data frame containing the following elements:
prevalence of all covariates
regression model parameters to simulate outcomes
settings used to simulate the profile
covariateIds and covariateNames
time window
prevalence of exclusion of covariates
R/ThresholdSummary.R
+ positiveLikelihoodRatio.Rd
Calculate the positiveLikelihoodRatio
+positiveLikelihoodRatio(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
positiveLikelihoodRatio value
+Calculate the positiveLikelihoodRatio
+R/ThresholdSummary.R
+ positivePredictiveValue.Rd
Calculate the positivePredictiveValue
+positivePredictiveValue(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
positivePredictiveValue value
+Calculate the positivePredictiveValue
+Create predictive probabilities
+predictCyclops(plpModel, data, cohort)
An object of type predictiveModel
as generated using
+fitPlp
.
The new plpData containing the covariateData for the new population
The cohort to calculate the prediction for
The value column in the result data.frame is: logistic: probabilities of the outcome, poisson: +Poisson rate (per day) of the outome, survival: hazard rate (per day) of the outcome.
+Generates predictions for the population specified in plpData given the model.
+Predict the risk of the outcome using the input plpModel for the input plpData
+predictPlp(plpModel, plpData, population, timepoint)
An object of type plpModel
- a patient level prediction model
An object of type plpData
- the patient level prediction
+data extracted from the CDM.
The population created using createStudyPopulation() who will have their risks predicted or a cohort without the outcome known
The timepoint to predict risk (survival models only)
A dataframe containing the prediction for each person in the population with an attribute metaData containing prediction details.
+The function applied the trained model on the plpData to make predictions
+R/PreprocessingData.R
+ preprocessData.Rd
A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data +and remove rare or redundant features
+preprocessData(covariateData, preprocessSettings)
The covariate part of the training data created by splitData
after being sampled and having
+any required feature engineering
The settings for the preprocessing created by createPreprocessSettings
The data processed
+Returns an object of class covariateData
that has been processed
Train various models using a default parameter gird search or user specified parameters
+recalibratePlp(
+ prediction,
+ analysisId,
+ typeColumn = "evaluationType",
+ method = c("recalibrationInTheLarge", "weakRecalibration")
+)
A prediction dataframe
The model analysisId
The column name where the strata types are specified
Method used to recalibrate ('recalibrationInTheLarge' or 'weakRecalibration' )
An object of class runPlp
that is recalibrated on the new data
The user can define the machine learning model to train (regularised logistic regression, random forest, +gradient boosting machine, neural network and )
+Train various models using a default parameter gird search or user specified parameters
+recalibratePlpRefit(plpModel, newPopulation, newData)
The trained plpModel (runPlp$model)
The population created using createStudyPopulation() who will have their risks predicted
An object of type plpData
- the patient level prediction
+data extracted from the CDM.
An object of class runPlp
that is recalibrated on the new data
The user can define the machine learning model to train (regularised logistic regression, random forest, +gradient boosting machine, neural network and )
+Run a list of predictions analyses
+runMultiplePlp(
+ databaseDetails = createDatabaseDetails(),
+ modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings =
+ setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3,
+ modelSettings = setLassoLogisticRegression())),
+ onlyFetchData = F,
+ cohortDefinitions = NULL,
+ logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = T, logName =
+ "runPlp Log"),
+ saveDirectory = getwd(),
+ sqliteLocation = file.path(saveDirectory, "sqlite")
+)
The database settings created using createDatabaseDetails()
A list of model designs created using createModelDesign()
Only fetches and saves the data object to the output folder without running the analysis.
A list of cohort definitions for the target and outcome cohorts
The setting specifying the logging for the analyses created using createLogSettings()
Name of the folder where all the outputs will written to.
(optional) The location of the sqlite database with the results
A data frame with the following columns:
analysisId | The unique identifier +for a set of analysis choices. |
targetId | The ID of the target cohort populations. |
outcomeId | The ID of the outcomeId. |
dataLocation | The location where the plpData was saved |
the settings ids | The ids for all other settings used for model development. |
This function will run all specified predictions as defined using .
+R/RunPlp.R
+ runPlp.Rd
This provides a general framework for training patient level prediction models. The user can select +various default feature selection methods or incorporate their own, The user can also select from +a range of default classifiers or incorporate their own. There are three types of evaluations for the model +patient (randomly splits people into train/validation sets) or year (randomly splits data into train/validation sets +based on index year - older in training, newer in validation) or both (same as year spliting but checks there are +no overlaps in patients within training set and validaiton set - any overlaps are removed from validation set)
+runPlp(
+ plpData,
+ outcomeId = plpData$metaData$call$outcomeIds[1],
+ analysisId = paste(Sys.Date(), plpData$metaData$call$outcomeIds[1], sep = "-"),
+ analysisName = "Study details",
+ populationSettings = createStudyPopulationSettings(),
+ splitSettings = createDefaultSplitSetting(type = "stratified", testFraction = 0.25,
+ trainFraction = 0.75, splitSeed = 123, nfold = 3),
+ sampleSettings = createSampleSettings(type = "none"),
+ featureEngineeringSettings = createFeatureEngineeringSettings(type = "none"),
+ preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = T),
+ modelSettings = setLassoLogisticRegression(),
+ logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = T, logName =
+ "runPlp Log"),
+ executeSettings = createDefaultExecuteSettings(),
+ saveDirectory = getwd()
+)
An object of type plpData
- the patient level prediction
+data extracted from the CDM. Can also include an initial population as
+plpData$popualtion.
(integer) The ID of the outcome.
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp.
(character) Name for the analysis
An object of type populationSettings
created using createStudyPopulationSettings
that
+specifies how the data class labels are defined and addition any exclusions to apply to the
+plpData cohort
An object of type splitSettings
that specifies how to split the data into train/validation/test.
+The default settings can be created using createDefaultSplitSetting
.
An object of type sampleSettings
that specifies any under/over sampling to be done.
+The default is none.
An object of featureEngineeringSettings
specifying any feature engineering to be learned (using the train data)
An object of preprocessSettings
. This setting specifies the minimum fraction of
+target population who must have a covariate for it to be included in the model training
+and whether to normalise the covariates before training
An object of class modelSettings
created using one of the function:
setLassoLogisticRegression() A lasso logistic regression model
setGradientBoostingMachine() A gradient boosting machine
setAdaBoost() An ada boost model
setRandomForest() A random forest model
setDecisionTree() A decision tree model
setKNN() A KNN model
An object of logSettings
created using createLogSettings
+specifying how the logging is done
An object of executeSettings
specifying which parts of the analysis to run
The path to the directory where the results will be saved (if NULL uses working directory)
An object containing the following:
+ +model The developed model of class plpModel
executionSummary A list containing the hardward details, R package details and execution time
performanceEvaluation Various internal performance metrics in sparse format
prediction The plpData cohort table with the predicted risks added as a column (named value)
covariateSummary A characterization of the features for patients with and without the outcome during the time at risk
analysisRef A list with details about the analysis
This function takes as input the plpData extracted from an OMOP CDM database and follows the specified settings to +develop and internally validate a model for the specified outcomeId.
+Save the modelDesignList to a json file
+savePlpAnalysesJson(
+ modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings =
+ setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3,
+ modelSettings = setLassoLogisticRegression())),
+ cohortDefinitions = NULL,
+ saveDirectory = NULL
+)
A list of modelDesigns created using createModelDesign()
A list of the cohortDefinitions (generally extracted from ATLAS)
The directory to save the modelDesignList settings
This function creates a json file with the modelDesignList saved
+if (FALSE) {
+savePlpAnalysesJson(
+modelDesignList = list(
+createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()),
+createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())
+),
+saveDirectory = 'C:/bestModels'
+)
+}
+
+
savePlpData
saves an object of type plpData to folder.
savePlpData(plpData, file, envir = NULL, overwrite = F)
An object of type plpData
as generated using
+getPlpData
.
The name of the folder where the data will be written. The folder should +not yet exist.
The environment for to evaluate variables when saving
Whether to force overwrite an existing file
The data will be written to a set of files in the folder specified by the user.
+# todo
+
+
+
Saves the plp model
+savePlpModel(plpModel, dirPath)
A trained classifier returned by running runPlp()$model
A location to save the model to
Saves the plp model to a user specificed folder
+R/SaveLoadPlp.R
+ savePlpResult.Rd
Saves the result from runPlp into the location directory
+savePlpResult(result, dirPath)
The result of running runPlp()
The directory to save the csv
Saves the result from runPlp into the location directory
+R/SaveLoadPlp.R
+ savePlpShareable.Rd
Save the plp result as json files and csv files for transparent sharing
+savePlpShareable(result, saveDirectory, minCellCount = 10)
An object of class runPlp with development or validation results
The directory the save the results as csv files
Minimum cell count for the covariateSummary and certain evaluation results
Saves the main results json/csv files (these files can be read by the shiny app)
+Saves the prediction dataframe to RDS
+savePrediction(prediction, dirPath, fileName = "prediction.rds")
The prediciton data.frame
The directory to save the prediction RDS
The name of the RDS file that will be saved in dirPath
Saves the prediction data frame returned by predict.R to an RDS file and returns the fileLocation where the prediction is saved
+Calculate the sensitivity
+sensitivity(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
sensitivity value
+Calculate the sensitivity
+R/SklearnClassifierSettings.R
+ setAdaBoost.Rd
Create setting for AdaBoost with python DecisionTreeClassifier base estimator
+(list) The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
(list) Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learningRate and nEstimators parameters +There is a trade-off between learningRate and nEstimators.
(list) If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.
A seed for the model
Create setting for lasso Cox model
+Numeric: prior distribution starting variance
An option to add a seed when training the model
a set of covariate IDS to limit the analysis to
a set of covariates whcih are to be forced to be included in the final model. default is the intercept
An option to set number of threads when training model
Numeric: Upper prior variance limit for grid-search
Numeric: Lower prior variance limit for grid-search
Numeric: maximum relative change in convergence criterion from successive iterations to achieve convergence
Integer: maximum iterations of Cyclops to attempt before returning a failed-to-converge error
model.lr <- setCoxModel()
+
R/SklearnClassifierSettings.R
+ setDecisionTree.Rd
Create setting for the scikit-learn 1.0.1 DecisionTree with python
+setDecisionTree(
+ criterion = list("gini"),
+ splitter = list("best"),
+ maxDepth = list(as.integer(4), as.integer(10), NULL),
+ minSamplesSplit = list(2, 10),
+ minSamplesLeaf = list(10, 50),
+ minWeightFractionLeaf = list(0),
+ maxFeatures = list(100, "sqrt", NULL),
+ maxLeafNodes = list(NULL),
+ minImpurityDecrease = list(10^-7),
+ classWeight = list(NULL),
+ seed = sample(1e+06, 1)
+)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
(list) The maximum depth of the tree. If NULL, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
The minimum number of samples required to split an internal node
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least minSamplesLeaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sampleWeight is not provided.
(list) The number of features to consider when looking for the best split (int/'sqrt'/NULL)
(list) Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. (int/NULL)
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
(list) Weights associated with classes 'balance' or NULL
The random state seed
if (FALSE) {
+model.decisionTree <- setDecisionTree(maxDepth=10,minSamplesLeaf=10, seed=NULL )
+}
+
R/GradientBoostingMachine.R
+ setGradientBoostingMachine.Rd
Create setting for gradient boosting machine model using gbm_xgboost implementation
+The number of trees to build
The number of computer threads to use (how many cores do you have?)
If the performance does not increase over earlyStopRound number of trees then training stops (this prevents overfitting)
Maximum depth of each tree - a large value will lead to slow model training
Minimum sum of of instance weight in a child node - larger values are more conservative
The boosting learn rate
Controls weight of positive class in loss - useful for imbalanced classes
L2 regularization on weights - larger is more conservative
L1 regularization on weights - larger is more conservative
An option to add a seed when training the final model
R/CyclopsSettings.R
+ setIterativeHardThresholding.Rd
Create setting for lasso logistic regression
+The maximum number of non-zero predictors
Specifies the IHT penalty; possible values are `BIC` or `AIC` or a numeric value
An option to add a seed when training the model
A vector of numbers or covariateId names to exclude from prior
Logical: Force intercept coefficient into regularization
Logical: Fit final subset with no regularization
integer
numeric
integer
numeric
numeric
model.lr <- setLassoLogisticRegression()
+
Create setting for knn model
+The number of neighbors to consider
The directory where the results and intermediate steps are output
The number of threads to use when applying big knn
if (FALSE) {
+model.knn <- setKNN(k=10000)
+}
+
R/CyclopsSettings.R
+ setLassoLogisticRegression.Rd
Create setting for lasso logistic regression
+Numeric: prior distribution starting variance
An option to add a seed when training the model
a set of covariate IDS to limit the analysis to
a set of covariates whcih are to be forced to be included in the final model. default is the intercept
An option to set number of threads when training model
Logical: Force intercept coefficient into prior
Numeric: Upper prior variance limit for grid-search
Numeric: Lower prior variance limit for grid-search
Numeric: maximum relative change in convergence criterion from successive iterations to achieve convergence
Integer: maximum iterations of Cyclops to attempt before returning a failed-to-converge error
Use coefficients from a previous model as starting points for model fit (transfer learning)
model.lr <- setLassoLogisticRegression()
+
R/LightGBM.R
+ setLightGBM.Rd
Create setting for gradient boosting machine model using lightGBM (https://github.com/microsoft/LightGBM/tree/master/R-package).
+The number of computer threads to use (how many cores do you have?)
If the performance does not increase over earlyStopRound number of trees then training stops (this prevents overfitting)
Number of boosting iterations.
This hyperparameter sets the maximum number of leaves. Increasing this parameter can lead to higher model complexity and potential overfitting.
This hyperparameter sets the maximum depth . Increasing this parameter can also lead to higher model complexity and potential overfitting.
This hyperparameter sets the minimum number of data points that must be present in a leaf node. Increasing this parameter can help to reduce overfitting
This hyperparameter controls the step size at each iteration of the gradient descent algorithm. Lower values can lead to slower convergence but may result in better performance.
This hyperparameter controls L1 regularization, which can help to reduce overfitting by encouraging sparse models.
This hyperparameter controls L2 regularization, which can also help to reduce overfitting by discouraging large weights in the model.
Controls weight of positive class in loss - useful for imbalanced classes
This parameter cannot be used at the same time with scalePosWeight, choose only one of them. While enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities.
An option to add a seed when training the final model
R/SklearnClassifierSettings.R
+ setMLP.Rd
Create setting for neural network model with python
+setMLP(
+ hiddenLayerSizes = list(c(100), c(20)),
+ activation = list("relu"),
+ solver = list("adam"),
+ alpha = list(0.3, 0.01, 1e-04, 1e-06),
+ batchSize = list("auto"),
+ learningRate = list("constant"),
+ learningRateInit = list(0.001),
+ powerT = list(0.5),
+ maxIter = list(200, 100),
+ shuffle = list(TRUE),
+ tol = list(1e-04),
+ warmStart = list(TRUE),
+ momentum = list(0.9),
+ nesterovsMomentum = list(TRUE),
+ earlyStopping = list(FALSE),
+ validationFraction = list(0.1),
+ beta1 = list(0.9),
+ beta2 = list(0.999),
+ epsilon = list(1e-08),
+ nIterNoChange = list(10),
+ seed = sample(1e+05, 1)
+)
(list of vectors) The ith element represents the number of neurons in the ith hidden layer.
(list) Activation function for the hidden layer.
"identity": no-op activation, useful to implement linear bottleneck, returns f(x) = x
"logistic": the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
"tanh": the hyperbolic tan function, returns f(x) = tanh(x).
"relu": the rectified linear unit function, returns f(x) = max(0, x)
(list) The solver for weight optimization. (‘lbfgs’, ‘sgd’, ‘adam’)
(list) L2 penalty (regularization term) parameter.
(list) Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batchSize=min(200, n_samples).
(list) Only used when solver='sgd' Learning rate schedule for weight updates. ‘constant’, ‘invscaling’, ‘adaptive’, default=’constant’
(list) Only used when solver=’sgd’ or ‘adam’. The initial learning rate used. It controls the step-size in updating the weights.
(list) Only used when solver=’sgd’. The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’.
(list) Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.
(list) boolean: Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.
(list) Tolerance for the optimization. When the loss or score is not improving by at least tol for nIterNoChange consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.
(list) When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
(list) Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.
(list) Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.
(list) boolean Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10 percent of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.
(list) The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if earlyStopping is True.
(list) Exponential decay rate for estimates of first moment vector in adam, should be in 0 to 1.
(list) Exponential decay rate for estimates of second moment vector in adam, should be in 0 to 1.
(list) Value for numerical stability in adam.
(list) Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.
A seed for the model
if (FALSE) {
+model.mlp <- setMLP()
+}
+
R/SklearnClassifierSettings.R
+ setNaiveBayes.Rd
Create setting for naive bayes model with python
+setNaiveBayes()
if (FALSE) {
+model.nb <- setNaiveBayes()
+}
+
R/HelperFunctions.R
+ setPythonEnvironment.Rd
Use the virtual environment created using configurePython()
+setPythonEnvironment(envname = "PLP", envtype = NULL)
A string for the name of the virtual environment (default is 'PLP')
An option for specifying the environment as'conda' or 'python'. If NULL then the default is 'conda' for windows users and 'python' for non-windows users
This function sets PatientLevelPrediction to use a virtual environment
+R/SklearnClassifierSettings.R
+ setRandomForest.Rd
Create setting for random forest model with python (very fast)
+setRandomForest(
+ ntrees = list(100, 500),
+ criterion = list("gini"),
+ maxDepth = list(4, 10, 17),
+ minSamplesSplit = list(2, 5),
+ minSamplesLeaf = list(1, 10),
+ minWeightFractionLeaf = list(0),
+ mtries = list("sqrt", "log2"),
+ maxLeafNodes = list(NULL),
+ minImpurityDecrease = list(0),
+ bootstrap = list(TRUE),
+ maxSamples = list(NULL, 0.9),
+ oobScore = list(FALSE),
+ nJobs = list(NULL),
+ classWeight = list(NULL),
+ seed = sample(1e+05, 1)
+)
(list) The number of trees to build
(list) The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
(list) The maximum depth of the tree. If NULL, then nodes are expanded until all leaves are pure or until all leaves contain less than minSamplesSplit samples.
(list) The minimum number of samples required to split an internal node
(list) The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least minSamplesLeaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
(list) The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sampleWeight is not provided.
(list) The number of features to consider when looking for the best split:
int then consider max_features features at each split.
float then max_features is a fraction and round(max_features * n_features) features are considered at each split
'sqrt' then max_features=sqrt(n_features)
'log2' then max_features=log2(n_features)
NULL then max_features=n_features
(list) Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
(list) A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
(list) Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
(list) If bootstrap is True, the number of samples to draw from X to train each base estimator.
(list) Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
The number of jobs to run in parallel.
(list) Weights associated with classes. If not given, all classes are supposed to have weight one. NULL, “balanced”, “balanced_subsample”
A seed when training the final model
R/SklearnClassifierSettings.R
+ setSVM.Rd
Create setting for the python sklearn SVM (SVC function)
+(list) Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
(list) Specifies the kernel type to be used in the algorithm. one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. If none is given ‘rbf’ will be used.
(list) degree of kernel function is significant only in poly, rbf, sigmoid
(list) kernel coefficient for rbf and poly, by default 1/n_features will be taken. ‘scale’, ‘auto’ or float, default=’scale’
(list) independent term in kernel function. It is only significant in poly/sigmoid.
(list) whether to use the shrinking heuristic.
(list) Tolerance for stopping criterion.
(list) Class weight based on imbalance either 'balanced' or NULL
Specify the size of the kernel cache (in MB).
A seed for the model
if (FALSE) {
+model.svm <- setSVM(kernel='rbf', seed = NULL)
+}
+
simulateplpData
creates a plpData object with simulated data.
simulatePlpData(plpDataSimulationProfile, n = 10000)
An object of type plpDataSimulationProfile
as generated
+using the createplpDataSimulationProfile
function.
The size of the population to be generated.
An object of type plpData
.
This function generates simulated data that is in many ways similar to the original data on which +the simulation profile is based. The contains same outcome, comparator, and outcome concept IDs, +and the covariates and their 1st order statistics should be comparable.
+Loads sklearn python model from json
+sklearnFromJson(path)
path to the model json file
R/SklearnToJson.R
+ sklearnToJson.Rd
Saves sklearn python model object to json in path
+sklearnToJson(model, path)
a fitted sklearn python model object
path to the saved model file
Calculate the specificity
+specificity(TP, TN, FN, FP)
Number of true positives
Number of true negatives
Number of false negatives
Number of false positives
specificity value
+Calculate the specificity
+splitSettings
R/DataSplitting.R
+ splitData.Rd
Split the plpData into test/train sets using a splitting settings of class splitSettings
splitData(
+ plpData = plpData,
+ population = population,
+ splitSettings = splitSettings
+)
An object of type plpData
- the patient level prediction
+data extracted from the CDM.
The population created using createStudyPopulation
that define who will be used to develop the model
An object of type splitSettings
specifying the split - the default can be created using createDefaultSplitSetting
An object of class splitSettings
Returns a list containing the training data (Train) and optionally the test data (Test). Train is an Andromeda object containing
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the train data (outcomeCount is the class label)
folds: a table (rowId, index) specifying which training fold each data point is in.
Test is an Andromeda object containing
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the test data (outcomeCount is the class label)
R/Formatting.R
+ toSparseM.Rd
Converts the standard plpData to a sparse matrix
+toSparseM(plpData, cohort = NULL, map = NULL)
An object of type plpData
with covariate in coo format - the patient level prediction
+data extracted from the CDM.
If specified the plpData is restricted to the rowIds in the cohort (otherwise plpData$labels is used)
A covariate map (telling us the column number for covariates)
Returns a list, containing the data as a sparse matrix, the plpData covariateRef +and a data.frame named map that tells us what covariate corresponds to each column +This object is a list with the following components:
A sparse matrix with the rows corresponding to each person in the plpData and the columns corresponding to the covariates.
The plpData covariateRef.
A data.frame containing the data column ids and the corresponding covariateId from covariateRef.
This function converts the covariate file from ffdf in COO format into a sparse matrix from +the package Matrix
+#TODO
+
+
+
R/ExternalValidatePlp.R
+ validateExternal.Rd
externalValidatePlp - Validate model performance on new data
+validateExternal(
+ validationDesignList,
+ databaseDetails,
+ logSettings,
+ outputFolder
+)
A list of objects created with createValidationDesign
A list of objects of class
+databaseDetails
created using createDatabaseDetails
An object of logSettings
created
+using createLogSettings
The directory to save the validation results to +(subfolders are created per database in validationDatabaseDetails)
R/RunMultiplePlp.R
+ validateMultiplePlp.Rd
This function loads all the models in a multiple plp analysis folder and +validates the models on new data
+validateMultiplePlp(
+ analysesLocation,
+ validationDatabaseDetails,
+ validationRestrictPlpDataSettings = createRestrictPlpDataSettings(),
+ recalibrate = NULL,
+ cohortDefinitions = NULL,
+ saveDirectory = NULL
+)
The location where the multiple plp analyses are
A single or list of validation database settings created using createDatabaseDetails()
The settings specifying the extra restriction settings when extracting the data created using createRestrictPlpDataSettings()
.
A vector of recalibration methods (currently supports 'RecalibrationintheLarge' and/or 'weakRecalibration')
A list of cohortDefinitions
The location to save to validation results
Users need to input a location where the results of the multiple plp analyses +are found and the connection and database settings for the new data
+R/ViewShinyPlp.R
+ viewDatabaseResultPlp.Rd
open a local shiny app for viewing the result of a PLP analyses from a database
+viewDatabaseResultPlp(
+ mySchema,
+ myServer,
+ myUser,
+ myPassword,
+ myDbms,
+ myPort = NULL,
+ myTableAppend
+)
Database result schema containing the result tables
server with the result database
Username for the connection to the result database
Password for the connection to the result database
database management system for the result database
Port for the connection to the result database
A string appended to the results tables (optional)
Opens a shiny app for viewing the results of the models from a database
+R/ViewShinyPlp.R
+ viewMultiplePlp.Rd
open a local shiny app for viewing the result of a multiple PLP analyses
+viewMultiplePlp(analysesLocation)
The directory containing the results (with the analysis_x folders)
Opens a shiny app for viewing the results of the models from various T,O, Tar and settings +settings.
+R/ViewShinyPlp.R
+ viewPlp.Rd
This is a shiny app for viewing interactive plots of the performance and the settings
+viewPlp(runPlp, validatePlp = NULL, diagnosePlp = NULL)
The output of runPlp() (an object of class 'runPlp')
The output of externalValidatePlp (on object of class 'validatePlp')
The output of diagnosePlp()
Opens a shiny app for interactively viewing the results
+Either the result of runPlp and view the plots
++Topic + | ++Research Summary + | ++Link + | +
---|---|---|
+Problem Specification + | ++When is prediction suitable in observational data? + | ++Guidelines needed + | +
+Data Creation + | ++Comparison of cohort vs case-control design + | ++Journal of Big Data + | +
+Data Creation + | ++Addressing loss to follow-up (right censoring) + | ++BMC medical informatics and decision makingk + | +
+Data Creation + | ++Investigating how to address left censoring in features construction + | ++BMC Medical Research Methodology + | +
+Data Creation + | ++Impact of over/under-sampling + | ++ Journal of big data + | +
+Data Creation + | ++Impact of phenotypes + | ++Study Done - Paper submitted + | +
+Model development + | ++How much data do we need for prediction - Learning curves at scale + | ++International Journal of Medical Informatics + | +
+Model development + | ++What impact does test/train/validation design have on model performance + | ++BMJ Open + | +
+Model development + | ++What is the impact of the classifier + | ++JAMIA + | +
+Model development + | ++Can we find hyper-parameter combinations per classifier that consistently lead to good performing models when using claims/EHR data? + | ++Study needs to be done + | +
+Model development + | ++Can we use ensembles to combine different algorithm models within a database to improve models transportability? + | ++ Caring is Sharing–Exploiting the Value in Data for Health and Innovation + | +
+Model development + | ++Can we use ensembles to combine models developed using different databases to improve models transportability? + | ++ BMC Medical Informatics and Decision Making + | +
+Model development + | ++Impact of regularization method + | ++ JAMIA + | +
+Evaluation + | ++Why prediction is not suitable for risk factor identification + | ++ Machine Learning for Healthcare Conference + | +
+Evaluation + | ++Iterative pairwise external validation to put validation into context + | ++ Drug Safety + | +
+Evaluation + | ++A novel method to estimate external validation using aggregate statistics + | ++ Study under review + | +
+Evaluation + | ++How should we present model performance? (e.g., new visualizations) + | ++JAMIA Open + | +
+Evaluation + | ++How to interpret external validation performance (can we figure out why the performance drops or stays consistent)? + | ++Study needs to be done + | +
+Evaluation + | ++Recalibration methods + | ++Study needs to be done + | +
+Evaluation + | ++Is there a way to automatically simplify models? + | ++Study protocol under development + | +