-
Notifications
You must be signed in to change notification settings - Fork 50
caret ml cv
Cross-validations (CV) are a possible cure for overfitting. Overfitting refers to the concept that the model is well built using the test-set, however when an unknown external validation set is applied the model completely fails. The best way to avoid overfitting is always to perform a three-way split
- Training set (70% of data)
- Test set (30% of data)
- Validation set (+30% of new data)
Some people merge training an test set and just perform cross-validations on them and then use the validation set for "validation". Caret provides for different six different CV methods. Explanations can be found here.
Num | Name | Speed | Accuracy | Name |
---|---|---|---|---|
1 | boot632 | fast | best | the .632+ Bootstrap |
2 | LGOCV | fast | good | leave-group-out cross-validation |
3 | LOOCV | slooow | good | leave-one-out cross-validation |
4 | cv | fast | good | k-fold cross-validation |
5 | repeatedcv | fast | good | repeated 10–fold cross–validation |
6 | boot | fast | ok | bootstrap |
7 | none | fastest | no | none |
8 | oob | NA | NA | out-of-bag (only for tree models) |
Speed, of course using no CV (method none) is the fastest, all others are good options and always should be enabled. be aware that depending on your settings method LOOCV maybe 40-fold slower than other methods. There is also method oob but only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models.
Simple example caret cross-validation methods
# Single example, no cross-validation
require(caret); data(BloodBrain); set.seed(123);
fit1 <- train(bbbDescr, logBBB, "knn"); fit1
# cross-validation example with method boot
require(caret); data(BloodBrain); set.seed(123);
tc <- trainControl(method="boot")
fit1 <- train(bbbDescr, logBBB, trControl=tc, method="knn"); fit1
Apply all six CV-methods in caret at once
Now it maybe interesting to see which CV method performs best or to use benchmarks. One can do that sequentially one-by-one which is easier to understand or in a loop which is more human readable. In R we can also use lapply which returns a list or sapply which returns a matrix. Because the results are rather complicated I prefer loops or sequential code.
# All available six cross-validation methods applied
require(caret); data(BloodBrain);
cvMethods <- c("boot632","LGOCV","LOOCV","cv","repeatedcv", "boot" );
all <- sapply(cvMethods ,function(x) {set.seed(123); print(x); tc <- trainControl(method=(x))
fit1 <- train(bbbDescr, logBBB, trControl=tc, method="knn") }); all
all[4, ]
# All caret cross-validation methods applied using lapply (list result)
require(caret); data(BloodBrain);
cvMethods <- c("boot632","LGOCV","LOOCV","cv","repeatedcv", "boot");
all <- lapply(cvMethods ,function(x) {set.seed(123); print(x); tc <- trainControl(method=(x))
fit1 <- train(bbbDescr, logBBB, trControl=tc, method="knn") })
all
The latest lapply and sapply method examples gives us a nice view of the different cross-validations at once. So we can see which method performs best, extract the times for performing CVs and do more. Of course such complicated matrices are hard to handle, because they are multi-dimensional. Here assigning single model names and looping trough them may be easier.
# Print all caret cv methods for a given dataset and method
require(caret); data(BloodBrain);
cvMethods <- c("boot632","LGOCV","LOOCV","cv","repeatedcv", "boot");
all <- lapply(cvMethods ,function(x) {set.seed(123); print(x); tc <- trainControl(method=(x))
fit1 <- train(bbbDescr, logBBB, trControl=tc, method="knn") }); all;
# extract the used cvMethods (redundant because already incvMethods)
myNames <- lapply(1:6, function(x) all[[x]]$control$method)
# save results
results <- sapply(all,getTrainPerf)
# change column Names to cv methods
colnames(results) <- myNames;
# get the results
results
# boot632 LGOCV LOOCV cv repeatedcv boot
# TrainRMSE 0.619778 0.6275048 0.6309407 0.6192086 0.6192086 0.66943
# TrainRsquared 0.4009745 0.3554037 0.3429081 0.3831812 0.3831812 0.3140373
# method "knn" "knn" "knn" "knn" "knn" "knn"
So here we have it essentially a way to output all CV methods in a few lines of code, of course it could be condensed more, but it will come quite handy for all kinds of modifications, including the correct data-splitting method and resampling and validation performance.
Links
-
Overfitting examples - some discussions about overfitting
-
caret CV examples - simple useful examples to perform caret CVs, each method is explained
-
04_Over_Fitting.R - chapter 4 from the caret book, beware example is rather large
-
lapply, sapply - its a loooop (in C/C++). Explanations of apply, lapply, sapply, mapply, tapply and so on.
-
trControl - documentation of caret CV methods
-
lapply,sapply,mapply - covers apply, eapply, lapply, mapply, rapply, tapply
Source code
- caret-cv-examples - examples from the page above
- caret-ML Home
- caret-ML Overview
- caret-ML Setups
- caret-ML Data sets
- caret-ML Preprocess
- caret-ML Cross-validations
- caret-ML Regression
- caret-ML Classification
- caret-ML Parallel
- caret-ML Benchmarks
- caret-ML Deployment
- caret-ML Links and blogs