-
Notifications
You must be signed in to change notification settings - Fork 50
caret ml datasets
There are a number of data sets included in the caret library, but many other examples play around traditional R data sets such as the iris data. A vast resource of machine learning data sets is the UCI ML archive which covers over 350 different datasets. Generally it is important to know first if the data set is geared towards classifications or regressions. Secondly input (x) and output (y) variables need to be known, plus of they are categorial (A,B,C) or continuous (1,2,3) variables. Furthermore it is important to know what the dimension/size is to avoid extremely long examples in the beginning. This is not intended as a comprehensive collection, but rather a quick pointer for practitioners.
Example data sets and their dimensions and use View/Download
I recommend the above View/Download button to see the *.csv datasets, the markdown table below is sub-par. Anyway here it goes, check the dimensions to select fast/slow datasets.
Num | Data set | library | Reg/Class | Y-target | X-input | Dimension | Design |
---|---|---|---|---|---|---|---|
1 | data(iris) | R | Class | Species | remaining data | 150 x 5 | multi class |
2 | data(trees) | R | NA | NA | remaining data | 31x 3 | multi class |
3 | data(Glass) | R | Class | Type | remaining data | 214 x 10 | multi class |
4 | data(cox2) | caret | Class | cox2Class | cox2Descr,cox2IC50 | 462x 255 | Two class |
5 | data(oil) | caret | Class | oilType | fattyAcids | 96 x 7 | multi class |
6 | data(dhfr) | caret | Reg | Y | remaining data | 325 x 229 | multi param |
7 | data(GermanCredit) | caret | Reg/Class | Class | remaining data | 1000 x 62 | multi param |
8 | data(BostonHousing) | mlbench | Reg | medv | remaining data | 506x14 | multi param |
9 | data(BloodBrain) | caret | Reg | logBBB | bbbDescr | 208 x 134 | multi param |
10 | data(mdrr) | caret | Class | mdrrClass | mdrrDescr | 528 x 342 | Two class |
11 | data(Satellite) | mlbench | Class | classes | remaining data | 6435 x 37 | multi class |
12 | data(cars) | caret | Reg/Class | any | any | 804x15 | mixed |
13 | data(dhfr) | caret | Class | Y | remaining data | 325x229 | Two class |
14 | data(pottery) | caret | Class | potteryClass | pottery | NA | Two class |
15 | data(segmentationData) | caret | Class | Class | remaining data | 2019 x 61 | Two class |
16 | data(tecator) | caret | Reg | endpoints | absorp | 215x 100 | multi param |
17 | data(abalone) | APM | Class | Type | remaining data | 4177 x 9 | multi class |
18 | data(AlzheimerDisease) | APM | Class | diagnosis | predictors | 333 x 130 | Two class |
19 | data(ChemicalManufacturingProcess) | APM | Reg | Yield | remaining data | 176 x 58 | multi param |
20 | data(concrete) | APM | Reg | CompressiveStrength | remaining data | 1030 x 9 | multi param |
21 | data(FuelEconomy) | APM | Reg/Class | 3 sets | multi param | ||
22 | data(hepatic) | APM | Class | injury | remaining data | 2 sets | multi param |
23 | data(solubility) | APM | Reg | solTestY | 1267 x 228 | multi param | |
24 | data(permeability) | APM | Reg | permeability | fingerprints | 165 x 1107 | multi param |
25 | data(schedulingData) | APM | Class | Class | remaining data | 4331x8 | multi class |
26 | data(segmentationOriginal) | APM | Class | Class | remaining data | 2019 x 119 | |
27 | data(twoClassData) | APM | Class | classes | predictors | 208 x 2 | Two class |
28 | data(BreastCancer) | mlbench | Class | Class | remaining data | 699 x 11 | Two class |
29 | data(PimaIndiansDiabetes) | mlbench | Class | diabetes | remaining data | 768 x 9 | Two class |
30 | data(Sonar) | mlbench | Class | Class | remaining data | 208x61 | Two class |
31 | Human Activity Recognition (HAR) | puc-rio.br | Class | “class” | remaining data | 165634 x 21 | multi class |
Show all available data sets loaded
# show all available data sets installed
library(caret); library(datasets); library(AppliedPredictiveModeling); library(mlbench); data();
# Data sets in package ‘AppliedPredictiveModeling’:
# ChemicalManufacturingProcess Chemical Manufacturing Process Data
# abalone Abalone Data
# bio (hepatic) Hepatic Injury Data
# cars2012 (FuelEconomy) Fuel Economy Data
# chem (hepatic) Hepatic Injury Data
# classes (twoClassData) Two Class Example Data
# concrete Compressive Strength of Concrete from Yeh (1998)
# ...
Quick instructions to work with the data sets
# load the dataset
data(iris)
# get the dimension of the dataset
dim(iris)
## [1] 150 5
# get the class name (here data frame) to choose correct operators
class(iris)
## [1] "data.frame"
# invoke simple data viewer
View(iris)
# invoke the useless editor
edit(iris)
# get the data structure
str(iris)
> str(iris)
##'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1
# load the package that contains the data
require(caret)
# load the mdrr regression data set
data(mdrr)
# view y-factor (View(mdrr) does not work)
View(mdrrClass)
# get dimension of mdrrClass
length(mdrrClass)
## [1] 528
dim(mdrrDescr)
## [1] 528 342
Code
- ML data sets - view some important ML datasets and their dimensions.
LINKS
- mlbench data sets - R mlbench package which covers many data sets
- caret data sets - R caret PDF contains detailed info about the caret specific sets
- caret WIKI - quick description of caret sets
- ApplPredMod data sets - from the compendium book Applied Predictive Modeling
- UCI sets - general collection (many already converted to R)
- caret-ML Home
- caret-ML Overview
- caret-ML Setups
- caret-ML Data sets
- caret-ML Preprocess
- caret-ML Cross-validations
- caret-ML Regression
- caret-ML Classification
- caret-ML Parallel
- caret-ML Benchmarks
- caret-ML Deployment
- caret-ML Links and blogs