Skip to content
Tobias Kind edited this page Nov 9, 2015 · 19 revisions

There are a number of data sets included in the caret library, but many other examples play around traditional R data sets such as the iris data. A vast resource of machine learning data sets is the UCI ML archive which covers over 350 different datasets. Generally it is important to know first if the data set is geared towards classifications or regressions. Secondly input (x) and output (y) variables need to be known, plus of they are categorial (A,B,C) or continuous (1,2,3) variables. Furthermore it is important to know what the dimension/size is to avoid extremely long examples in the beginning. This is not intended as a comprehensive collection, but rather a quick pointer for practitioners.


Example data sets and their dimensions and use View/Download

I recommend the above View/Download button to see the *.csv datasets, the markdown table below is sub-par. Anyway here it goes, check the dimensions to select fast/slow datasets.

Num Data set library Reg/Class Y-target X-input Dimension Design
1 data(iris) R Class Species remaining data 150 x 5 multi class
2 data(trees) R NA NA remaining data 31x 3 multi class
3 data(Glass) R Class Type remaining data 214 x 10 multi class
4 data(cox2) caret Class cox2Class cox2Descr,cox2IC50 462x 255 Two class
5 data(oil) caret Class oilType fattyAcids 96 x 7 multi class
6 data(dhfr) caret Reg Y remaining data 325 x 229 multi param
7 data(GermanCredit) caret Reg/Class Class remaining data 1000 x 62 multi param
8 data(BostonHousing) mlbench Reg medv remaining data 506x14 multi param
9 data(BloodBrain) caret Reg logBBB bbbDescr 208 x 134 multi param
10 data(mdrr) caret Class mdrrClass mdrrDescr 528 x 342 Two class
11 data(Satellite) mlbench Class classes remaining data 6435 x 37 multi class
12 data(cars) caret Reg/Class any any 804x15 mixed
13 data(dhfr) caret Class Y remaining data 325x229 Two class
14 data(pottery) caret Class potteryClass pottery NA Two class
15 data(segmentationData) caret Class Class remaining data 2019 x 61 Two class
16 data(tecator) caret Reg endpoints absorp 215x 100 multi param
17 data(abalone) APM Class Type remaining data 4177 x 9 multi class
18 data(AlzheimerDisease) APM Class diagnosis predictors 333 x 130 Two class
19 data(ChemicalManufacturingProcess) APM Reg Yield remaining data 176 x 58 multi param
20 data(concrete) APM Reg CompressiveStrength remaining data 1030 x 9 multi param
21 data(FuelEconomy) APM Reg/Class 3 sets multi param
22 data(hepatic) APM Class injury remaining data 2 sets multi param
23 data(solubility) APM Reg solTestY 1267 x 228 multi param
24 data(permeability) APM Reg permeability fingerprints 165 x 1107 multi param
25 data(schedulingData) APM Class Class remaining data 4331x8 multi class
26 data(segmentationOriginal) APM Class Class remaining data 2019 x 119
27 data(twoClassData) APM Class classes predictors 208 x 2 Two class
28 data(BreastCancer) mlbench Class Class remaining data 699 x 11 Two class
29 data(PimaIndiansDiabetes) mlbench Class diabetes remaining data 768 x 9 Two class
30 data(Sonar) mlbench Class Class remaining data 208x61 Two class
31 Human Activity Recognition (HAR) puc-rio.br Class “class” remaining data 165634 x 21 multi class

Show all available data sets loaded

# show all available data sets installed
library(caret); library(datasets); library(AppliedPredictiveModeling); library(mlbench); data();

# Data sets in package ‘AppliedPredictiveModeling’:
# ChemicalManufacturingProcess                          Chemical Manufacturing Process Data
# abalone                                               Abalone Data
# bio (hepatic)                                         Hepatic Injury Data
# cars2012 (FuelEconomy)                                Fuel Economy Data
# chem (hepatic)                                        Hepatic Injury Data
# classes (twoClassData)                                Two Class Example Data
# concrete                                              Compressive Strength of Concrete from Yeh (1998)
# ...

Quick instructions to work with the data sets

# load the dataset
data(iris)
# get the dimension of the dataset
dim(iris)
## [1] 150   5
# get the class name (here data frame)  to choose correct operators
class(iris)
## [1] "data.frame"

# invoke simple data viewer
View(iris)
# invoke the useless editor
edit(iris)
# get the data structure
str(iris)
> str(iris)
##'data.frame':   150 obs. of  5 variables:
## $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 
# load the package that contains the data
require(caret)
# load the mdrr regression data set 
data(mdrr)
# view y-factor (View(mdrr) does not work)
View(mdrrClass)
# get dimension of mdrrClass
length(mdrrClass)
## [1] 528
dim(mdrrDescr)
## [1] 528 342

Code

  • ML data sets - view some important ML datasets and their dimensions.

LINKS

Clone this wiki locally