-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
134 lines (88 loc) · 6.13 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# sptidy
<!-- badges: start -->
[![R-CMD-check](https://github.com/UBC-MDS/sptidy/workflows/R-CMD-check/badge.svg)](https://github.com/UBC-MDS/sptidy/actions)
<!-- badges: end -->
An R package that produces a tidy output for tidymodels model evaluation!
## Introduction
Sptidy implements a `tidy` and `augment` function for base R's linear regression and Tidymodel's kmeans clustering to ease model selection and assessment tasks. This package is a simplified reimplementation of the existing `tidy` and `augment` functions in the Broom package. Sptidy’s family of tidy functions returns a dataframe that summarizes important model information, while the augment function expands the original dataframe to include additional model specific information by observation. This package is meant to complement [Sktidy](https://github.com/UBC-MDS/sktidy), a Python package that was created to tidy up the scikit-learn package.
## Features
The functions that this package currently support include:
- `tidy_kmeans()`: Returns inertia, cluster location, and number of associated points at the level of clusters in a tidy format.
- `tidy_lr()`: Returns coefficients and corresponding feature names in a tidy format.
- `augment_lr()` : Returns predictions and residuals for each point in the training data set in a tidy format.
- `augment_kmeans()` : Returns assigned cluster and distance from cluster center for the data the kmeans algorithm was fitted with in a tidy format.
## How sptidy fits into the R ecosystem
[Tidymodels](https://github.com/tidymodels) is a “meta-package” for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse. One of the packages it includes is [broom](https://broom.tidymodels.org/) which takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames. The tidy data refers to outputting the results in a `data.frame` where each variable has its own column, each observation has its own row, and each value has its own cell. In `sptidy`, we implement the functions `tidy()` and `augment()` for the linear regression model from base R using the function [`lm()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm) and the KMeans model from R `stats` package using the function [`kmeans()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans).
## Installation
This package is not yet available on CRAN, but you can install the development version from [GitHub](https://github.com/) with:
``` r
devtools::install_github("UBC-MDS/sptidy")
```
## Example
This is a basic example which shows you how to solve a common problem:
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
When working with linear regression and kmeans clustering in R, it's nice to:
* have a one-stop shop summary data frame on model statistics
* have a data frame with prediction, residual or cluster assignment with respect to each observation
The sptidy package provides tools that make these tasks swift and easy.
## Overview
This vignette is the package-wide documentation for R package sptidy. This helps those who want to use this package better understand what sptidy does. A full usage demonstration of all functions in this package is included in this vignette.
## Available functions
sptidy provides functions for model evaluation and analysis. These functions work with 2 types of models: lm() and kmeans().
- linear regression
- `tidy_lr` outputs summary on lm()
- `augment_lr` augments a data frame with predictions and residuals
- kmeans clustering
- `tidy_kmeans` outputs summary on kmeans()
- `augment_kmeans` augments a data frame with cluster assignments
## Data
First, let us load sptidy and 2 data set: `longley`, `iris` for function usage demonstration.
```{r setup}
library(sptidy)
data(longley)
data(iris)
```
## Function Usage Demonstration
### tidy_lr(): clean summary output for lm()
sptidy::tidy_lr() provides a tidy data frame that summarizes a fitted linear regression object lm(). The argument in the function needs to be a fitted lm() object. The output data frame has 4 columns, describing coefficient estimates, standard error, t-statistics and p-values.
```{r}
my_lr <- lm(Employed~., data = longley)
sptidy::tidy_lr(my_lr)
```
### augment_lr(): augment data frame from lm()
sptidy::augment_lr() augments the data frame with predictions and residuals from a fitted linear regression object lm(). The first argument is the fitted lm() object. The second and third argument refer to the feature data frame and target data frame that are fitted to the lm() object. The output data frame has additional 2 columns, describing predictions and residuals with respect to each observation.
```{r}
my_lr <- lm(Employed~., data = longley)
sptidy::augment_lr(my_lr, longley[1:6], as.data.frame(longley$Employed))
```
### tidy_kmeans(): clean summary output for kmeans()
sptidy::tidy_kmeans() provides a tidy data frame that summarizes a fitted kmeans clustering object kmenas(). The first argument is the fitted kmeans() object. The second argument refers to the data that was fitted to the kmeans() object. The output data frame has 3 columns, describing cluster number, cluster center and number of points within each cluster.
```{r}
data <- iris[,1:3]
kclust <- kmeans(data, centers = 3)
tidy_kmeans(kclust, data)
```
### augment_kmeans(): augment data frame from kmeans()
sptidy::augment_kmeans() augments the data frame with cluster assignment from a fitted kmeans clustering object kmeans(). The first argument is the fitted kmeans() object. The second argument refers to the data that was fitted to the kmeans() object. The output data frame has additional 1 column, describing cluster assignment with respect to each observation.
```{r}
data <- iris[,1:3]
kclust <- kmeans(data, centers = 3)
augment_kmeans(kclust, data)
```