The tidyverse is a user-friendly suite of R packages designed
to make data analysis simpler, less error-prone and more enjoyable. This
package contains a host of helper functions designed to minimize
keystrokes, and to overcome some common pain-points in plotting, data
summarization and environment management.
To install the development version, if required first install devtools
install.packages('devtools')
To install tidyExt:
devtools::install_github('bansell/tidyExt')
Load tidyverse and tidyExt:
library(tidyverse)
library(tidyExt)
To manage source code files across multiple systems and folders, it can be useful to quickly print the entire path for the source file you are working in, without quotes for quick copy/paste.
printScriptDir()
## [1] "~/Git/tidyExt/tidyExt_vignette.Rmd"
NB run from your .R or .Rmd file, this function will not return the characters “## [1]” in the console.
Certain tidyverse functions like rename() and select() often conflict
with function names from other packages. If many packages are loaded, to
reset the tidyverse functions as the default, after use
fix_tidyverse_conflicts()
. Thanks to Jacob Munro for this one.
fix_tidyverse_conflicts()
Make boxplots with overlaid datapoints. There is no jitter in the y axis in order to accurately represent data values.
mpg |> ggplot2::ggplot(aes(x=class, y=cty)) +
geom_boxjitter( point_size = 2, point_col='dodger blue')
Make nested boxplots with overlaid datapoints. There is no jitter in the y axis in order to accurately represent data values.
mpg |> ggplot2::ggplot(aes(x=class, y=cty, col=interaction(drv,cyl))) +
geom_boxdodge()
…sorry. Here are some useful statistics shortcuts:
Adds a linear regression line to scatter plot and calls ggpubr to print the line equation and p value
mpg |> ggplot2::ggplot(aes(cty,hwy)) + geom_point() + geom_smooth_lm()
## `geom_smooth()` using formula 'y ~ x'
A wrapper for scale() that returns a single vector to use within
dplyr::mutate()
etc. This function is copied from
here.
scale() output:
diamonds |> mutate(table_scale = scale(table)) |> select(table_scale) |> str()
## tibble [53,940 × 1] (S3: tbl_df/tbl/data.frame)
## $ table_scale: num [1:53940, 1] -1.1 1.586 3.376 0.243 0.243 ...
## ..- attr(*, "scaled:center")= num 57.5
## ..- attr(*, "scaled:scale")= num 2.23
scale_this() output:
diamonds |> mutate(table_scale = scale_this(table)) |> select(table_scale) |> str()
## tibble [53,940 × 1] (S3: tbl_df/tbl/data.frame)
## $ table_scale: num [1:53940] -1.1 1.586 3.376 0.243 0.243 ...
A way to simultaneously count and sort the relative proportion of character data in descending order. Simple example:
diamonds |> sort_pct(cut)
## # A tibble: 5 x 3
## cut n pct
## <ord> <int> <dbl>
## 1 Ideal 21551 0.400
## 2 Premium 13791 0.256
## 3 Very Good 12082 0.224
## 4 Good 4906 0.0910
## 5 Fair 1610 0.0298
More complex:
diamonds |> sort_pct(cut,color)
## # A tibble: 35 x 4
## cut color n pct
## <ord> <ord> <int> <dbl>
## 1 Ideal G 4884 0.0905
## 2 Ideal E 3903 0.0724
## 3 Ideal F 3826 0.0709
## 4 Ideal H 3115 0.0577
## 5 Premium G 2924 0.0542
## 6 Ideal D 2834 0.0525
## 7 Very Good E 2400 0.0445
## 8 Premium H 2360 0.0438
## 9 Premium E 2337 0.0433
## 10 Premium F 2331 0.0432
## # … with 25 more rows
Minimize keystrokes for common plot label and legend modifications
sample_n(diamonds,1000) |> ggplot2::ggplot(aes(x=carat,y=price, col=clarity)) + geom_point() + bottom_legend()
sample_n(diamonds,1000) |> ggplot2::ggplot(aes(x=carat,y=price, col=clarity)) + geom_point() + no_legend()
Set x labels at any angle. 30° by default.
mpg |> ggplot2::ggplot(aes(x=manufacturer,y=hwy)) + geom_boxjitter() + x_angle()
This function is useful when you want to make scatterplots (for example, PCA plots) coloured by multiple different factors. The colour space is rapidly exhausted and important plotting information is lost. For example:
my_df <- mpg |> mutate(year=factor(year), cyl=factor(cyl))
my_df |> gather(key,value,year,cyl,drv,manufacturer) |>
ggplot2::ggplot(aes(cty,hwy,col=value)) + geom_point() + facet_wrap(~key,ncol=2) +
bottom_legend()
It is very hard to distinguish the data from years 1999 vs 2008, and the figure legend is a jumble of labels from all facets.
To handle this, we recycling the default ggcolour scale to maximize the contrast in each facet. Caution: be sure to check the legend under each plot to avoid confusing the colour encodings between facets.
First create a vector containing the column names of interest for colouring points in the scatterplot
my_features <- c('year','drv','cyl','manufacturer')
my_df <- mpg |> mutate(year=factor(year), cyl=factor(cyl))
plot_cycle_cols(df = my_df, X='cty',Y='hwy', myLabel = 'manufacturer', colour_vec = my_features)
Creating and modifying colour scales can be hard work in ggplot2. These functions help to print the HEX codes and display the swatch for the selected colours, from default ggplot2 or RColorBrewer palettes.
default_GG_col(12)
## [1] "#F8766D" "#DE8C00" "#B79F00" "#7CAE00" "#00BA38" "#00C08B" "#00BFC4"
## [8] "#00B4F0" "#619CFF" "#C77CFF" "#F564E3" "#FF64B0"
First check out the palette information to see all of the available Brewer palettes.
RColorBrewer::brewer.pal.info
## maxcolors category colorblind
## BrBG 11 div TRUE
## PiYG 11 div TRUE
## PRGn 11 div TRUE
## PuOr 11 div TRUE
## RdBu 11 div TRUE
## RdGy 11 div FALSE
## RdYlBu 11 div TRUE
## RdYlGn 11 div FALSE
## Spectral 11 div FALSE
## Accent 8 qual FALSE
## Dark2 8 qual TRUE
## Paired 12 qual TRUE
## Pastel1 9 qual FALSE
## Pastel2 8 qual FALSE
## Set1 9 qual FALSE
## Set2 8 qual TRUE
## Set3 12 qual FALSE
## Blues 9 seq TRUE
## BuGn 9 seq TRUE
## BuPu 9 seq TRUE
## GnBu 9 seq TRUE
## Greens 9 seq TRUE
## Greys 9 seq TRUE
## Oranges 9 seq TRUE
## OrRd 9 seq TRUE
## PuBu 9 seq TRUE
## PuBuGn 9 seq TRUE
## PuRd 9 seq TRUE
## Purples 9 seq TRUE
## RdPu 9 seq TRUE
## Reds 9 seq TRUE
## YlGn 9 seq TRUE
## YlGnBu 9 seq TRUE
## YlOrBr 9 seq TRUE
## YlOrRd 9 seq TRUE
brewer_GG_col(6,'Blues')
## [1] "#EFF3FF" "#C6DBEF" "#9ECAE1" "#6BAED6" "#3182BD" "#08519C"
brewer_GG_col(4,'Paired')
## [1] "#A6CEE3" "#1F78B4" "#B2DF8A" "#33A02C"
brewer_GG_col(4,'RdYlBu')
## [1] "#D7191C" "#FDAE61" "#ABD9E9" "#2C7BB6"
The utils::head()
function will print all column names which can flood
the console. For large matrices in particular, its often useful to check
the top left corner of the matrix. bighead(n)
prints a square data
frame of dimensions n X n.
diamond_mat <- as.matrix(diamonds[sample(1000), ])
diamond_mat |> bighead()
## # A tibble: 6 x 6
## carat cut color clarity depth table
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0.64 Ideal G VVS1 61.9 56.0
## 2 0.76 Premium E SI1 61.8 58.0
## 3 0.70 Ideal D SI1 59.7 58.0
## 4 0.25 Very Good E VS2 63.3 60.0
## 5 0.32 Ideal I VVS1 62.0 55.3
## 6 0.70 Ideal D SI1 61.0 59.0
NB this will return an 8x8 data frame as default when called from the console, an .R or .Rmd file.
The default console output for tidyverse tables is to display 6 rows of
data. Use print_all()
to output the entire table in the console. This
is useful for data frames of intermediate size (7-100 rows) instead of
modifying print()
or using View()
.
mpg |> print_all()
NB Not run here. This will print the entire table to the console when called from the console, an .R or .Rmd file.
We hope these functions are useful for making your daily R coding work quicker and easier! Please don’t hesitate to modify for your own use or suggest updates through github.