R’s rich ecosystem of packages, statistical analysis capabilities, reproducibility features, and active community make it a powerful tool for data manipulation and visualization. Its versatility and flexibility make it suitable for a wide range of data analysis tasks, from exploratory data analysis to advanced statistical modeling.
R offers a wide range of packages and functions for data manipulation,
transformation, and cleaning. Apart from the multitude of base-r
functions such as apply
and loop constructs, The dplyr
and
tidyverse
packages also provide a powerful set of tools for filtering,
sorting, aggregating, and joining datasets. These packages make it easy
to perform complex data manipulations efficiently.
In R, you can use a for
loop to iterate over a sequence of values or
elements and perform a set of operations repeatedly.
# Iterate over a sequence of numbers
for (i in 1:5) {
# Print the current value of i
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
You can also use a for loop to iterate over elements of a vector, list, or any other iterable object. Here’s an example:
# Iterate over a vector samples
samples <- c("control-1", "control-2", "treatment-1", "treatment-2")
for (s in samples) {
#print each sample name
print(s)
}
## [1] "control-1"
## [1] "control-2"
## [1] "treatment-1"
## [1] "treatment-2"
To illustrate a nested loop example on clinical data using a data frame in R, let’s assume you have a dataset with clinical information about patients, including their ID, blood pressure, and cholesterol levels. You want to perform a specific operation on each patient’s data using a nested loop structure. Here’s an example:
# Sample clinical data
clinical_data <- data.frame(id = c(1, 2, 3),
blood_pressure = c(120, 130, 140),
cholesterol = c(180, 200, 220) )
print(clinical_data)
## id blood_pressure cholesterol
## 1 1 120 180
## 2 2 130 200
## 3 3 140 220
#Nested loop to iterate over each patient's data
for (i in 1:nrow(clinical_data)) {
# Print patient ID
cat("Patient ID:", clinical_data$id[i], "\n")
# Inner loop for specific operations on each patient's data
for (j in 2:ncol(clinical_data)) {
# Perform a specific operation (e.g., print blood pressure and cholesterol)
cat(colnames(clinical_data)[j], ":", clinical_data[i, j], "\n")
}
cat("\n")
}
## Patient ID: 1
## blood_pressure : 120
## cholesterol : 180
##
## Patient ID: 2
## blood_pressure : 130
## cholesterol : 200
##
## Patient ID: 3
## blood_pressure : 140
## cholesterol : 220
Next we will use IF/ELSE statements to check if any patient has high blood pressure and to prescibe a specific medication in the case of high BP.
# For loop with a conditional statement to perform a specific operation on each patient's data
for (i in 1:nrow(clinical_data)) {
# Check if blood pressure is above
if (clinical_data$blood_pressure[i] > 130) {
# If blood pressure is above 130, print patient ID and blood pressure level
cat("Patient ID:", clinical_data$id[i], "\n")
cat("Cholesterol level:", clinical_data$cholesterol[i], "\n")
cat("High blood pressure:", clinical_data$blood_pressure[i], "\n")
cat("Prescription:", "Zestril", "\n")
} else {
# If blood pressure is not above 130, print patient ID and cholesterol level cat("Patient ID:", clinical_data$id[i], "\n")
cat("Patient ID:", clinical_data$id[i], "\n")
cat("Cholesterol level:", clinical_data$cholesterol[i], "\n")
}
cat("\n")
}
## Patient ID: 1
## Cholesterol level: 180
##
## Patient ID: 2
## Cholesterol level: 200
##
## Patient ID: 3
## Cholesterol level: 220
## High blood pressure: 140
## Prescription: Zestril
The apply()
function in R allows you to apply a specified function to
each row or column of a data frame or matrix. It usually runs much
faster than for loops and is more concise to code, but may limited
flexibility when dealing with more complicated data manipulation
procedures.
Here we will apply the R mean() function to calculate the average cholesterol and blood pressure for the patients
# Apply function to calculate the mean for each column column_means
column_means <- apply(clinical_data, 2, mean)
print(column_means)
## id blood_pressure cholesterol
## 2 130 200
#> id blood_pressure cholesterol
#> 2 130 200
One of the main advantages of R is the extensive data visualization functionality. ggplot2 is the most widely used package and follows a layered approach, where you can add different layers to a plot to build up the desired visualization, making it very flexible and easy to modify plots
#example dataframe showing blood pressure increases with patient age
df <- data.frame( age = c(45, 56, 34, 67, 51, 42, 59, 38, 48),
BP = c(120, 130, 125, 145, 135, 115, 140, 130, 125))
#intall and load ggplot2
#install.packages('ggplot2')
library('ggplot2')
#generate scatterplot
ggplot(data = df, aes(x = age, y = BP)) +
geom_point(size=4, col='blue') +
labs(x = "age", y = "BP") +
ggtitle("Scatter Plot: age vs BP")
For barplots use the ggplot()
function to set up the plot and the
geom_bar()
function to add the bars.
# Create a dataframe
df <- data.frame( condition = c("Condition A", "Condition B", "Condition C"),
count = c(20, 30, 15))
# Create the bar plot
ggplot(data = df, aes(x = condition, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Condition", y = "Count") +
ggtitle("Bar Plot")
#####. 2.3. Basic boxplots
ggplot2 can also be used to generate boxplots. In this example, we
create a dataframe named df
with two columns: group
and value
. The
group
column represents the different groups in the clinical study
(e.g., treatment groups or control groups), and the value
column
represents the corresponding measurements or observations.
# Create a dataframe
df <- data.frame( group = rep(c("Group A", "Group B", "Group C"), each = 50),
value = c(rnorm(50, mean = 10, sd = 2),
rnorm(50, mean = 15, sd = 3), rnorm(50, mean = 12, sd = 1.5)) )
#generate boxplots
ggplot(data = df, aes(x = group, y = value)) +
geom_boxplot(fill = "steelblue", color = "black") +
labs(x = "Group", y = "Value") +
ggtitle("Boxplot of treatment groups")
The geom_boxplot()
function adds the boxplots to the plot. By default,
the boxplots show the median, interquartile range, and whiskers. You can
customize the appearance of the boxplots by using additional arguments
in geom_boxplot()
, such as fill
for the color of the boxes and
color
for the color of the outlines.
#####. 2.4. Adding trendlines
Using geom_smooth() it is relatively straightforward to add trendlines to line plot. Here a linear model fit is used, but others such as local polynormal regression fitting (loess) can be used.
patients <- c("Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5")
values1 <- c(10, 25, 30, 40, 50)
values2 <- c(20, 39, 25, 43, 60)
#Create a scatter plot with trendline using ggplot2
data <- data.frame(patients, values1, values2)
ggplot(data, aes(x = values1, y = values2)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("TPM Gene A") +
ylab("TPM Gene B") +
ggtitle("Gene Expression Scatter Plot with Trendline")
## `geom_smooth()` using formula = 'y ~ x'
We can also add standard error layer to the graph
ggplot(data, aes(x = values1, y = values2)) +
geom_point() +
geom_smooth(method = "lm", se=TRUE) +
xlab("TPM Gene A") +
ylab("TPM Gene B") +
ggtitle("Gene Expression Scatter Plot with Trendline and SE")
## `geom_smooth()` using formula = 'y ~ x'
#####. 2.5. Advanced Boxplots
scale_fill_brewer() from the ColorBrewer package contains predefined color schemes which can be used to colour the boxplots plots. The theme() function is used for changing plot aesthetics such as gridlines and background with predefined styles. In the case below the ‘minimal_theme’ was used
library('reshape2')
library('readr')
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
num_genes <- 100
num_samples <- 10
#Generate random gene expression data
gene_expression <- matrix(rnorm(num_genes * num_samples), nrow = num_genes, ncol = num_samples)
#Convert gene expression data to a data frame
gene_expression_df <- as.data.frame(gene_expression)
gene_expression_melted <- melt(gene_expression_df)
## No id variables; using all as measure variables
#Create colored boxplots using ggplot2
boxplot_colored <- ggplot(gene_expression_melted,
aes(x = variable, y = value, fill=variable)) +
geom_boxplot() +
scale_fill_brewer(palette = 'RdPu', n=10) +
labs(x = "Samples", y = "Gene Expression", title = "Colored Boxplots of Gene Expression Data") +
theme_minimal()
boxplot_colored
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette RdPu is 9
## Returning the palette you asked for with that many colors
#####. 2.6. Advanced plotting: Dot plots for scRNA-seq data.
Gene expression dotplots are often used to visualize Single cell RNA-seq data. Colour intensity = log2(count+1) Dot size = proportion of cells expressing gene
Apart from ggplot2, other packages are used to generate the dotplot: Cowplot for improved aesthetics Viridis for colour schemes dplyr for modifying columns
library('dplyr')
library('cowplot')
gene_cluster <- read_tsv('https://github.com/davemcg/davemcg.github.io/raw/master/content/post/scRNA_dotplot_data.tsv.gz')
markers <- gene_cluster$Gene %>% unique()
gene_cluster %>% filter(Gene %in% markers) %>%
mutate(`% Expressing` = (cell_exp_ct/cell_ct) * 100) %>%
filter(count > 0, `% Expressing` > 1) %>%
ggplot(aes(x=cluster, y = Gene, color = count, size = `% Expressing`)) +
geom_point() +
cowplot::theme_cowplot() +
theme(axis.line = element_blank()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylab('') +
theme(axis.ticks = element_blank()) +
scale_color_gradientn(colours = viridis:: viridis (20), limits = c(0,4), oob = scales::squish, name = 'log2 (count + 12')