01-intro.Rmd

# Introduction to R {#intro}

It is often said that the majority of time is spent on data cleaning and 20% on analysis. A common trap when start using R is that the library have not been installed or the data are in different folder to the working directory. 
Installing library is easy with the command _install.packages_. The command _getwd()_ will tell you which directory you are in.

Sometimes, the user encounter issues with R. This is not an uncommon problem and often not due to serious errors but forgetting that R is case-sensitive. Also 
when copying codes from the net across, R does not recognise the “inverted 
comma” but has its own "inverted comma".  Unlike Python, R does not tolerate 
space between variables. R also does not like variables to be named starting 
with a number.

R treats the variables in certain ways depend on their class. Some function requires that the variables are numeric. The Breast Cancer data from mlbench has 11 columns with the 9 covariates being factors and ordered factors. These issues will be dealt further in the chapter on data wrangling. One way to get an 
overview of the structure of the data is to use the _glimpse_ function from _dplyr_.

Another issue is that some libraries have assign certain names to their function and which is also used by other libraries. In the course of opening multiple libraries, R may be confused and use the last library open and which may lead to error result. We will illustrate this with the _forest_ function in the 
Statistics chapter.

A selling point of R is that there is a large community of users who often post their solutions online. it's worth spending time on _stack overflow_ to look at similar problems that others have and the approaches to solving them. A clue to pose questions is to look at the error. If you wish to post questions for others to help then it's encouraged that the question include some data so that the person providing the solution can understand the nature of the errors. Useful blogs are also available at https://www.r-bloggers.com/.

This chapter focusses on the _ggplot2_ and its related libraries for plotting [@Wickham2016]. Base R also has plotting function but lacks the flexibility of _ggplot2_. Plotting is introduced early to enage clinicians who may not have the patience read the following chapter on data wrangling prior. The book is not intended to be used in a sequential fashion as the reader may find elements of this chapter relevant to them and jump to another chapter such as chapter 3 on statistics. 

## Plot using base R

Below we illustrate bar plot using base R.

```{r 01-intro-1}
data("Leukemia", package="Stat2Data")
#AML dataset-treatment
colnames(Leukemia)
#base R
hist(Leukemia$Age, 8, xlab = "Age",main="Acute Myeloid Leukemia Dataset")
```

Line plot can be performed in base R using _abline_ argument. The font is 
defined by _cex_ argument and colour defined by _col_ argument. The title is defined by _main_ argument. The X and Y axes can be labelled using _xlab_ and _ylab_ arguments within the _plot_ function. Below we illustrate example of drawing a line to a point and in the segment below we illustrate example of drawing a line.

```{r 01-intro-1-1}
#set parameters of plot
X=seq(0,1,by=.1) #define a set of point from 0 to 1, separated by 0.1.
Y=seq(0,1,by=.1)

#define A and B
A=.065; B=.44

#location of point
plot(X,Y, main="ROC curve")
points(A,B,pch=8,col="red",cex=2) #add point
abline(coef = c(0,1)) #add diagonal line

#draw line to a point
segments(x0=0,y0=0,x1=A,y1=B,col="blue")
segments(x0=A,y0=B,x1=1,y=1,col="blue")
```

This is an illustration of using base R to plot regression line. Later, we will illustrate using _geom_smooth_ call from _ggplot2_.

```{r 01-intro-1-2}
TPR_1=.44; FPR_1=.065

plot(X,Y,main="Likelihood ratio graph", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
  points(.02,.8,pch=8,col="red",cex=2) #add point
  df1<-data.frame(c1=c(0,TPR_1),c2=c(0,FPR_1))
  reg1<-lm(c1~c2,data=df1)
  df2<-data.frame(c1=c(TPR_1,1),c2=c(FPR_1,1))
  reg2<-lm(c1~c2,data=df2)
  abline(reg1) #draw line using coefficient reg1
  abline(reg2) #draw line using coefficient reg2
  text(x=FPR_1,y=TPR_1+.3,label="Superior",cex=.7)
  text(x=FPR_1+.2,y=TPR_1+.2,label="Absence",cex=.7)
  text(x=.0125,y=TPR_1-.1,label="Presence",cex=.7)
  text(x=FPR_1+.1,y=TPR_1,label="Inferior",cex=.7)
  
```

Function can be plotted using variations of the above. This requires a formula 
to describe variable _Y_. Shading in base R is performed with the _polygon_ function.

```{r 01-intro-2}

AUC_Logistic<-function (A,B,C,D){
  #binary data
  #A=True pos %B=False positive   %C=False negative   %D=True negative
  
  TPR=A/(A+C)
  FPR=1-(D/(D+B))
  
  #binomial distribution sqrt(np(1-p))
  #Statist. Med. 2002; 21:1237-1256 (DOI: 10.1002/sim.1099)
  STDa=sqrt((A+C)*TPR*(1-TPR));
  STDn=sqrt((B+D)*FPR*(1-FPR));
  a=STDn/STDa;
  theta=log((TPR/(1-TPR))/((FPR/(1-FPR))^a));
  
  #define a set of point from 0 to 1, separated by 0.001.
  X=seq(0,1,by=0.001)  
  
  #logistic regression model
  Y1=(X^a)/(X^a+(((1-X)^a)*exp(-1*theta)));
  AUC=round(pracma::trapz(X,Y1),2)
  AUC

#SE using Hanley & McNeil
#Preferred method if more than one TPR,FPR data point known
#Hanley is less conservative than Bamber
  Nn=B+D;
  Na=A+C;
  Q1=AUC/(2-AUC);
  Q2=2*AUC^2/(1+AUC);
  
SEhanley=sqrt(((AUC*(1-AUC))+((Na-1)*(Q1-AUC^2))+((Nn-1)*(Q2-AUC^2)))/(Na*Nn))

#SE using Bamber
#Ns is the number of patients in the smallest group
if (A+C>B+D) {
  Ns=B+D
  } else {
   Ns=A+C
  }
SEbamber=sqrt((AUC*(1-AUC))/(Ns-1))

# plot smoothed ROC
plot(X,Y1,main="ROC curve", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
points(FPR,TPR,pch=8,col="red",cex=2) #add point

Y2=0
polygon(c(X[X >= 0 & X <= 1],
          rev(X[X >= 0 & X <= 1])),
        c(Y1[X >= 0 & X <= 1],
          rev(Y2[X >= 0 & X <= 1])),
        col = "#6BD7AF")


print(paste("The Area under the ROC curve using the logistic function is", 
            AUC,". The Area under the ROC curve using rank sum method is", round(.5*(TPR+(1-FPR)),2)))
}

AUC_Logistic(10,20,2,68)
```

## ggplot2

The plot below uses _ggplot2_ or grammar of graphics. The plot is built layer by layer like constructing a sentence. Plotting is a distinct advantage of R over commercial software with GUI (graphical user interface) like _SPSS_. A wide variety of media organisations (BBC, Economist) are using _ggplot2_ with their 
own stylised theme. The plot has a certain structure such the name of the data 
and aesthetics for x and y axes. For illustration purpose the aesthetics are labelled with x and y statements. The fill argument in the aesthetics indicate 
the variables for coloring. The colors chosen for this graph were imposed by the journal _Epilepsia_ [@pmid30577087]. To run the examples, check that you have install the libraries. an error can occurred if you don't have the required library. The meta-character _#_ is used to signal that the line is meant to comment the code ie R will not read it. The _install.packages_ command only need to be run once.   

### Histogram

The flexibility of _ggplot2_ is shown here in this histogram. The legend can be altered using the _scale_fill_manual_ function. If other colours are preferred then under _values_ add the preferred colours. 


There are different ways to use _ggplot2_: quick plot or _qplot_ with limited options and full _ggplot2_ with all the options. The choice of the method 
depends on individual preference and as well as reason for plotting.

```{r 01-intro-3}
library(ggplot2)
#qplot
qplot(Age, data=Leukemia, bins=8)
```

Now a more complex version of histogram with ggplot with color added.

```{r 01-intro-3-1}
#ggplot2
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)
```

Note the legend is a bit untidy. The legend can be changed using _scale_fill_manual_. The color can be specified using rgb argument. This is important as some journals prefer certain color.

```{r 01-intro-3-1-2}
#adding legend, changing the values 0 and 1 to treatment response
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)+
  scale_fill_manual(name="Response",values=c("#999999","#56B4E9"),
                    breaks=c(0,1),labels=c("No Treatment","Treatment"))

```

Adding title is easy with _ggtitle_.

```{r 01-intro-3-3}
#adding title
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
  geom_histogram(bins=8)+
  scale_fill_manual(name="Response",
                    values=c("#555555","#56B4E9"),
                    breaks=c(0,1),
                    labels=c("No Treatment","Treatment"))+
  ggtitle("Acute Myeloid Leukemia Treatment dataset")
```

### Bar plot

Previously, we had used base R for bar plot. Here we use the _geom_bar_ argument in ggplot.

```{r  01-intro-4}
library(ggplot2)
library(extrafont)

#this data came from a paper published in Epilespia 2019 on cost of looking 
#after patients with pseudoseizure 

##                              Lower  Upper
##                    percentage 0.95   0.95  
## Duration_PNES.lmg  0.0974     0.0057 0.2906
## pseudostatus.lmg   0.2283     0.0687 0.3753
## Hx_anxiety.lmg     0.0471     0.0057 0.1457
## Hx_depression.lmg  0.0059     0.0041 0.1082
## DSP.lmg            0.0582     0.0071 0.1500
## seizure_burden.lmg 0.0179     0.0041 0.1058
## sex.lmg            0.0413     0.0030 0.1519
df<-data.frame("Predictors"=c("Duration_PNES","pseudostatus","Hx_anxiety",
"Hx_depression","DSP","seizure_burden","sex"),
    "Importance"=c(0.09737,0.22825, 0.047137,0.00487,0.058153,0.01786,0.04131),
    "Lower.95"=c(0.0057,0.0687,0.0057,0.0041,0.0071,0.0041,0.0030),
    "Upper.95"=c(0.2906,0.3753,0.1457,0.1082,0.1500,0.1058,0.1519))

#check dimensions of data frame
dim(df)

#check variables in data frame
colnames(df)

#bar plot uses geom_bar 
ggplot(df, aes(x=Predictors,y=Importance))+
  geom_bar(stat="identity")
```

This bar plot may be considered as untidy as the variables have not been sorted. Reordering the data requires ordering the factors. The colors were requested by the journal Epilesia in order to avoid recognitin of the bar from color 
blindness.

```{r 01-intro-4-1}
#reordering the data
df3<-df2<-df
df3$Predictors<-factor(df2$Predictors, levels=df2[order(df$Importance),"Predictors"])
#adding color
p<-ggplot(df3, aes(x=Predictors,y=Importance,fill=Predictors))+
  geom_bar(colour="black",
    stat="identity",
    fill=
      c("#e4b84b","#ce8080","#511c23","#e37c1d","#ffde75","#abb47d","#a1c5cb"))
p
```

This bar plot is now ordered but the labels on the axis seem to run into each other. One solution is to tile the axis title using _element_text_. Note that 
the text size can also be specified within this argument.

```{r 01-intro-4-2}
#rotate legend on x axis label by 45
p+theme(axis.title.y = element_text(face="bold",  size=12),
        axis.title.x = element_text(face="bold", size=12), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=10))
```

The title can be broken up using the backward slash.

```{r 01-intro-4-3}
#adding break in title
p1<-p+geom_errorbar(aes(ymin=Lower.95,ymax=Upper.95,width=0.2))+
  labs(y="R2 exaplained (%)")+
  theme(text=element_text(size=10))+
  ggtitle(" Relative Importance of Regressors \n Cost among patients with non-epileptic seizure")
p1
```


### Pie chart

This example uses the data above on contribution of non-epileptic seizure variables to hospitalised cost.


```{r 01-intro-5}
library(ggplot2)
df3$Percentage=round(df3$Importance/sum(df3$Importance)*100,0)
ggplot(df3, aes(x="" ,y=Percentage,fill=Predictors))+
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() 
```


### Scatter plot

The above data is used here to illustrate scatter plot. We can denote the color difference among the Predictors by adding _color_ argument in the _aesthetics_. 

```{r 01-intro-6}
#color in qplot
qplot(data=df3, Predictors, Importance,color=Predictors)
```

Adding color in ggplot is the same as in qplot.

```{r 01-intro-6-1}
#color ggplot
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors))+
  geom_point()+
  theme(axis.title.y = element_text(face="bold",  size=10),
        axis.title.x = element_text(face="bold", size=10), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=10))
```

The size argument within _aes_ can be used like color.  In this case, it's used 
to denote the importance of the predictors.

```{r 01-intro-6-2}
#size
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors,size=Predictors))+
  geom_point()+
  theme(axis.title.y = element_text(face="bold",  size=12),
        axis.title.x = element_text(face="bold", size=12), 
        axis.text.x  = element_text(angle=45, vjust=0.5, size=12))
```

This is a more complicated example of scatter plot combined with formula of regression line. The _paste0_ function is used to add the equation to the plot. The data comes from GBD 2016 publication on lifetime risk of stroke. A 
comparison with plotting from base R is also provided. 

```{r 01-intro-6-3, warning=F}

library(tidyverse)
load("./Data-Use/world_stroke.Rda")
#fitting a regression model
fit<-lm(MeanLifetimeRisk~LifeExpectancy,data=world_sfdf)
fitsum<-summary(fit)
#base R scatter plot with fitted line
x=world_sfdf$LifeExpectancy #define x
y=world_sfdf$MeanLifetimeRisk #define y
plot(x,y, data=world_sfdf, main = "Lifetime Stroke Risk",
     xlab = "Life Expectancy", ylab = "Life time Risk",
     pch = 19)
abline(lm(y ~ x, data = world_sfdf), col = "blue")
```
The ggplot version is now provided. Note that the line is fitted in the _geom_smooth_ argument. An interesting aspect of this plot is that the data can 
be describe as heterosecadic in which the variance changes throughout the plot. 

```{r 01-intro-6-4, warning=F}
#ggplot2 scatter plot with fitted line
SR<-ggplot(world_sfdf,  aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm")+geom_point()+
  xlab("Life Expectancy")+
  ggtitle(paste0("Life time Risk", "=", 
                 round(fitsum$coefficients[1],2),"+",
                 round(fitsum$coefficients[2],2)," x ","Life Expectancy"))
SR
```

To use the name of the country as label, we use _geom_text_. The world_sfdf data is now partitioned to show only data from Europe. An interesting pattern emerge. There is clumping of the data around Belgium and Portugal. The _nudge_x_ and _nudge_y_ function are used to adjust the labels and the _size_ argument adjust the label.

```{r 01-intro-6-4-1, warning =F}
library(tidyverse)
library(sf)

world_sfdf %>% filter(Continent=="EUROPE") %>%
ggplot( aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm")+
  xlab("Life Expectancy")+
  geom_text(aes(label=Country, nudge_x=.35, nudge_y=.5, avoid_overlap=T), 
            size=2)

```


In order to understand the reason for deviations from the fitted line above, it 
is possible possible to add additional step to explore the relationship for each income group. This graph illustrates that the high income countries have a 
ceiling in the relationship between lifetime risk and life expectancy from age 
of 70 onward.

```{r 01-intro-6-4-2, warning=F}
SRIncome<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm", 
              aes(group=Income, linetype=Income, colour= Income))+ 
  geom_point()+
  xlab("Life Expectancy")

SRIncome
```

### arrange plot in grids


Plots can be arrange in tabular format for presentation or journal submission.In base R multiple plots can be combined using _par_ function and specify the number of columns by _mfrow_. The number of columns can be specified with _ncol_ call when using _gridExtra_ library.

```{r 01-intro-6-5, warning=F}
#Leukemia data
par(mfrow=c(1,2)) #row of 1 and 2 columns

x=Leukemia$Age #define x
y=Leukemia$Blasts #define y

plot(x,y, data=Leukemia, main = "Leukemia data",
     xlab = "Age", ylab = "Blasts",
     pch = 19)
abline(lm(y ~ x, data = Leukemia), col = "blue")

y1=Leukemia$Smear
plot(x,y1, data=Leukemia, main = "Leukemia data",
     xlab = "Age", ylab = "Smear",
     pch = 19)
abline(lm(y1 ~ x, data = Leukemia), col = "blue")

```

```{r 01-intro-6-6, warning=F}
library(gridExtra)

SRContinent<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
  geom_smooth(method="lm", 
              aes(group=Continent, linetype=Continent, colour= Continent))+
  geom_point()+
  xlab("Life Expectancy")+
  ylim(c(0,50))

grid.arrange(SRIncome, SRContinent, ncol=1) 
```

An alternative way to display multiple plots is to use _patchwork_ library.

```{r 01-intro-6-6-1, warning=F}
library(patchwork)
SRIncome/SRContinent
```

### Line plot

The data below is generated in the section on data simulation. The data were simulated using summary data from recent clot retrieval trials in stroke [@pmid25517348,@pmid25671797]

```{r 01-intro-7}
library(ggplot2)
library(dplyr)
dtTime<-read.csv("./Data-Use/dtTime_simulated.csv")
dtTrial<-read.csv("./Data-Use/dtTrial_simulated.csv")

#summarise data using group_by
dtTime2<-dtTime %>%
  group_by(period, T) %>%
  summarise(meanY=mean(Y),
            sdY=sd(Y),
            upperY=meanY+sdY,
            lowerY=meanY-sdY)

#individual line plot
ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_line(aes(color=as.factor(T),group=id))+
  scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")+ylab("NIHSS")
```

The line plot can also be represented as boxplot without the connecting lines.

```{r 01-intro-8}
#box plot
gg<-ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_boxplot(aes(color=as.factor(T)))+  
  xlab("Time")+ylab("NIHSS")
gg+scale_fill_discrete(name="Treatment")

```

To perform line plot on the grouped data, first fit the regression line to the grouped data.

```{r 01-intro-8-1}

#linear regression Y1 predict Y2 where Y1 and Y2 are grouped data 
#from the simulated data above.
fit<-lm(Y1~Y0+T, data=dtTrial)

dtTrial2<-filter(dtTrial, T==1)

fit2<-lm(Y2~Y1, data=dtTrial2)

#line plot by group
pd <- position_dodge2(width = 0.2) # move them .2 to the left and right
gbase  = ggplot(dtTime2, aes(y=meanY, colour=as.factor(T))) + geom_errorbar(aes(ymin=lowerY, ymax=upperY), width=.3, position=pd)+
geom_point(position=pd) 
gline = gbase + geom_line(position=pd) 

dtTime2$period=as.numeric(dtTime2$period)
unique((dtTime2$period))

gline = gline %+% dtTime2

print(gline + aes(x=period))
```

### Facet wrap

Facet wrap is a good way to visually explore different aspects pf the data. 
Using the dtTime data above, the plots are separated by trial assignment.

```{r 01-intro-9}
ggplot(dtTime,aes(x=as.factor(period),y=Y))+
  geom_line(aes(color=as.factor(T),group=id))+
  scale_color_manual(values = c("#e38e17", "#8e17e3"))+ 
  facet_wrap(~T)+ 
  xlab("Time")+ylab("NIHSS")
```

### Polygons

The _geom_polygon_ is often used in thematic plot of maps. It can be used to 
show polygons outside of map. It requires one data frame for coordinate and another for the values. 

```{r 01-intro-10}

#simulate data
ids <- factor(c("1", "2", "3", "4","5","6"))

values <- data.frame(
  id = ids,
  value = c(3, 3.1, 3.2, 3.3,3.4,3.5)
)

a=seq(0,1.5,by=0.5)
x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1)

positions <- data.frame(
  id = rep(ids, each = 4),
  x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1),
  y=sample(x)
  )

# Currently we need to manually merge the two together
comb <- merge(values, positions, by = c("id"))

p <- ggplot(comb, aes(x = x, y = y)) +
  geom_polygon(aes(fill = value, group = id))
  
p



```

### Gantt chart

Gantt chart can be used to illustrate project timeline. It needs a minimum of 4 data columns: Activity, Project, a start date and end date. This example below 
is meant as a template. If you have 6 rows of data then add 2 extra rows of data including colours.

```{r 01-intro-11}
library(tidyverse)
gantt_df<-data.frame(item=seq(1:4), 
    activity=c("Ethics submission","Length","Recruitment","Follow up"),
    category=c("Ethics","Duration","Recruitment","Follow up"),
    Start=c("2020-06-01","2021-01-01","2021-01-01","2022-01-01"),
    End=c("2021-01-01","2023-01-01","2022-01-01","2023-01-01"))

#Set factor level to order the activities on the plot
gantt_df <- mutate(gantt_df,
      activity=factor(activity,levels=activity[1:nrow(gantt_df)]),
      category=factor(category,levels=category[1:nrow(gantt_df)]))  
plot_gantt <- qplot(ymin = Start,
                    ymax = End,
                    x = activity,
                    colour = category,
                    geom = "linerange",
                    data = gantt_df,
                    size = I(10)) +    #width of line
    scale_colour_manual(values = c("blue", "red", "purple", "yellow")) +
    coord_flip() +
    xlab("") +
    ylab("") +
    ggtitle("Project plan")
plot_gantt

```

### Heatmap

The _ggplot2_ library can also be used for creating heatmap. This plot uses the _geom_tile_ function.

```{r 01-intro-12}
library(ggplot2)
library(plyr)
library(reshape)
library(scales)

#swiss fertility dataset from 1888
data(swiss, package = "datasets") 

swiss$swiss_canton<-row.names(swiss) #assign column name to row name
rownames(swiss)<-NULL #remove row name
data.m <- melt(swiss)
 data.m <- ddply(data.m, .(variable), transform, rescale = rescale(value))
 q <- ggplot(data.m, aes(variable, swiss_canton)) + 
         geom_tile(aes(fill = rescale), colour = "white")+
        scale_fill_gradient(low = "white", high = "steelblue")+
   ggtitle("Swiss Fertility Data 1888")
 q
```

## ggplot2 extra

The following plots retains the framework of ggplot2. Their uses require installing additional libraries.

### Alluvial and Sankey diagram

The Sankey flow diagram uses the width of the arrow used to indicate the flow rate. It is often used to indicate energy dissipation in the system. There are several libraries providing Sankey plot such as _networkD3_ library. Alluvial 
plot is a subset of Sankey diagram but differs in having a structured workflow. The _ggalluvial_ library is chosen here for demonstration as it forms part of the _ggplot2_ framework. 


```{r 01-intro-13, warning=F}
library(ggalluvial)
library(Stat2Data)
data("Leukemia") #treatment of leukemia
#partition Age into 8 ordered factors
Leukemia$AgeCat<-cut_interval(Leukemia$Age, n=8, ordered_result=TRUE)
#axis1
ggplot(data=Leukemia, aes (y=Smear,axis1=AgeCat, axis2=Resp,axis3=Status))+
  geom_alluvium(aes(fill=AgeCat),width = 1/12)+
  geom_label(stat = "stratum", infer.label = TRUE) +
  geom_label(stat = "stratum", infer.label = TRUE)+
  scale_x_discrete(limits = c("AgeCat","Resp", "Status"),label=c("Age Category","Treatment Response","Mortality"), expand = c(.1, .1)) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  ggtitle("Outcome after Leukemia Treatment")
```

### Survival plot

The _survminer_ library extends _ggplot2_ style to survival plot. It requires several libraries such as _survival_ for survival analysis and _lubridate_ to parse time. A description of [survival analysis](Survival analysis) is provided 
in the Statistics section.

```{r 01-intro-14, warning=F}
library(survminer)
library(lubridate)
library(survival)
 
data(cancer, package="survival")

#data from survival package on NCCTG lung cancer trial
#https://stat.ethz.ch/R-manual/R-devel/library/survival/html/lung.html
#time in days
#status cesored=1, dead=2sex:
#sex:Male=1 Female=2
sfit<- survfit(Surv(time, status) ~ sex, data = cancer)
ggsurvplot(sfit, data=cancer,
           surv.median.line = "hv",
           pval=TRUE,
           pval.size=3, 
           conf.int = TRUE,
           legend.labs=c("Male","Female"),xlab="Time (Days)", 
           break.time.by=50,
           font.x=5,
           font.y=5,
           ggtheme = theme_bw(),  
           risk.table=T,
           risk.table.font=2, #adjust font
           risk.table.height=.3  #adjust table height
           )
```

### ggraph and tidygraph

The _igraph_ library does the heavy lifting in graph theory analysis. This 
aspect will be expanded on in the chapter on Graph Theory. However, the plotting function with _igraph_ is still not optimal. The _ggraph_ and _tidygraph_ libraries extend the _ggplot2_ style to graph.

```{r 01-intro-15}
library(tidygraph)
library(ggraph) 

#relationship among members of acute stroke team
tpa<-readr::read_csv("./Data-Use/TPA_edge010817.csv")%>%
  rename(from=V1, to=V2)

#node by degree centrality
graph<-as_tbl_graph(tpa) %>%  mutate(degree = centrality_degree())
ggraph(graph, layout = 'fr') + 
 geom_edge_link() + 
  #label size by degree centrality
  geom_node_point(aes(size=degree))+
  #label node
  geom_node_text(aes(label=name),repel=T)+
  ggtitle("Acute Stroke Team Network")
```

### ggparty-decision tree

Decision tree can be plotted using _ggplot2_ and _ggparty_ framework. This 
example uses data from the Breast Cancer dataset. It explores the effect of different histological features on class of breast tissue type. 

```{r 01-intro-16}
library(ggparty)
library(partykit)
library(tidyverse)

data("BreastCancer",package = "mlbench")

#check data type
str(BreastCancer)

BC<- BreastCancer %>% select(-Id)

treeBC<-ctree(Class~., data=BC, control = ctree_control(testtype = "Teststatistic"))

#plot tree using plot from base R
plot(treeBC)
#plot tree using ggparty

ggparty(treeBC) +
  geom_edge() +
  geom_edge_label() +
  geom_node_label(aes(label = splitvar),
                  ids = "inner") +
  geom_node_label(aes(label = info),
                  ids = "terminal")

ggparty(treeBC) +
  geom_edge() +
  geom_edge_label() +
  # map color to level and size to nodesize for all nodes
  geom_node_splitvar(aes(col = factor(level),
                         size = nodesize)) +
  geom_node_info(aes(col = factor(level),
                     size = nodesize))

```



### Map 

Several simple examples are provided here. They illustrate the different 
plotting methods used according to the type of data. It is important to check 
the structure of the data using _class()_ function.

#### Thematic map

The _ggplot2_ library can also be used to generate thematic (choropleth) map. 
Here we use _map_data_ function from _ggplot2_ to obtain a map of USA. Geocoded data are contained in the _long_ and _lat_ columns. The US map data is in 
standard dataframe format. In this case, the _geom_map_ function is used for mapping. The _USArrests_ data contains a column for murder, assault, rape and urban population. The assault data presented here is normalised by the 
population data. This section will be expanded further in the _Geospatial Analysis_ chapter.

```{r 01-intro-18}
library(dplyr)
library(ggplot2)
arrest<-data("USArrests") 
arrest<-USArrests%>% add_rownames("region") %>% 
  mutate(region=tolower(region))
US <- map_data("state") 
map_assault<-ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add USArrests data here
  geom_map(data=arrest, map=US,
              aes(fill=Assault/UrbanPop, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')
map_assault
```

This is a more complex example and uses state by state COVID-19 data CDC 
website. Steps to extract the COVID-10 is shown in the next chapter. The 
shapefile for USA can also be extracted from _tigris_ library. A challenge with  plotting a map of US is that the country extends far North to Alaska and East to pacific islands.

```{r 01-intro-18-1}
library(ggplot2)
library(dplyr)
covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=str_to_lower(Jurisdiction))

map_covid<-ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add covid data here
  geom_map(data=covid, map=US,
              aes(fill=CumulativeIncidence31.03.20, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')
map_covid
```

In the simple example below we will generate a map of Australian State 
territories color by size of area. The _ggplot2_ combines with _sf_ library and uses the shape file data in the _geom_sf_ call. 

```{r 01-intro-18-2}
library(ggplot2)
library(sf)
#shape file
Aust<-st_read("./Data-Use/GCCSA_2016_AUST.shp") 
colnames(Aust) #find out column variables
ggplot() + 
  geom_sf(data=Aust,aes(fill=AREASQKM16))+
  labs(x="Longitude (WGS84)", y="Latitude", title="Map of Australia") + 
  theme_bw() 
```

#### tmap

The _tmap_ library works in conjunction with _ggplot2_ and _sf_. The _tm_shape_ function takes in the shape data. The _tm_polygon_ function color the shape file with the column data of interest.

```{r 01-intro-19}
library(tmap)
load("./Data-Use/world_stroke.Rda")
#data from GBD 2016 investigators
colnames(world_sfdf) 
class(world_sfdf) #contains simple features
#map of country income
m<-tm_shape(world_sfdf, projection = "+proj=eck4")+tm_polygons("Income")
m

#save object as png
#tmap_save(m,file="world_income.png"")

#save as leaflet object
#tmap_save(m,file="world_income.html"")
```


map of lifetime  stroke risk

```{r 01-intro-19-1}
n<-tm_shape(world_sfdf, projection = "+proj=eck4")+tm_polygons("MeanLifetimeRisk")
n
```

#### Voronoi

The _ggplot2_ and _sf_ libraries are extended to include drawing of voronoi. Voronoi is a special type of polygon. It can be seen as a mathematical approach 
to partition regions such that all points within the polygons are closest (depending on distance metrics) to the seed points. Voronoi has been used in disease mapping (John Snow mapping of Broad Street Cholera outbreak) and meteorology (Alfred Thiessen polygon method for measuring rainfall in basin). 
This is a more complex coding task. It uses the _geom_voronoi_ call from _ggvoronoi_ library. Some libraries have vignettes to help you implement the codes. The vignette in the _ggvoronoi_ library can be called using _vignette("ggvoronoi")_. The _osmdata_ library will be used to provide map from OpenStreetMap. A related library is _OpenStreetMap_. The latter library uses raster file whereas _osmdata_ provides vectorised map data. In the chapter on Geospatial Analysis we will expand on this theme with interactive map. One issue with the use of voronoi is that there are infinite sets and so a boundary needs to set. In the example below, the boundary was set for Greater Melbourne.

```{r 01-intro-20}
library(dplyr)
library(ggvoronoi)
library(ggplot2)
library(sf)

#subset data with dplyr for metropolitan melbourne
msclinic<-read.csv("./Data-Use/msclinic.csv") %>% filter(clinic==1, metropolitan==1)

Victoria<-st_read("./Data-Use/GCCSA_2016_AUST.shp") %>% filter(STE_NAME16=="Victoria") 

m<-ggplot(msclinic)+
  geom_point(aes(x=lon,y=lat))+ 
  #add hospital location
  geom_voronoi(aes(x=lon,y=lat,fill=distance),fill=NA, color="black")+ 
  #create voronoi from hospital location
  geom_sf(data=Victoria,aes(fill=AREASQKM16)) +
  labs(x="Longitude (WGS84)", y="Latitude", 
       title="Voronoi Map of MS Clinics in Melbourne")
m  
```
This map is not so useful as the features of Victoria overwhelm the features of Greater Melbourne. 


### ggwordcloud

The _ggwordcloud_ library extend the _ggplot2_ family to creating wordcloud. The following is an illustration of wordcloud created from comments on Youtube video "Stroke Heroes Act Fast".

```{r 01-intro-21}
library(ggwordcloud)
library(tm)
library(tidyverse)
heroes_df<-read.csv("./Data-Use/Feb_01_1_49_59 PM_2018_AEDT_YoutubeData.csv",stringsAsFactors = FALSE)
#cleaning data
keywords <- heroes_df$Comment 
keywords <- iconv(keywords, to = 'utf-8')
#create corpus
myCorpus <- VCorpus(VectorSource(keywords))
#lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove numer
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
#remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"),lazy=TRUE) 
#remove white space
myCorpus <- tm_map(myCorpus, stripWhitespace, lazy=TRUE)
#term document matrix
dtm <- DocumentTermMatrix(myCorpus,control = list(wordLengths=c(3, 20)))
#remove sparse terms
dtm<-removeSparseTerms(dtm, 0.95)
#remove words of length <=3
tdm=TermDocumentMatrix(myCorpus,
                       control = list(minWordLength=4,maxWordLength=20) )
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
#remove words with frequency <=1
d <- data.frame(word = names(v),freq=v) %>% filter(freq>1)
#wordcloud
ggplot(data = d, 
       aes(label = word, size = freq, col = as.character(freq))) + 
  geom_text_wordcloud(rm_outside = TRUE, max_steps = 1,
                      grid_size = 1, eccentricity = .8)+
    scale_size_area(max_size = 12)+
    scale_color_brewer(palette = "Paired", direction = -1)+
  theme_void()
```

### gganimate

The library _gganimate_ can be used to generate videos in the form of gif file. The data needs to be collated from wide to long format. For the purpose of this book, the code has been turned off.

```{r 01-intro-22, eval=FALSE}
library(ggplot2)
library(gganimate)

#ensure data in long format, Y =NIHSS
u <- ggplot(dtTime2, aes(period, meanY , color = T, frame = period)) +
  geom_bar(stat="identity") +
  geom_text(aes(x = 11.9, label = period), hjust = 0) +   xlim(0,13)+
  coord_cartesian(clip = 'off') + 
  facet_wrap(~meanY)+
  labs(title = 'ECR Trials', y = 'Number') +
transition_reveal(period)
u
#create gif
#animate(u,fps=15,duration=15)
#anim_save(u,file="./Data-Use/simulated_ECR_trial.gif", width=800, height=800)
```

### ggneuro

There are several ways of plotting mri scans. The _ggneuro_ library is 
illustrated here as it relates to the _ggplot2_ family. The Colin $T_1$ scan is 
a high resolution scan from MNI.

```{r 01-intro-23}
#devtools::install_github("muschellij2/ggneuro")
library(ggneuro)
library(neurobase)
colin_1mm<-untar("./Data-Use/colin_1mm.tgz")
colinIm<-readNIfTI("colin_1mm") 
ggortho(colinIm)
```

## plotly

Plotly has its own API and uses _Dash_ to upload the figure on the web. It has additional ability for interaction as well as create a video. Examples are provided with calling plotly directly or via _ggplot2_. In the examples below, 
the plots are performed using _ggplot2_ and then pass onto _plotly_ using _ggplotly_ function. The example uses the Leukemia dataset from _Stat2Data_ library. 

### Scatter plot with plotly

The plotly syntax uses a _~_ after the _=_ symbol to identify a variable for plotting.

```{r 01-intro-24}
library(plotly)
library(Stat2Data)
data("Leukemia") #treatment of leukemia
#scatter plot directly from plotly
plot_ly(x=~Age,y=~Smear, #percentage of blast
        color=~as.factor(Status),  #dead or alive
        symbol=~Resp, 
        symbols=c('circle' ,'square'), #Response to treatment
        data=Leukemia) 

```

### Bar plot with plotly

Plotly can uses a ggplot object directly.

```{r 01-intro-24-1}
#using the epilepsy data in p generated above for bar plot
ggplotly(p)
```

### Heatmap

```{r 01-intro-24-2}

#using swiss data above to create heatmap
ggplotly(q) 
```

### map

```{r 01-intro-24-3}
library(ggplot2)
library(plotly)
library(dplyr)

covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=str_to_lower(Jurisdiction)) #consistent

ggplotly(
  ggplot()+ 
  geom_map(data=US, map=US,
                    aes(x=long, y=lat, map_id=region),
                    fill="#ffffff", color="#ffffff", size=0.15)+
  #add covid data here
  geom_map(data=covid, map=US,
              aes(fill=NumberCases31.03.20, map_id=region),
                    color="#ffffff", size=0.15)+
  scale_fill_continuous(low='thistle2', high='darkred', 
                                 guide='colorbar')
)
```

You can write citations, too. For example, we are using the **bookdown** package [@R-bookdown] in this sample book, which was built on top of R Markdown and **knitr** [@xie2015].