-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path01-intro.Rmd
991 lines (768 loc) · 36.2 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
# Introduction to R {#intro}
It is often said that the majority of time is spent on data cleaning and 20% on analysis. A common trap when start using R is that the library have not been installed or the data are in different folder to the working directory.
Installing library is easy with the command _install.packages_. The command _getwd()_ will tell you which directory you are in.
Sometimes, the user encounter issues with R. This is not an uncommon problem and often not due to serious errors but forgetting that R is case-sensitive. Also
when copying codes from the net across, R does not recognise the “inverted
comma” but has its own "inverted comma". Unlike Python, R does not tolerate
space between variables. R also does not like variables to be named starting
with a number.
R treats the variables in certain ways depend on their class. Some function requires that the variables are numeric. The Breast Cancer data from mlbench has 11 columns with the 9 covariates being factors and ordered factors. These issues will be dealt further in the chapter on data wrangling. One way to get an
overview of the structure of the data is to use the _glimpse_ function from _dplyr_.
Another issue is that some libraries have assign certain names to their function and which is also used by other libraries. In the course of opening multiple libraries, R may be confused and use the last library open and which may lead to error result. We will illustrate this with the _forest_ function in the
Statistics chapter.
A selling point of R is that there is a large community of users who often post their solutions online. it's worth spending time on _stack overflow_ to look at similar problems that others have and the approaches to solving them. A clue to pose questions is to look at the error. If you wish to post questions for others to help then it's encouraged that the question include some data so that the person providing the solution can understand the nature of the errors. Useful blogs are also available at https://www.r-bloggers.com/.
This chapter focusses on the _ggplot2_ and its related libraries for plotting [@Wickham2016]. Base R also has plotting function but lacks the flexibility of _ggplot2_. Plotting is introduced early to enage clinicians who may not have the patience read the following chapter on data wrangling prior. The book is not intended to be used in a sequential fashion as the reader may find elements of this chapter relevant to them and jump to another chapter such as chapter 3 on statistics.
## Plot using base R
Below we illustrate bar plot using base R.
```{r 01-intro-1}
data("Leukemia", package="Stat2Data")
#AML dataset-treatment
colnames(Leukemia)
#base R
hist(Leukemia$Age, 8, xlab = "Age",main="Acute Myeloid Leukemia Dataset")
```
Line plot can be performed in base R using _abline_ argument. The font is
defined by _cex_ argument and colour defined by _col_ argument. The title is defined by _main_ argument. The X and Y axes can be labelled using _xlab_ and _ylab_ arguments within the _plot_ function. Below we illustrate example of drawing a line to a point and in the segment below we illustrate example of drawing a line.
```{r 01-intro-1-1}
#set parameters of plot
X=seq(0,1,by=.1) #define a set of point from 0 to 1, separated by 0.1.
Y=seq(0,1,by=.1)
#define A and B
A=.065; B=.44
#location of point
plot(X,Y, main="ROC curve")
points(A,B,pch=8,col="red",cex=2) #add point
abline(coef = c(0,1)) #add diagonal line
#draw line to a point
segments(x0=0,y0=0,x1=A,y1=B,col="blue")
segments(x0=A,y0=B,x1=1,y=1,col="blue")
```
This is an illustration of using base R to plot regression line. Later, we will illustrate using _geom_smooth_ call from _ggplot2_.
```{r 01-intro-1-2}
TPR_1=.44; FPR_1=.065
plot(X,Y,main="Likelihood ratio graph", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
points(.02,.8,pch=8,col="red",cex=2) #add point
df1<-data.frame(c1=c(0,TPR_1),c2=c(0,FPR_1))
reg1<-lm(c1~c2,data=df1)
df2<-data.frame(c1=c(TPR_1,1),c2=c(FPR_1,1))
reg2<-lm(c1~c2,data=df2)
abline(reg1) #draw line using coefficient reg1
abline(reg2) #draw line using coefficient reg2
text(x=FPR_1,y=TPR_1+.3,label="Superior",cex=.7)
text(x=FPR_1+.2,y=TPR_1+.2,label="Absence",cex=.7)
text(x=.0125,y=TPR_1-.1,label="Presence",cex=.7)
text(x=FPR_1+.1,y=TPR_1,label="Inferior",cex=.7)
```
Function can be plotted using variations of the above. This requires a formula
to describe variable _Y_. Shading in base R is performed with the _polygon_ function.
```{r 01-intro-2}
AUC_Logistic<-function (A,B,C,D){
#binary data
#A=True pos %B=False positive %C=False negative %D=True negative
TPR=A/(A+C)
FPR=1-(D/(D+B))
#binomial distribution sqrt(np(1-p))
#Statist. Med. 2002; 21:1237-1256 (DOI: 10.1002/sim.1099)
STDa=sqrt((A+C)*TPR*(1-TPR));
STDn=sqrt((B+D)*FPR*(1-FPR));
a=STDn/STDa;
theta=log((TPR/(1-TPR))/((FPR/(1-FPR))^a));
#define a set of point from 0 to 1, separated by 0.001.
X=seq(0,1,by=0.001)
#logistic regression model
Y1=(X^a)/(X^a+(((1-X)^a)*exp(-1*theta)));
AUC=round(pracma::trapz(X,Y1),2)
AUC
#SE using Hanley & McNeil
#Preferred method if more than one TPR,FPR data point known
#Hanley is less conservative than Bamber
Nn=B+D;
Na=A+C;
Q1=AUC/(2-AUC);
Q2=2*AUC^2/(1+AUC);
SEhanley=sqrt(((AUC*(1-AUC))+((Na-1)*(Q1-AUC^2))+((Nn-1)*(Q2-AUC^2)))/(Na*Nn))
#SE using Bamber
#Ns is the number of patients in the smallest group
if (A+C>B+D) {
Ns=B+D
} else {
Ns=A+C
}
SEbamber=sqrt((AUC*(1-AUC))/(Ns-1))
# plot smoothed ROC
plot(X,Y1,main="ROC curve", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
points(FPR,TPR,pch=8,col="red",cex=2) #add point
Y2=0
polygon(c(X[X >= 0 & X <= 1],
rev(X[X >= 0 & X <= 1])),
c(Y1[X >= 0 & X <= 1],
rev(Y2[X >= 0 & X <= 1])),
col = "#6BD7AF")
print(paste("The Area under the ROC curve using the logistic function is",
AUC,". The Area under the ROC curve using rank sum method is", round(.5*(TPR+(1-FPR)),2)))
}
AUC_Logistic(10,20,2,68)
```
## ggplot2
The plot below uses _ggplot2_ or grammar of graphics. The plot is built layer by layer like constructing a sentence. Plotting is a distinct advantage of R over commercial software with GUI (graphical user interface) like _SPSS_. A wide variety of media organisations (BBC, Economist) are using _ggplot2_ with their
own stylised theme. The plot has a certain structure such the name of the data
and aesthetics for x and y axes. For illustration purpose the aesthetics are labelled with x and y statements. The fill argument in the aesthetics indicate
the variables for coloring. The colors chosen for this graph were imposed by the journal _Epilepsia_ [@pmid30577087]. To run the examples, check that you have install the libraries. an error can occurred if you don't have the required library. The meta-character _#_ is used to signal that the line is meant to comment the code ie R will not read it. The _install.packages_ command only need to be run once.
### Histogram
The flexibility of _ggplot2_ is shown here in this histogram. The legend can be altered using the _scale_fill_manual_ function. If other colours are preferred then under _values_ add the preferred colours.
There are different ways to use _ggplot2_: quick plot or _qplot_ with limited options and full _ggplot2_ with all the options. The choice of the method
depends on individual preference and as well as reason for plotting.
```{r 01-intro-3}
library(ggplot2)
#qplot
qplot(Age, data=Leukemia, bins=8)
```
Now a more complex version of histogram with ggplot with color added.
```{r 01-intro-3-1}
#ggplot2
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
geom_histogram(bins=8)
```
Note the legend is a bit untidy. The legend can be changed using _scale_fill_manual_. The color can be specified using rgb argument. This is important as some journals prefer certain color.
```{r 01-intro-3-1-2}
#adding legend, changing the values 0 and 1 to treatment response
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
geom_histogram(bins=8)+
scale_fill_manual(name="Response",values=c("#999999","#56B4E9"),
breaks=c(0,1),labels=c("No Treatment","Treatment"))
```
Adding title is easy with _ggtitle_.
```{r 01-intro-3-3}
#adding title
ggplot(data=Leukemia,aes(Age,fill=as.factor(Resp)))+
geom_histogram(bins=8)+
scale_fill_manual(name="Response",
values=c("#555555","#56B4E9"),
breaks=c(0,1),
labels=c("No Treatment","Treatment"))+
ggtitle("Acute Myeloid Leukemia Treatment dataset")
```
### Bar plot
Previously, we had used base R for bar plot. Here we use the _geom_bar_ argument in ggplot.
```{r 01-intro-4}
library(ggplot2)
library(extrafont)
#this data came from a paper published in Epilespia 2019 on cost of looking
#after patients with pseudoseizure
## Lower Upper
## percentage 0.95 0.95
## Duration_PNES.lmg 0.0974 0.0057 0.2906
## pseudostatus.lmg 0.2283 0.0687 0.3753
## Hx_anxiety.lmg 0.0471 0.0057 0.1457
## Hx_depression.lmg 0.0059 0.0041 0.1082
## DSP.lmg 0.0582 0.0071 0.1500
## seizure_burden.lmg 0.0179 0.0041 0.1058
## sex.lmg 0.0413 0.0030 0.1519
df<-data.frame("Predictors"=c("Duration_PNES","pseudostatus","Hx_anxiety",
"Hx_depression","DSP","seizure_burden","sex"),
"Importance"=c(0.09737,0.22825, 0.047137,0.00487,0.058153,0.01786,0.04131),
"Lower.95"=c(0.0057,0.0687,0.0057,0.0041,0.0071,0.0041,0.0030),
"Upper.95"=c(0.2906,0.3753,0.1457,0.1082,0.1500,0.1058,0.1519))
#check dimensions of data frame
dim(df)
#check variables in data frame
colnames(df)
#bar plot uses geom_bar
ggplot(df, aes(x=Predictors,y=Importance))+
geom_bar(stat="identity")
```
This bar plot may be considered as untidy as the variables have not been sorted. Reordering the data requires ordering the factors. The colors were requested by the journal Epilesia in order to avoid recognitin of the bar from color
blindness.
```{r 01-intro-4-1}
#reordering the data
df3<-df2<-df
df3$Predictors<-factor(df2$Predictors, levels=df2[order(df$Importance),"Predictors"])
#adding color
p<-ggplot(df3, aes(x=Predictors,y=Importance,fill=Predictors))+
geom_bar(colour="black",
stat="identity",
fill=
c("#e4b84b","#ce8080","#511c23","#e37c1d","#ffde75","#abb47d","#a1c5cb"))
p
```
This bar plot is now ordered but the labels on the axis seem to run into each other. One solution is to tile the axis title using _element_text_. Note that
the text size can also be specified within this argument.
```{r 01-intro-4-2}
#rotate legend on x axis label by 45
p+theme(axis.title.y = element_text(face="bold", size=12),
axis.title.x = element_text(face="bold", size=12),
axis.text.x = element_text(angle=45, vjust=0.5, size=10))
```
The title can be broken up using the backward slash.
```{r 01-intro-4-3}
#adding break in title
p1<-p+geom_errorbar(aes(ymin=Lower.95,ymax=Upper.95,width=0.2))+
labs(y="R2 exaplained (%)")+
theme(text=element_text(size=10))+
ggtitle(" Relative Importance of Regressors \n Cost among patients with non-epileptic seizure")
p1
```
### Pie chart
This example uses the data above on contribution of non-epileptic seizure variables to hospitalised cost.
```{r 01-intro-5}
library(ggplot2)
df3$Percentage=round(df3$Importance/sum(df3$Importance)*100,0)
ggplot(df3, aes(x="" ,y=Percentage,fill=Predictors))+
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
```
### Scatter plot
The above data is used here to illustrate scatter plot. We can denote the color difference among the Predictors by adding _color_ argument in the _aesthetics_.
```{r 01-intro-6}
#color in qplot
qplot(data=df3, Predictors, Importance,color=Predictors)
```
Adding color in ggplot is the same as in qplot.
```{r 01-intro-6-1}
#color ggplot
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors))+
geom_point()+
theme(axis.title.y = element_text(face="bold", size=10),
axis.title.x = element_text(face="bold", size=10),
axis.text.x = element_text(angle=45, vjust=0.5, size=10))
```
The size argument within _aes_ can be used like color. In this case, it's used
to denote the importance of the predictors.
```{r 01-intro-6-2}
#size
ggplot(df3, aes(x=Predictors,y=Importance,color=Predictors,size=Predictors))+
geom_point()+
theme(axis.title.y = element_text(face="bold", size=12),
axis.title.x = element_text(face="bold", size=12),
axis.text.x = element_text(angle=45, vjust=0.5, size=12))
```
This is a more complicated example of scatter plot combined with formula of regression line. The _paste0_ function is used to add the equation to the plot. The data comes from GBD 2016 publication on lifetime risk of stroke. A
comparison with plotting from base R is also provided.
```{r 01-intro-6-3, warning=F}
library(tidyverse)
load("./Data-Use/world_stroke.Rda")
#fitting a regression model
fit<-lm(MeanLifetimeRisk~LifeExpectancy,data=world_sfdf)
fitsum<-summary(fit)
#base R scatter plot with fitted line
x=world_sfdf$LifeExpectancy #define x
y=world_sfdf$MeanLifetimeRisk #define y
plot(x,y, data=world_sfdf, main = "Lifetime Stroke Risk",
xlab = "Life Expectancy", ylab = "Life time Risk",
pch = 19)
abline(lm(y ~ x, data = world_sfdf), col = "blue")
```
The ggplot version is now provided. Note that the line is fitted in the _geom_smooth_ argument. An interesting aspect of this plot is that the data can
be describe as heterosecadic in which the variance changes throughout the plot.
```{r 01-intro-6-4, warning=F}
#ggplot2 scatter plot with fitted line
SR<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
geom_smooth(method="lm")+geom_point()+
xlab("Life Expectancy")+
ggtitle(paste0("Life time Risk", "=",
round(fitsum$coefficients[1],2),"+",
round(fitsum$coefficients[2],2)," x ","Life Expectancy"))
SR
```
To use the name of the country as label, we use _geom_text_. The world_sfdf data is now partitioned to show only data from Europe. An interesting pattern emerge. There is clumping of the data around Belgium and Portugal. The _nudge_x_ and _nudge_y_ function are used to adjust the labels and the _size_ argument adjust the label.
```{r 01-intro-6-4-1, warning =F}
library(tidyverse)
library(sf)
world_sfdf %>% filter(Continent=="EUROPE") %>%
ggplot( aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
geom_smooth(method="lm")+
xlab("Life Expectancy")+
geom_text(aes(label=Country, nudge_x=.35, nudge_y=.5, avoid_overlap=T),
size=2)
```
In order to understand the reason for deviations from the fitted line above, it
is possible possible to add additional step to explore the relationship for each income group. This graph illustrates that the high income countries have a
ceiling in the relationship between lifetime risk and life expectancy from age
of 70 onward.
```{r 01-intro-6-4-2, warning=F}
SRIncome<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
geom_smooth(method="lm",
aes(group=Income, linetype=Income, colour= Income))+
geom_point()+
xlab("Life Expectancy")
SRIncome
```
### arrange plot in grids
Plots can be arrange in tabular format for presentation or journal submission.In base R multiple plots can be combined using _par_ function and specify the number of columns by _mfrow_. The number of columns can be specified with _ncol_ call when using _gridExtra_ library.
```{r 01-intro-6-5, warning=F}
#Leukemia data
par(mfrow=c(1,2)) #row of 1 and 2 columns
x=Leukemia$Age #define x
y=Leukemia$Blasts #define y
plot(x,y, data=Leukemia, main = "Leukemia data",
xlab = "Age", ylab = "Blasts",
pch = 19)
abline(lm(y ~ x, data = Leukemia), col = "blue")
y1=Leukemia$Smear
plot(x,y1, data=Leukemia, main = "Leukemia data",
xlab = "Age", ylab = "Smear",
pch = 19)
abline(lm(y1 ~ x, data = Leukemia), col = "blue")
```
```{r 01-intro-6-6, warning=F}
library(gridExtra)
SRContinent<-ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
geom_smooth(method="lm",
aes(group=Continent, linetype=Continent, colour= Continent))+
geom_point()+
xlab("Life Expectancy")+
ylim(c(0,50))
grid.arrange(SRIncome, SRContinent, ncol=1)
```
An alternative way to display multiple plots is to use _patchwork_ library.
```{r 01-intro-6-6-1, warning=F}
library(patchwork)
SRIncome/SRContinent
```
### Line plot
The data below is generated in the section on data simulation. The data were simulated using summary data from recent clot retrieval trials in stroke [@pmid25517348,@pmid25671797]
```{r 01-intro-7}
library(ggplot2)
library(dplyr)
dtTime<-read.csv("./Data-Use/dtTime_simulated.csv")
dtTrial<-read.csv("./Data-Use/dtTrial_simulated.csv")
#summarise data using group_by
dtTime2<-dtTime %>%
group_by(period, T) %>%
summarise(meanY=mean(Y),
sdY=sd(Y),
upperY=meanY+sdY,
lowerY=meanY-sdY)
#individual line plot
ggplot(dtTime,aes(x=as.factor(period),y=Y))+
geom_line(aes(color=as.factor(T),group=id))+
scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")+ylab("NIHSS")
```
The line plot can also be represented as boxplot without the connecting lines.
```{r 01-intro-8}
#box plot
gg<-ggplot(dtTime,aes(x=as.factor(period),y=Y))+
geom_boxplot(aes(color=as.factor(T)))+
xlab("Time")+ylab("NIHSS")
gg+scale_fill_discrete(name="Treatment")
```
To perform line plot on the grouped data, first fit the regression line to the grouped data.
```{r 01-intro-8-1}
#linear regression Y1 predict Y2 where Y1 and Y2 are grouped data
#from the simulated data above.
fit<-lm(Y1~Y0+T, data=dtTrial)
dtTrial2<-filter(dtTrial, T==1)
fit2<-lm(Y2~Y1, data=dtTrial2)
#line plot by group
pd <- position_dodge2(width = 0.2) # move them .2 to the left and right
gbase = ggplot(dtTime2, aes(y=meanY, colour=as.factor(T))) + geom_errorbar(aes(ymin=lowerY, ymax=upperY), width=.3, position=pd)+
geom_point(position=pd)
gline = gbase + geom_line(position=pd)
dtTime2$period=as.numeric(dtTime2$period)
unique((dtTime2$period))
gline = gline %+% dtTime2
print(gline + aes(x=period))
```
### Facet wrap
Facet wrap is a good way to visually explore different aspects pf the data.
Using the dtTime data above, the plots are separated by trial assignment.
```{r 01-intro-9}
ggplot(dtTime,aes(x=as.factor(period),y=Y))+
geom_line(aes(color=as.factor(T),group=id))+
scale_color_manual(values = c("#e38e17", "#8e17e3"))+
facet_wrap(~T)+
xlab("Time")+ylab("NIHSS")
```
### Polygons
The _geom_polygon_ is often used in thematic plot of maps. It can be used to
show polygons outside of map. It requires one data frame for coordinate and another for the values.
```{r 01-intro-10}
#simulate data
ids <- factor(c("1", "2", "3", "4","5","6"))
values <- data.frame(
id = ids,
value = c(3, 3.1, 3.2, 3.3,3.4,3.5)
)
a=seq(0,1.5,by=0.5)
x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1)
positions <- data.frame(
id = rep(ids, each = 4),
x=c(a,a-.5,a+.5,a+.2, a-.3,a+.1),
y=sample(x)
)
# Currently we need to manually merge the two together
comb <- merge(values, positions, by = c("id"))
p <- ggplot(comb, aes(x = x, y = y)) +
geom_polygon(aes(fill = value, group = id))
p
```
### Gantt chart
Gantt chart can be used to illustrate project timeline. It needs a minimum of 4 data columns: Activity, Project, a start date and end date. This example below
is meant as a template. If you have 6 rows of data then add 2 extra rows of data including colours.
```{r 01-intro-11}
library(tidyverse)
gantt_df<-data.frame(item=seq(1:4),
activity=c("Ethics submission","Length","Recruitment","Follow up"),
category=c("Ethics","Duration","Recruitment","Follow up"),
Start=c("2020-06-01","2021-01-01","2021-01-01","2022-01-01"),
End=c("2021-01-01","2023-01-01","2022-01-01","2023-01-01"))
#Set factor level to order the activities on the plot
gantt_df <- mutate(gantt_df,
activity=factor(activity,levels=activity[1:nrow(gantt_df)]),
category=factor(category,levels=category[1:nrow(gantt_df)]))
plot_gantt <- qplot(ymin = Start,
ymax = End,
x = activity,
colour = category,
geom = "linerange",
data = gantt_df,
size = I(10)) + #width of line
scale_colour_manual(values = c("blue", "red", "purple", "yellow")) +
coord_flip() +
xlab("") +
ylab("") +
ggtitle("Project plan")
plot_gantt
```
### Heatmap
The _ggplot2_ library can also be used for creating heatmap. This plot uses the _geom_tile_ function.
```{r 01-intro-12}
library(ggplot2)
library(plyr)
library(reshape)
library(scales)
#swiss fertility dataset from 1888
data(swiss, package = "datasets")
swiss$swiss_canton<-row.names(swiss) #assign column name to row name
rownames(swiss)<-NULL #remove row name
data.m <- melt(swiss)
data.m <- ddply(data.m, .(variable), transform, rescale = rescale(value))
q <- ggplot(data.m, aes(variable, swiss_canton)) +
geom_tile(aes(fill = rescale), colour = "white")+
scale_fill_gradient(low = "white", high = "steelblue")+
ggtitle("Swiss Fertility Data 1888")
q
```
## ggplot2 extra
The following plots retains the framework of ggplot2. Their uses require installing additional libraries.
### Alluvial and Sankey diagram
The Sankey flow diagram uses the width of the arrow used to indicate the flow rate. It is often used to indicate energy dissipation in the system. There are several libraries providing Sankey plot such as _networkD3_ library. Alluvial
plot is a subset of Sankey diagram but differs in having a structured workflow. The _ggalluvial_ library is chosen here for demonstration as it forms part of the _ggplot2_ framework.
```{r 01-intro-13, warning=F}
library(ggalluvial)
library(Stat2Data)
data("Leukemia") #treatment of leukemia
#partition Age into 8 ordered factors
Leukemia$AgeCat<-cut_interval(Leukemia$Age, n=8, ordered_result=TRUE)
#axis1
ggplot(data=Leukemia, aes (y=Smear,axis1=AgeCat, axis2=Resp,axis3=Status))+
geom_alluvium(aes(fill=AgeCat),width = 1/12)+
geom_label(stat = "stratum", infer.label = TRUE) +
geom_label(stat = "stratum", infer.label = TRUE)+
scale_x_discrete(limits = c("AgeCat","Resp", "Status"),label=c("Age Category","Treatment Response","Mortality"), expand = c(.1, .1)) +
scale_fill_brewer(type = "qual", palette = "Set1") +
ggtitle("Outcome after Leukemia Treatment")
```
### Survival plot
The _survminer_ library extends _ggplot2_ style to survival plot. It requires several libraries such as _survival_ for survival analysis and _lubridate_ to parse time. A description of [survival analysis](Survival analysis) is provided
in the Statistics section.
```{r 01-intro-14, warning=F}
library(survminer)
library(lubridate)
library(survival)
data(cancer, package="survival")
#data from survival package on NCCTG lung cancer trial
#https://stat.ethz.ch/R-manual/R-devel/library/survival/html/lung.html
#time in days
#status cesored=1, dead=2sex:
#sex:Male=1 Female=2
sfit<- survfit(Surv(time, status) ~ sex, data = cancer)
ggsurvplot(sfit, data=cancer,
surv.median.line = "hv",
pval=TRUE,
pval.size=3,
conf.int = TRUE,
legend.labs=c("Male","Female"),xlab="Time (Days)",
break.time.by=50,
font.x=5,
font.y=5,
ggtheme = theme_bw(),
risk.table=T,
risk.table.font=2, #adjust font
risk.table.height=.3 #adjust table height
)
```
### ggraph and tidygraph
The _igraph_ library does the heavy lifting in graph theory analysis. This
aspect will be expanded on in the chapter on Graph Theory. However, the plotting function with _igraph_ is still not optimal. The _ggraph_ and _tidygraph_ libraries extend the _ggplot2_ style to graph.
```{r 01-intro-15}
library(tidygraph)
library(ggraph)
#relationship among members of acute stroke team
tpa<-readr::read_csv("./Data-Use/TPA_edge010817.csv")%>%
rename(from=V1, to=V2)
#node by degree centrality
graph<-as_tbl_graph(tpa) %>% mutate(degree = centrality_degree())
ggraph(graph, layout = 'fr') +
geom_edge_link() +
#label size by degree centrality
geom_node_point(aes(size=degree))+
#label node
geom_node_text(aes(label=name),repel=T)+
ggtitle("Acute Stroke Team Network")
```
### ggparty-decision tree
Decision tree can be plotted using _ggplot2_ and _ggparty_ framework. This
example uses data from the Breast Cancer dataset. It explores the effect of different histological features on class of breast tissue type.
```{r 01-intro-16}
library(ggparty)
library(partykit)
library(tidyverse)
data("BreastCancer",package = "mlbench")
#check data type
str(BreastCancer)
BC<- BreastCancer %>% select(-Id)
treeBC<-ctree(Class~., data=BC, control = ctree_control(testtype = "Teststatistic"))
#plot tree using plot from base R
plot(treeBC)
#plot tree using ggparty
ggparty(treeBC) +
geom_edge() +
geom_edge_label() +
geom_node_label(aes(label = splitvar),
ids = "inner") +
geom_node_label(aes(label = info),
ids = "terminal")
ggparty(treeBC) +
geom_edge() +
geom_edge_label() +
# map color to level and size to nodesize for all nodes
geom_node_splitvar(aes(col = factor(level),
size = nodesize)) +
geom_node_info(aes(col = factor(level),
size = nodesize))
```
### Map
Several simple examples are provided here. They illustrate the different
plotting methods used according to the type of data. It is important to check
the structure of the data using _class()_ function.
#### Thematic map
The _ggplot2_ library can also be used to generate thematic (choropleth) map.
Here we use _map_data_ function from _ggplot2_ to obtain a map of USA. Geocoded data are contained in the _long_ and _lat_ columns. The US map data is in
standard dataframe format. In this case, the _geom_map_ function is used for mapping. The _USArrests_ data contains a column for murder, assault, rape and urban population. The assault data presented here is normalised by the
population data. This section will be expanded further in the _Geospatial Analysis_ chapter.
```{r 01-intro-18}
library(dplyr)
library(ggplot2)
arrest<-data("USArrests")
arrest<-USArrests%>% add_rownames("region") %>%
mutate(region=tolower(region))
US <- map_data("state")
map_assault<-ggplot()+
geom_map(data=US, map=US,
aes(x=long, y=lat, map_id=region),
fill="#ffffff", color="#ffffff", size=0.15)+
#add USArrests data here
geom_map(data=arrest, map=US,
aes(fill=Assault/UrbanPop, map_id=region),
color="#ffffff", size=0.15)+
scale_fill_continuous(low='thistle2', high='darkred',
guide='colorbar')
map_assault
```
This is a more complex example and uses state by state COVID-19 data CDC
website. Steps to extract the COVID-10 is shown in the next chapter. The
shapefile for USA can also be extracted from _tigris_ library. A challenge with plotting a map of US is that the country extends far North to Alaska and East to pacific islands.
```{r 01-intro-18-1}
library(ggplot2)
library(dplyr)
covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=str_to_lower(Jurisdiction))
map_covid<-ggplot()+
geom_map(data=US, map=US,
aes(x=long, y=lat, map_id=region),
fill="#ffffff", color="#ffffff", size=0.15)+
#add covid data here
geom_map(data=covid, map=US,
aes(fill=CumulativeIncidence31.03.20, map_id=region),
color="#ffffff", size=0.15)+
scale_fill_continuous(low='thistle2', high='darkred',
guide='colorbar')
map_covid
```
In the simple example below we will generate a map of Australian State
territories color by size of area. The _ggplot2_ combines with _sf_ library and uses the shape file data in the _geom_sf_ call.
```{r 01-intro-18-2}
library(ggplot2)
library(sf)
#shape file
Aust<-st_read("./Data-Use/GCCSA_2016_AUST.shp")
colnames(Aust) #find out column variables
ggplot() +
geom_sf(data=Aust,aes(fill=AREASQKM16))+
labs(x="Longitude (WGS84)", y="Latitude", title="Map of Australia") +
theme_bw()
```
#### tmap
The _tmap_ library works in conjunction with _ggplot2_ and _sf_. The _tm_shape_ function takes in the shape data. The _tm_polygon_ function color the shape file with the column data of interest.
```{r 01-intro-19}
library(tmap)
load("./Data-Use/world_stroke.Rda")
#data from GBD 2016 investigators
colnames(world_sfdf)
class(world_sfdf) #contains simple features
#map of country income
m<-tm_shape(world_sfdf, projection = "+proj=eck4")+tm_polygons("Income")
m
#save object as png
#tmap_save(m,file="world_income.png"")
#save as leaflet object
#tmap_save(m,file="world_income.html"")
```
map of lifetime stroke risk
```{r 01-intro-19-1}
n<-tm_shape(world_sfdf, projection = "+proj=eck4")+tm_polygons("MeanLifetimeRisk")
n
```
#### Voronoi
The _ggplot2_ and _sf_ libraries are extended to include drawing of voronoi. Voronoi is a special type of polygon. It can be seen as a mathematical approach
to partition regions such that all points within the polygons are closest (depending on distance metrics) to the seed points. Voronoi has been used in disease mapping (John Snow mapping of Broad Street Cholera outbreak) and meteorology (Alfred Thiessen polygon method for measuring rainfall in basin).
This is a more complex coding task. It uses the _geom_voronoi_ call from _ggvoronoi_ library. Some libraries have vignettes to help you implement the codes. The vignette in the _ggvoronoi_ library can be called using _vignette("ggvoronoi")_. The _osmdata_ library will be used to provide map from OpenStreetMap. A related library is _OpenStreetMap_. The latter library uses raster file whereas _osmdata_ provides vectorised map data. In the chapter on Geospatial Analysis we will expand on this theme with interactive map. One issue with the use of voronoi is that there are infinite sets and so a boundary needs to set. In the example below, the boundary was set for Greater Melbourne.
```{r 01-intro-20}
library(dplyr)
library(ggvoronoi)
library(ggplot2)
library(sf)
#subset data with dplyr for metropolitan melbourne
msclinic<-read.csv("./Data-Use/msclinic.csv") %>% filter(clinic==1, metropolitan==1)
Victoria<-st_read("./Data-Use/GCCSA_2016_AUST.shp") %>% filter(STE_NAME16=="Victoria")
m<-ggplot(msclinic)+
geom_point(aes(x=lon,y=lat))+
#add hospital location
geom_voronoi(aes(x=lon,y=lat,fill=distance),fill=NA, color="black")+
#create voronoi from hospital location
geom_sf(data=Victoria,aes(fill=AREASQKM16)) +
labs(x="Longitude (WGS84)", y="Latitude",
title="Voronoi Map of MS Clinics in Melbourne")
m
```
This map is not so useful as the features of Victoria overwhelm the features of Greater Melbourne.
### ggwordcloud
The _ggwordcloud_ library extend the _ggplot2_ family to creating wordcloud. The following is an illustration of wordcloud created from comments on Youtube video "Stroke Heroes Act Fast".
```{r 01-intro-21}
library(ggwordcloud)
library(tm)
library(tidyverse)
heroes_df<-read.csv("./Data-Use/Feb_01_1_49_59 PM_2018_AEDT_YoutubeData.csv",stringsAsFactors = FALSE)
#cleaning data
keywords <- heroes_df$Comment
keywords <- iconv(keywords, to = 'utf-8')
#create corpus
myCorpus <- VCorpus(VectorSource(keywords))
#lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove numer
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
#remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"),lazy=TRUE)
#remove white space
myCorpus <- tm_map(myCorpus, stripWhitespace, lazy=TRUE)
#term document matrix
dtm <- DocumentTermMatrix(myCorpus,control = list(wordLengths=c(3, 20)))
#remove sparse terms
dtm<-removeSparseTerms(dtm, 0.95)
#remove words of length <=3
tdm=TermDocumentMatrix(myCorpus,
control = list(minWordLength=4,maxWordLength=20) )
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
#remove words with frequency <=1
d <- data.frame(word = names(v),freq=v) %>% filter(freq>1)
#wordcloud
ggplot(data = d,
aes(label = word, size = freq, col = as.character(freq))) +
geom_text_wordcloud(rm_outside = TRUE, max_steps = 1,
grid_size = 1, eccentricity = .8)+
scale_size_area(max_size = 12)+
scale_color_brewer(palette = "Paired", direction = -1)+
theme_void()
```
### gganimate
The library _gganimate_ can be used to generate videos in the form of gif file. The data needs to be collated from wide to long format. For the purpose of this book, the code has been turned off.
```{r 01-intro-22, eval=FALSE}
library(ggplot2)
library(gganimate)
#ensure data in long format, Y =NIHSS
u <- ggplot(dtTime2, aes(period, meanY , color = T, frame = period)) +
geom_bar(stat="identity") +
geom_text(aes(x = 11.9, label = period), hjust = 0) + xlim(0,13)+
coord_cartesian(clip = 'off') +
facet_wrap(~meanY)+
labs(title = 'ECR Trials', y = 'Number') +
transition_reveal(period)
u
#create gif
#animate(u,fps=15,duration=15)
#anim_save(u,file="./Data-Use/simulated_ECR_trial.gif", width=800, height=800)
```
### ggneuro
There are several ways of plotting mri scans. The _ggneuro_ library is
illustrated here as it relates to the _ggplot2_ family. The Colin $T_1$ scan is
a high resolution scan from MNI.
```{r 01-intro-23}
#devtools::install_github("muschellij2/ggneuro")
library(ggneuro)
library(neurobase)
colin_1mm<-untar("./Data-Use/colin_1mm.tgz")
colinIm<-readNIfTI("colin_1mm")
ggortho(colinIm)
```
## plotly
Plotly has its own API and uses _Dash_ to upload the figure on the web. It has additional ability for interaction as well as create a video. Examples are provided with calling plotly directly or via _ggplot2_. In the examples below,
the plots are performed using _ggplot2_ and then pass onto _plotly_ using _ggplotly_ function. The example uses the Leukemia dataset from _Stat2Data_ library.
### Scatter plot with plotly
The plotly syntax uses a _~_ after the _=_ symbol to identify a variable for plotting.
```{r 01-intro-24}
library(plotly)
library(Stat2Data)
data("Leukemia") #treatment of leukemia
#scatter plot directly from plotly
plot_ly(x=~Age,y=~Smear, #percentage of blast
color=~as.factor(Status), #dead or alive
symbol=~Resp,
symbols=c('circle' ,'square'), #Response to treatment
data=Leukemia)
```
### Bar plot with plotly
Plotly can uses a ggplot object directly.
```{r 01-intro-24-1}
#using the epilepsy data in p generated above for bar plot
ggplotly(p)
```
### Heatmap
```{r 01-intro-24-2}
#using swiss data above to create heatmap
ggplotly(q)
```
### map
```{r 01-intro-24-3}
library(ggplot2)
library(plotly)
library(dplyr)
covid<-read.csv("./Data-Use/Covid_bystate_Table130420.csv") %>% mutate(region=str_to_lower(Jurisdiction)) #consistent
ggplotly(
ggplot()+
geom_map(data=US, map=US,
aes(x=long, y=lat, map_id=region),
fill="#ffffff", color="#ffffff", size=0.15)+
#add covid data here
geom_map(data=covid, map=US,
aes(fill=NumberCases31.03.20, map_id=region),
color="#ffffff", size=0.15)+
scale_fill_continuous(low='thistle2', high='darkred',
guide='colorbar')
)
```
You can write citations, too. For example, we are using the **bookdown** package [@R-bookdown] in this sample book, which was built on top of R Markdown and **knitr** [@xie2015].