forked from GreenhamLab/Brapa_R500_Circadian_Transcriptome
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Brapa_CircadianTranscriptome_Markdown.Rmd
executable file
·1580 lines (1209 loc) · 80.4 KB
/
Brapa_CircadianTranscriptome_Markdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Expansion of the circadian transcriptome in _Brassica_ _rapa_ and genome wide diversification
of paralog expression patterns
author: "Ryan Sartor, Stevan Zorich and Kathleen Greenham"
date: "3/10/2020"
output:
html_document: default
pdf_document: default
---
```{r setup}
knitr::opts_chunk$set(fig.width=6, fig.height=3, warning=FALSE, message=FALSE, echo = TRUE)
```
Load the required libraries.
```{r loadLibs}
require(edgeR)
require(stringr)
require(ggplot2)
require(DiPALM)
require(rain)
require(WGCNA)
require(circlize)
```
## Getting Started
First, set some path variables. You will need to define the path where raw RNA-seq and annotation data files are stored (rawDataPath) and the path where you want outputs to go (outputPath).
```{r dataPathsHidden, include=FALSE}
rawDataPath = "/Users/greenham/Documents/UMN/Manuscripts/R500_Genome_Polyploidy_Clock/diPALM/Markdown/Markdown_GitHub/"
outputPath= "/Users/greenham/Documents/UMN/Manuscripts/R500_Genome_Polyploidy_Clock/diPALM/Markdown/Markdown_GitHub/Output/"
```
```{r load, include=FALSE}
# This block loads in some pre-run data objects to make this thing knit faster
load(file.path(rawDataPath,"combined.RData"))
```
```{r dataPaths, eval=FALSE}
rawDataPath = "~/Path/To/Input/Files/"
outputPath= "~/Path/To/Output/Directory/"
```
A collection of functions is included as a supplemental file. Put this code into the 'rawDataPath' and source this code.
```{r addFunctions}
source(file.path(rawDataPath,"Rfunctions.R"))
```
### Raw Data
The first dataset contains time-course transcript expression from the two different entrainment conditions; photocycles (LD) and thermocycles (HC). Following entrainment, plants were shifted to constant conditions (LLHH), 24 hours later, leaf tissue was sampled every 2 hours for 48 hours.
Raw counts from this data can be found on the NCBI GEO Accession: GSE123654 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123654).
Download the file: **'GSE123654_R500_LDHH_LLHC_rawCounts.csv'**.
Place the file in the 'rawDataPath'.
```{r loadLDHC}
LDHC<-read.csv(file.path(rawDataPath,"GSE123654_R500_LDHH_LLHC_rawCounts.csv"),stringsAsFactors = F, row.names = 1)
colnames(LDHC)<-gsub("ZT","ZT_",colnames(LDHC))
```
Load in the gene lengths (used for normalization). The .csv file is included as a supplemental table with the publication.
```{r geneLen}
BrLengths = read.csv(file.path(rawDataPath,"Brapa_GeneLengths.csv"),stringsAsFactors = F, row.names=1)
BrLengths<-setNames(object = as.numeric(BrLengths[[1]]), nm = rownames(BrLengths))
```
Normalize with edgeR and log transform.
```{r NormLDHC}
cntsDge<- DGEList(counts = LDHC)
cntsDge<- calcNormFactors(cntsDge)
LDHCLog<- rpkm(cntsDge, log=T, gene.length = BrLengths[rownames(LDHC)], prior.count=0.1)
```
Look at LD and HC separately and filter genes based on expression level.
Only retain genes with at least one sample with log10(FPKM) > 0.
```{r FilterLDHC}
LD<-LDHCLog[,grep("LD_",colnames(LDHCLog),fixed = T)]
HC<-LDHCLog[,grep("HC_",colnames(LDHCLog),fixed = T)]
LDexpressed<-apply(LD,1,function(x) max(x)>0)
HCexpressed<-apply(HC,1,function(x) max(x)>0)
LDandHCexpressed<-names(LDexpressed)[which(LDexpressed & HCexpressed)]
LDfiltered<-LD[LDexpressed,]
HCfiltered<-HC[HCexpressed,]
LDandHCExpfiltered<-cbind(LD[LDandHCexpressed,],HC[LDandHCexpressed,])
```
Use the 'RAIN' R package to find the subset of cycling genes for each dataset
(this will take 5-20 minutes for each ```rain()``` function call).
```{r CallCycling, eval=F}
LDRainRes<-rain(x = t(LDfiltered), deltat = 2, period = 24, period.delta = 4, nr.series = 2)
HCRainRes<-rain(x = t(HCfiltered), deltat = 2, period = 24, period.delta = 4, nr.series = 2)
repOrder<-as.numeric(rbind(1:48,49:96))
LDandHCRainRes<-rain(x = t(LDandHCExpfiltered[,repOrder]), deltat = 2, period = 24, period.delta = 4, nr.series = 4)
#****Save if you want to avoid re-running it in the future
#save(LDRainRes,HCRainRes,LDandHCRainRes,file=file.path(outputPath,"RainResults.RData"))
LDCycling<-row.names(LDRainRes)[which(LDRainRes$pVal<0.01)]
HCCycling<-row.names(HCRainRes)[which(HCRainRes$pVal<0.01)]
LDandHCCycling<-row.names(LDandHCRainRes)[which(LDandHCRainRes$pVal<0.01)]
LDfiltered<-LDfiltered[LDCycling,]
HCfiltered<-HCfiltered[HCCycling,]
LDandHCfiltered<-LDandHCExpfiltered[LDandHCCycling,]
```
## Build Coexpression Networks
In order to construct coexpression networks, the replicates are averaged.
```{r LDHCAvgReps}
# Remove the replicate labels
LDAvgCols<-colnames(LDfiltered)
LDAvgCols<-gsub("\\_R[[:digit:]]","",LDAvgCols)
HCAvgCols<-colnames(HCfiltered)
HCAvgCols<-gsub("\\_R[[:digit:]]","",HCAvgCols)
LDandHCAvgCols<-colnames(LDandHCfiltered)
LDandHCAvgCols<-str_extract(LDandHCAvgCols,"ZT_[[:digit:]]*")
# Average replicates together
LDAvg<-tapply(colnames(LDfiltered),INDEX = LDAvgCols, function(x) rowSums(as.data.frame(LDfiltered[,x]),na.rm = T)/length(x))
HCAvg<-tapply(colnames(HCfiltered),INDEX = HCAvgCols, function(x) rowSums(as.data.frame(HCfiltered[,x]),na.rm = T)/length(x))
LDandHCAvg<-tapply(colnames(LDandHCfiltered),INDEX = LDandHCAvgCols, function(x) rowSums(as.data.frame(LDandHCfiltered[,x]),na.rm = T)/length(x))
# Since the timepoints get out of order with the tapply, we re-order the columns to be back in chronological order
LDAvg<-do.call(cbind,LDAvg[order(as.numeric(sapply(strsplit(names(LDAvg),split = "_"),function(x) x[3])))])
HCAvg<-do.call(cbind,HCAvg[order(as.numeric(sapply(strsplit(names(HCAvg),split = "_"),function(x) x[3])))])
LDandHCAvg<-do.call(cbind,LDandHCAvg[order(as.numeric(sapply(strsplit(names(LDandHCAvg),split = "_"),function(x) x[2])))])
```
Gene coexpression networks are constructed using the WGCNA package run using batch mode to get a set of eigengenes.
This first part can be run if desired. It is used to determine a good power value to use in network construction.
```{r LDHC_WGCNA_Power, eval=F}
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sftLD = pickSoftThreshold(t(LDAvg), powerVector = powers, verbose = 5)
sftHC = pickSoftThreshold(t(HCAvg), powerVector = powers, verbose = 5)
sftLDHC = pickSoftThreshold(t(LDandHCAvg), powerVector = powers, verbose = 5)
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sftLD$fitIndices[,1],(sftLD$fitIndices[,2]*-sign(sftLD$fitIndices[,3])),xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",main = paste("LD Scale independence"))
plot(sftHC$fitIndices[,1],(sftHC$fitIndices[,2]*-sign(sftHC$fitIndices[,3])),xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",main = paste("HC Scale independence"))
plot(sftLDHC$fitIndices[,1],(sftLDHC$fitIndices[,2]*-sign(sftLDHC$fitIndices[,3])),xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",main = paste("LDHC Scale independence"))
```
We've decided on a power of 10.
Now construct the networks.
```{r LDHC_WGCNA, eval=F}
BlockModsLD<- blockwiseModules(datExpr = t(LDAvg), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=2, pamRespectsDendro = F, nThreads = 4, verbose=7)
BlockModsHC<- blockwiseModules(datExpr = t(HCAvg), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=2, pamRespectsDendro = F, nThreads = 4, verbose=7)
BlockModsLDHC<- blockwiseModules(datExpr = t(LDandHCAvg), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=2, pamRespectsDendro = F, nThreads = 4, verbose=7)
#**** Save if you want to avoid re-running it in the future
#save(BlockModsLD,BlockModsHC,BlockModsLDHC,file=file.path(outputPath,"WGCNAMods_LDHC.RData"))
# Rename the eigengenes to just colors
colnames(BlockModsLD[[3]])<-gsub("^ME","",colnames(BlockModsLD[[3]]))
colnames(BlockModsHC[[3]])<-gsub("^ME","",colnames(BlockModsHC[[3]]))
colnames(BlockModsLDHC[[3]])<-gsub("^ME","",colnames(BlockModsLDHC[[3]]))
# Only compare common genes
intCols<-intersect(names(BlockModsLD$colors),names(BlockModsHC$colors))
```
### Visualize Network Modules
Plot the clustered heatmaps to display expression patterns of gene modules.
First make some color palettes.
```{r ColorPalettes, fig.width=6, fig.height=2}
# Blue - Orange Palette (used for the heatmap expression intensity)
boCols<-c(rgb(0,0,0.6,1),"white","darkorange")
boFunc<-colorRampPalette(colors = boCols, bias = 0.7)
blueOrange<-boFunc(50)
barplot(rep(1,50),col=blueOrange, main="blueOrange Heatmap")
# Blue-Green Palette (used to distinguish between different modules)
bgCols<-c(rgb(0,0,0.3,1),rgb(0.9,0,0.9,1),"white",rgb(0.9,0.9,0,1),rgb(0,0.3,0,1))
bgFunc<-colorRampPalette(colors = bgCols, bias = 1)
blueGreenLD<-bgFunc(14)
names(blueGreenLD)<-paste("LD_",str_pad(string = c(1:14),width = 2,pad = "0"),sep="")
barplot(rep(1,14),col=blueGreenLD, main="blueGreenLD")
blueGreenHC<-bgFunc(10)
names(blueGreenHC)<-paste("HC_",str_pad(string = c(1:10),width = 2,pad = "0"),sep="")
barplot(rep(1,10),col=blueGreenHC, main="blueGreenHC")
blueGreenLDHC<-bgFunc(12)
names(blueGreenLDHC)<-paste("LDHC_",str_pad(string = c(1:12),width = 2,pad = "0"),sep="")
barplot(rep(1,12),col=blueGreenLDHC, main="blueGreenLDHC")
```
Reorganize the modules so they are arranged by phase and relabel them.
```{r OrganizeHeatmaps}
# Manually rearrange modules based on phase (peak expression time)
LDcOrd<-c("turquoise","greenyellow","brown","magenta","purple", "black", "blue", "tan", "salmon", "pink" ,"red","green", "cyan", "yellow")
HCcOrd<-c("turquoise","yellow","pink","black","brown", "purple","blue","magenta","green","red")
LDHCcOrd<-c("pink","brown","black","purple","magenta","greenyellow","green","blue","turquoise","tan","red","yellow")
# Re-map the module names assigned by WGCNA to our LD/HC numbers
LDcMap<-names(blueGreenLD)
names(LDcMap)<-LDcOrd
HCcMap<-names(blueGreenHC)
names(HCcMap)<-HCcOrd
LDHCcMap<-names(blueGreenLDHC)
names(LDHCcMap)<-LDHCcOrd
LDcolors<-setNames(LDcMap[BlockModsLD$colors],nm = names(BlockModsLD$colors))
HCcolors<-setNames(HCcMap[BlockModsHC$colors],nm = names(BlockModsHC$colors))
LDHCcolors<-setNames(LDHCcMap[BlockModsLDHC$colors],nm = names(BlockModsLDHC$colors))
# Reorder the genes in the phase order
LDdf<-data.frame(Module=sort(LDcolors[intCols]))
HCdf<-data.frame(Module=sort(HCcolors[intCols]))
LDHCdf<-data.frame(Module=sort(LDHCcolors))
```
Finally, plot the heatmaps using the 'pheatmap' package.
```{r LHDCheatmaps, fig.width=5, fig.height=8}
require(pheatmap)
#pdf(file.path(outputPath,"Figure1_LDheatmap.pdf"),width = 5,height = 6)
pheatmap(LDAvg[row.names(LDdf),],cluster_rows = F, cluster_cols = F, scale = "row", color= blueOrange, annotation_row = LDdf, annotation_colors = list(Module=blueGreenLD), show_rownames = F, labels_col = gsub("LD_","",colnames(LDAvg),fixed = T))
#dev.off()
#pdf(file.path(outputPath,"Figure1_HCheatmap.pdf"),width = 5,height = 6)
pheatmap(HCAvg[row.names(HCdf),],cluster_rows = F, cluster_cols = F, scale = "row",color = blueOrange, annotation_row = HCdf, annotation_colors = list(Module=blueGreenHC), show_rownames = F, labels_col = gsub("HC_","",colnames(HCAvg),fixed = T))
#dev.off()
```
To examine the overlap between LD and HC derived modules, we compare the correlation of expression patterns of module eigengenes as well as the overlap of genes contained within the modules.
```{r LDvsHC_Modules}
# Calculate eigengene correlations
cTable<-cor(BlockModsLD$MEs,BlockModsHC$MEs)
cTable<-cTable[LDcOrd,HCcOrd]
rownames(cTable)<-LDcMap[rownames(cTable)]
colnames(cTable)<-HCcMap[colnames(cTable)]
# Calculate gene overlap statistics
pTable<-overlapTable(labels1 = LDcolors[intCols],labels2 = HCcolors[intCols])
pTableCnt<-pTable$countTable
pTable<-pTable$pTable
pTable[pTable==0]<-min(pTable[pTable>0])
pTable<--log10(pTable)
pTable<-pTable[row.names(cTable),colnames(cTable)]
pTableCnt<-pTableCnt[row.names(cTable),colnames(cTable)]
# Generate dataframes to be used for plotting the modules (size and color) on a circos plot
LDtab<-table(LDdf)[LDcMap]
circDFLD<-data.frame(module=names(LDtab),color=blueGreenLD[names(LDtab)],Min=1,Max=as.numeric(LDtab), stringsAsFactors = F)
HCtab<-table(HCdf)[HCcMap]
circDFHC<-data.frame(module=names(HCtab),color=blueGreenHC[names(HCtab)],Min=1,Max=as.numeric(HCtab), stringsAsFactors = F)
circDF<-rbind(circDFLD[nrow(circDFLD):1,],circDFHC)
```
Generate a circos plot to visualize the relationships between LD and HC modules.
```{r LDvsHC_CircosPlot, fig.width=6, fig.height=6}
# Build color map for circos links
colTable<-cTable
colTable[1:length(cTable)]<-adjustcolor(ReturnColorMap(as.numeric(cTable),ColVec = c("blue","yellow","red")),alpha.f = 0.5)
# Heights of the labels
labOff<-c(rep(2,14),1.4,rep(2,9))
#pdf("file.path(outputPath,"Figure1_Circos.pdf"),width = 6,height = 6)
circos.clear()
circos.par(cell.padding = c(0, 0, 0, 0), start.degree=270, track.height=0.1)
circos.initialize(factors=factor(circDF$module,levels = circDF$module),xlim=cbind(circDF$Min, circDF$Max))
circos.track(factors = factor(circDF$module,levels = circDF$module), ylim=c(0,1), track.margin=c(0,0.1), panel.fun = function(x, y){
i = get.cell.meta.data("sector.numeric.index")
xlim = get.cell.meta.data("xlim")
#print(paste(labOff[i],circDF$color[i]))
circos.text(CELL_META$xcenter,labOff[i], cex=0.7, font=2, circDF$module[i], facing="downward")
circos.rect(xleft = xlim[1],xright = xlim[2], ybottom = 0, ytop = 1, track.index = 1, col=circDF$color[i])
})
# Loop through the pValue table comparing modules and plot links for the significant ones
# Keep track of total links for each module
rSum<-setNames(rep(0,nrow(pTable)),nm = rownames(pTable))
cSum<-setNames(rep(0,nrow(pTable)),nm = colnames(pTable))
for(r in row.names(pTable))
{
for(c in colnames(pTable))
{
if(pTable[r,c]>=2)
{
newR<-rSum[r]+pTableCnt[r,c]
newC<-cSum[c]+pTableCnt[r,c]
#print(paste(r,(rSum[r]+1),newR,c,(cSum[c]+1),newC))
circos.link(sector.index1 = r, point1 = c((rSum[r]+1),newR), sector.index2 = c, point2 = c((cSum[c]+1),newC), col=colTable[r,c])
rSum[r]<-newR
cSum[c]<-newC
}
}
}
#dev.off()
# Plot a legend for the circos plot
#pdf(file.path(outputPath,"Figure1_CircosLegend.pdf"),width = 2,height = 6)
#s<-seq(0,1,length.out = 20)
#c<-ReturnColorMap(s,ColVec = c("yellow","red"))
#par(mar=c(1,4,1,1),cex=1.5)
#BlankPlot(xrng = c(0,1),yrng = c(0,1.1), ylab="Module Expression Pearson Correlation")
#rect(xleft = 0,xright = 1,ybottom = s,ytop = s+(s[2]-s[1]),col = adjustcolor(c,alpha.f = 0.5))
#axis(side = 2, at=pretty(s)+(0.5*(s[2]-s[1])), labels = pretty(s),font=2,tick = F,line = -1)
#dev.off()
```
GO Biological Process categorical enrichment of LD and HC modules. This block uses the supplemental files **'GO_BP_Annotations.csv'** and **'GO_Term_Descriptions.csv'**. Make sure these files are placed in your 'rawDataPath' folder.
```{r LDHC_CategoricalEnrichment, eval=F}
# Temp load("~/Documents/McClungLab/Brassica/GOAnnotate/R500_CategoricalEnrich.RData")
# First load the Brassica R500 GO BP annotations
BrEnrichGO_BP = read.csv(file.path(rawDataPath,"GO_BP_Annotations.csv"),stringsAsFactors = F)
# This function is used to put the annotations in a useful format to calculate categorical enrichment
BrEnrichGO_BP = BuildEnrichMaps(inDF = BrEnrichGO_BP)
# Load the GO term descriptions
GODescriptions<-read.csv(file.path(rawDataPath,"GO_Term_Descriptions.csv"),stringsAsFactors = F)
GODescriptions<-setNames(object = GODescriptions$Description, nm = GODescriptions$Goterm)
# Group the genes into a list of modules
LDmods<-tapply(X = names(LDcolors), INDEX = LDcolors, function(x) x)
HCmods<-tapply(X = names(HCcolors), INDEX = HCcolors, function(x) x)
# Calculate categorical enrichment for each module
EnrichLstLD<-lapply(LDmods,function(x) CalcEnrich(MapBuild = BrEnrichGO_BP, testvec = x, codedesc = GODescriptions, refvec = LDCycling))
EnrichLstHC<-lapply(HCmods,function(x) CalcEnrich(MapBuild = BrEnrichGO_BP, testvec = x, codedesc = GODescriptions, refvec = HCCycling))
# Filter out all non-significant categories
EnrichLstLD<-lapply(EnrichLstLD,function(x) x@CatList[which(as.numeric(x@CatList[,"Adj_P-Value"]) <= 0.01),,drop=F])
EnrichLstHC<-lapply(EnrichLstHC,function(x) x@CatList[which(as.numeric(x@CatList[,"Adj_P-Value"]) <= 0.01),,drop=F])
# If desired, write the output to the output folder
dir.create(path = file.path(outputPath,"LDandHC_GOenrich"))
sapply(names(EnrichLstLD), function(x) write.csv(EnrichLstLD[[x]],file=file.path(outputPath,"LDandHC_GOenrich",paste(x,".csv",sep=""))))
sapply(names(EnrichLstHC), function(x) write.csv(EnrichLstHC[[x]],file=file.path(outputPath,"LDandHC_GOenrich",paste(x,".csv",sep=""))))
```
### LD and HC Expression Analysis Using DiPALM
The LD and HC entrainments definitely result in some differences in gene expression. Next we run the DiPALM analysis in order to find genes with significantly different expression patterns.
```{r LDvsHC_DiPALM, eval=F}
# First split up the data matrix into separate time-courses for comparison
LDHCfiltered = LDandHCfiltered
spNms<-strsplit(x = colnames(LDHCfiltered), split = "_")
tnms<-sapply(spNms,function(x) paste(x[c(1,4)],collapse = "."))
colnames(LDHCfiltered)<-sapply(spNms,function(x) paste(x[2:3],collapse = ""))
TCsLDvHC<- tapply(X = 1:length(tnms),INDEX = tnms, function(x) LDHCfiltered[,x])
# Now re-combine them in order to calculate eigengenes
TCsAll<-do.call(rbind,TCsLDvHC)
BlockModsLDvHC<- blockwiseModules(datExpr = t(TCsAll), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=1, pamRespectsDendro = F, nThreads = 4, verbose=3)
#**** Save if you want to avoid re-running it in the future
#save(BlockModsLDvHC,file=file.path(outputPath,"LDvsHCDipalmBlockMods.RData"))
# Pull out the module eigengenes
MEs<-BlockModsLDvHC[[3]]
# Calculate kMEs and kMeds
kMEs<-BuildModMembership(MeMat = MEs, TCsLst = TCsLDvHC)
kMed<-sapply(TCsLDvHC,function(x) apply(x,1,function(y) median(y,na.rm = T)))
# Build a permuted expression set
TCsLDvHCPerm<-lapply(TCsLDvHC,function(x) x[sample(1:nrow(x),100000,replace = T),])
kMEsPerm<-BuildModMembership(MeMat = MEs, TCsLst = TCsLDvHCPerm)
kMedPerm<-sapply(TCsLDvHCPerm,function(x) apply(x,1,function(y) median(y,na.rm = T)))
# Define models
tFact<-as.factor(rep(c("HC","LD"),each=2))
design<-model.matrix(~0+tFact)
# Define the contrast
contr<-"tFactHC-tFactLD"
# Fit the models
LimmaModsMEs<-lapply(kMEs, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMed<-BuildLimmaLM(dataMat = kMed, designMat = design, contrastStr = contr)
# Pull out the t-statistics
LimmaModsMEs<-do.call(cbind,lapply(LimmaModsMEs,function(x) x$t))
LimmaModsMed<-LimmaModsMed$t
gc()
# Fit the permuted data
LimmaModsMEsPerm<-lapply(kMEsPerm, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMedPerm<-BuildLimmaLM(dataMat = kMedPerm, designMat = design, contrastStr = contr)
LimmaModsMEsPerm<-do.call(cbind,lapply(LimmaModsMEsPerm,function(x) x$t))
LimmaModsMedPerm<-LimmaModsMedPerm$t
gc()
# Sum the test statistics
TestSumsMEs<-apply(LimmaModsMEs,1, function(x) sum(abs(x),na.rm = T))
TestSumsMed<-abs(LimmaModsMed[,1])
PermSumsMEs<-apply(LimmaModsMEsPerm,1, function(x) sum(abs(x),na.rm = T))
PermSumsMed<-abs(LimmaModsMedPerm[,1])
# It can be informative to plot the distributions of the test genes vs. the permuted distribution
ggPlotMultiDensities(denslist = list(Test=TestSumsMEs,Permuted=PermSumsMEs), main = "Pattern Change Scores", xlab = "Differential Pattern Score",lwidth = 1, scale = F)
ggPlotMultiDensities(denslist = list(Test=TestSumsMed,Permuted=PermSumsMed), main = "Pattern Change Scores", xlab = "Differential Median Expression Score",lwidth = 1, scale = F)
AdjMEsLDvHC<-sapply(TestSumsMEs,function(x) AdjustPvalue(tVal = x, tVec = TestSumsMEs, pVec = PermSumsMEs))
AdjMedLDvHC<-sapply(TestSumsMed,function(x) AdjustPvalue(tVal = x, tVec = TestSumsMed, pVec = PermSumsMed))
SigMEsLDvHC<-AdjMEsLDvHC[which(AdjMEsLDvHC<0.01)]
SigMedLDvHC<-AdjMedLDvHC[which(AdjMedLDvHC<0.01)]
```
All genes in this DiPALM analysis are cycling. The set of genes that have different cycling patterns dependent upon entrainment conditions is interesting but we do not pursue these further in this study. We will however, look at GO categorical enrichment of this set.
```{r GO_Cyc_DiffEntrain, eval=F}
diffEntrain_Enrich<-CalcEnrich(MapBuild = BrEnrichGO_BP, testvec = names(SigMEsLDvHC), codedesc = GODescriptions, refvec = LDandHCCycling)
diffEntrain_Enrich<-diffEntrain_Enrich@CatList[which(as.numeric(diffEntrain_Enrich@CatList[,"Adj_P-Value"]) <= 0.01),,drop=F]
write.csv(diffEntrain_Enrich,file=file.path(outputPath,"LDandHC_GOenrich","Differential_Pattern.csv"))
```
## Expression Level Analysis of Retained Paralogs
Now we begin to look at 2 and 3-copy _B.rapa_ paralogs.
We have defined two types of homologous relationships:
\newline
1) There exists one Arabidopsis ortholog and two _B. rapa_ paralogs. These sets are defined in the supplemental file: **"At_Br_Orthologs2.csv"**.
\newline
2) There exists one Arabidopsis ortholog and three _B. rapa_ paralogs. These sets are defined in the supplemental file: **"At_Br_Orthologs3.csv"**.
\newline
Both of these files should be placed in your 'rawDataPath' folder.
First, we look for a relationship between retained copies of genes and expression level.
```{r CopyVsNoncopy}
# Cycling of Br and At paralogs
# First load the homolog data
BrOrtho2<-read.csv(file.path(rawDataPath,"At_Br_Orthologs2.csv"),stringsAsFactors = F)
BrOrtho3<-read.csv(file.path(rawDataPath,"At_Br_Orthologs3.csv"),stringsAsFactors = F)
# Convert 3-copy sets to 2-copy comparisons and combine all
Br2<-lapply(1:nrow(BrOrtho2),function(x) as.character(BrOrtho2[x,2:3]))
Br3<-lapply(1:nrow(BrOrtho3),function(x) as.character(BrOrtho3[x,2:4]))
Br23<-c(Br2,Br3)
# Only retain pairs with at least 2 cycling copies
BrCyc<-lapply(Br23,function(x) intersect(x,LDandHCCycling))
tLen<-sapply(BrCyc,length)
Br2Cyc<-BrCyc[which(tLen>1)]
# Define set of genes that are cycling copies
BrCopied<-unlist(Br2Cyc)
BrAll<-LDandHCCycling
# Define set of genes that are cycling and not copies
BrNoCopy<-setdiff(BrAll,BrCopied)
```
We observe higher expression (in general) in the set of copied genes. Is this an effect of the higher or lower expressed copy being generally higher?
To test this, we generate random pairs of single copy genes and compare to our paralogous gene pairs.
```{r CopyPairsVsNoncopyPairs}
# Make random pairs of single copy genes
Br2Rand1<-sample(BrNoCopy, floor(length(BrNoCopy)/2), replace = F)
Br2Rand2<-setdiff(BrNoCopy,Br2Rand1)[1:length(Br2Rand1)]
Br2Rand<-lapply(1:length(Br2Rand1),function(x) c(Br2Rand1[x],Br2Rand2[x]))
# Split paralogous pairs based on expresssion
BrAvgExpr<-apply(LDandHCExpfiltered,1,function(x) mean(x,na.rm=T))
BrCopyHighExp<-sapply(Br2Cyc,function(x) x[which.max(BrAvgExpr[x])])
BrCopyLowExp<-sapply(Br2Cyc,function(x) x[which.min(BrAvgExpr[x])])
RandCopyHighExp<-sapply(Br2Rand,function(x) x[which.max(BrAvgExpr[x])])
RandCopyLowExp<-sapply(Br2Rand,function(x) x[which.min(BrAvgExpr[x])])
# Put all the results together in a list
BrListExp<-list("Br_NoCopy"=BrNoCopy,"Br_Copied"=BrCopied,"RanPairs_HiExp"=RandCopyHighExp, "RanPairs_LoExp"=RandCopyLowExp, "BrCopy_HiExp"=BrCopyHighExp,"BrCopy_LoExp"=BrCopyLowExp)
ExpList<-lapply(BrListExp,function(x) BrAvgExpr[x])
names(ExpList)<-paste(names(ExpList)," [",sapply(ExpList,length),"]",sep="")
# Anova test for significance
ExpDf<-data.frame(Exp=unlist(ExpList),Group=rep(names(ExpList),times=sapply(ExpList,length)))
ExpAnov<-aov(formula = Exp ~ Group, data = ExpDf)
ExpTuk<-TukeyHSD(x = ExpAnov)
```
Plot the results.
```{r CopyVsNonCopyPlots, fig.width=4, fig.height=6}
#pdf(file.path(outputPath,"Figure2_A.pdf"),width = 3,height = 5)
ggp<-ggboxplot(dList = ExpList[1:2], cols = c("red","blue"), notch = T, notchLines = F, ledge = F, ylab = "Expression Log2(FPKM)", ylim = c(-5,10))
bracketDf<-data.frame(x1=c(1,1,2),y1=c(10,9.5,9.5),x2=c(2,1,2),y2=c(10,10,10))
ggp<-ggp+geom_segment(data=bracketDf, mapping=aes(x=x1,y=y1,xend=x2,yend=y2), lty=1, lwd=0.6, color=rep("grey40",3))
pValDF<-data.frame(x=c(1.5),y=c(9.5),lab=c("P-value: 0.000"))
ggp<-ggp+geom_text(mapping = aes(x=x,y=y,label=lab), data = pValDF)
ggp
#dev.off()
#pdf(file.path(outputPath,"Figure2_B.pdf"),width = 6,height = 5)
ggp<-ggboxplot(dList = ExpList[3:6], cols = c(rgb(0.5,0,0),rgb(1,0.5,0.5),rgb(0,0,0.5),rgb(0.5,0.5,1)), notch = T, notchLines = F, ledge = F, ylab = "Expression Log2(FPKM)", ylim = c(-5,10))
# Manually annotate p-values
bracketDf<-data.frame(x1=c(1,2,1,3,2,4),y1=c(10,-5,9.5,9.5,-4.5,-4.5),x2=c(3,4,1,3,2,4),y2=c(10,-5,10,10,-5,-5))
ggp<-ggp+geom_segment(data=bracketDf, mapping=aes(x=x1,y=y1,xend=x2,yend=y2), lty=1, lwd=0.6, color=rep("grey40",6))
pValDF<-data.frame(x=c(2,3),y=c(9.5,-4.5),lab=c("P-value: 0.981","P-value: 0.000"))
ggp<-ggp+geom_text(mapping = aes(x=x,y=y,label=lab), data = pValDF)
ggp
#dev.off()
```
### Phase Distribution of Paralogs in the LDHC Network
Plot the module eigengenes for the combined LDHC expression modules.
```{r LDHC_ModuleExpression, fig.width=6, fig.height=6}
# Look at the LDHC modules
LDHC_MEs<-BlockModsLDHC$MEs[,names(LDHCcMap)]
colnames(LDHC_MEs)<-LDHCcMap
#pdf(file.path(outputPath,"Figure2_C.pdf"),width = 6,height = 6)
pheatmap(t(LDHC_MEs),cluster_rows = F, cluster_cols = F, scale = "row",color = blueOrange, labels_col = gsub("_","",colnames(LDandHCAvg),fixed = T))
#dev.off()
```
Next we ask if paralogous genes are enriched or depleted for any specific peak expresssion times (phase).
```{r LDHCPhase_Enrich}
# Group all the LDHC modules
LDHCmods<-tapply(X = names(LDHCcolors), INDEX = LDHCcolors, function(x) x)
# Test each module for enrichment and depletion of multi-copy genes
LDHCcycleEnrich<-sapply(LDHCmods,function(x) PhyperOverlap(set1 = x,set2 = unique(BrCopied),backset = BrAll,nlog10 = T))
LDHCcycleDeplete<-sapply(LDHCmods,function(x) PhyperOverlap(set1 = x,set2 = unique(BrCopied),backset = BrAll,nlog10 = T,lowerTail = T))
phypRes<-cbind(LDHCcycleEnrich,LDHCcycleDeplete)
# Impose a negative sign on depleted modules and leave enriched modules positive
phypRes<-ifelse(test = LDHCcycleEnrich>LDHCcycleDeplete, yes = LDHCcycleEnrich, no = (LDHCcycleDeplete*-1))
ngeneRes<-sapply(LDHCmods,length)
# Generate dataframe for ggplot2
ggdf<-data.frame(pVal=phypRes,numGene=ngeneRes,module=factor(names(phypRes),levels = rev(names(phypRes))))
```
Generate plots.
```{r PlotPhaseCopyData, fig.width=4, fig.height=6}
#pdf(file.path(outputPath,"Fig2D_1.pdf"),width = 4,height = 6)
ggplot(data = ggdf, mapping = aes(x=module, y = ggdf$pVal)) + geom_bar(stat = "identity", aes(fill=module)) + coord_flip() + labs(title="Enrichment", y="-log10(depletion/enrichment pValue)") + scale_fill_manual(values = blueGreenLDHC) + theme_bw() + theme(legend.position="none")
#dev.off()
#pdf(file.path(outputPath,"Fig2D_2.pdf"),width = 4,height = 6)
ggplot(data = ggdf, mapping = aes(x=module, y = ggdf$numGene)) + geom_bar(stat = "identity", aes(fill=module)) + coord_flip() + labs(title="Number of Genes", y="Number of Genes/Module") + scale_fill_manual(values = blueGreenLDHC) + theme_bw() + theme(legend.position="none")
#dev.off()
```
Enrichment analysis of genes represented in the morning and evening phased modules that are over-represented for multi-copy genes.
```{r PhaseEnrich, eval=F}
# Pull out all the multi-copy genes from enriched and depleted modules
LDHCcopyEnriched<-unlist(LDHCmods[which(phypRes>(-log10(0.05)))])
LDHCcopiedMorning<-intersect(LDHCcopyEnriched,BrCopied)
LDHCcopyDepleted<-unlist(LDHCmods[which(phypRes<(-(-log10(0.05))))])
LDHCcopiedEvening<-intersect(LDHCcopyDepleted,BrCopied)
# Calculate GO categorical enrichment
EnrichLstLDHC<-lapply(list(MorningCopies=LDHCcopiedMorning, EveningCopies=LDHCcopiedEvening),function(x) CalcEnrich(MapBuild = BrEnrichGO_BP, testvec = x, codedesc = GODescriptions, refvec = LDandHCCycling))
# Filter out all non-significant categories
EnrichLstLDHC<-lapply(EnrichLstLDHC,function(x) x@CatList[which(as.numeric(x@CatList[,"Adj_P-Value"]) <= 0.01),])
# Output the results
dir.create(path = file.path(outputPath,"LDHC_MorningEveningGOEnrich"))
sapply(names(EnrichLstLDHC), function(x) write.csv(EnrichLstLDHC[[x]],file=file.path(outputPath,"LDHC_MorningEveningGOEnrich",paste(x,".csv",sep=""))))
```
### Expression Pattern Analysis of Paralog Pairs
Next we want to find out how many _B. rapa_ paralog pairs have divergent expression patterns.
For this, we use DiPALM. We treat the LD and HC data sets like replicates in order to increase our statistical power to detect divergent expression patterns. DiPALM operates on a linear model framework using the limma package. Thus, we can include LD and HC as potential covariates into our model in order to account for the effects between these two conditions.
```{r LDHC_DiPALM_Br1VsBr2, eval=F}
# For simplicity, convert all of the 3-copy B. rapa paralog sets into 3, 2-copy sets and combine with the actual 2-copy pairs
LDHCfiltered<-LDandHCExpfiltered
Br2a<-setNames(BrOrtho2,nm = c("At","Br1","Br2"))
Br2b<-setNames(BrOrtho3[,-2],nm = c("At","Br1","Br2"))
Br2c<-setNames(BrOrtho3[,-3],nm = c("At","Br1","Br2"))
Br2d<-setNames(BrOrtho3[,-4],nm = c("At","Br1","Br2"))
Br2s<-rbind(Br2a,Br2b,Br2c,Br2d)
# Retain only the pairs where both copies have expression data
Br2s<-Br2s[which(Br2s[,2]%in%rownames(LDHCfiltered) & Br2s[,3]%in%rownames(LDHCfiltered)),-1]
# Assign some row names
rNms<-paste(Br2s[,1],Br2s[,2],sep="_")
rownames(Br2s)<-rNms
# Split up orthologs
cntsLogPara2<-lapply(colnames(Br2s),function(x) LDHCfiltered[Br2s[,x],])
for(n in 1:ncol(Br2s))
{
colnames(cntsLogPara2[[n]])<-paste(colnames(Br2s)[n],colnames(cntsLogPara2[[n]]),sep = "_")
}
cntsLogPara2<-do.call(cbind,cntsLogPara2)
row.names(cntsLogPara2)<-rNms
# Convert expression data into a list of the different time-courses
spNms<-strsplit(x = colnames(cntsLogPara2), split = "_")
tnms<-sapply(spNms,function(x) paste(x[c(1,2,5)],collapse = "."))
colnames(cntsLogPara2)<-sapply(spNms,function(x) paste(x[3:4],collapse = ""))
TCsPara<- tapply(X = 1:length(tnms),INDEX = tnms, function(x) cntsLogPara2[,x])
TCsAll<-do.call(rbind,TCsPara)
# Run WGCNA to get eigengene vectors
BlockMods_LDHCBr1vBr2<- blockwiseModules(datExpr = t(TCsAll), power = 10, networkType = "signed", corType="bicor", TOMType="signed", minModuleSize=100, mergeCutHeight=0.2, deepSplit=2, pamRespectsDendro = F, nThreads = 4, verbose=7)
#****Save if you want to avoid re-running it in the future
#save(BlockMods_LDHCBr1vBr2,file=file.path(outputPath,"DiPALM_LDHC_Br1vBr2_BlockMods.RData"))
# Pull out MEs
MEs<-BlockMods_LDHCBr1vBr2[[3]]
# Define models
BrFact<-as.factor(rep(c("BR1","BR2"),each=4))
# Include the LD/HC as a factor in the model
LDHCFact<-as.factor(rep(c("HC","HC","LD","LD"),times=2))
design<-model.matrix(~0+BrFact+LDHCFact)
contr<-"BrFactBR1-BrFactBR2"
# Run DiPALM analysis
kMEs<-BuildModMembership(MeMat = MEs, TCsLst = TCsPara)
Med<-sapply(TCsPara,function(x) apply(x,1,function(y) median(y,na.rm = T)))
TCsParaPerm<-lapply(TCsPara,function(x) x[sample(1:nrow(x),100000,replace = T),])
kMEsPerm<-BuildModMembership(MeMat = MEs, TCsLst = TCsParaPerm)
MedPerm<-sapply(TCsParaPerm,function(x) apply(x,1,function(y) median(y,na.rm = T)))
LimmaModskMEs<-lapply(kMEs, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMed<-BuildLimmaLM(dataMat = Med, designMat = design, contrastStr = contr)
LimmaModskMEs<-do.call(cbind,lapply(LimmaModskMEs,function(x) x$t))
LimmaModsMed<-LimmaModsMed$t
gc()
LimmaModskMEsPerm<-lapply(kMEsPerm, function(x) BuildLimmaLM(dataMat = x, designMat = design, contrastStr = contr))
LimmaModsMedPerm<-BuildLimmaLM(dataMat = MedPerm, designMat = design, contrastStr = contr)
LimmaModskMEsPerm<-do.call(cbind,lapply(LimmaModskMEsPerm,function(x) x$t))
LimmaModsMedPerm<-LimmaModsMedPerm$t
gc()
TestSumskMEs<-apply(LimmaModskMEs,1, function(x) sum(abs(x)))
TestSumsMed<-abs(LimmaModsMed[,1])
PermSumskMEs<-apply(LimmaModskMEsPerm,1, function(x) sum(abs(x)))
PermSumsMed<-abs(LimmaModsMedPerm[,1])
# If desired, plot the score distributions of tested genes vs. the permuted genes
ggPlotMultiDensities(denslist = list(Test=TestSumskMEs,Permuted=PermSumskMEs), main = "Pattern Change Scores", xlab = "Differential Pattern Score",lwidth = 1, scale = F)
ggPlotMultiDensities(denslist = list(Test=TestSumsMed,Permuted=PermSumsMed), main = "Expression Change Scores", xlab = "Differential Expression Score",lwidth = 1, scale = F)
AdjkMEsLDHCBr1vBr2<-sapply(TestSumskMEs,function(x) AdjustPvalue(tVal = x, tVec = TestSumskMEs, pVec = PermSumskMEs))
AdjMedLDHCBr1vBr2<-sapply(TestSumsMed,function(x) AdjustPvalue(tVal = x, tVec = TestSumsMed, pVec = PermSumsMed))
SigkMEsLDHCBr1vBr2<-AdjkMEsLDHCBr1vBr2[which(AdjkMEsLDHCBr1vBr2<0.01)]
SigMedLDHCBr1vBr2<-AdjMedLDHCBr1vBr2[which(AdjMedLDHCBr1vBr2<0.01)]
```
Now we see that over 1800 _B. rapa_ paralog pairs show divergent expression patterns. Is this divergence random or can we cluster pairs based on ones that diverge in a similar way?
```{r LDHC_Br1vBr2_Clusters, fig.width=6, fig.height=10}
# Cluster the paralogs based on pattern difference and also on expression difference
# The figure displays the expression levels averaged across replicates
LDandHCAvgCols<-colnames(LDandHCExpfiltered)
LDandHCAvgCols<-str_extract(LDandHCAvgCols,"ZT_[[:digit:]]*")
# Calculate the means
LDHCExprAvg<-tapply(colnames(LDandHCExpfiltered),INDEX = LDandHCAvgCols, function(x) rowSums(as.data.frame(LDandHCExpfiltered[,x]),na.rm = T)/length(x))
# Becomes disordered with tapply, re-ordering the columns to be back in chronological order
LDHCExprAvg<-do.call(cbind,LDHCExprAvg[order(as.numeric(sapply(strsplit(names(LDHCExprAvg),split = "_"),function(x) x[2])))])
# Build a matrix of average expression across LDHC for the paralog pairs
MEsMat<-do.call(rbind,strsplit(names(SigkMEsLDHCBr1vBr2),split = "_",fixed=T))
row.names(MEsMat)<-names(SigkMEsLDHCBr1vBr2)
# Currently, Br1 and Br2 are defined arbitrarily. Using the sign of the summed t-statistics from the limma output, we can define Br1 and Br2 non-arbitrarily
MEsSign<-apply(LimmaModskMEs,1, function(x) sign(sum(x)))
MedSign<-apply(LimmaModsMed,1, function(x) sign(sum(x)))
MEsMat<-t(sapply(rownames(MEsMat),function(x) if(MEsSign[x]==1) MEsMat[x,1:2] else MEsMat[x,c(2,1)]))
MEsMat<-cbind(LDHCExprAvg[MEsMat[,1],],LDHCExprAvg[MEsMat[,2],])
# Normalize so the heatmap will display patterns
m1<-t(apply(MEsMat,1,function(x) (x[1:24]-mean(x[1:24],na.rm=T))/sd(x[1:24],na.rm=T)))
m2<-t(apply(MEsMat,1,function(x) (x[25:48]-mean(x[25:48],na.rm=T))/sd(x[25:48],na.rm=T)))
MEsMat<-cbind(m1,m2)
row.names(MEsMat)<-names(SigkMEsLDHCBr1vBr2)
# Cluster the paralog pairs based on expression across Br1 and Br2
MEsCor<-cor(t(MEsMat))
MEsTree<-hclust(as.dist(1-MEsCor),method = "complete")
# Clusters
MEsClust<-cutreeDynamic(dendro = MEsTree, method = "tree", minClusterSize = 50, cutHeight = 1.5)
# Re-assign cluster names
MEsClustOrd<-setNames(paste("pDif_Cluster",1:length(unique(MEsClust)),sep=""),nm=unique(MEsClust[MEsTree$order]))
MEsClust<-MEsClustOrd[as.character(MEsClust)]
# Re-order the plotting of each paralog set based on the clusters
MEsPlotOrd<-unlist(lapply(MEsClustOrd,function(x) which(MEsClust%in%x)))
# Build a dataframe to input info into the pheatmap function to tell it how to annotate the clusters on the side of the heatmap
MEsAnnot<-data.frame(row.names=MEsTree$labels,Cluster=MEsClust,stringsAsFactors = F)
# Plot the heatmap
plotMat<-NULL
breakVec<-0
for(cl in MEsClustOrd)
{
tmpInds<-which(MEsClust==cl)
#print(length(tmpInds))
tmpMat<-rbind(MEsMat[tmpInds,1:24],MEsMat[tmpInds,25:48])
plotMat<-rbind(plotMat,tmpMat)
breakVec<-c(breakVec,length(tmpInds)+max(breakVec),(length(tmpInds)*2)+max(breakVec))
#breakVec<-c(breakVec,(length(tmpInds)*2)+max(breakVec))
}
#pdf(file.path(outputPath,"Figure3_A.pdf"),width = 4,height =20)
pheatmap(mat = plotMat, scale = "row", cluster_cols = FALSE, treeheight_row = 0 , cluster_rows = FALSE, color = blueOrange[5:50], show_rownames = F, annotation_legend = T, gaps_row = breakVec, annotation_row = MEsAnnot, annotation_colors = list(Cluster=setNames(bgFunc(length(MEsClustOrd)),nm = MEsClustOrd)))
#dev.off()
# Repeat for the set of paralog pairs that have significantly different expression
MedMat<-do.call(rbind,strsplit(names(SigMedLDHCBr1vBr2),split = "_",fixed=T))
row.names(MedMat)<-names(SigMedLDHCBr1vBr2)
# Re-order Br1 and Br2 so that the higher expressing paralog is always first
MedMat<-t(sapply(rownames(MedMat),function(x) if(MedSign[x]==1) MedMat[x,1:2] else MedMat[x,c(2,1)]))
MedMat<-cbind(LDHCExprAvg[MedMat[,1],],LDHCExprAvg[MedMat[,2],])
row.names(MedMat)<-names(SigMedLDHCBr1vBr2)
MedCor<-cor(t(MedMat))
MedTree<-hclust(as.dist(1-MedCor),method = "complete")
# Cluster
MedClust<-cutreeDynamic(dendro = MedTree, method = "tree", minClusterSize = 75, cutHeight = 1.25)
MedClustOrd<-setNames(paste("exDif_Cluster",1:length(unique(MedClust)),sep=""),nm=unique(MedClust[MedTree$order]))
MedClust<-MedClustOrd[as.character(MedClust)]
MedPlotOrd<-unlist(lapply(MedClustOrd,function(x) which(MedClust%in%x)))
MedAnnot<-data.frame(row.names=MedTree$labels,Cluster=MedClust,stringsAsFactors = F)
# Manually scale the data so that expression changes are evident
MedMat<-t(apply(MedMat,1,function(x) (x-mean(x,na.rm=T))/sd(x,na.rm = T)))
plotMat<-NULL
breakVec<-0
for(cl in MedClustOrd)
{
tmpInds<-which(MedClust==cl)
#print(length(tmpInds))
tmpMat<-rbind(MedMat[tmpInds,1:24],MedMat[tmpInds,25:48])
plotMat<-rbind(plotMat,tmpMat)
breakVec<-c(breakVec,length(tmpInds)+max(breakVec),(length(tmpInds)*2)+max(breakVec))
#breakVec<-c(breakVec,(length(tmpInds)*2)+max(breakVec))
}
#pdf(file.path(outputPath,"Figure3_B.pdf"),width = 4,height =20)
pheatmap(mat = plotMat, scale = "none", cluster_cols = FALSE, treeheight_row = 0 , cluster_rows = FALSE, color = blueOrange[5:50], show_rownames = F, annotation_legend = T, gaps_row = breakVec, annotation_row = MedAnnot, annotation_colors = list(Cluster=setNames(bgFunc(length(MedClustOrd)),nm = MedClustOrd)))
#dev.off()
```
We can also look at these clusters individually using a ribbon plot in order to easily see pattern differences.
```{r LDHC_RibbonPlots, fig.width=8, fig.height=4}
MEsClusters<-tapply(names(SigkMEsLDHCBr1vBr2),INDEX = MEsClust,function(x) x)
TCsMEs<-list(Br1=MEsMat[,1:24],Br2=MEsMat[,25:48])
#pdf(file.path(outputPath,"Fig3_Ribbons.pdf"),width = 10,height = 5)
for(i in 1:length(MEsClusters))
{
PlotTCsRibbon(TClst = TCsMEs, tgenes = MEsClusters[[i]],scale = T, tcols=c("red","blue"), tltys = c(1,2),main = paste(names(MEsClusters)[i]," Gene#:",length(MEsClusters[[i]]),sep=" "))
}
#dev.off()
```
## GRN Construction in _B. rapa_ and Arabidopsis Using GENIE3
We are interested in comparing _B. rapa_ paralog pairs with the corresponding single ortholog in Arabidopsis. To do this, we have decided that the most robust method is to compare the predicted targets of a gene regulatory network.
Here we introduce Arabidopsis data. This data is a collection of publicly available microarray data sets compiled by Todd Mockler's group. The data can be found here:
ftp://www.mocklerlab.org/diurnal/expression_data/Arabidopsis_thaliana_data.tab.gz
We have extracted 4 of these data sets and re-organized them into matrix format .csv files. The supplemental file is called: **'At_ExpressionData.csv'**. This file should be placed in the 'rawDataPath' folder.
```{r AtDataLoad}
# Load the Arabidopsis data sets
AtData<-read.csv(file.path(rawDataPath,"At_ExpressionData.csv"),stringsAsFactors = F)
AtDataSet<-AtData$X.1
AtDataRows<-AtData$X
AtData<-AtData[,-c(1,2)]
AtData<-as.matrix(AtData)
rownames(AtData)<-AtDataRows
# Split it up into separate data sets
AtExp<-tapply(X = 1:nrow(AtData),INDEX = AtDataSet,function(x) as.matrix(AtData[x,]))
# Make unique colnames
for(exp in 1:length(AtExp))
{
colnames(AtExp[[exp]])<-paste(names(AtExp)[exp],colnames(AtExp[[exp]]),sep="_")
}
AtExpLog<-lapply(AtExp,function(x) log2(x))
# Filter the genes so that they must have at least one value > 2 in all 4 runs
AtExpressed<-lapply(AtExpLog,function(x) apply(x,1,function(y) max(y)>2))
AtExpressed<-names(which(AtExpressed[[1]] & AtExpressed[[2]] & AtExpressed[[3]] & AtExpressed[[4]]))
AtExpFiltered<-do.call(cbind,lapply(AtExpLog,function(x) x[AtExpressed,]))
```
Now we load in the mapping between _B. rapa_ and Arabidopsis orthologs. The supplemental file is called **'Brapa_Ath_Orthologues.csv'**. This file should be placed in the 'rawDataPath' folder.
```{r OrthologLoad}
# Load the ortholog data
LDHCfiltered<-LDandHCExpfiltered
AtBr<-read.csv(file.path(rawDataPath,"Brapa_Ath_Orthologues.csv"),stringsAsFactors = F)
AtBr<-AtBr[,5:6]
# Remove all BRs with no At ortholog
AtBrOrthoTargets<-AtBr[which((AtBr$BRA %in% rownames(LDHCfiltered))&(AtBr$Ath_ortho %in% rownames(AtExpFiltered))),]
# Define all possible (detected) target genes for both networks
AtTargets<-unique(AtBrOrthoTargets$Ath_ortho)
BrTargets<-unique(AtBrOrthoTargets$BRA)
```
Now we need to define the set of transcription factors (TFs) for each of the networks.
For arabidopsis, we selected TFs from the Arabidopsis TF database (https://agris-knowledgebase.org/AtTFDB/). We have included a supplemental file of the TFs called:
**'At_TFList.csv'**.
For _B. rapa_, we selected TFs using the MapMan automated gene annotation software, MERCATOR (https://mapman.gabipd.org/app/mercator), and taking all genes with the annotation "Transcriptional Control: 27.3". We have included a supplemental file called: **'Brapa_TFList.csv'**.
The last TF file is a set of genes known to be part of the central plant oscillator (clock genes). We add these as potential regulators (nearly all of them are classified as TFs). This file is called: **'Brapa_At_ClockIDs.csv'**.
All three of these files should be placed in the 'rawDataPath' folder.
```{r TFLoad}
# Load TF Data
AtTFs<-read.csv(file.path(rawDataPath,"At_TFList.csv"),stringsAsFactors = F)
AtTFs<-unique(AtTFs$Gene_ID)
BrTFs<-read.csv(file.path(rawDataPath,"Brapa_TFList.csv"),stringsAsFactors = F)
BrTFs<-BrTFs$Accession
# Load clock genes
clock<-read.csv(file.path(rawDataPath,"Brapa_At_ClockIDs.csv"),stringsAsFactors = F)
AtClock<-unique(clock$ATG)
BrClock<-unique(clock$BRA)
AtTFs<-union(AtTFs,AtClock)
BrTFs<-union(BrTFs,BrClock)
# Only consider the TFs that are expressed
AtTFs<-intersect(AtTFs,row.names(AtExpFiltered))
BrTFs<-intersect(BrTFs,rownames(LDHCfiltered))
```
Next, we construct our Gene Regulator Networks (GRNs) using the GENIE3 framework (https://github.com/aertslab/GENIE3).
```{r Genie3, eval=F}
# First, define the target and TF expression matrices
AtTFsMat<-AtExpFiltered[AtTFs,]
AtTargetMat<-AtExpFiltered[AtTargets,]
BrTFsMat<-LDHCfiltered[BrTFs,]
BrTargetMat<-LDHCfiltered[BrTargets,]
# Permute Target matrices to give 10,000 permuted targets
AtTargetMatPerm<-matrix(data = sample(as.numeric(AtTargetMat),480000,replace = T),nrow = 10000, dimnames = list(row=paste("Decoy",1:10000,sep="_"),col=colnames(AtTargetMat)))
BrTargetMatPerm<-matrix(data = sample(as.numeric(BrTargetMat),960000,replace = F),nrow = 10000, dimnames = list(row=paste("Decoy",1:10000,sep="_"),col=colnames(BrTargetMat)))
# Build the weight matrices
# ****If the R 'parallel' package is installed, I recommend increasing the 'nThr' parameter below ****
AtWeights<-RS.Get.Weigth.Matrix(target.matrix = t(AtTargetMat), input.matrix = t(AtTFsMat),nThr = 1)
BrWeights<-RS.Get.Weigth.Matrix(target.matrix = t(BrTargetMat), input.matrix = t(BrTFsMat),nThr = 1)
AtWeightsPerm<-RS.Get.Weigth.Matrix(target.matrix = t(AtTargetMatPerm), input.matrix = t(AtTFsMat), nThr=1)
BrWeightsPerm<-RS.Get.Weigth.Matrix(target.matrix = t(BrTargetMatPerm), input.matrix = t(BrTFsMat), nThr=1)
#****Save if you want to avoid re-running it in the future
#save(AtWeights,BrWeights,AtWeightsPerm,BrWeightsPerm,file=file.path(outputPath,"GRN_WeightMatrix.RData"))
# Impose an FDR edge weight cutoff to determine significant edge scores
AtWeightsCut<-CutGRNMat(tMat = AtWeights, pMat = AtWeightsPerm, cutPv = 0.05)
BrWeightsCut<-CutGRNMat(tMat = BrWeights, pMat = BrWeightsPerm, cutPv = 0.05)
# Organize target groups
AtTargetGroups<-apply(AtWeightsCut,2,function(x) rownames(AtWeightsCut)[which(x>0)])
BrTargetGroups<-apply(BrWeightsCut,2,function(x) rownames(BrWeightsCut)[which(x>0)])
```
Once again, load the ortholog/paralog information and filter on groups that are present in the corresponding GRNs.
```{r GRNHomologs}
# Create the total set of 2-copy pairs
Br2a<-setNames(BrOrtho2,nm = c("At","Br1","Br2"))
Br2b<-setNames(BrOrtho3[,-2],nm = c("At","Br1","Br2"))
Br2c<-setNames(BrOrtho3[,-3],nm = c("At","Br1","Br2"))
Br2d<-setNames(BrOrtho3[,-4],nm = c("At","Br1","Br2"))
Br2s<-rbind(Br2a,Br2b,Br2c,Br2d)
colnames(Br2s)<-colnames(BrOrtho2)
Br2s_At<-(Br2s$ATG %in% names(AtTargetGroups))
Br2s_Br1<-(Br2s$BrOr1 %in% names(BrTargetGroups))
Br2s_Br2<-(Br2s$BrOr2 %in% names(BrTargetGroups))
# Find the At-Br1-Br2 sets where all three genes are TFs
CommonTFs<-which(Br2s_At & Br2s_Br1 & Br2s_Br2)
CommonGroups<-lapply(CommonTFs,function(x) list(Ath=AtTargetGroups[[Br2s[x,1]]],Br1=BrTargetGroups[[Br2s[x,2]]],Br2=BrTargetGroups[[Br2s[x,3]]]))
names(CommonGroups)<-apply(Br2s[CommonTFs,],1,function(x) paste(x,collapse = "_"))
# Only retain groups with at least 10 predicted targets for each TF
ret<-which(sapply(CommonGroups,function(x) length(x[[1]])>=10 & length(x[[2]])>=10 & length(x[[3]])>=10))
CommonGroupsCut<-CommonGroups[ret]
```
### Identifying the Arabidopsis-like _B. rapa_ TF Based on Network Connectivity
Now we have defined our 256 sets of At-Br1-Br2 TF homologs. We want to know, based on predicted target groups, if one of the Br TFs is more like the At TF.
We cannot do a simple hypergeometric test between the target groups because the mapping of At - Br orthologs is not one-to-one. So in all cases, we map the At targets back to every possible Br ortholog, calculate the overlap between target groups and then compare this to a null distribution of overlap numbers that were generated separately for each At-Br1-Br2 triplet. The null distributions are generated by randomly re-shuffling the members of the Br1 and Br2 groups to end up with the same group sizes as the original and then running the same overlap test and repeating this 10,000 times.
```{r GRN_TargetAtOverlap, eval=F}
# Enrichment test for overlap between Br paralogs of At group and each Br Group
CommonGroupsEnrich<-lapply(CommonGroupsCut,function(x) TargetOverlapStats(TargGroups = x,OrthoMat = AtBrOrthoTargets))
# Run the same enrichment test 10,000 times after randomly swapping targets between the two Br groups
CommonGroupsEnrichPerm<-lapply(CommonGroupsCut,function(x) sapply(1:10000,function(y) TargetOverlapStats(TargGroups = TargetGroupPerm(x), OrthoMat = AtBrOrthoTargets)))
#****Save if you want to avoid re-running it in the future
#save(CommonGroupsEnrichPerm,file=file.path(outputPath,"GRN_TargetOverlap_Permuted.RData"))
# The hypergeometric test is invalid since we have converted the At genes back to all possible versions of the Br genes. We use the mean and sd of the permuted distributions and assume these are normal distributions (they are), then calculate p-values.
for(grp in names(CommonGroupsEnrich))
{
# Calculate test pvalues
CommonGroupsEnrich[[grp]][1]<-pnorm(q = CommonGroupsEnrich[[grp]][6]-1, mean = mean(CommonGroupsEnrichPerm[[grp]][6,]), sd = sd(CommonGroupsEnrichPerm[[grp]][6,]), lower.tail = F)
CommonGroupsEnrich[[grp]][2]<-pnorm(q = CommonGroupsEnrich[[grp]][7]-1, mean = mean(CommonGroupsEnrichPerm[[grp]][7,]), sd = sd(CommonGroupsEnrichPerm[[grp]][7,]), lower.tail = F)
# Calculate permutation pvalues
CommonGroupsEnrichPerm[[grp]][1,]<-pnorm(q = CommonGroupsEnrichPerm[[grp]][6,]-1, mean = mean(CommonGroupsEnrichPerm[[grp]][6,]), sd = sd(CommonGroupsEnrichPerm[[grp]][6,]), lower.tail = F)
CommonGroupsEnrichPerm[[grp]][2,]<-pnorm(q = CommonGroupsEnrichPerm[[grp]][7,]-1, mean = mean(CommonGroupsEnrichPerm[[grp]][7,]), sd = sd(CommonGroupsEnrichPerm[[grp]][7,]), lower.tail = F)
}
# In order to make plots easier to understand, we fix these results so that the more significant Br is always labeled as Br1 and the less significant one is always Br2.
# ID the groups where Br2 is more significant
Br2Sigs<-which(sapply(CommonGroupsEnrich,function(x) x[2]<x[1]))
# Now switch Br1 and B2 in these groups for the test and the permuted
for(flp in Br2Sigs)
{
tmpVec<-setNames(object = CommonGroupsEnrich[[flp]][c(2,1,3,5,4,7,6)], nm = names(CommonGroupsEnrich[[flp]]))
tmpNm<-rownames(CommonGroupsEnrichPerm[[flp]])
tmpMat<-CommonGroupsEnrichPerm[[flp]][c(2,1,3,5,4,7,6),]
rownames(tmpMat)<-tmpNm
CommonGroupsEnrich[[flp]]<-tmpVec
CommonGroupsEnrichPerm[[flp]]<-tmpMat
tmpNm<-strsplit(x = names(CommonGroupsEnrich)[flp],split = "_",fixed=T)[[1]]
tmpNm<-paste(tmpNm[c(1,3,2)],collapse = "_")
names(CommonGroupsEnrich)[flp]<-tmpNm
names(CommonGroupsEnrichPerm)[flp]<-tmpNm
}
```
Now that we have calculated pvalues for the significance of Br1 and Br2 overlaps with the At target genes, we want to know which of the 256 At-Br1-Br2 TF triplets represents a case where one of the Br TFs has maintained At-like behavior while the other Br TF has diverged. In other words do we see cases where one Br is significantly more At-like than the other. To run this test, we can use the same permuted distributions from the previous block. This time, we simply caclulate the difference in the Br1 and Br2 pvalues (using log10 transformation), then simply repeat this for the 10,000 permuted sets and identify cases where the difference is larger than expected.
```{r GRN_TargetOverlapDiff}
# Take the difference of the Group1 vs. Group2
CommonGroupsEnrichDiffPerm<-lapply(CommonGroupsEnrichPerm,function(x) -log10(x[1,])--log10(x[2,]))
CommonGroupsEnrichDiff<-sapply(CommonGroupsEnrich,function(x) -log10(x[1])--log10(x[2]))
# Calculate P-value of enrichment difference
# When Group1 is more enriched (right tail)
# Right tail is all we care about now since the Br genes were re-aranged so that the more enriched one is first