gene-expression.Rmd

---
title: "Emergent transcriptional adaption facilitates the convergent succession within a synthetic community"
author: "Chun-Hui Gao, Hui Cao, Peng Cai, et.al."
date: "`r Sys.Date()`"
output: 
  bookdown::html_document2:
    self_contained: no
    toc: yes
    number_sections: no
    toc_float: yes
    toc_depth: 3
bibliography: references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      include = TRUE)
knitr::opts_chunk$set(fig.width = 8,
                      out.width = "75%",
                      fig.asp=0.618,
                      fig.align = "center",
                      message = F,
                      dev = c("png","pdf"))


# 作图
library(tidyverse)
library(ggplot2)
library(cowplot)
library(ggpubr)
library(corrplot)
library(pheatmap)
library(RColorBrewer)
library(vegan)
library(DESeq2)
library(openxlsx)
library(clusterProfiler)
theme_set(theme_bw())

 
```

## Growth of *E. coli* and *P. putida* in monoculture and cocultures

`qPCR_data` contains the result of species-specific quantitative PCR. It has five columns:

-   sample: Sample name
-   condition: monoculture and coculture names. The two-species coculture system has five different initial abundance, two of which were monocultures, and the other three had the initial ratio (EC/PP) of 1:1000, 1:1, 1000:1, respectively. For convenience, we named the *P. putida* monoculture, 1:1000, 1:1, 1000:1 cocultures and *E. coli* monoculture as related to the proportion of *E. coli* to five groups of "none", "less", "equal", "more" and "all", respectively.
-   time: sample time.
-   species: species name. In cocultures, the quantity was related to two species, respectively.
-   quantity: the quantity of species, as calculated from the qPCR CT value and standard curves.

For example:

```{r qPCR-data}
qPCR_data <- read.csv("data/qPCR-data.csv")
qPCR_data$condition <- factor(qPCR_data$condition, 
                              levels = c("none","less","equal","more","all"))

head(qPCR_data)

organism <- c("EC","PP")
organsim_fullname <- c("EC"="E. coli","PP"="P. putida")
```

### Comparision of coculture and monoculture quantities

Figure \@ref(fig:monoculture-vs-coculture) shows Growth curves of *E. coli* and *P. putida* in monoculture (A) and the "1:1000", "1:1", "1000:1" cocultures (B-D). The quantities were determined using species-specific qPCR as described in methods. In B-D subplots, the growth curves of monocultures were placed on the background layer (dashed lines), shows the variance analysis of the species abundances between monoculture and the "1:1000" (b), "1:1" (d), and "1000:1" (d) cocultures. The significance of p-values were showed by “*” (p<0.05), “**” (p<0.01) or “ns” (p>0.05). 


```{r}

# stats
growth_data <- qPCR_data %>% group_by(species,time,condition)  %>%
  summarise(y=median(log10(quantity)),std=sd(log10(quantity),na.rm = T))

# separate monoculture and coculture
monoculture <-  growth_data %>%
  filter(condition %in% c("none","all"))
coculture <- growth_data %>%
  filter(condition %in% c("less","equal","more"))


# plot
growth_curve_mono <- ggplot(monoculture,aes(time,y,color=species)) +
  geom_line(lty="dashed",size=1) +
    geom_errorbar(aes(ymin=y-std,ymax=y+std),size=.5) + 
    geom_point(data = monoculture)


```


```{r monoculture-vs-coculture,fig.asp=0.618, fig.cap="Analysis the variance of the species abundances between monoculture and cocultures"}
library(rstatix)

qPCR_stat <- qPCR_data %>% mutate(
  condition = fct_other(condition,
                        keep=c("less","equal","more"),
                        other_level =  "mono")) %>%
  group_by(time,species) %>%
  wilcox_test(.,quantity ~ condition,ref.group = "mono") %>% 
  mutate(condition = group2) %>%
  select(time,species,condition,p.adj,p.adj.signif)


# significance
variance <- qPCR_data %>%
  filter(condition %in% c("less","equal","more")) %>%
  group_by(time,condition,species) %>%
  summarise(y=max(log10(quantity))) %>%
  ungroup() %>%
  left_join(qPCR_stat)

variance_plot <- function(cond){
  qPCR_stat %>% filter(condition == cond) %>%
    left_join(filter(coculture, condition==cond)) %>%
  ggplot(aes(time,y,color=species)) +
  geom_line(size=1,alpha=1/2) +
  geom_point(size=2,alpha=1/2) +
        geom_line(lty="dashed",size=1,alpha=1/3,data = monoculture) +
    geom_errorbar(aes(ymin=y-std,ymax=y+std),size=.5,data = monoculture,alpha=1/3) + 
  ggrepel::geom_text_repel(aes(label=p.adj.signif),
                           max.overlaps = 20,
                           segment.colour = "black",
                           show.legend = F)
}

plots <- lapply(c("less","equal","more"), variance_plot)
plots = lapply(c(list(growth_curve_mono),plots), function(p){
  p + 
    ylab("Log(quantity)") + xlab("Time (h)") +
        scale_y_continuous(limits = c(5,11),breaks = 5:10) +
        scale_color_discrete(labels=organsim_fullname,name="Species") +
        theme_bw() +
        theme(legend.position = c(0.7,0.3),
              legend.text = element_text(face = "italic"))
})
plot_grid(plotlist = plots,labels = "auto", ncol = 3)
export::graph2ppt(append  = TRUE)

```


### Ratio (EC/PP) changes in cocultures

Ratios of EC/PP in cocultures were calculated, and ratio changes in cocultures were compared in Figure \@ref(fig:ratio-change).

```{r}
# calculate ratio in cocultures
ratio <- qPCR_data %>% 
  filter(condition %in% c("less","equal","more")) %>% 
  group_by(sample, condition, species, time) %>% 
  mutate(rep = row_number()) %>% 
  ungroup() %>% 
  pivot_wider(names_from = species, values_from = quantity) %>%
  mutate(ratio = EC/PP) %>%
  filter(!is.na(ratio))
```

Figure \@ref(fig:ratio-change) shows the deterministic assembly of *E. coli* and *P. putida* cocultures. (A) the real-time ratio of EC: PP after 0, 0.5, 1, 2, 4, 8 and 24 h cultivation. (B) Analysis of the variances of EC/PP ratios (24h) between cocultures.

(ref:ratio-change-figcap) Deterministic assembly of *E. coli* and *P. putida* cocultures. (A) the real-time ratio of EC: PP after 0, 0.5, 1, 2, 4, 8 and 24 h cultivation. (B) Analysis of the variances of EC/PP ratios (24h) between cocultures.

```{r ratio-change, fig.asp=0.32, fig.cap="(ref:ratio-change-figcap)"}
ratio.sum <- ratio %>% 
  group_by(sample) %>%
  mutate(y=mean(ratio,na.rm=TRUE),std=sd(ratio,na.rm=TRUE)) %>% 
  dplyr::select(sample,condition, time, y, std) %>%
  unique()
ratio.sum$condition <- factor(ratio.sum$condition,
                        levels = c("less","equal","more"),
                        labels = c("1:1000","1:1","1000:1")) 
plot_ratio <- ggplot(ratio.sum, aes(time,y,shape=condition,color=condition)) + 
  geom_rect(aes(xmin=23,xmax=25,ymin=0.02,ymax=0.3),
            fill="lightyellow",color="black",alpha=0.1) +
  geom_line(size=1,show.legend = F) +
  geom_point(size=2,show.legend = F) +
  geom_errorbar(aes(ymin=y-std,ymax=y+std),show.legend = F) +
  geom_text(aes(x=9,label=condition),hjust=0,vjust=c(0,0,1),
            data = filter(ratio.sum,time==8),
            show.legend = F) +
  scale_y_log10(labels=formatC,breaks=10^(-3:3),
                expand = expansion(0.1)) +
  geom_bracket(xmin = 0, xmax = 2,y.position=3.5,label = "ns",inherit.aes = FALSE) +
  geom_bracket(xmin = 0, xmax = 4,y.position=1,label = "ns",inherit.aes = FALSE) +
  geom_bracket(xmin = 0, xmax = 4,y.position=0.001,label = "ns",inherit.aes = FALSE) +
  labs(x="Time (h)", y="Ratio (EC/PP)") +
  theme(legend.position = c(0.8,0.75))

ratio_24h <- ratio %>% filter(time==24)
plot_ratio_stats <- ggplot(ratio_24h,aes(condition,ratio,color=condition)) +
  geom_boxplot(fill="lightyellow") + 
  geom_jitter() + 
  stat_compare_means(
    comparisons = list(c("less","equal"),c("less","more"),c("equal","more")),
    label="p.format") +
   scale_x_discrete(breaks=c("less","equal","more"),
                    labels=c("1:1000","1:1","1000:1"))+
  xlab("Condition") + ylab("Ratio (EC/PP)") +
  scale_y_continuous(expand = expansion(0.2)) +
  theme(legend.position = "none",
        panel.background = element_rect(fill="lightyellow"))

plot_grid(plot_ratio,plot_ratio_stats,labels = "auto", ncol = 3, rel_widths = c(2,1.2,2))
export::graph2ppt(append  = TRUE)

```

Notably, the ratio differences were non-significant by time in the logarithmic phase, but were all significant in the stationary phase for every coculture (Figure \@ref(fig:comparison-by-time)).

For each coculture, significance of variances between each sample were as follows:

-   the 1:1000 coculture

```{r}
### stats
ratio1 <- ratio %>% filter(condition=="less")
(p1 <- pairwise.wilcox.test(ratio1$ratio,ratio1$time,p.adjust.method = "BH"))
```

-   the 1:1 coculture

```{r}
ratio2 <- ratio %>% filter(condition=="equal")
(p2 <- pairwise.wilcox.test(ratio2$ratio,ratio2$time,p.adjust.method = "BH"))
```

-   and the 1000:1 coculture

```{r}
ratio3 <- ratio %>% filter(condition=="more")
(p3 <- pairwise.wilcox.test(ratio3$ratio,ratio3$time,p.adjust.method = "BH"))
```

Figure \@ref(fig:comparison-by-time) shows P-values of pairwise comparison of real-time EC/PP ratios in the "1:1000" (A), "1:1" (B) and "1000:1" (C) cocultures. Circles showed the adjusted p-value for the comparison of the EC/PP ratios between two samples, which were indicated on the top and left (large circle mean large p-value). On this plot, a cross mark was given if the p-value is non-significant (P \> 0.05).

```{r comparison-by-time, fig.cap="P-values of pairwise comparison of real-time EC/PP ratios in the “1:1000” (A), “1:1” (B) and “1000:1” (C) cocultures.", fig.asp=0.45}
par(mfrow=c(1,3))
pvalues <- list(p1,p2,p3)
line = 0
cex = 1.2
side = 3
adj=0.15
plots <- lapply(seq_along(pvalues), function(i){
  x <- pvalues[[i]]
  corrplot(x$p.value, 
           type = "lower",
           col = "grey",
           cl.pos = "n",
           is.corr = FALSE, 
           method = "circle",
           p.mat = x$p.value,
           sig.level = 0.05)
  mtext(LETTERS[[i]], side = side, line = line, cex = cex, adj = adj)

})

```


## Gene expression analysis

To reveal the mechanism of community assembly in cocultures, we investigated the transcriptomic changes in cocultures using RNA-seq analysis.

Totally 60 samples were sequences. After sequencing quality control, each sample has 2.6 -- 3.9 M paired reads, and 3.9 -- 5.9 G base pairs, having an overall coverage of 300 X at least. After filtration, high-quality reads were aligned against the P. putida (<https://www.ncbi.nlm.nih.gov/genome/?term=pseudomonas+putida+kt2440>) and E.coli reference genome (<https://www.ncbi.nlm.nih.gov/genome/?term=Escherichia+coli+K-12>) using hisat2 and gene expression changes were quantified using DESeq2 software [@herbergDiagnosticTestAccuracy2016; @loveModeratedEstimationFold2014]. While aligning to reference genomes, the overall aligned rates are ranging from 97.23% to 98.43%.

```{r}
tableS1 <- read.xlsx("./data/tableS1.xlsx")
summary(tableS1)
```

Figure \@ref(fig:reads-count) shows the reads count of *E. coli* and *P. putida* in each RNA-seq library. Samples were taken with each condition at indicated times (at 0, 4, 8 and 24h). (A-D) *P. putida* monoculture samples; (E-H) *E. coli* and *P. putida* "1:1000" coculture; (I-L) the "1:1" coculture; (M-P) the "1000:1" coculture; and (Q-T) the *E. coli* monoculture samples. Each sample has three replicates. Only aligned reads to either *E. coli* or *P. putida* genome were used in this calculation. Plots showed the proportion of reads which have aligned to corresponding genome by different colors, as indicated on the top.

```{r reads-count, fig.asp=1.5,width=6, fig.cap="Reads count of *E. coli* and *P. putida* in each RNA-seq library"}
ht_counts <- readRDS(file = "./data/ht_counts.rds")
ht_counts$group <- factor(ht_counts$group,
                          levels = c("none_0h","none_4h","none_8h","none_24h","less_0h","less_4h","less_8h","less_24h","equal_0h","equal_4h","equal_8h","equal_24h","more_0h","more_4h","more_8h","more_24h","all_0h","all_4h","all_8h","all_24h"),
                          labels = c("P. putida_0h","P. putida_4h","P. putida_8h","P. putida_24h","1:1000_0h","1:1000_4h","1:1000_8h","1:1000_24h","1:1_0h","1:1_4h","1:1_8h","1:1_24h","1000:1_0h","1000:1_4h","1000:1_8h","1000:1_24h","E. coli_0h","E. coli_4h","E. coli_8h","E. coli_24h"))
ht_counts_total <- ht_counts %>% group_by(sample_id, group, organism) %>%
  summarise(sum_of_reads=sum(count)) %>%
  group_by(sample_id) %>% 
  mutate(proportion=sum_of_reads/sum(sum_of_reads))
samples <- levels(ht_counts_total$group)
plots <- lapply(1:length(samples),function(i){
  sample_group <- samples[[i]]
  df <- filter(ht_counts_total,group==sample_group)
  ggplot(df,aes(x=sample_id, y=proportion,fill=organism)) + 
    geom_bar(stat = "identity",position = "stack") + 
    scale_y_continuous(labels = function(l)paste(format(l*100,digits = 2),"%",sep="")) +
    scale_x_discrete(labels=c("Rep1","Rep2","Rep3")) +
    scale_fill_discrete(name="Organism: ",labels=c("EC"="E. coli","PP"="P. putida")) +
    labs(title = sample_group) +
    theme(legend.text = element_text(face = "italic"),
          legend.position = "none",
          axis.title = element_blank())
})
legend <- get_legend(plots[[1]] + theme(legend.position = "top",legend.direction = "horizontal"))
plot_grid(legend, plot_grid(plotlist = plots,labels = "auto",ncol=4),rel_heights = c(1,15),ncol=1)

```

### Identify gene expression changes

This step is a time consuming step. Use the precalculated DEG if possible.

```{r}
# load precalculated DEG results.
dds.EC <- readRDS("data/dds.EC.2.rds")
dds.PP <- readRDS("data/dds.PP.2.rds")
DEG_results.EC <- readRDS("data/DEG_results.EC.rds")
DEG_results.PP <- readRDS("data/DEG_results.PP.rds")
```

### RNA-seq clustering

First of all, we compared the gene expression profiles between monoculture and cocultures at genome level. Figure \@ref(fig:RNA-seq-PCA) shows the Principle coordinate analysis (PCA) of the gene expression profiles of *E. coli* (A) and *P. putida* (B) in different samples.

```{r}
list_of_vsd <- lapply(list(dds.EC,dds.PP),function(dds){
  vst(dds,blind = F)
})
list_of_vsd[[1]]$ratio0 <- factor(list_of_vsd[[1]]$ratio0,
                                  levels = c("none","less","equal","more","all"),
                        labels = c("P. putida","1:1000","1:1","1000:1","E. coli"))
list_of_vsd[[2]]$ratio0 <- factor(list_of_vsd[[2]]$ratio0,
                                  levels = c("none","less","equal","more","all"),
                        labels = c("P. putida","1:1000","1:1","1000:1","E. coli"))

```

The results showed that the gene expression profiles were different between monoculture and cocultures in the beginning, and became more consistent in the later (Fig. \@ref(fig:RNA-seq-PCA)A). Differences in gene expression appeared immediately after mixing of *E. coli* and *P. putida*, which has been described by 0h data (due to the time required for manual operation, this is of course not the real 0h). The expression profiles (0h) of *E. coli* monoculture, "1:1" and "1000:1" cocultures are close, but the "1:1000" coculture is distinct from the others. At 4h and 8h, we can still visually separate the "less" samples and the others on the plots. However, the expression profiles of *E. coli* became closer by time. After 24h cultivation, there was no obvious boundary between *E. coli* monoculture and three cocultures. Likewise, the expression profiles of *P. putida* has a similar pattern, except that the distance between "1000:1" coculture and other three samples was more obvious (Fig. \@ref(fig:RNA-seq-PCA)B).

```{r}
myPlotPCA <- function(object,
                      intgroup = c("time","ratio0"), 
                      show.label = FALSE,
                      return_data = FALSE) {
  require(dplyr,quietly = T)
  require(DESeq2,quietly = T)
  require(vegan,quietly = T)
  pca <- rda(t(assay(object))) 
  
  percent_var <- pca$CA$eig/pca$tot.chi  
  
  if (!all(intgroup %in% names(colData(object)))) {
    stop("the argument 'intgroup' should specify columns of colData(dds)")
  }
  intgroup_df <- as.data.frame(colData(object)[, intgroup, 
                                               drop = FALSE]) %>%
    tibble::rownames_to_column(var = "sample_id")
  
  df <- scores(pca)$sites %>% 
    as.data.frame() %>%
    tibble::rownames_to_column(var="sample_id") %>%
    left_join(intgroup_df,by="sample_id") %>% 
    mutate(time=factor(time,levels = sort(unique(as.numeric(time)))))
  
  if (return_data){
    attr(df, "percentvar") <- percent_var
    return(df)
  } 
  
  mapping <- aes(PC1, PC2, color=ratio0, label=sample_id)
  
  p <- ggplot(df,mapping) +
    geom_point(size=2)  +
    xlab(paste0("PC1: ", round(percent_var[1] * 100), "% variance")) + 
    ylab(paste0("PC2: ", round(percent_var[2] * 100), "% variance"))
  
  if (show.label) {
    return(p + geom_text_repel(show.legend = F))
  } else {
    return(p)
  }
  
}
```


```{r RNA-seq-PCA, fig.width=8,fig.asp=0.618,fig.cap="Principle coordinate analysis (PCA) of the gene expression profiles of *E. coli* (A) and *P. putida* (B) in different samples."}
list_of_PCA_plot <- lapply(list_of_vsd, function(vsd) {
  myPlotPCA(vsd) + 
    facet_wrap(~time,ncol=4) +
    directlabels::geom_dl(aes(label=ratio0),method = "smart.grid",size=2) +  #文本代替标签 位置标注的不好,改size没用
    scale_color_manual(limits=c("E. coli","1:1000","1:1","1000:1","P. putida"),
                       values = brewer.pal(5,"Dark2")) +
    theme(legend.position = "none")
  })

plot_grid(plotlist = list_of_PCA_plot,labels = "auto",ncol=1)
```


### Beta-dispersion

According to one of the reviewers' suggestions, we **use the beta-dispersion** to compare the significance of gene expression change within different time points.

```{r}

models = lapply(list_of_vsd, function(object){
  dist = vegdist(t(assay(object)), method = "euclidean")
  group_data = colData(object) %>% 
    as_tibble() %>%
    dplyr::mutate(group=paste0(ratio0," (",time,"h)")) 
  mod = with(group_data, betadisper(dist, group = group))
  return(mod)
})
```

We define a function `ggplot.betadisper()` to plot *betadisper* object with ggplot2 methods.

```{r}
# group meta data
group_meta = tableS1 %>%
  dplyr::rename_all(tolower) %>%
  dplyr::select(sample, condition, time) %>%
  dplyr::mutate(condition = factor(condition,
                               levels = c("none","less","equal","more","all"),
                               labels = c("P. putida","1:1000","1:1","1000:1","E. coli"))) %>%
  dplyr::mutate(group =  paste0(condition," (",time,"h)"))

ggplot.betadisper = function(x){
  
  percent_var = x$eig/sum(x$eig)
  xylab = paste0("PCoA ", 1:2, ": ", round(percent_var[1:2] * 100), "% variance") 
  
  sites = scores(x, display = "sites", choices = c(1,2)) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column(var="sample")
  sites$group = x$group
  sites = left_join(sites, group_meta)
  
  centroids = scores(x, display = "centroids", choices = 1:2) %>%
    as.data.frame() %>%
    tibble::rownames_to_column(var="group")
  centroids = left_join(centroids, group_meta %>% select(-sample) %>% unique())
  
  ggplot() +
    aes_string(x="PCoA1",y="PCoA2") +
    geom_point(aes(color = condition), shape = 21, alpha = 1/2,data = sites) +
    geom_polygon(aes(fill=condition, group=group), data=sites, alpha=1/2) +
    geom_point(aes(color=condition, fill=condition), shape=21, size = 3, data=centroids ) +
    ggrepel::geom_label_repel(aes(label = group, color = condition), 
                              data=centroids, 
                              alpha=1/2,
                              label.size = NA,
                              segment.color = "grey",
                              fontface = "bold.italic") + 
    scale_color_manual(limits=c("E. coli","1:1000","1:1","1000:1","P. putida"),
                       values = brewer.pal(5,"Dark2")) +
    scale_fill_manual(limits=c("E. coli","1:1000","1:1","1000:1","P. putida"),
                       values = brewer.pal(5,"Dark2")) +
    labs(x = xylab[[1]], 
         y = xylab[[2]]) +
    theme(legend.position = "none")
  

}
```

(ref:cap-betadisper-ec) Beta-dispersion shows the gene expression trajectory of all *E. coli* containing cultures in the plane of PCoA 1 and PCoA 2 over time. Open circles are the coordinates of samples. There are three points for each condition and time, and they formed a color-filled triangle. Centroid for each triangle was highlighted with larger size point with filled color.

```{r betadisper-ec, fig.cap="(ref:cap-betadisper-ec)",fig.cap=.5}
p1 = ggplot.betadisper(models[[1]])
p1
export::graph2ppt(append  = TRUE)
```

(ref:cap-betadisper-pp) Beta-dispersion shows the gene expression trajectory of all *P. putida* containing cultures in the plane of PCoA 1 and PCoA 2 over time. Open circles are the coordinates of samples. There are three points for each condition and time, and they formed a color-filled triangle. Centroid for each triangle was showed as a filled circle.

```{r betadisper-pp, fig.cap="(ref:cap-betadisper-pp)", fig.cap=.5}
p2 = ggplot.betadisper(models[[2]])
p2
export::graph2ppt(append  = TRUE)
```

```{r include=FALSE,fig.asp=0.5}
plot_grid(p1,p2,ncol = 2)
```


#### Comparison of dispersion 

- For the dispersion of *E. coli*

The gene expression dispersion of 1:1000 is the highest at 0 and 4 h.

```{r}
par(mar = c(7,5,2,1))
boxplot(models[[1]], las = 2, xlab = NA, 
        main = expression( italic(E.~coli )))
```

- For the dispersion of *P. putida*

The gene expression dispersion of 1000:1 is the highest at 0, 4 and 8 h.


```{r}
par(mar = c(7,5,2,1))
boxplot(models[[2]], las = 2, xlab = NA, 
        main = expression(italic(P.~putida)))
```


### Number of DEGs

Figure \@ref(fig:number-of-deg) shows the numbers of differentially expressed genes (DEGs) in three cocultures, in *E. coli* (A) and *P. putida* (B). The DEGs were identified by comparison with the corresponding monoculture at the same time. Up- and down-regulation of genes were colored by red and cyan, respectively.

```{r number-of-deg, fig.asp=0.8,fig.cap="numbers of differentially expressed genes (DEGs) in three cocultures, in *E. coli* (A) and *P. putida* (B).",fig.width=5}
## count DEG
deg_count <- function(data){
  do.call("rbind",lapply(data, function(x) table(x$expression))) %>%
    as.data.frame() %>%
    rownames_to_column(var = "name") %>%
    separate(name, into = c("ratio","time"), sep = "_", extra = "drop") %>%
    mutate(time = as.numeric(str_extract(time, "[0-9]+"))) %>%
    pivot_longer(cols = c("dn","up"), names_to = "type", values_to = "count") %>%
    mutate(count = ifelse(type == "dn", -count, count)) %>%
    complete(ratio, time, type, fill = list(count = 0)) %>%
    mutate(ratio = factor(ratio, 
                          levels = c("less","equal","more"),
                          labels = c("1:1000","1:1","1000:1")),
           type = factor(type,
                         levels = c("up","dn"),
                         labels = c("Up","Down")))
}


deg_count_EC <- deg_count(DEG_results.EC)
deg_count_PP <- deg_count(DEG_results.PP)
count <- list(deg_count_EC, deg_count_PP)
library(ggtext)
deg_count_plots <- lapply(seq_along(count), function(i){
  x <- count[[i]]
  ggplot(x, aes(x = time, y = count, color = type)) +
    geom_point() +
    geom_line(size = 1) +
    scale_y_continuous(labels = function(x){abs(x)}) +
    facet_wrap(~ratio) +
    labs(x="Time(h)",y="Number of DEGs",
         color = "Gene expression:",
         title= paste0("DEGs in *",organsim_fullname[[i]],"*")) +
    theme(legend.position = c(0.618,1),
        legend.justification = c(0.5,-0.65),
        legend.direction = "horizontal",
        plot.title = element_markdown())
}) 


plot_grid(plotlist = deg_count_plots,labels = "auto",ncol = 1)
```

### Specific DEGs in *E. coli* and *P. putida*

Figure \@ref(fig:deg-venn) shows the comparision of differentially expressed genes by time in *E. coli* (A-C) and *P. putida* (D-F). Although DEGs overlapped for different time, specific genes are the majority in almost every time point. For instance, 57 out of 93 DEGs are specific in 1:1000 coculture in *E. coli* (Fig. \@ref(fig:deg-venn)A).

```{r deg-venn, fig.width=10, fig.cap="Comparision of differentially expressed genes by time"}
library(ggVennDiagram)

ratio0 <- c("1:1000","1:1","1000:1")
deg_Venn_plot_EC <- lapply(seq_along(ratio0), function(i){
  gene_list <- lapply(DEG_results.EC[(i*4-3):(i*4)], function(x){x$gene})
  ggVennDiagram(gene_list,label = "count",
                category.names = c("0h","4h","8h","24h")) +
    scale_fill_gradient(low="white",high="red",limits=c(0,310)) +
    labs(title=paste0(ratio0[[i]]," - *E. coli*")) +
    theme(legend.position = "none",
          plot.title = element_markdown(hjust=0.5))
})

deg_Venn_plot_PP <- lapply(seq_along(ratio0), function(i){
  gene_list <- lapply(DEG_results.PP[(i*4-3):(i*4)], function(x){x$gene})
  ggVennDiagram(gene_list,label = "count",
                category.names = c("0h","4h","8h","24h")) +
    scale_fill_gradient(low="white",high="red",limits=c(0,310)) +
    labs(title=paste0(ratio0[[i]]," - *P. putida*")) +
    theme(legend.position = "none",
          plot.title = element_markdown(hjust=0.5))
})

plot_grid(plotlist = c(deg_Venn_plot_EC,deg_Venn_plot_PP),
          labels = "auto")

```

## Enrichment analysis of DEGs

We use **clusterProfiler** to perform KEGG enrichment analysis.

```{r}

kegg_path_tree <- function(df, pathway_name, gene_id, sep="/"){
  pathway <- df %>% 
    separate_rows(all_of(gene_id), sep = sep) %>%
    dplyr::select(all_of(c(pathway_name, gene_id))) %>%
    unique() %>%
    mutate(value = 1) %>%
    pivot_wider(id_cols = pathway_name,
                names_from = gene_id,
                values_from = value,
                values_fill = list(value = 0))
  library(vegan)
  matrix <- as.matrix(column_to_rownames(pathway, pathway_name))
  dist <- vegdist(matrix, method = "jaccard")
  
  library(ape)
  library(ggtree)
  tree <- bionj(dist)
  p <- ggtree(tree,branch.length = "none")
  p$data %>%
    filter(isTip) %>%
    arrange(y) %>%
    pull(label)
}

ck_plot <- function(ck, 
                    mapping = aes(time, Description, size = GeneRatio)){
  df <- data.frame(ck) %>%
    mutate(ratio = factor(ratio, levels = c("less","equal","more"),
                          labels = c("1:1000","1:1","1000:1")),
           time = factor(time, levels = c("0h","4h","8h","24h"))) 
  df$Description <- factor(df$Description, 
                           levels = 
                             kegg_path_tree(df,
                                            pathway_name = "Description",
                                            gene_id = "geneID"))
  df$GeneRatio <- sapply(df$GeneRatio, function(x) eval(parse(text = x)))
  ggplot(df, mapping = mapping) +
    geom_point() +
    facet_grid(~ ratio, scales = "free_y") +
    labs(y="KEGG pathway")
}


grid_panel_autoheight <- function(p){
  # 调整panel的高度
  require(gtable)
  gp <- ggplotGrob(p)
  # gtable_show_layout(gp)
  facet.rows <- gp$layout$t[grepl("panel",gp$layout$name)]
  y.var <- sapply(ggplot_build(p)$layout$panel_scales_y,
                  function(l) length(l$range$range))
  gp$heights[facet.rows] <- gp$heights[facet.rows] * y.var
  return(gp)
}

reset_facet_width <- function(p, rel_width = c(1)){
  gp <- ggplotGrob(p)
  facet.cols <- gp$layout$l[grepl("panel",gp$layout$name)]
  gp$widths[facet.cols] <- gp$widths[facet.cols] * rel_width
  return(gp)
}

```

### KEGG enrichment in *E. coli*


```{r}
deg1 <- do.call("rbind", DEG_results.EC) %>% 
  separate(comparison, into = c("ratio","time"), extra = "drop")

ck1 <- compareCluster(gene ~ ratio + time, 
                      data = deg1, 
                      fun = "enrichKEGG", 
                      organism = "eco",
                      use_internal_data = TRUE) 

p1 <- ck_plot(ck1)
p1 <- grid_panel_autoheight(p1)
```

### KEGG enrichment in *P. putida*

```{r}
deg2 <- do.call("rbind", DEG_results.PP) %>% 
  separate(comparison, into = c("ratio","time"), extra = "drop")

ck2 <- compareCluster(gene ~ ratio + time, 
                      data = deg2, 
                      fun = "enrichKEGG", 
                      organism = "ppu",
                      use_internal_data = TRUE) 

p2 <- ck_plot(ck2)
p2 <- grid_panel_autoheight(p2)
```

### Dotplot of KEGG enrichment results

(ref:kegg-ora-cap) A dot plot shows the KEGG enrichment result of *E. coli* (A) and *P. putida* (B) in the three coculture as a function of time.

```{r kegg-ora, fig.width=8, fig.asp=1.1, fig.cap="(ref:kegg-ora-cap) "}
plot_grid(p1,p2, rel_heights = c(1,0.3), ncol = 1, labels = "auto",align = "v")

```

## Get set enrichment analysis (GSEA) of gene expression profile

Enrichment analysis of DEGs is a common approach in analyzing gene expression profiles. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) [@subramanianGeneSetEnrichment2005] directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.


```{r}
# load pre-calculated result
gene_expression.EC <- readRDS("data/gene_expression.EC.rds")
gene_expression.PP <- readRDS("data/gene_expression.PP.rds")
```

```{r warning=FALSE}
# define a function to extract and prepare formatted data for GSEA
get_genelist <- function(x){
  if (nrow(x) < 1) return(NULL)
  geneList <- x$log2FoldChange
  names(geneList) <- x$gene
  geneList <- sort(geneList,decreasing = T) 
  return(geneList)
}

set.seed(1234)

```

### *E. coli* GSEA KEGG result

```{r gseKEGG-result-EC, warning=FALSE}
gseKEGG_results.EC <- lapply(gene_expression.EC, function(x){
  geneList <- get_genelist(x)
  tryCatch(gseKEGG(geneList, 
                   organism = "eco",
                   eps = 1e-20,
                   pvalueCutoff = 1, # all results
                   use_internal_data = TRUE),
           error=function(e) NULL)
})
count.EC.pathways = nrow(data.frame(gseKEGG_results.EC[[1]]))

```

The GSEA analysis of *E. coli* included `r count.EC.pathways` pathways.

### *P. putida* GSEA KEGG result

```{r gseKEGG-result-PP, warning=FALSE}
gseKEGG_results.PP <- lapply(gene_expression.PP, function(x){
  geneList <- get_genelist(x)
  tryCatch(gseKEGG(geneList, 
                   organism = "ppu",
                   eps = 1e-20,
                   pvalueCutoff = 1, # all results
                   use_internal_data = TRUE),
           error=function(e) NULL)
})
count.PP.pathways = nrow(data.frame(gseKEGG_results.PP[[1]]))
```

The GSEA analysis of *P. putida* included `r count.PP.pathways` pathways.

### Dotplot of GSEA results

```{r}
# combine and reform GSEA result to a data frame
gse_result <- function(result){
  name <- names(result)
  l <- lapply(seq_along(result), function(i){
    data.frame(result[[i]]) %>%
      mutate(comparison = name[[i]])
  })
  do.call("rbind", l) %>%
    separate(comparison, into = c("ratio","time"), extra = "drop") %>%
    mutate(ratio = factor(ratio, levels = c("less","equal","more"),
                          labels = c("1:1000","1:1","1000:1")),
           time = factor(time, levels = c("0h","4h","8h","24h")),
           type = ifelse(p.adjust > 0.05, "unchanged",
                         ifelse(enrichmentScore >0, "activated", "suppressed")),
           enrichScore = abs(enrichmentScore)) 
}

# dotplot GSEA result
gse_dotplot <- function(df){
  ggplot(df, aes(time, Description, size = enrichScore,color=type)) +
    geom_point() +
    facet_grid(~ ratio, scales = "free_y") +
    labs(y="KEGG pathway") +
    scale_size(limits = c(0.2,1.0))
}
```

```{r include=FALSE}
## GSEA dotplot of E. coli
df1 <- gse_result(gseKEGG_results.EC) %>%
  filter(type %in% c("activated","suppressed"))
eco_gsea_pathway <- kegg_path_tree(df1,
                                  pathway_name = "Description",
                                  gene_id = "core_enrichment")
df1$Description <- factor(df1$Description,
                          levels = eco_gsea_pathway)
gsea_plot_EC <- gse_dotplot(df1)

## GSEA dotplot of P. putida
df2 <- gse_result(gseKEGG_results.PP) %>%
  filter(type %in% c("activated","suppressed"))
ppu_gsea_pathway <- kegg_path_tree(df2,
                              pathway_name = "Description",
                              gene_id = "core_enrichment")
df2$Description <- factor(df2$Description,
                          levels = ppu_gsea_pathway)
gsea_plot_PP <- gse_dotplot(df2)
```

(ref:gsea-dotplot-figcap) GSEA result of cocultures in *E. coli* (A) and *P. putida* (B)

```{r gsea-dotplot, fig.asp=1.5, fig.width = 8, fig.cap="(ref:gsea-dotplot-figcap)"}
plot_grid(gsea_plot_EC,gsea_plot_PP,align = "v", ncol = 1,labels = "auto",rel_heights = c(1.5,1))
export::graph2ppt(append  = TRUE)
```


Figure \@ref(fig:gsea-dotplot) shows the GSEA analysis result for *E. coli* (A) and *P. putida* (B).

### Overlap of GSEA pathway in *E. coli* and *P. putida*

```{r}
# todo
merged_gsea <- rbind(mutate(df1, organism = "eco"),
                     mutate(df2, organism = "ppu"))

```

```{r eval=FALSE}
upset_gsea = merged_gsea %>% group_by(organism,type,time) %>% nest() 
upset_gsea$pathway <- sapply(upset_gsea$data, function(x) pull(x, Description) %>% as.character())
l <- upset_gsea$pathway
names(l) = paste(upset_gsea$organism, upset_gsea$type, upset_gsea$time,sep = "-")
upset(fromList(l), nsets=23,keep.order = T, order.by = "freq")

```


Figure \@ref(fig:gsea-overlap-vennplot) shows the Overlap of GSEA pathway in *E. coli* and *P. putida*. 


(ref:gsea-overlap-vennplot-figcap) Overlap of GSEA pathway in *E. coli* and *P. putida*. Pathways were distinguished by their descriptions, and separated to activated and suppressed ones. 

```{r gsea-overlap-vennplot, fig.asp=0.5, fig.cap="(ref:gsea-overlap-vennplot-figcap)"}
# changed pathways
venn_gsea <- merged_gsea %>%
  filter(type %in% c("suppressed","activated")) %>% 
  group_by(organism, type) %>% nest()
venn_gsea$pathway <- sapply(venn_gsea$data, function(x) pull(x, Description) %>% as.character())

l <- venn_gsea$pathway
names(l) <- paste(venn_gsea$organism, venn_gsea$type,sep = "-")
gsea_pathway_venn1 <- ggVennDiagram(l,label = "count") +
    scale_x_continuous(expand = c(0.1,0.1))+ 
  scale_fill_gradient(low="white",high="red",limits=c(0,100)) +
  theme(legend.position = "none")

l2 <- list(eco=c(l[[1]],l[[2]]), ppu=c(l[[3]],l[[4]]))
gsea_pathway_venn2 <- ggVennDiagram(l2, label = "count")+  
  scale_fill_gradient(low="white",high="red",limits=c(0,50)) +
  theme(legend.position = "none")

l3 <- list(activated=c(l[[1]],l[[3]]), suppressed=c(l[[2]],l[[4]]))
gsea_pathway_venn3 <- ggVennDiagram(l3, label = "count") + 
  scale_fill_gradient(low="white",high="red",limits=c(0,50)) +
 theme(legend.position = "none")

plot_grid(plot_grid(gsea_pathway_venn2,gsea_pathway_venn3,ncol = 1,labels = c("A","B")),gsea_pathway_venn1,labels = c("","C"),ncol = 2,rel_widths = c(.5,1))
export::graph2ppt(append  = TRUE)
```


View intersect in Venn diagram interactively (Figure \@ref(fig:overlap-plotly)).

```{r overlap-plotly, warning=FALSE, out.width="100%", fig.cap="Venn plot"}
p <- ggVennDiagram(l, label = NULL, show_intersect = TRUE)
plotly::ggplotly(p)
```

## Session info

```{r}
sessioninfo::session_info()
```


## Reference