diff --git a/Private and shared alleles/Early diverging species/dfoil.rmd b/Private and shared alleles/Early diverging species/dfoil.rmd index a1d151f..0b9fcbd 100644 --- a/Private and shared alleles/Early diverging species/dfoil.rmd +++ b/Private and shared alleles/Early diverging species/dfoil.rmd @@ -1,7 +1,7 @@ --- -title: "Applyinf the dfoil method to study the introgression of Sb" +title: "Comparing ILS and introgression scenarios using SNPs" author: "Rodrigo Pracana, QMUL" -date: "21/10/2021" +date: "04/10/2021" output: html_document --- @@ -51,7 +51,7 @@ dir.create(recursive = T, "results/fig") read_dfoil <- function(path, species) { dfoil <- read.table(path, comment.char = '&', header = T) dfoil$species <- species - + dfoil %>% rename(gene = X.chrom) %>% mutate(gene = gsub(".*\\/", "", gene)) %>% @@ -134,7 +134,7 @@ We chose a single sample to represent each species. For each gene in the superge read_allele_counts <- function(path, species) { dfoil <- read.table(path, comment.char = '&', header = T) dfoil$species <- species - + dfoil %>% rename(gene = X.chrom) %>% mutate(gene = gsub(".*\\/", "", gene)) %>% @@ -216,9 +216,297 @@ all_dfoil %>% From these patterns, it is clear that the "parallel origin" hypothesis, where alleles follow the species tree and ILS, is not correct. -At face value, the DFO and DOL values in our results could indicate that the SB have a more recent divergence than the separation between SB and Sb, which in turn would support the ancestral polymorphism hypothesis. However, both our alternative scenarios (ancestral polymorphism or introgression) are extremely discordant from the species tree. The measures of DFO and DOL are heavily skewed by alleles that are private to either SB (BAABA) or Sb (ABBAA), which could occur under either scenario. +At face value, the DFO and DOL values in our results could indicate that the SB have a more recent divergence than the separation between SB and Sb, which in turn would support the ancestral polymorphism hypothesis. However, both our alternative scenarios (ancestral polymorphism or introgression) are extremely discordant from the species tree. The measures of DFO and DOL are heavily skewed by alleles that are private to either SB (BAABA) or Sb (ABBAA), which could occur under either scenario. In the following, I examine specific allele combinations that give evidence for one or the other hypothesis. + +## Private alleles support the introgression hypothesis + +Under the introgression hypothesis, the terminal branch lengths of Sb are expected to be smaller than those of SB, as the separation between SB variants represents a relative old speciation event, whereas the separation between Sb variants represent more recent introgression events. + +As a proxy for branch length, we counted the number of alleles that are private to each terminal branch. Indeed, the number of alleles that are private to the SB supergene terminal branches are greater than those private to each Sb terminal branch (two-tail paired t-test, Bonferroni corrected p < 0.05 for each species). This pattern could support a younger age for species separation in Sb than in SB, supporting the introgression hypothesis. + +```{r, fig.height = 5, fig.width = 5} + +ggplot(all_counts) + + geom_point(aes(x = ABAAA + AABAA, y = BAAAA + AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, yintercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("Sb (ABAAA + AABAA)") + + ylab("SB (BAAAA + AAABA)") + + ggtitle("Private to terminal branches") -> p1 + +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$p.value, + t = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$statistic, + df = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$parameter, + mean_invicta = mean(ABAAA + AABAA), + mean_focal = mean(BAAAA + AAABA), + mean_invicta_b = mean(AABAA), + mean_focal_b = mean(ABAAA), + mean_invicta_B = mean(AAABA), + mean_focal_B = mean(BAAAA)) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + + +``` + +However, it is important to note that effective population size is an important determinant of branch length. With higher effective population size, branch lengths can get "stretched" to before the speciation event occured. Thus, the pattern above could be consistent with the ancestral polymorphism hypothesis, since a lower effective population size in Sb relative to SB would cause Sb to have shorter terminal branches (despite equal divergence age). + +![](results/tree_models.png) + +Indeed, assuming the introgression hypothesis (the topology on the right) measured the branch lengths going from the supergene origin to Sb in _S. invicta_ and each focal species, and compared it to the branch length of SB in _S. invicta_ and in each focal species. In all but one comparison, the SB branch was smaller than the Sb branch, unlike the expectation that it should be equal or larger. + +```{r, fig.height = 12, fig.width = 12} + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("invicta b") + + ylab("invicta B") + + ggtitle("invicta b VS invicta B") -> p1 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + ABAAA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("focal b") + + ylab("focal B") + + ggtitle("focal b VS focal B") -> p2 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("invicta b") + + ylab("focal B") + + ggtitle("invicta b VS focal B") -> p3 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("focal b") + + ylab("invicta B") + + ggtitle("focal b VS invicta B") -> p4 + +(p1 + p2) / (p3 + p4) + +all_counts %>% + group_by(species) %>% + summarise("invicta b VS invicta B_p" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$p.value, + "invicta b VS invicta B_t" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$statistic, + "invicta b VS invicta B_df" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$parameter, + "invicta b VS invicta B_mean b" = mean(ABBAA + AABAA), + "invicta b VS invicta B_mean B" = mean(AAABA), + "focal b VS focal B_p" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$p.value, + "focal b VS focal B_t" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$statistic, + "focal b VS focal B_df" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$parameter, + "focal b VS focal B_mean b" = mean(ABBAA + ABAAA), + "focal b VS focal B_mean B" = mean(BAAAA), + "invicta b VS focal B_p" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$p.value, + "invicta b VS focal B_t" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$statistic, + "invicta b VS focal B_df" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$parameter, + "invicta b VS focal B_mean b" = mean(ABBAA + AABAA), + "invicta b VS focal B_mean B" = mean(BAAAA), + "focal b VS invicta B_p" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$p.value, + "focal b VS invicta B_t" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$statistic, + "focal b VS invicta B_df" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$parameter, + "focal b VS invicta B_mean b" = mean(ABBAA + ABAAA), + "focal b VS invicta B_mean B" = mean(AAABA)) %>% + ungroup %>% + pivot_longer(-species) %>% + separate(name, c("comparison", "stat"), sep = "_") %>% + pivot_wider(names_from = stat, values_from = value) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + arrange(comparison, species) %>% + as.data.frame() + +``` + +Nevertheless, our analysis in based on alleles in genic regions - some of which likely to be slightly deleterious. Slightly deleterious alleles have higher subtitution rates under low Ne, which could be an alternative explanation for the slightly higher number of private alleles in Sb. + +Furthermore, there are many more alleles fixed in Sb than in SB (two-tail paired t-test, Bonferroni corrected p < 0.05 for each species) - indeed, almost no alleles are fixed among SB of different species. This could suggest that the SB variants do not share a common branch before speciation. + +```{r, fig.height = 5, fig.width = 5} + +ggplot(all_counts) + + geom_point(aes(x = ABBAA, y = BAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("Sb (ABBAA)") + + ylab("SB (BAABA)") + + ggtitle("Fixed private alleles") -> p1 + +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(ABBAA, BAABA, paired = TRUE)$p.value, + t = t.test(ABBAA, BAABA, paired = TRUE)$statistic, + df = t.test(ABBAA, BAABA, paired = TRUE)$parameter, + mean_invicta = mean(ABBAA), + mean_focal = mean(BAABA)) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + +``` + +## Evidence that Sb introgressed from _S. invicta_ to the remaining species + +Additionally, we tested whether the SB branch is longer in _S. invicta_ relative to each focal species. A difference in branch length would support the introgression hypothesis, and inform us on the direction of introgression: + +((richteri_b, invicta_b),invicta_B),richteri_B)),saevissima) + +((richteri_b, invicta_b),richteri_B),invicta_B)),saevissima) + +We tested the hypothesis that the species in which Sb evolved will have fewer alleles private to SB, given that it will share part of its branch with Sb. We found that the SB of _S. invicta_ had fewer private alleles than the SB of each focal species. This difference was small but significant in all species (two-tail paired t-test, Bonferonni-corrected p < 0.05) except for _S. interrupta_, where there was no significant difference. + +```{r, fig.height = 6, fig.width = 6} + +all_counts %>% + mutate(species = paste0("S. ", species)) %>% + ggplot() + + geom_point(aes(x = AAABA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 50), ylim = c(0,50)) + + xlab("Private to S. invicta SB") + + ylab("Private to focal SB") -> p1 +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(AAABA, BAAAA, paired = TRUE)$p.value, + t = t.test(AAABA, BAAAA, paired = TRUE)$statistic, + estimate = t.test(AAABA, AABAA, paired = TRUE)$estimate, + mean_invicta = mean(AAABA), + mean_focal = mean(BAAAA)) %>% + ungroup %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + +``` + +We then tested the hypothesis that the species in which Sb evolved will have more private alleles in common with Sb than the other species. We look specifically at alleles that are present in the Sb variant of both species and the SB of either species. Most genes had a small number of such alleles. Nevertheless, on average _S. invicta_ had more private alleles in common with Sb than each of the focal species (two-tail paired t-test, Bonferonni-corrected p < 0.05). + +```{r, fig.height = 6, fig.width = 6} + +# Alleles shared between Sb and invicta SB +# ((richteri_B, richteri_b),(invicta_b, invicta_B)),saevissima). +# ABBBA +# AABBA +# ABABA +# Alleles shared between Sb and richteri SB +# ((richteri_B, richteri_b),(invicta_b, invicta_B)),saevissima). +# BBBAA +# BABAA +# BBAAA + +all_counts %>% + mutate(shared_lb_invicta_bb = ABBBA, + shared_lb_focal_bb = BBBAA) %>% + mutate(species = as.factor(paste0("S. ", species))) -> + shared_lb_each_bb + +shared_lb_each_bb %>% + group_by(species) %>% + nest() %>% + mutate(model = map(data, + ~ t.test(.$shared_lb_invicta_bb, + .$shared_lb_focal_bb, + paired = TRUE)), + results = map(model, tidy), + mean_invicta = map(data, ~ mean(.$shared_lb_invicta_bb)), + mean_focal = map(data, ~ mean(.$shared_lb_focal_bb)), + sum_invicta = map(data, ~ sum(.$shared_lb_invicta_bb)), + sum_focal = map(data, ~ sum(.$shared_lb_focal_bb))) %>% + unnest(c(results, mean_invicta, mean_focal, sum_invicta, sum_focal)) %>% + select(-data, -model, -method, -alternative) %>% + mutate(perc_diff = round(100 * (mean_invicta - mean_focal) / mean_focal), + corrected_p = 4 * p.value) -> + shared_lb_each_bb_summary + +as.data.frame(shared_lb_each_bb_summary) + +``` + + +```{r} + + +# Add P-value to plot for printing + parse_p <- function(p_value, facetting_var) { + + if (is.na(p_value)) { + p_value <- paste0("atop('", expr(!!facetting_var), "',", + ")") + } else if (p_value >= 0.05) { + p_value <- paste0("atop('", expr(!!facetting_var), "',", + "'corrected p =", expr(!!round(p_value, 2)),"')") + } else if (p_value < 0.05 & p_value >= 0.001) { + p_value <- paste0("atop('", expr(!!facetting_var), "',", + "'corrected p = ", expr(!!round(p_value, 3)),"')") + } else { + p_value <- paste0("atop('", expr(!!facetting_var), "',", + "'corrected p < 10'", "^", expr(!!(floor(log10(abs(p_value))) + 1)),")") + } + return(p_value) +} + +species_levels <- levels(shared_lb_each_bb$species) +p_vector <- shared_lb_each_bb_summary$corrected_p[match(species_levels, shared_lb_each_bb_summary$species)] + +# Make P value expression +p_vector <- sapply(1:length(p_vector), function(i) parse_p(p_vector[i], + species_levels[i])) + +shared_lb_each_bb$species <- factor(shared_lb_each_bb$species, + labels = p_vector) + +shared_lb_each_bb %>% + ggplot() + + geom_point(aes(x = shared_lb_invicta_bb, + y = shared_lb_focal_bb), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2, labeller = labeller(species = label_parsed)) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 21.5), ylim = c(0, 21.5)) + + xlab("Shared between Sb and the S. invicta SB") + + ylab("Shared between Sb and the focal SB") -> p2 + +p2 + +ggsave(p2, file = "results/lb_shared_each_bb.pdf", width = 6, height = 6) + +``` + +## Conclusion + +The higher number of private alleles in Sb than in the SB of each species could be constistent with the ancestral polymorphism hypothesis, but could also be consistent with introgression and lower Ne in Sb. + +The fact that SB had fewer private alleles in _S. invicta_ than in any of the other species is consistent with an introgression from _S. invicta_ to each of the other species, as is the fact that SB of _S. invicta_ shared more alleles with the Sb branch than the SB of any other species. -We conclude that we cannot use dfoil to study the introgression of Sb between species. ## Appendix @@ -263,7 +551,7 @@ DIL is composed of the following four components. A right-skewed distribution su theme_bw() + xlim(-35,35) + ylim(0,60) + - geom_vline(xintercept = 0) -> + geom_vline(xintercept = 0) -> p1 ggplot(all_counts) + @@ -306,7 +594,7 @@ DFI is composed of the following four components. A left-skewed distribution sup theme_bw() + xlim(-35,35) + ylim(0,60) + - geom_vline(xintercept = 0) -> + geom_vline(xintercept = 0) -> p1 ggplot(all_counts) + @@ -339,7 +627,7 @@ p1 + p2 + p3 + p4 + plot_layout(ncol = 4) ### Components for DFO -DFO is composed of the comparisons below. We see that the the negative skew is seen only for the first pannel and third pannel, which take into account the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). +DFO is composed of the comparisons below. We see that the the negative skew is seen only for the first pannel and third pannel, which take into account the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). ```{r, fig.width=12, fig.height = 6} @@ -349,7 +637,7 @@ DFO is composed of the comparisons below. We see that the the negative skew is s theme_bw() + xlim(-35,35) + ylim(0,60) + - geom_vline(xintercept = 0) -> + geom_vline(xintercept = 0) -> p1 ggplot(all_counts) + @@ -382,7 +670,7 @@ p1 + p2 + p3 + p4 + plot_layout(ncol = 4) ### Components for DOL -A similar pattern is seen for DOL, where the positive skew is also seen only for the first pannel and third pannel, relating to the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). +A similar pattern is seen for DOL, where the positive skew is also seen only for the first pannel and third pannel, relating to the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). ```{r, fig.width=12, fig.height = 6} @@ -392,7 +680,7 @@ A similar pattern is seen for DOL, where the positive skew is also seen only for theme_bw() + xlim(-35,35) + ylim(0,60) + - geom_vline(xintercept = 0) -> + geom_vline(xintercept = 0) -> p1 ggplot(all_counts) +