diff --git a/Private and shared alleles/Early diverging species/dfoil.rmd b/Private and shared alleles/Early diverging species/dfoil.rmd index dede1da..0b9fcbd 100644 --- a/Private and shared alleles/Early diverging species/dfoil.rmd +++ b/Private and shared alleles/Early diverging species/dfoil.rmd @@ -123,7 +123,7 @@ The introgression hypothesis follows one of the following trees: ((richteri_b, invicta_b),richteri_B),invicta_B)),saevissima) -We used the DFOIL software to charecterise allele distributions across the species. DFOIL considers symmetrical five-taxon the "species tree" topology, with the introgressed species on the left two branches, _S. invicta_ in the two middle branches and the outgroup _S. saevissima_ on the right branch. As an example, the _S. richeteri_ tree we considered is: +We use the DFOIL approach charecterise allele distributions across the species. DFOIL considers symmetrical five-taxon the "species tree" topology, with the introgressed species on the left two branches, _S. invicta_ in the two middle branches and the outgroup _S. saevissima_ on the right branch. As an example, the _S. richeteri_ tree we considered is: ((richteri_B, richteri_b),(invicta_b, invicta_B)),saevissima). @@ -164,15 +164,251 @@ all_counts %>% ``` +## No support for the parallel origin hypothesis + +Under the parallel origin hypothesis, alleles would be shared between species only by ILS, meaning that a supergene variant in a species would be equally distant to both supergene variants in the other species. + +The DFOIL approach gives us four measures. We expect all measures to be zero. + +DFO tests whether the first taxon (SB in the focal species) is closer to the third (invicta Sb) or fourth (invicta SB) taxon. For all species, we get a negative value on average, indicating that focal SB variant is closer to the SB variant of _S. invicta_ than the Sb variant of _S. invicta_. + +DIL measures whether the second taxon (focal Sb) is closer to the third (invicta Sb) or the fourth (invicta SB). We have a positive value on average for all species, indicating that Sb variants share more alleles than expected under ILS. + +DFI measures whether the third taxon (invicta Sb) is closer to the first (focal SB) or the second (focal Sb). Wehave a negative value on average for all focal species, again indicating that Sb variants share more alleles than expected under ILS. + +DOL measures whether the fourth taxon (invicta SB) is closer to the first (focal SB) or second (focal Sb). For all species except _S. AdRX_ , we get a slightly positive value, indicating that the SB variant of _S. invicta_ is closer to the SB variant of the focal species than to the Sb variant of the focal species. + +```{r, fig.height = 7, fig.width = 10} + +all_long %>% + mutate(significant = ifelse(P_value < 0.05, "p < 0.05", "p > 0.05")) %>% + ggplot(aes(x = D_type, y = D_value)) + + # geom_jitter(aes(colour = significant)) + + geom_boxplot(width=0.5, fill ="transparent") + + geom_hline(yintercept = 0) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + +``` + +These patterns can be seen below, where I add the proportion of genes that have a significantly positive of negative D for each D type and for each species (chi-square goodness-of-fit test, uncorrected p < 0.05). I also plot the P-value distributions. + +```{r} + +all_long %>% + mutate(significant = ifelse(P_value < 0.05, "p < 0.05", "p > 0.05")) %>% + group_by(D_type, species) %>% + summarise(mean_D = mean(D_value), + proportion_positive = round(sum(P_value < 0.05 & D_value > 0) / n(), 2), + proportion_negative = round(sum(P_value < 0.05 & D_value < 0) / n(), 2)) + +all_dfoil %>% + select(species, gene, DFO_Pvalue, DIL_Pvalue, DFI_Pvalue, DOL_Pvalue) %>% + pivot_longer(-c(species, gene)) %>% + mutate(D = gsub("_.*", "", name)) %>% + mutate(D = fct_relevel(D, "DFO", "DIL", "DFI", "DOL")) %>% + ggplot() + + geom_histogram(aes(x = value), colour = "black", fill = "white") + + facet_grid(cols = vars(D), rows = vars(species)) + + theme_bw() + +``` + +From these patterns, it is clear that the "parallel origin" hypothesis, where alleles follow the species tree and ILS, is not correct. + +At face value, the DFO and DOL values in our results could indicate that the SB have a more recent divergence than the separation between SB and Sb, which in turn would support the ancestral polymorphism hypothesis. However, both our alternative scenarios (ancestral polymorphism or introgression) are extremely discordant from the species tree. The measures of DFO and DOL are heavily skewed by alleles that are private to either SB (BAABA) or Sb (ABBAA), which could occur under either scenario. In the following, I examine specific allele combinations that give evidence for one or the other hypothesis. + +## Private alleles support the introgression hypothesis + +Under the introgression hypothesis, the terminal branch lengths of Sb are expected to be smaller than those of SB, as the separation between SB variants represents a relative old speciation event, whereas the separation between Sb variants represent more recent introgression events. + +As a proxy for branch length, we counted the number of alleles that are private to each terminal branch. Indeed, the number of alleles that are private to the SB supergene terminal branches are greater than those private to each Sb terminal branch (two-tail paired t-test, Bonferroni corrected p < 0.05 for each species). This pattern could support a younger age for species separation in Sb than in SB, supporting the introgression hypothesis. + +```{r, fig.height = 5, fig.width = 5} + +ggplot(all_counts) + + geom_point(aes(x = ABAAA + AABAA, y = BAAAA + AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, yintercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("Sb (ABAAA + AABAA)") + + ylab("SB (BAAAA + AAABA)") + + ggtitle("Private to terminal branches") -> p1 + +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$p.value, + t = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$statistic, + df = t.test(ABAAA + AABAA, BAAAA + AAABA, paired = TRUE)$parameter, + mean_invicta = mean(ABAAA + AABAA), + mean_focal = mean(BAAAA + AAABA), + mean_invicta_b = mean(AABAA), + mean_focal_b = mean(ABAAA), + mean_invicta_B = mean(AAABA), + mean_focal_B = mean(BAAAA)) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + + +``` + +However, it is important to note that effective population size is an important determinant of branch length. With higher effective population size, branch lengths can get "stretched" to before the speciation event occured. Thus, the pattern above could be consistent with the ancestral polymorphism hypothesis, since a lower effective population size in Sb relative to SB would cause Sb to have shorter terminal branches (despite equal divergence age). + +![](results/tree_models.png) + +Indeed, assuming the introgression hypothesis (the topology on the right) measured the branch lengths going from the supergene origin to Sb in _S. invicta_ and each focal species, and compared it to the branch length of SB in _S. invicta_ and in each focal species. In all but one comparison, the SB branch was smaller than the Sb branch, unlike the expectation that it should be equal or larger. + +```{r, fig.height = 12, fig.width = 12} + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("invicta b") + + ylab("invicta B") + + ggtitle("invicta b VS invicta B") -> p1 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + ABAAA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("focal b") + + ylab("focal B") + + ggtitle("focal b VS focal B") -> p2 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("invicta b") + + ylab("focal B") + + ggtitle("invicta b VS focal B") -> p3 + +ggplot(all_counts) + + geom_point(aes(x = ABBAA + AABAA, y = AAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("focal b") + + ylab("invicta B") + + ggtitle("focal b VS invicta B") -> p4 + +(p1 + p2) / (p3 + p4) + +all_counts %>% + group_by(species) %>% + summarise("invicta b VS invicta B_p" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$p.value, + "invicta b VS invicta B_t" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$statistic, + "invicta b VS invicta B_df" = t.test(ABBAA + AABAA, AAABA, paired = TRUE)$parameter, + "invicta b VS invicta B_mean b" = mean(ABBAA + AABAA), + "invicta b VS invicta B_mean B" = mean(AAABA), + "focal b VS focal B_p" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$p.value, + "focal b VS focal B_t" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$statistic, + "focal b VS focal B_df" = t.test(ABBAA + ABAAA, BAAAA, paired = TRUE)$parameter, + "focal b VS focal B_mean b" = mean(ABBAA + ABAAA), + "focal b VS focal B_mean B" = mean(BAAAA), + "invicta b VS focal B_p" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$p.value, + "invicta b VS focal B_t" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$statistic, + "invicta b VS focal B_df" = t.test(ABBAA + AABAA, BAAAA, paired = TRUE)$parameter, + "invicta b VS focal B_mean b" = mean(ABBAA + AABAA), + "invicta b VS focal B_mean B" = mean(BAAAA), + "focal b VS invicta B_p" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$p.value, + "focal b VS invicta B_t" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$statistic, + "focal b VS invicta B_df" = t.test(ABBAA + ABAAA, AAABA, paired = TRUE)$parameter, + "focal b VS invicta B_mean b" = mean(ABBAA + ABAAA), + "focal b VS invicta B_mean B" = mean(AAABA)) %>% + ungroup %>% + pivot_longer(-species) %>% + separate(name, c("comparison", "stat"), sep = "_") %>% + pivot_wider(names_from = stat, values_from = value) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + arrange(comparison, species) %>% + as.data.frame() + +``` + +Nevertheless, our analysis in based on alleles in genic regions - some of which likely to be slightly deleterious. Slightly deleterious alleles have higher subtitution rates under low Ne, which could be an alternative explanation for the slightly higher number of private alleles in Sb. + +Furthermore, there are many more alleles fixed in Sb than in SB (two-tail paired t-test, Bonferroni corrected p < 0.05 for each species) - indeed, almost no alleles are fixed among SB of different species. This could suggest that the SB variants do not share a common branch before speciation. + +```{r, fig.height = 5, fig.width = 5} + +ggplot(all_counts) + + geom_point(aes(x = ABBAA, y = BAABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 75), ylim = c(0,75)) + + xlab("Sb (ABBAA)") + + ylab("SB (BAABA)") + + ggtitle("Fixed private alleles") -> p1 + +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(ABBAA, BAABA, paired = TRUE)$p.value, + t = t.test(ABBAA, BAABA, paired = TRUE)$statistic, + df = t.test(ABBAA, BAABA, paired = TRUE)$parameter, + mean_invicta = mean(ABBAA), + mean_focal = mean(BAABA)) %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + +``` + ## Evidence that Sb introgressed from _S. invicta_ to the remaining species -We then tested the hypothesis that SB in the species in which Sb evolved will have more private alleles in common with Sb than the other species: +Additionally, we tested whether the SB branch is longer in _S. invicta_ relative to each focal species. A difference in branch length would support the introgression hypothesis, and inform us on the direction of introgression: ((richteri_b, invicta_b),invicta_B),richteri_B)),saevissima) ((richteri_b, invicta_b),richteri_B),invicta_B)),saevissima) -We look specifically at alleles that are present in the Sb variant of both species and the SB of either species. Most genes had a small number of such alleles. Nevertheless, on average _S. invicta_ had more private alleles in common with Sb than each of the focal species (two-tail paired t-test, Bonferonni-corrected p < 0.05). +We tested the hypothesis that the species in which Sb evolved will have fewer alleles private to SB, given that it will share part of its branch with Sb. We found that the SB of _S. invicta_ had fewer private alleles than the SB of each focal species. This difference was small but significant in all species (two-tail paired t-test, Bonferonni-corrected p < 0.05) except for _S. interrupta_, where there was no significant difference. + +```{r, fig.height = 6, fig.width = 6} + +all_counts %>% + mutate(species = paste0("S. ", species)) %>% + ggplot() + + geom_point(aes(x = AAABA, y = BAAAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 50), ylim = c(0,50)) + + xlab("Private to S. invicta SB") + + ylab("Private to focal SB") -> p1 +p1 + +all_counts %>% + group_by(species) %>% + summarise(p = t.test(AAABA, BAAAA, paired = TRUE)$p.value, + t = t.test(AAABA, BAAAA, paired = TRUE)$statistic, + estimate = t.test(AAABA, AABAA, paired = TRUE)$estimate, + mean_invicta = mean(AAABA), + mean_focal = mean(BAAAA)) %>% + ungroup %>% + mutate(corrected_p = p * 4) %>% + mutate(significant = ifelse(corrected_p < 0.05,"*","")) %>% + as.data.frame() + +``` + +We then tested the hypothesis that the species in which Sb evolved will have more private alleles in common with Sb than the other species. We look specifically at alleles that are present in the Sb variant of both species and the SB of either species. Most genes had a small number of such alleles. Nevertheless, on average _S. invicta_ had more private alleles in common with Sb than each of the focal species (two-tail paired t-test, Bonferonni-corrected p < 0.05). ```{r, fig.height = 6, fig.width = 6} @@ -267,5 +503,210 @@ ggsave(p2, file = "results/lb_shared_each_bb.pdf", width = 6, height = 6) ## Conclusion -The fact that SB of _S. invicta_ shared more alleles with the Sb branch than the SB of any other species is consistent with an introgression from _S. invicta_ to each of the other species. +The higher number of private alleles in Sb than in the SB of each species could be constistent with the ancestral polymorphism hypothesis, but could also be consistent with introgression and lower Ne in Sb. + +The fact that SB had fewer private alleles in _S. invicta_ than in any of the other species is consistent with an introgression from _S. invicta_ to each of the other species, as is the fact that SB of _S. invicta_ shared more alleles with the Sb branch than the SB of any other species. + + +## Appendix + +### Alleles shared between SB and Sb + +```{r, fig.height = 5, fig.width = 13} + +ggplot(all_counts) + + geom_point(aes(x = ABBBA, y = BBBAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 20), ylim = c(0, 20)) -> p1 + +ggplot(all_counts) + + geom_point(aes(x = AABBA, y = BABAA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 20), ylim = c(0, 20)) -> p2 + +ggplot(all_counts) + + geom_point(aes(x = ABABA, y = BBABA), alpha = 0.5) + + facet_wrap(vars(species), ncol = 2) + + theme_bw() + + geom_abline(slope=1, intercept=0) + + coord_fixed(xlim = c(0, 20), ylim = c(0, 20)) -> p3 + +p1 + p2 + p3 + +``` + +### Components for DIL + +DIL is composed of the following four components. A right-skewed distribution supports the introgression hypothesis. + +```{r, fig.width=12, fig.height = 6} + + ggplot(all_counts) + + geom_histogram(aes(x = ABBAA - ABABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> + p1 + +ggplot(all_counts) + + geom_histogram(aes(x = BBBAA - BBABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> p2 + +ggplot(all_counts) + + geom_histogram(aes(x = BAABA - BABAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> +p3 + +ggplot(all_counts) + + geom_histogram(aes(x = AAABA - AABAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + geom_vline(xintercept = 0) -> p4 + +p1 + p2 + p3 + p4 + plot_layout(ncol = 4) + +``` + +### Components for DFI + +DFI is composed of the following four components. A left-skewed distribution supports the introgression hypothesis. + +```{r, fig.width=12, fig.height = 6} + + ggplot(all_counts) + + geom_histogram(aes(x = BABAA - ABBAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> + p1 + +ggplot(all_counts) + + geom_histogram(aes(x = BABBA - ABBBA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> p2 + +ggplot(all_counts) + + geom_histogram(aes(x = ABABA - BAABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> +p3 + +ggplot(all_counts) + + geom_histogram(aes(x = ABAAA - BAAAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + geom_vline(xintercept = 0) -> p4 + +p1 + p2 + p3 + p4 + plot_layout(ncol = 4) + +``` + +### Components for DFO + +DFO is composed of the comparisons below. We see that the the negative skew is seen only for the first pannel and third pannel, which take into account the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). + +```{r, fig.width=12, fig.height = 6} + + ggplot(all_counts) + + geom_histogram(aes(x = BABAA - BAABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> + p1 + +ggplot(all_counts) + + geom_histogram(aes(x = BBBAA - BBABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> p2 +ggplot(all_counts) + + geom_histogram(aes(x = ABABA - ABBAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> +p3 + +ggplot(all_counts) + + geom_histogram(aes(x = AAABA - AABAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + geom_vline(xintercept = 0) -> p4 + +p1 + p2 + p3 + p4 + plot_layout(ncol = 4) + +``` + +### Components for DOL + +A similar pattern is seen for DOL, where the positive skew is also seen only for the first pannel and third pannel, relating to the alleles that are private to each of the supergene variants (ABBAA for Sb and BAABA for SB). + +```{r, fig.width=12, fig.height = 6} + + ggplot(all_counts) + + geom_histogram(aes(x = BAABA - ABABA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> + p1 + +ggplot(all_counts) + + geom_histogram(aes(x = BABBA - ABBBA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> p2 + +ggplot(all_counts) + + geom_histogram(aes(x = ABBAA - BABAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + + geom_vline(xintercept = 0) -> +p3 + +ggplot(all_counts) + + geom_histogram(aes(x = ABAAA - BAAAA), fill = "white", colour = "black") + + facet_grid(rows = vars(species)) + + theme_bw() + + xlim(-35,35) + + ylim(0,60) + geom_vline(xintercept = 0) -> p4 + +p1 + p2 + p3 + p4 + plot_layout(ncol = 4) + +```