Note
- Master Thesis Bioinformatics at the University of Tübingen
- Thesis period: 01.12.2023 - 01.06.2024
Horizontal gene transfer (HGT) plays a significant role in shaping the genetic landscape of bacterial populations. In contrast to the more common vertical gene transfer, horizontal gene transfer allows the lateral exchange of genes. To study the impact of HGT on bacterial gene frequency spectra, we have extended existing mutation models within the open-source software msprime 1 2 by incorporating a gene gain and loss model using the Infinitely Many Genes model 3 approach. The ancestry and mutation simulation is then extended to support HGT events. Additionally, the model is adjusted to fix its otherwise random ancestry simulation to specified trees, which is essential for parameter estimation and fitting the simulation to real data. We then develop an innovative simulation-based testing framework to determine whether a gene frequency spectrum results from neutral evolution. Finally, this framework is validated, and real-world parameters are estimated using pangenome data.
Tip
A ready to use Jupyter Notebook with examples can be found here: example_usage.ipynb
The repository is structured as follows:
Filename | Description |
---|---|
conda_env.yml | Conda environment with all required software packages. |
gene_model.py | Main Code for the Gene Gain / Loss simulation. |
gfs.py | Utility function for analysing / modifying GFS. |
hgt_mutations.py | Extension of the msprime mutation simulation to support HGT. |
hgt_sim_args.py | Default simulation parameters. |
hgt_simulation.py | Extension of the msprime ancestry simulation to support HGT. |
neutrality_test.py | Neutrality test based on a |
optimisation.py | Algorithm to fit the simulation to real world GFS. |
example_usage.ipynb | Jupyter Notebook with examples. |
pangenome-gene-transfer-simulation.pdf | Thesis |
Dirname | Description |
---|---|
data | Simulated data and measurements. |
gfs_analysis | Impact of HGT and GC on the GFS of fixed trees. |
minimal_site_count | Impact of double gene gain events on the GFS. |
panX | Files generated by panX. |
tex | LaTeX source files. |
Unless otherwise labelled this piece of software is published unter the GNU General Public License v3.0.
Permissions | Conditions | Limitations |
---|---|---|
✓ Commercial use | Disclose source | ✕ Liability |
✓ Distribution | License and copyright notice | ✕ Warranty |
✓ Modification | Same license | |
✓ Patent use | State changes | |
✓ Private use |
Go to LICENSE.md to see the full version.
The logo is partially based on the output of tskit_arg_visualizer.
Footnotes
-
Franz Baumdicker et al. "Efficient ancestry and mutation simulation with msprime 1.0". In: Genetics 220.3 (Dec. 2021). Ed. by S Browning.issn: 1943-2631. doi: 10.1093/genetics/iyab229. url: http://dx.doi.org/10.1093/genetics/iyab229 ↩
-
Franz Baumdicker, Wolfgang R. Hess and Peter Pfaffelhuber. "The Infinitely Many Genes Model for the Distributed Genome of Bacteria". In: Genome Biology and Evolution 4.4 (2012), pp. 443–456. doi: 10.1093/gbe/evs016. url: http://dx.doi.org/10.1093/gbe/evs016 ↩