A toolkit for evolutionary analyses of linkage groups
Best addressed via conda
$ conda install pandas docopt ete3 pygraphviz matplotlib tqdm networkx=2.4 numpy=1.20.3=py38h9894fe3_0 python=3.8 more-itertools
Usage: syngraph <module> [<args>...] [-D -V -h]
[Modules]
build Build graph from orthology data (e.g. BUSCO *.full_table.tsv)
infer Model rearrangements over a tree
tabulate Get table of extant and ancestral genomes
viz Visualise graph/data [Under development]
[Options]
-h, --help Show this screen.
-D, --debug Print debug information [TBI]
-v, --version Show version
[Dependencies]
---------------------------------------------------------------------------------------------
| $ conda install -c conda-forge networkx=2.4 pandas docopt tqdm ete3 pygraphviz matplotlib |
---------------------------------------------------------------------------------------------
syngraph build -d directory_of_tsv_files -m -o test
syngraph infer -g test.pickle -t newick.txt -r 2 -s taxon_1 -o test
syngraph infer -g test.pickle -t newick.txt -r 3 -s taxon_1 -o test
syngraph tabulate -g test.with_ancestors.pickle -o test
Input data should only contain markers from chromosome-scale sequences as unscaffolded contigs will result in excess fission events being inferred.
If using BUSCO data, tsv files should be named My_taxon.\*.tsv
where My_taxon is also a leaf in the newick tree. Each row should contain the BUSCO_ID, sequence, start position, and end position. These can be grepped from the *full_table.tsv
file generated by BUSCO (Busco_id, Sequence, Gene_Start, Gene_End). E.g.:
0at7088 HG995313.1 5723272 5863707
1at7088 HG995286.1 19966914 20084934
2at7088 HG995296.1 11128843 11215510
After building a syngraph, inter-chromosomal rearrangements can be inferred with syngraph infer
. This requires a newick tree relating the taxa in the analysis. Branch lengths are used by syngraph but this only influences how the tree is traversed, so approximate branch lengths are fine.
The -r
option sets the inference mode, 2
for fissions and fusions, and 3
for fissions, fusions, and reciprocal translocations (which is currently experimental).
The -m
option sets the minimum number of markers that can be involved in a rearrangement. Setting -m 1
will mean that a rearrangement will be reported when a single marker 'moves' between chromosomes. By contrast, setting higher values, e.g. -m 100
, will mean that chromosome fissions or sets of complex rearrangements will be missed. A reasonable starting point is -m 5
although this may need to be adjusted given the density of markers, size of chromosomes, and accuracy of marker orthology.
The most useful output file is *.rearrangements.tsv
. This lists rearrangements inferred over the tree. The branch of the tree where a rearrangement happened is denoted by its parent and child nodes. The event is reported as fission/fusion/translocation. Multiplicity is the number of events. This is normally 1, but can be more if a chromosome has fissioned into mutliple fragements. The last column is ref_seqs, and shows which chromosomes are involved in the rearrangement given an extant genome, an inferred ancestral genome, or a predefined list of marker --> chromosome relationships.
#parent child event multiplicity ref_seqs
n7 Brenthis_ino fusion 1 [['n5_2', 'n5_17'], ['n5_20']]
n5 n7 fusion 1 [['n5_6'], ['n5_19']]
Syngraph is still under active development. Please open an issue if you have any questions about running the software or interpreting your results.
If you use syngraph in your research then please cite this preprint.