The two previous demos (1st run and 2nd run) are small-scale, limited in either input protein set or reference database, respectively. Now let's look at a realistic case. In this page we will replicate the analysis described in the original HGTector paper (Fig. 3 of Zhu et al., 2014).
This analysis involves seven Rickettsia species under the spotted fever group (TaxID: 114277). In the 2014 paper, we searched against the NCBI nr database as of 2012. Here we will see what we get using the default database constructed in Oct 2019.
This page won't have much text, since most ingredients of HGTector were already discussed in the two previous tutorials. I will let the figures explain themselves.
Species | TaxID | Genome |
---|---|---|
R. africae | 35788 | GCF_000023005.1 |
R. akari | 786 | GCF_000018205.1 |
R. conorii | 781 | GCF_000007025.1 |
R. felis | 42862 | GCF_000012145.1 |
R. massiliae | 35791 | GCF_000016625.1 |
R. rickettsii | 783 | GCF_000018225.1 |
R. slovaca | 35794 | GCF_000237845.1 |
Just run the default workflow:
hgtector search -i input -o search -m diamond -p 16 -d hgtdb/diamond/db -t hgtdb/taxdump
hgtector analyze -i search -o analyze -t hgtdb/taxdump
The automated workflow correctly figured that all seven genomes belong to the spotted fever group (even though it is not a standard taxonomic rank!), and let it be the "self" group. It then moves up one level to genus Rickettsia, and let it be the "close" group.
In this realistic analysis, both "close" and "distal" groups are sufficiently sampled, and the distributions of scores are apparently bimodal.
(Left panel is "close", right panel is "distal", same below)
The default method for kernel bandwidth selection quickly finds the "atypical" cluster in each group.
Finally it predicts a total of 357 putatively HGT-derived genes, out of 8,559 genes in the seven genomes. As the scatter plot (left) shows, it is a pretty conservative prediction. If one wants more candidates, the two parameters --noise
and --silhouette
may be set to zero to include genes in the "grey zone".
The predicted HGT activity is differential across species (right). For example Rickettsia felis appears to have many more HGT-derived genes than Rickettsia akari.
Now we will customize a few parameters so as to mimick the 2014 analysis.
hgtector analyze -i search -o analyze -t hgtdb/taxdump \
--bandwidth grid --self-tax 114277 --close-tax 766
Here the "self" group is still the spotted fever group (the same as HGTector's guess), but the "close" group is raised to order Rickettsiales (TaxID: 766). The enlarged "close" group allows for more robust clustering. Meanwhile, the kernel bandwidth is to be optimized using grid search with cross validation (this is new).
This setting predicts 268 HGT-derived genes.