Skip to content

Latest commit

 

History

History
73 lines (44 loc) · 4.33 KB

realrun.md

File metadata and controls

73 lines (44 loc) · 4.33 KB

Real runs

The two previous demos (1st run and 2nd run) are small-scale, limited in either input protein set or reference database, respectively. Now let's look at a realistic case. In this page we will replicate the analysis described in the original HGTector paper (Fig. 3 of Zhu et al., 2014).

zhu.2014.fig3

This analysis involves seven Rickettsia species under the spotted fever group (TaxID: 114277). In the 2014 paper, we searched against the NCBI nr database as of 2012. Here we will see what we get using the default database constructed in Oct 2019.

This page won't have much text, since most ingredients of HGTector were already discussed in the two previous tutorials. I will let the figures explain themselves.

Input data

Species TaxID Genome
R. africae 35788 GCF_000023005.1
R. akari 786 GCF_000018205.1
R. conorii 781 GCF_000007025.1
R. felis 42862 GCF_000012145.1
R. massiliae 35791 GCF_000016625.1
R. rickettsii 783 GCF_000018225.1
R. slovaca 35794 GCF_000237845.1

Default workflow

Just run the default workflow:

hgtector search -i input -o search -m diamond -p 16 -d hgtdb/diamond/db -t hgtdb/taxdump
hgtector analyze -i search -o analyze -t hgtdb/taxdump

The automated workflow correctly figured that all seven genomes belong to the spotted fever group (even though it is not a standard taxonomic rank!), and let it be the "self" group. It then moves up one level to genus Rickettsia, and let it be the "close" group.

In this realistic analysis, both "close" and "distal" groups are sufficiently sampled, and the distributions of scores are apparently bimodal.

(Left panel is "close", right panel is "distal", same below)

ricket.def.hist

The default method for kernel bandwidth selection quickly finds the "atypical" cluster in each group.

ricket.def.kde

Finally it predicts a total of 357 putatively HGT-derived genes, out of 8,559 genes in the seven genomes. As the scatter plot (left) shows, it is a pretty conservative prediction. If one wants more candidates, the two parameters --noise and --silhouette may be set to zero to include genes in the "grey zone".

The predicted HGT activity is differential across species (right). For example Rickettsia felis appears to have many more HGT-derived genes than Rickettsia akari.

ricket.def.plot

Customization

Now we will customize a few parameters so as to mimick the 2014 analysis.

hgtector analyze -i search -o analyze -t hgtdb/taxdump \
  --bandwidth grid --self-tax 114277 --close-tax 766

Here the "self" group is still the spotted fever group (the same as HGTector's guess), but the "close" group is raised to order Rickettsiales (TaxID: 766). The enlarged "close" group allows for more robust clustering. Meanwhile, the kernel bandwidth is to be optimized using grid search with cross validation (this is new).

This setting predicts 268 HGT-derived genes.

ricket.ori.hist

ricket.ori.kde

ricket.def.plot