Numts (pronounced "new mights") are mitochondrial DNA segments that have been
transposed into the nuclear genome. If you work with mitochondrial genome data
they're a problem as there's no good way to identify them.
Numts have different properties than their mitochondrial counterparts; the nuclear genome has a different (generally lower) rate of molecular evolution
and a different mode of inheritance (not just matrilineal). In general if you know that your data are single
source then Numts don't tend to be a problem; most of the time the majorty basecall is correct.
However, if you're trying to call heteroplasmies or you
think you may have a mixture of DNAs it would be good to remove such sequences from your analysis.
One way to remove Numts is to use an alignment-based approach-- map your reads to, say, the GRCh38 reference genome, as it contains (some) Numts, reads that look more similar to the Numts in this reference will be mapped there instead of to the mitochondrial genome. This has two major short-comings. GRCh38 only has some Numts; not all. In fact, some Numt insertions are polymorphic, which means that some individuals will have them, others will not, and further, that the set of all Numts will always be incomplete (as you haven't sampled all individuals in the population). The second shortcoming to reference-based approaches that when you map reads you're mapping an allele, an allele that may be fairly different from the reference genome(s). E.g., what if your read has 4 differences to the rCRS, and 3 differences to some Numt? What does that mean? And does the answer change if we know that the read is an exact hit to some known mitochondrial sequence, just not the rCRS (hint, it does)?
The premise behind RtN! is simple: only keep reads that map well to some known mitochondrial sequence. The simplicitly of this approach is that Numts don't have to be annotated; if a read is very far from all known mitochondrial sequences it very likely harbors a lot of sequencing errors (better to ignore it), or it's an off-target alignment (which we don't want). RTN uses annotated genomes from HmtDB. If a read is similar to some known sequence it is kept, otherwise it's mapping quality is set to 0. RTN also maps reads to a database of annotated Numt alleles. The database includes alleles from: Dayama et al, Calabrese et al and Smart et al. The reads are then decorated with two tags: ZH, which gives the minimum distance of the read to some annotated Human sequence, and ZN, which gives the same distance but to the Numt alleles. The minimum distance is defined in two ways: either ignoring indels (just using the matched bases in the CIGAR string), or it can include the number of indels as well-- I would guess that the former is better for Ion sequencing, while the latter would be better for Illumina.
As we describe in our paper, most reads are ambiguous and exactly match both a human sequence and a Numt sequence. The principle of RtN! is to keep ambiguous reads and flag them for manual review. There are processing steps that can be done to make RtN! more aggressive in these cases, but to be clear, some types of Numts are intractable.
Clone this repo (NO recursive downloads SeqLib, htslib, bwa, fermi-kit, ... )
git clone https://github.com/Ahhgust/RtN.git
cd RtN
Change directories. Uncompress and index humans.fa.bz2 a la:
bunzip2 humans.fa.bz2 && bwa index humans.fa
(This will take a while)
The Numt databases need no decompression/indexing
Select the appropriate static binary
*nix
or
Windows
And run it (help):
./rtn -h
Recommendations for single-end reads (likely Ion sequencing)
./rtn -h humans.fa -n Calabrese_Dayama_Smart_Numts.fa -b yourBam.bam
In words:
-h: Compute near-neighbor distances to the human sequences (and remove reads that align poorly to all known human haplotypes)
-n: And compute the near-neighbor distance to an (always incomplete) database of Numt alleles
-b: And do so using the bam files specified...
Recommendations for paired-end reads (likely Illumina sequencing)
./rtn -p -h humans.fa -n Calabrese_Dayama_Smart_Numts.fa -b yourBam.bam
-p: The same as above, but run in paired-end mode
If you know that your data are single source there's an additional trick that can be used.
Rather than aligning the reads to HmtDB it suffices to align the reads to just the single source haplotype.
We call this running a bootstrap (as in the "pull yourself up by" as opposed to the resampling technique).
To do a bootstrap, you can either use a fasta sequence (for the given individual) and then index it with bwa
(aka,
bwa index yourFasta.fa
)
As an added convenience we've added a simply python script (empop2fa.py)
that reads in an Empop-style encoded haplotype, and produces an indexed fasta.
Use it as:
fgrep -v '#' empopFile | python3 empop2fa.py rCRS.fa
By Empop-style file we mean a file that describes sequence differences to the rCRS sequence.
The file is presumed to be tab-separated, with the variants starting at column 4 onwards.
Click here for an example.
RtN! can then be run similar to the above way. In practice we have had excellent luck using a more stringent likelihood threshold in this case (1e-3),
./rtn -L 1e-3 -p -h mySingleSource.fa -n Calabrese_Dayama_Smart_Numts.fa -b yourBam.bam
(somewhat unusual) Requirements: c++ json library (install as necessary)
sudo apt-get install libjsoncpp-dev
git clone --recursive https://github.com/Ahhgust/RtN.git
cd RtN/SeqLib/htslib # Make htslib
autoconf
autoheader
./configure --enable-libcurl
make
cd ../bwa # Make BWA
make
cd ../fermi-lite # and Fermi-lite
make
cd .. # and now the SeqLib library
./configure LDFLAGS='-lcurl -lcrypto'
make
cd ..
make # and make RtN
git clone --recursive https://github.com/Ahhgust/RtN.git
cd RtN/SeqLib/htslib # Make htslib
autoconf
autoheader
./configure --disable-libcurl
make
cd ../bwa # Make BWA
make
cd ../fermi-lite # and Fermi-lite
make
cd .. # and now the SeqLib library
./configure
make
cd ..
make static # and make RtN