arcnumt

Scripts used to detect archaic NUMTs in NGS data

Description

A pipeline including a collection of scripts used to analyse NUMTs discovered in whole genome paired read data. flanking_region_analysis.py is a script used to calculate match ratios with archaic genomes of a specific genomic region. numt_stats.py is a script to calculate various statistics for discovered NUMTs. mito_variance.py is a script to calculate pariwise differences between all sequences of an alignment.

Required resources

This workflow is based on the output of dinumt (https://github.com/mills-lab/dinumt) including the supplementary files obtained with the option --output_support.

For some steps third party software is required. Here is a list of those I used, but they can be replaced by other software doing the same:

bam2fastx: https://github.com/PacificBiosciences/bam2fastx
bwa mem: http://bio-bwa.sourceforge.net/
samtools: http://www.htslib.org/doc/samtools.html
bam-rewrap: https://bitbucket.org/ustenzel/biohazard-tools
GATK 4.0 HaplotypeCaller: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php
GATK 3.8 FastaAlternativeReferenceMaker: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

python modules needed:

pysam
argparse
collections
multiprocessing
vcf
Bio

In addition you will need:

RSRS reference sequence: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322232/
alignment file with chimp and human mitochondrial genomes used for phylogenetic analysis. Must contain RSRS as a reference

For flanking region analysis you will need:

phased genotypes for worldwide populations (e.g. 1000 GP or SGDP dataset)
phased genotypes for analysed samples
archaic vcf-file
file with phase information for archaic NUMT (example: phases.txt)
file with sample informations (example: all_studies_samples.txt)

Workflow

Split the NUMT read files into single bam-files for each NUMT:

split_sam.py -s sample_support.sam \

Further process each individual NUMT to obtain its sequence and combine it with corresponding sequences from various mitochondrial genomes:

bam2fastx -q -A -o numt.fq -N numt.bam \
bwa mem RSRS.fasta numt.fq | samtools view -b | bam-rewrap RSRS:16569 | samtools sort > numt.sorted.bam; samtools index numt.sorted.bam \
fixbam.py -s numt.sorted.bam -o numt.sorted.fixed.bam; samtools index numt.sorted.fixed.bam \
getbed.py -s numt.sorted.fixed.bam -o numt.bed \

gatk4.0 HaplotypeCaller -L numt.bed -R RSRS.fasta -I numt.sorted.fixed.bam -O numt.vcf \
java -jar gatk3.8 -T FalstaAlternateReferenceMaker -R RSRS.fasta -o numt.fasta -L numt.bed -V numt.vcf \

extract_mito.py -n numt.fasta -a aligned_mt_genomes.fasta -b numt.bed -o numt_mito.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arcnumt

Scripts used to detect archaic NUMTs in NGS data

Description

Required resources

Workflow

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
README.md		README.md
all_studies_samples.txt		all_studies_samples.txt
extract_mito.py		extract_mito.py
fixbam.py		fixbam.py
flanking_region_analysis.py		flanking_region_analysis.py
getbed.py		getbed.py
mito_variance.py		mito_variance.py
numt_stats.py		numt_stats.py
phases.txt		phases.txt
split_sam.py		split_sam.py

robbueck/arcnumt

Folders and files

Latest commit

History

Repository files navigation

arcnumt

Scripts used to detect archaic NUMTs in NGS data

Description

Required resources

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages