changelog v1.0 --> v1.1
- Output vcf:
- Cleanup output vcfs from irrelevant info fields in header
- Reference genotypes are now printed in the traditionnal (REF/ALT) format, with REF = TE present = 0, and ALT = TE absent (deletion) = 1.
- Hard code python2.7 in assembly script to match Spades requirements
- Improve Non-Reference allele reconstruction script at TSD
- Clean bugs and silence non-threatening error messages
- Change parameterfile_NoRef.ini to parameterfile_NRef.ini to match regular script naming
- Create tutorial section (upcoming manuscript)
see the TypeTE paper in NAR (2020)
TypeTE is a pipeline dedicated to genotype segregating Mobile Element Insertion (MEI) previously scored with a MEI detection tool such as MELT (Mobile Element Locator Tool, Gardner et al., 2017). TypeTE extracts reads from each detected polymorphic MEI and reconstruct acurately both presence and absence alleles. Eventually, remapping of the reads at the infividual level allow to score the genotype of the MEI using a modified version of Li's et al. genotype likelihood. This method drammatically improves the quality of the genotypes of reported MEI and can be directly used after a MELT run on both non-reference and reference insertions.
TypeTE is divided in two modules: "Non-reference" to genotype insertions absent from the reference genome and "Reference" to genotype TE copies present in the reference genomes.
Currently TypeTE is working only with Alu insertions in the human genome but will be soon available for L1, SVA as well as virtualy any retrotransposon in any organism with a reference genome.
This pipeline is developped by Jainy Thomas (University of Utah) and Clement Goubert (Cornell University). Elaborated with the collaboration of Jeffrey M. Kidd (University of Michigan)
Please adress all you questions and comments using the "issue" tab of the repository. This allows the community to search and find directly answers to their issues. If help is not comming, you can email your questions at goubert.clement[at]gmail.com
A docker container is coming for TypeTE! Stay tuned to get the latest version as soon as it comes out!
TypeTE rely on popular softwares often already in the toolbox of computational biologists! The following programs need to be installed and their path reported in the file "parameterfile_[No]Ref.init"
Perl executable must be in the user path
- PERL https://www.perl.org/
- BioPerl https://bioperl.org/INSTALL.html
- PYTHON 2.7 https://www.python.org/download/releases/2.7/ (Not compatible with Python 3)
- PARALLEL https://www.gnu.org/software/parallel/
- PICARD https://broadinstitute.github.io/picard/
- BEDTOOLS http://bedtools.readthedocs.io/en/latest/
- SEQTK https://github.com/lh3/seqtk
- BAMUTILS https://genome.sph.umich.edu/wiki/BamUtil
- SPADES http://cab.spbu.ru/software/spades/
- MINIA http://minia.genouest.org/
- CAP3 http://seq.cs.iastate.edu/cap3.html
- BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- BWA http://bio-bwa.sourceforge.net/bwa.shtml
- BGZIP http://www.htslib.org/doc/bgzip.html
- TABIX http://www.htslib.org/doc/tabix.html
- Clone from git repository:
git clone --recurse-submodules https://github.com/clemgoub/TypeTE.git
cd TypeTE
-
Complete the fields associated to the path of each dependent program in the files
"parameterfile_Ref.init"
and"parameterfile_NRef.init"
-
And that's it!
You will need:
- A vcf/vcf.gz file (VCF) such as generated by the MELT discovery workflow. Examples are available in the folder "test_data". The vcf file must contain on Reference or Non-reference loci according to the module chosen. Loci/individuals must be sampled from the original vcf/vcf.gz using the following flag
--recode-INFO-all
in vcftools so the subsetted vcf will be compatible with TypeTE. If a new vcf is created specially for TypeTE, the following tags must be present in the "INFO" field (column) for non-reference loci only:
- MEINFO= with predicted subfamily (Repbase name) and orientation of the TE (ex: MEINFO=AluYa5,.,.,+ | if the subfamily is unknown: MEINFO=AluUndef;.,.,+)
- TSD= to indicate the predicted TSD (ex: TSD=AATAGAATTAGCAATTTTG | if no TSD detected TSD=null)
example:
##fileformat=VCFv4.1
##<HEADER OF THE VCF FILE>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA07056 NA11830 NA12144
1 72639020 ALU_umary_ALU_244 C <INS:ME:ALU> . . MEINFO=AluUndef,4,281,-;TSD=AGCAATCTTATTTTC GT 0|1 0|0 0|1
10 69994906 ALU_umary_ALU_8067 G <INS:ME:ALU> . . MEINFO=AluUndef,8,280,+;TSD=AATAGAATTAGCAATTTTG GT 0|0 0|1 0|1
The "TSD=" and "MEINFO=" might be in different orders in the column "INFO" (8) of the vcf without issue. These fields are not required for the Reference module where these will be extracted from the reference genome
-
bam files for each individual found in the vcf file
-
a two column tab separated table with the sample name and corresponding bam name (BAMFILE):
sample1 sample1-xxx-file.bam
sample2 sample2-yyy-file.bam
sample3 sample3-zzz-file.bam
-
Reference genome (GENOME) in fasta format (to date tested with hg19 and hg38). In another reference genome is used, you will need to update the RepeatMasker track corresponding to your reference as well as the repeat you want to genotype.
-
RepeatMasker Track a .bed files reporting each reference MEI insertion masked by RepeatMasker for the reference sequence provided. The family names must match the names of the consensus given in the RM_FASTA field. (provided by default for Alu on hg19 and hg38)
-
RepeatMasker Consensus (RM_FASTA) a .fasta file with the consensus sequences of the repeats analysed (provided by default for Alu)
-
Edit the file "parameterfile_NoRef.init" or "parameterfile_Ref.init" following the indications within:
### MAIN PARAMETERS
# user data
VCF="/workdir/cg629/bin/TypeTE/test_data/test_data_nonref.vcf" #Path to MELT vcf (.vcf or .vcf.gz) must contain INFO field with TSD and MEI type
BAMPATH="/workdir/cg629/Projects/TypeTE_tutorial/test_data/" # Path to the bams folder
BAMFILE="/workdir/cg629/bin/TypeTE/test_data/input_table.txt" # <indiv_name> <bam_name> (2 fields tab separated table)
# genome data
RM_TRACK="/workdir/cg629/bin/TypeTE/Ressources/RepeatMasker_Alu_hg19.bed" # set by default for hg19
RM_FASTA="/workdir/cg629/bin/TypeTE/Ressources/refinelib" # set by default to be compatible with the Repeat Masker track included in the package
GENOME="/workdir/cg629/Projects/testTypeTE/hs37d5.fa" # Path the the reference genome sequence
# output
OUTDIR="/workdir/cg629/Projects/TypeTE_tutorial" # Path to place the output directory (will be named after PROJECT); OUTDIR must exist
PROJECT="OUTPUTS_NRef_testdata" # Name of the project (name of the folder)
# multi-threading
individual_nb="1" # number of individual per job (try to minimize that number)
CPU="3" # number of CPU (try to maximize that number) # CPU x individual_nb >= total # of individuals
## non-mendatory parameters
MAP="NO" #OR NO (experimental)
### DEPENDENCIES PATH
# /!\ PERL MUST BE IN PATH /!\
PARALLEL="/programs/parallel/bin/parallel" #Path to the GNU Parallel program
PICARD="/programs/picard-tools-2.9.0" #Path to Picard Tools
BEDTOOLS="/programs/bedtools-2.27.1/bin/bedtools" #Path to bedtools executable
SEQTK="/programs/seqtk" #Path to seqtk executable
BAMUTILS="/programs/bamUtil" #Path to bamUtil
SPADES="/programs/spades-3.5.0/bin" #Path to spades bin directory (to locate spades.py and dispades.py)
MINIA="/workdir/cg629/bin/minia/build/bin" #Path to minia bin directory
CAP3="/workdir/cg629/bin/CAP3" #Path to CAP3 directory
BLAST="/programs/ncbi-blast-2.7.1+/bin" #Path to blast bin directory
BWA="/programs/bwa-0.5.9/bwa" #Path to bwa executable
BGZIP="bgzip" #Path to bgzip executable
TABIX="tabix" #Path to tabix executable
- Fill the appropriated
parameterfile_[N]Ref.init
according to your local paths and files - Run the following command in the TypeTE folder:
nohup ./run_TypeTE_[N]Ref.sh &> TypeTE.log &
Use ./run_TypeTE_Ref.sh
for reference insertions and ./run_TypeTE_NRef.sh
for non-reference insertions.
TypeTE outputs a vcf.gz file containing all individual genotypes with genotypes likelihoods. The vcf convention reports genotypes relative to the allele present in the reference genome, thus TypeTE reports Reference insertions as 0/0 (homozygous) or (0/1), with 1/1 genotypes being homozygous for the absence of TE. This pattern is the opposite for the Non-Reference insertions.
We have prepared a small tutorial/test-run to check if all the components of TypeTE works perfectly.
We are going to run the pipeline on 2 loci of 3 individuals from the 1000 Genome Project.
- Download the bam and bam.bai files
Within the TypeTE folder, type:
cd test_data
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai
The corresponding bam/bam.bai files will be downladed into /TypeTE/test_data
- Copy the parameterfile_NoRef.init template present in /TypeTE/test_data to the main folder
cp parameterfile_NRef.init ../
cd ../
-
Edit the parameterfile_NRef.init according to your dependancies and local path.
-
Run TypeTE
nohup ./run_TypeTE_NR.sh &> TypeTE_TESTRUN.log &
- Expected results
The genotypes from the original vcf (<>/TypeTE/test_data/test_data_nonref.vcf) are the following
NA07056 | NA11830 | NA12144 | |
---|---|---|---|
1_72639020 | 0/1 | 0/0 | 0/1 |
10_69994906 | 0/0 | 0/1 | 0/1 |
The new genotypes should be
NA07056 | NA11830 | NA12144 | |
---|---|---|---|
1_72639020 | 1/1 | 0/1 | 0/1 |
10_69994906 | 0/0 | 1/1 | 0/1 |
We will here genotype two reference loci in the same three individuals:
- Copy the parameterfile_Ref.init present in /TypeTE/test_data to the main folder
cp test_data/parameterfile_Ref.init .
-
Edit the parameterfile_Ref.init according to your dependancies and local path (but do not change anything else!)
-
Run TypeTE
nohup ./run_TypeTE_Ref.sh &> TypeTE_TESTRUN_ref.log &
- Expected results
The genotypes from the original vcf (<>/TypeTE/test_data/test_data_ref.vcf) are the following
NA07056 | NA11830 | NA12144 | |
---|---|---|---|
5_88043130 | 0/1 | 1/1 | 0/1 |
6_7717368 | 0/1 | 0/1 | 0/1 |
The new genotypess should be
NA07056 | NA11830 | NA12144 | |
---|---|---|---|
5_88043130 | 1/1 | 0/1 | 0/1 |
6_7717368 | 1/1 | 1/1 | 0/1 |