-
Notifications
You must be signed in to change notification settings - Fork 2
BYO_transcriptome.py: adding transcriptome sequences to a target file
The BYO_transcriptome.py
script can be used to add sequences from any transcriptome to any protein-coding nucleotide target file.
-
Generate an alignment for each gene in your existing target file.
-
Create a Hidden Markov Model (HMM) profile from each alignment, using HMMER.
-
For each HMM, search each transcriptome in the user-provided folder.
-
Add transcriptome hits to each gene alignment.
-
Trim the 5' and 3' termini of transcriptome hits to the longest of a given set of reference taxa sequences (defaults to Arabidopsis thaliana, Amborella trichopoda, or Oryza sativa for Angiosperm353 target files. The user can supply alternative taxa using the
-trim_to_refs
option). -
In cases where a transcriptome hit sequence is <100% (user-adjustable value using the
-length_percentage
option) the length of the longest original target file sequence for a given gene, the closest related target file sequence is identified using a distance matrix, and the transcriptome hit sequence is extended by grafting with the 5' and/or 3' termini of the closest related sequence. If the resulting grafted sequence is still shorter than the longest original target file sequence, it is grafted again with the longest original sequence. IMPORTANT: the resulting target file sequence is therefore a chimeric construct, and these cases are flagged in the sequence name (e.g.>JOIS_grafted_with_Arath-4893
). This grafting process is necessary as HybPiper translates a single chosen target file sequence for each gene and sample, and the resulting protein sequence is used as a query in Exonerate to search against assembled nucleotide contigs, using the protein2genome model. Consequently, short protein queries recover truncated nucleotide loci sequences, even if longer contigs have been successfully assembled.Optionally, a flag can be specified to discard short transcriptome hits rather than grafting them (
-discard_short
), and this option can be applied with a user-specified length threshold (-length_percentage
). This allows recovery of non-chimeric homologous sequences from transcriptomes that can be used in downstream phylogenetic analyses, but will likely negatively effect the length of recovered loci sequences. -
After trimming and grafting, all sequences are checked for stop codons by translating them in the first reading frame. If any are found, the second and third reading frames are checked. If an open reading frame is successfully found, the 5' end of the sequence is altered so that it starts in the correct reading frame. A single stop codon can occur in the final 10 codons of any given sequence, to allow for cases where some existing target genes end with a stop codon.
-
If no open reading frame can be found, a second correction step is performed using Exonerate. For each gene, the sequence corresponding to the longest of the taxon sequences provided for alignment trimming (defaults to Arabidopsis thaliana, Amborella trichopoda, or Oryza sativa for Angiosperm353 target files. The user can supply alternative taxa using the
-trim_to_refs
option) is translated, and used as a protein query against the nucleotide sequence with frameshifts using the protein2genome model. This model is aware of both introns and frameshifts. If more than one hit region is found, hits are ordered, redundant hit ranges (which can occur if your protein has repeats) are removed, and any overlaps are trimmed. If a single hit is found, frameshifts and introns are removed. In both cases, a frameshift- and intron-free sequence is produced. As the Exonerate process results in a sequence that is shorter than the longest reference sequence, it is grafted with the 5' and/or 3' termini of the longest reference sequence.The resulting sequence is then checked for stop codons. If one is found (which can happen when in-frame stop codons are present in the transcript, perhaps due to sequencing/assembly errors, or perhaps these codons are subject to RNA-editing etc) the sequence is removed from the final target file and written to folder
13_gene_sequences_with_uncorrected_frameshifts
.Note: If your transcript hit diverges from the protein reference in some regions of the protein, Exonerate might not find a match for that region. In such cases, the region(s) between transcript Exonerate hits are padded with Ns, corresponding to the number of codons present in the reference query at these positions (x3).
-
Finally, sequences are extracted from each gene alignment, gap positions are removed, and all sequences are concatenated to create a new target file.
To reiterate: if transcriptome hits are grafted (as noted in the fasta header e.g. >JOIS_grafted_with_Arath-4893
), they represent chimeric sequences and should NOT be used for phylogenetic analyses!
-
In your target file, gene sequences should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:
>AJFN-4471 AATGTTATACAGGATGAAGAGAAACTGAAT... >Ambtr-4471 AGTGTTATTCAAGATGAAGATGTATTGTCG... >Ambtr-4527 GAGGAGCGGGTGATTGCCTTGGTCGTTGGT... >Arath-4691 GAGCTTGGATCTATCGCTTGCGCAGCTCTC... >BHYC-4691 GAAGTGAACTGTGTTGCTTGTGGGTTTCTT... etc.
Note: It is expected that the input target file contains protein-coding nucleotide sequences that translate correctly in frame 1. This is particularly important if the target file will be used with HybPiper. However,
BYO_transcriptome.py
allows for input target file sequences that translate without stop codons in one of the forwards frames; it will correct the starting nucleotide of sequences in the output target file to ensure that they begin at frame 1. -
Your transcriptomes should be named with a unique identifier; this can optionally be followed by a dash and additional text, e.g.:
AVJK.fa PPPZ.fa etc. or AVJK-SOAPdenovo-Trans-assembly.fa PPPZ-SOAPdenovo-Trans-assembly.fa etc.
Note: Your transcriptomes can be zipped in
.zip
,.gz
or.bz2
format. They will be unzipped when the script runs.
-
In the summary report, the sum of new sequences added might differ slightly from the difference between the total number of sequences in the original vs new target files, if the transcriptomes searched already have sequences present in the target file provided.
-
If you are providing your own target taxon names for use in transcriptome hit trimming and frameshift correction (using the
-trim_to_refs <taxon_name>
option), you need to provide each taxon name separately e.g.-trim_to_refs species1_name -trim_to_refs species2_name
-
The script option
-python_threads
effectively allows the user to specify the number of processes/programs (e.g. MAFFT alignments) that the script will launch concurrently. For example, providing the option-python_threads 4
will result in four MAFFT gene alignments being launched at the same time, each running on a single thread by default (see below). -
The script option
-external_program_threads
specifies the number of threads used by each concurrently running process/program (default is 1), for programs that support the use of multiple threads. For example, providing the options-external_program_threads 4 -python_threads 4
will result in the launch of four MAFFT alignments, each using four threads (and thus 16 threads in total).
python BYO_transcriptome.py [-h] [-num_hits_to_recover <integer>]
[-python_threads <integer>]
[-external_program_threads <integer>]
[-length_percentage <float>]
[-hmmsearch_evalue <number in scientific notation; default is 1e-50>]
[-trim_to_refs <taxon_name>] [-no_n]
[-skip_exonerate_frameshift_fix]
[-discard_short]
target_file transcriptomes_folder
positional arguments:
target_file target fasta file containing nucleotide sequences
named following the convention e.g. >AJFN-4471
transcriptomes_folder
Folder containing transcriptome files. These can
optionally be zipped in .zip, .gz or .bz2 format
optional arguments:
-h, --help show this help message and exit
-num_hits_to_recover <integer>
Number of HMM hits to recover from each transcriptome,
default is 1
-python_threads <integer>
number of threads for multiprocessing pools, default
is 1
-external_program_threads <integer>
number of threads for HMMer, mafft, default is 1
-length_percentage <float>
length percentage cut-off for grafting or discarding,
default is 1.00
-hmmsearch_evalue <number in scientific notation; default is 1e-50>
Evalue threshold for searching transcriptomes with
HMMprofiles, default is 1e-50
-trim_to_refs <taxon_name>
Name corresponding to a taxon present in target file,
used for trimming and correcting frameshifts in
transcriptome hits
-no_n Remove n characters from transcriptomes hits
-skip_exonerate_frameshift_fix
Do not use Exonerate to fix frameshifts; discard such
sequences instead
-discard_short Discard transcriptome hits beneath the
-length_percentage value rather than grafting
If everything goes according to plan, the two main output folders of interest will be:
-
17_mega_target_file
. A folder containing the final target fileBYO_target.fasta
. -
18_reports
. A folder containing the general report filesummary_report.csv
, and the filereport_per_gene.csv
containing a presence/absence matrix of transcriptome hits for each gene/transcriptome.
Most steps in the script pipeline write results to a separate folder, which can be useful for troubleshooting. These are:
-
00_transcriptomes_renamed
Renamed transcriptome fasta files.
-
01_target_gene_fasta_files
A multi-fasta file for each gene in the target file provided.
-
02_target_gene_alignment
Alignments of the single-gene fasta files, generated with MAAFT.
-
03_target_gene_hmms
Hidden Markov Model (HMM) profiles derived from the single gene alignments using HMMER.
-
04_target_gene_hmm_hits
A folder for each gene, containing transcriptome hit fasta files for each transcriptome provided (if a hit was found).
-
05_target_gene_hmm_logs
HMMER log files from HMM searches of each transcriptome.
-
06_align_extractions
Per-gene alignments of transcriptome hit sequences with original target-file sequences, generated with MAAFT.
-
07_gene_sequences_with_Ns
A folder for each gene that had transcriptome hits containing 'N' characters, containing the sequences in a single fasta file.
-
08_trimmed_alignments
Per-gene alignments from folder
06_align_extractions
, with the 5' and 3' termini of any transcriptome hits trimmed to the longest of the reference sequences provided (using the-trim_to_refs
option) or one of Arabidopsis thaliana, Amborella trichopoda, or Oryza sativa (default taxa for Angiosperms353 datasets). -
09_grafting_sequence_alignments
A folder for each gene that had transcriptome hits requiring grafting. Each folder contains a fasta file with an alignment of the transcriptome hit and the closest related sequence from the original target file. A fasta file with an alignment of the transcriptome hit and the longest sequence from the original target file may also be present.
-
10_seqs_to_add
Per-gene fasta files containing transcriptome hit sequences (post-grafting) to be added to the target file.
-
11_final_seqs
Per-gene fasta files containing both transcriptome hits (post-grafting and frameshift correction) and original target file sequences. These fasta files are combined to produce the final target file
BYO_target.fasta
in folder17_mega_target_file
. Note that if the script has not run to completion, these fasta files might still contain sequences with frameshifts. -
12_gene_sequences_with_frameshifts
A folder for each gene that had transcriptome hits containing frameshifts/internal stop-codons. Each folder contains a fasta files with the corresponding transcriptome hits sequences.
-
13_gene_sequences_with_uncorrected_frameshifts
A folder for each gene where sequence frameshift/internal stop-codon corrections were unsuccessfully attempted using Exonerate. Each folder contains a fasta file with transcriptome hits that could not be corrected.
-
14_gene_exonerate_results
A folder for each gene where sequence frameshift/internal stop-codon corrections were attempted using Exonerate. Each folder contains input and output files from the Exonerate search, as well as a fasta file of corrected sequences if the process was successful.
-
15_final_seqs_alignments
Single gene alignments from fasta files in folder
11_final_seqs
. These are provided for troubleshooting and diagnostic purposes. -
16_final_seqs_alignments_with_warnings
Alignment fasta files for genes where at least one newly added sequence is longer than expected (as defined by 1.1x the length of the longest original target file sequence). In most cases this is likely due to the presence of an in-frame intron in the transcriptome sequence.
-
17_mega_target_file
A folder containing the final target file
BYO_target.fasta
. -
18_reports
A folder containing the general report file
summary_report.csv
, and the filereport_per_gene.csv
containing a presence/absence matrix of transcriptome hits for each gene/transcriptome.