Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop codons in the final assembly #106

Open
junchen-deng opened this issue Dec 26, 2022 · 8 comments
Open

Stop codons in the final assembly #106

junchen-deng opened this issue Dec 26, 2022 · 8 comments

Comments

@junchen-deng
Copy link

Hello,

Recently we have been using hybpiper to assemble thousands of BUSCO single-copy genes from metagenomic reads (2x150bp). The target file is amino acids, and we set the coverage cutoff for spades to 4 since the sequencing depth is low.

We notice that there are stop codons in many assembled markers, which is not ideal considering that we used amino acid targets. Then, we looked for some stop codons in the output file exonerate_results.fasta. Here are some examples:
Screenshot from 2022-12-26 17-00-25
Screenshot from 2022-12-26 17-03-00

It seems that many of them appear at the beginning or the end of an alignment, although some of them can follow right after the annotated intron. In the wiki (https://github.com/mossmatters/HybPiper/wiki/Introns), it said:

As Exonerate hits begin and end with exon hits against an exon-only target file sequence, this means that only intron sequences that occur between identified exons will be recovered. That is, sequence upstream of the first exon and downstream of the last exon will not be annotated, and will not appear in the inronerate.gff file found within each gene directory. These unannotated regions might be exonic (i.e. if the target protein used in Exonerate searches was shorter than the exonic regions assembled in SPAdes contigs) or intronic (but not detected for reasons described above).

My understanding is that exonerate cannot detect incomplete introns at the beginning or the end of a contig. Am I correct? Do you think this could be the reason why we get stop codons in the final assembly?

If this is the issue, do you have any suggestions on how to remove these intron leftovers? Thanks!
We are going to use these markers for phylogenomics. But the region around the stop codon often does not align well with the rest of the sequences.

Sincerely,
Junchen

@chrisjackson-pellicle
Copy link
Collaborator

chrisjackson-pellicle commented Jan 9, 2023

Hi Junchen,

Yes, it's also my understanding that Exonerate can't detect incomplete introns at the beginning or end of target contigs, as it's searching for 5' and 3' splice sites (I'm not sure if the intron model is more sophisticated than that, and searches for other intron motifs as well).

Can you upload the target file query sequence for Hemiptera3-395at7524, along with the 395at7524_contigs.fasta file from the gene directory? I'd like to do some testing to check for possible causes of your issue. At this stage it seems possible to me that it could (at least in some cases) be caused by incomplete introns, perhaps in combination with incorrect extension of the Exonerate alignment due to the use of the Exonerate option -refine (as used by HybPiper). I might need to add a --no_refine flag to hybpiper assemble, and/or an option to ignore Exonerate target hits that contain stop codons (it'll have to be optional, I guess, as HybPiper will need to accommodate organisms that use alternative genetic codes where a canonical stop is translated as a standard amino acid).

Cheers,

Chris

@junchen-deng
Copy link
Author

Hi Chris,

Thanks for the reply! Here are the files you asked for.

Cheers,
Junchen
query.txt
395at7524_contigs.txt

@chrisjackson-pellicle
Copy link
Collaborator

chrisjackson-pellicle commented Jan 16, 2023

Hi Junchen,

Thanks for those files. The 395at7524_contigs.txt file doesn't contain the sequence NODE_1_length_2012_cov_8.473173, though. Is that from a different sample? If so, can you send me the *_contigs.txt file for that sample too?

With the first example you provided above (NODE_15_length_371_cov_2.302721), it does indeed appear that the stop codons at the beginning of the alignment originate from incomplete intron sequence in the SPAdes assembly. Using the Hemiptera3-395at7524 protein as a query, I ran a tBLASTn search against the NCBI nucleotide database - the top hit was XM_039430175, PREDICTED: Nilaparvata lugens mediator of RNA polymerase II transcription subunit 23 (LOC111051234), mRNA.

I downloaded the corresponding region of the annotated genome (https://www.ncbi.nlm.nih.gov/nuccore/NC_052509.1?report=genbank&from=14296457&to=14358256&strand=true), and aligned it with the 'good' region of the Exonerate hit for NODE_15_length_371_cov_2.302721, (i.e., the region immediately after the second stop codon, onward). Looking at the annotated genomic sequence, the CDS of this particular gene appears to be 4,170 bases, and the gene comprises 23 exons across 61.8 kb. As you can see in the alignments below (exons are annotated in yellow, Nilaparvata lugens genomic sequence top, NODE_15 bottom), the 'good' region of the NODE_15_length_371_cov_2.302721 Exonerate alignment corresponds to a single exon.

Wide view:
zoom_out

Zoomed view:
zoom_in

As mentioned, HybPiper 2 runs Exonerate with the flag --refine by default, as this can produce much longer, correct alignments in some cases. However, in this case removing the --refine flag did prevent Exonerate from including the incomplete intron sequence in the alignment, i.e., the NODE_15_length_371_cov_2.302721 hit now looks like this:

exonerate -m protein2dna --showalignment yes --showvulgar no -V 0 query.txt NODE_15.fasta --refine n
one

C4 Alignment:
------------
         Query: Hemiptera3-395at7524
        Target: NODE_15_length_371_cov_2.302721
         Model: protein2dna:local
     Raw score: 178
   Query range: 953 -> 988
  Target range: 174 -> 279

 954 : ValPheAspIleValIleHisArgTyrLeuGluIleProProValThrLysSerLeuGluThrL : 975
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       ValPheAspIleValIleHisArgTyrLeuGluIleProProValThrLysSerLeuGluThrL
 175 : GTGTTCGACATAGTTATACACAGGTACTTGGAGATACCACCAGTCACAAAATCACTGGAGACAT : 238

 976 : euLeuGluHisLeuGlyCysLeuTyrLysPheHisAspArg : 988
       |||||||||||||||||||||||||||||||||||! !!:!
       euLeuGluHisLeuGlyCysLeuTyrLysPheHisGlyLys
 239 : TACTGGAACATTTAGGCTGTCTCTATAAATTTCATGGTAAG : 279

I'll have to have a think about the best way to deal with this issue. An uncomplicated approach might be to add a few flags to hybpiper assemble - something like --no_exonerate_refine and --no_exonerate_hit_stop_codons. Ideally, though, I think it'd be good to parse the Exonerate alignment and apply a few thresholds and simple heuristics to pull out the 'good' regions from an alignment that's been produced using the --refine flag.

I'll wait to receive your other *_contigs.txt file, so I can investigate your second example properly. In the meantime, I aligned the Hemiptera3-395at7524 protein sequence against a translation of the CDS from the annotated Nilaparvata lugens genome mentioned above. As you can see in the alignment below, the two sequences align well overall, but there's a dubious region around the location of the putative intron in your Exonerate alignment. I'm wondering if there's something amiss with the Hemiptera3-395at7524 sequence, and if this has resulted in Exonerate failing to correctly identify the intron boundaries in SPAdes contig NODE_1_length_2012_cov_8.473173. Perhaps the intron actually starts a few codons upstream, after the ...AlaLeuHisAspLys region, which is the exon/intron boundary in the Nilaparvata lugens gene as shown below? Can you tell me how your target file was produced?

Wide view:
wide

Zoomed view:
narrow

Cheers,

Chris

@junchen-deng
Copy link
Author

Hi Chris,

Thanks for all the investigations!
Sorry for the confusion from NODE_1_length_2012_cov_8.473173. Yes, it's from a different sample. Here is the contig file:
395at7524_contigs.sample2.txt

As you can see, we are working with planthoppers (Hemiptera: Fulgoromorpha). Nilaparvata lugens is one of the references we used. In the beginning, we tried to extract single-copy BUSCO genes directly from the metagenomic assembly by running BUSCO with the database hemiptera_odb10. However, this method only produced ~400 complete single-copy genes out of 2500. We assumed that it was because of the low sequencing depth (<10) for the host genome. The host contigs are too short that the complete BUSCO markers cannot be recovered.

Therefore, we turned to Hybpiper. This is how we built our target file: 1) we downloaded the complete genomes of a few close refs of planthoppers and run BUSCO on them with the database '''hemiptera_odb10''' to extract single-copy genes. Here are some refs we used:

Acyrthosiphon pisum ---- Hemiptera1 (https://www.ncbi.nlm.nih.gov/genome/448)
Homalodisca vitripennis ---- Hemiptera2 (https://www.ncbi.nlm.nih.gov/genome/13454)
Nilaparvata lugens ---- Hemiptera3 (https://www.ncbi.nlm.nih.gov/genome/2941; this is exactly the genome you found! )
Zanna intricata ---- Hemiptera4 (https://www.ncbi.nlm.nih.gov/genome/87039)

Then, 2) we combined them with busco_ancestral (Hemiptera0) to produce our target file. And we asked hybpiper to assemble the markers in this target file with metagenomic reads. In this way, we are able to assemble most markers with a low spades coverage cutoff.

It's very interesting to see that Hemiptera3-395at7524 protein, which was extracted from the same Nilaparvata lugens genome, is different in some regions. My intepretation is that BUSCO decided the exon/intron boundary differently, which causes some intron regions to be retained in the final busco single-copy genes. When we use these genes as targets for hybpiper, they cause more troubles during exonerate step.

It's also interesting to see that the putative intron region was trimmed after removing --refine flag. But I am not sure how this action will affect the overal results as you said that --refine by default can produce much longer, correct alignments in some cases.

I really want to hear more thoughts from you. We are working with phylogenomics. Although a few untrimmed introns in our final alignment may not affect the final phylogeny, they still create erroneous alignment for a few samples in these regions of many genes. It would be nice if we can find a solution to this so that we can have a clean alignment to work with at the end.

Sincerely,
Junchen

@chrisjackson-pellicle
Copy link
Collaborator

Hi Junchen,

I realised I made a mistake when testing the Exonerate --refine flag with your data, and that I was using the protein2dna model rather than the protein2genome model. When using the protein2dna model, Exonerate behaves as described above (i.e., the stop-codon-containing intron sequence isn't recovered in the Exonerate hit when --refine isn't used), but the protein2genome model recovers the intron sequence in the alignment regardless of whether --refine is used or not.

So, I'm in the process of adding an additional filter to HybPiper that processes the Exonerate alignments and trims hit boundaries as necessary. Hopefully it'll deal with your issue. It should be ready later this week, but I'll let you know when I've pushed the update.

Cheers,

Chris

@junchen-deng
Copy link
Author

Hi Chris,

Thanks very much!
My understanding is that hybpiper uses protein2genome model by default in Exonerate for the detection of introns. So basically the flag --refine cannot help with this issue. I am looking forward to the additional filter you are working on!

Cheers,
Junchen

@chrisjackson-pellicle
Copy link
Collaborator

Hi Junchen,

I've just pushed HybPiper version 2.1.2 and updated the conda packages - see the changelog here.

It includes a new filter for Exonerate alignments. The filter doesn't aim to remove stop codons explicitly, but rather it should trim any poorly aligned 5' and 3' ends from an Exonerate alignment before the corresponding SPAdes contig hit sequence is potentially incorporated into an output *.FNA sequence. In my testing, the filter correctly removed the dodgy end of your alignment for NODE_15_length_371_cov_2.302721.

In addition, HybPiper version 2.1.2 now runs a check for any 'internal' (i.e., non-terminal) stop codons in all final output *.FNA sequences - it provides a warning if any are found, and writes the corresponding gene names to a text file in the sample directory called <prefix>_genes_with_non_terminal_stop_codons.txt. I've added a section to the wiki that discusses this in more detail, see here.

I suspect this is something that'll need to be further developed for future versions of HybPiper, but hopefully the current fix deals with some of the issues you saw. Let me know how you go, and if you have any ideas about how this might be done better, please don't hesitate to let me know!

Cheers,

Chris

@junchen-deng
Copy link
Author

Hi Chris,

Thanks for the updates!
I re-run my samples, and the final alignment looks better!

I did some investigations on one sample and here are a few things that I am curious about:

  1. In some cases, the bad-quality regions at the beginning/end were successfully trimmed, but new 'correct' sequences were extracted to fill the gap. This is nice because the tool not only trims the wrong region but also tries to fill the gap by looking for the potentially correct region from other contigs. But if the correct region has already been assembled, why the old version would take the bad-quality region as the final output?

Here is an example from the sample OECSP2:
The AA alignment looks like this ('new' is the output from v2.1.2):

OECSP2-old      PTTPLSVDVSTIFQVHSKM
                ||||||        |||||
OECSP2-new      PTTPLSMSILDSLTVHSKM

The problematic contig is the following and was taken by the previous version as the final output:
Screenshot from 2023-02-07 15-40-36

But the correct contig also exists and was taken by the new version after trimming the potential intron region:
Screenshot from 2023-02-07 15-43-00

  1. In some cases, the incomplete intron regions are located in the middle of the contig, which failed to be detected by Exonerate and trimmed by the new filter.

Here is an example from the same sample:
Screenshot from 2023-02-07 16-10-28
The poorly aligned regions in the last three lines are potential intron regions. In this case, the query Hemiptera3-395at7524 was perfectly extracted by BUSCO and contains only exons (I compared it with the NCBI record). And the poorly aligned region starts right at the border of the exon/intron, indicating that it's part of the intron. But somehow, the intron is incomplete, and the assembler captures part of the next exon, making the incomplete intron region in the middle.

I am not sure how these internal incomplete introns could be detected. One way could be that the script will detect regions with the reduced alignment score and then decide if this region is an incomplete intron by its size or its sequences at two ends (to my knowledge, introns are often characterized by 'AG---------GT').

Thanks again for the quick updates. I am looking forward to hearing more thoughts from you!

If you want to look at this sample (OECSP2), here is the contig file.
395at7524_contigs_OECSP2.txt
The query file Hemiptera3-395at7524 is the same as the previous one.

Cheers,
Junchen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants