-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of partial genes in PPanGGOLiN #290
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, PPanGGOLiN treated partial genes as pseudogenes and ignored them unless the
--use_pseudo
flag was used in the annotate command. However, partial genes are still valid and should be included, which is exactly what this PR addresses.To translate these genes correctly, it's important to know where the first complete codon starts.
Here's how we deal with sequence shifts at the start:
\codon_start
field.frame
(column 8).If a gene is partial at the end, we trim the last 1 or 2 nucleotides to ensure that the sequence length is a multiple of 3 for proper translation.
The gene sequences stored in the pangenome file, and output by the
fasta
command, correspond to the exact part of the gene that will be translated.🛠️ Additional Changes:
transl_except
Handling:Previously, genes with the
transl_except
tag were also treated as pseudogenes. This tag indicates that a 'stop' codon should be translated as selenocysteine instead of a true stop. These genes are fully valid and shouldn't be ignored.This PR updates the behavior to include these genes. The only consequence is that a potential stop codon might be incorrectly translated in the protein sequence because MMseqs doesn’t handle this exception. However, this doesn't affect the clustering steps.
Gene Identifier Handling in GFF vs GBFF :
There was a difference in how gene local identifiers were parsed between GFF and GBFF files. In PPanGGOLiN, gene IDs are taken from the local identifier if it's unique across all genomes. Otherwise, an internal PPanGGOLiN ID is used.
locus_tag
.ID
attribute. However, in RefSeq genomes, thisID
often corresponds to non-unique protein accessions (WP_
), forcing PPanGGOLiN to use its internal ID instead.To harmonize this,
locus_tag
is now retrieved in GFF files (if present). If not found, theID
attribute is used as the gene's local identifier.