Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of partial genes in PPanGGOLiN #290

Merged
merged 20 commits into from
Oct 18, 2024
Merged

Handling of partial genes in PPanGGOLiN #290

merged 20 commits into from
Oct 18, 2024

Conversation

JeanMainguy
Copy link
Member

@JeanMainguy JeanMainguy commented Oct 15, 2024

Previously, PPanGGOLiN treated partial genes as pseudogenes and ignored them unless the --use_pseudo flag was used in the annotate command. However, partial genes are still valid and should be included, which is exactly what this PR addresses.

To translate these genes correctly, it's important to know where the first complete codon starts.

Here's how we deal with sequence shifts at the start:

  • For GBFF files, we use the \codon_start field.
  • For GFF files, we rely on the frame (column 8).

If a gene is partial at the end, we trim the last 1 or 2 nucleotides to ensure that the sequence length is a multiple of 3 for proper translation.

⚠️ Impact on Gene Sequences:

The gene sequences stored in the pangenome file, and output by the fasta command, correspond to the exact part of the gene that will be translated.

  • If a partial codon is found at the start or end of the gene, those nucleotides are removed.
  • The stored sequence begins at the first base of the first complete codon and ends at the last base of the final complete codon.

🛠️ Additional Changes:

transl_except Handling:

Previously, genes with the transl_except tag were also treated as pseudogenes. This tag indicates that a 'stop' codon should be translated as selenocysteine instead of a true stop. These genes are fully valid and shouldn't be ignored.

This PR updates the behavior to include these genes. The only consequence is that a potential stop codon might be incorrectly translated in the protein sequence because MMseqs doesn’t handle this exception. However, this doesn't affect the clustering steps.

Gene Identifier Handling in GFF vs GBFF :

There was a difference in how gene local identifiers were parsed between GFF and GBFF files. In PPanGGOLiN, gene IDs are taken from the local identifier if it's unique across all genomes. Otherwise, an internal PPanGGOLiN ID is used.

  • In GBFF files, the local ID comes from locus_tag.
  • In GFF files, it comes from the ID attribute. However, in RefSeq genomes, this ID often corresponds to non-unique protein accessions (WP_), forcing PPanGGOLiN to use its internal ID instead.

To harmonize this, locus_tag is now retrieved in GFF files (if present). If not found, the ID attribute is used as the gene's local identifier.

@jpjarnoux jpjarnoux self-requested a review October 18, 2024 08:11
ppanggolin/genome.py Show resolved Hide resolved
ppanggolin/genome.py Outdated Show resolved Hide resolved
ppanggolin/genome.py Outdated Show resolved Hide resolved
@jpjarnoux jpjarnoux merged commit 561d81b into dev Oct 18, 2024
4 checks passed
@axbazin axbazin deleted the handle_partial_genes branch October 29, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants