Handling of partial genes in PPanGGOLiN #290

JeanMainguy · 2024-10-15T08:56:03Z

Previously, PPanGGOLiN treated partial genes as pseudogenes and ignored them unless the --use_pseudo flag was used in the annotate command. However, partial genes are still valid and should be included, which is exactly what this PR addresses.

To translate these genes correctly, it's important to know where the first complete codon starts.

Here's how we deal with sequence shifts at the start:

For GBFF files, we use the \codon_start field.
For GFF files, we rely on the frame (column 8).

If a gene is partial at the end, we trim the last 1 or 2 nucleotides to ensure that the sequence length is a multiple of 3 for proper translation.

⚠️ Impact on Gene Sequences:

The gene sequences stored in the pangenome file, and output by the fasta command, correspond to the exact part of the gene that will be translated.

If a partial codon is found at the start or end of the gene, those nucleotides are removed.
The stored sequence begins at the first base of the first complete codon and ends at the last base of the final complete codon.

🛠️ Additional Changes:

`transl_except` Handling:

Previously, genes with the transl_except tag were also treated as pseudogenes. This tag indicates that a 'stop' codon should be translated as selenocysteine instead of a true stop. These genes are fully valid and shouldn't be ignored.

This PR updates the behavior to include these genes. The only consequence is that a potential stop codon might be incorrectly translated in the protein sequence because MMseqs doesn’t handle this exception. However, this doesn't affect the clustering steps.

Gene Identifier Handling in GFF vs GBFF :

There was a difference in how gene local identifiers were parsed between GFF and GBFF files. In PPanGGOLiN, gene IDs are taken from the local identifier if it's unique across all genomes. Otherwise, an internal PPanGGOLiN ID is used.

In GBFF files, the local ID comes from locus_tag.
In GFF files, it comes from the ID attribute. However, in RefSeq genomes, this ID often corresponds to non-unique protein accessions (WP_), forcing PPanGGOLiN to use its internal ID instead.

To harmonize this, locus_tag is now retrieved in GFF files (if present). If not found, the ID attribute is used as the gene's local identifier.

This reverts commit 7cb227c.

ppanggolin/genome.py

JeanMainguy added 18 commits October 10, 2024 14:11

identify starting and ending partiality of gene

7bb83ce

fix coordinates for partial genes

ad44264

add pytest for fixing coordinates fct

b8b1f42

refactor shifting function to handle tricky coordinates and reuse code

2a62ee8

update tests to match refactoring and tricky cases

aa4e988

do not ignore CDS with trans_except instead give a warning

0507a4b

simplify fct fix_partial_gene_coordinates

1e99584

rm old args from fct

9816e93

ignore NaturalNameWarning from tables as it is not really necessary

01c5cfe

handle partial genes from GFF files

520a84a

rm debug log

a79076b

rm debug

d99404c

use locus_tag in GFF parsing

e802d5e

update CI

7cb227c

update expected files with new genes and cluster

6507c57

Revert "update CI"

498b033

This reverts commit 7cb227c.

update CI

0ed382a

rm cluster file not used in CI anymore

760b0f0

jpjarnoux self-requested a review October 18, 2024 08:11

jpjarnoux added 2 commits October 18, 2024 12:08

Fix how frame getter/setter attribute is handled

a9a251f

Update docstring and typing

6940253

jpjarnoux requested changes Oct 18, 2024

View reviewed changes

ppanggolin/genome.py Show resolved Hide resolved

ppanggolin/genome.py Outdated Show resolved Hide resolved

ppanggolin/genome.py Outdated Show resolved Hide resolved

jpjarnoux merged commit 561d81b into dev Oct 18, 2024
4 checks passed

JeanMainguy mentioned this pull request Oct 29, 2024

Merge dev branch into master to release version 2.2.0 #298

Merged

axbazin deleted the handle_partial_genes branch October 29, 2024 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of partial genes in PPanGGOLiN #290

Handling of partial genes in PPanGGOLiN #290

JeanMainguy commented Oct 15, 2024 •

edited

Loading

Handling of partial genes in PPanGGOLiN #290

Handling of partial genes in PPanGGOLiN #290

Conversation

JeanMainguy commented Oct 15, 2024 • edited Loading

⚠️ Impact on Gene Sequences:

🛠️ Additional Changes:

transl_except Handling:

Gene Identifier Handling in GFF vs GBFF :

JeanMainguy commented Oct 15, 2024 •

edited

Loading

`transl_except` Handling: