Skip to content

Commit

Permalink
Merge pull request #119 from labgem/projection
Browse files Browse the repository at this point in the history
Add projection command
  • Loading branch information
axbazin authored Oct 10, 2023
2 parents 93a8468 + 3437391 commit f08819d
Show file tree
Hide file tree
Showing 39 changed files with 2,395 additions and 484 deletions.
21 changes: 18 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ jobs:
ppanggolin fasta -p stepbystep/pangenome.h5 --output stepbystep -f --prot_families all --gene_families shell --regions all --fasta organisms.fasta.list
ppanggolin draw -p stepbystep/pangenome.h5 --draw_spots --spots all -o stepbystep -f
ppanggolin metrics -p stepbystep/pangenome.h5 --genome_fluidity --info_modules --no_print_info -f --log metrics.log
cd -
cd -
- name: gbff parsing and MSA computing
shell: bash -l {0}
run: |
Expand Down Expand Up @@ -100,13 +100,14 @@ jobs:
shell: bash -l {0}
run: |
cd testingDataset
ppanggolin align --pangenome mybasicpangenome/pangenome.h5 --sequences some_chlam_proteins.fasta --output test_align --draw_related --getinfo
ppanggolin align --pangenome mybasicpangenome/pangenome.h5 --sequences some_chlam_proteins.fasta \
--output test_align --draw_related --getinfo --fast
cd -
- name: testing context command
shell: bash -l {0}
run: |
cd testingDataset
ppanggolin context --pangenome myannopang/pangenome.h5 --sequences some_chlam_proteins.fasta --output test_context
ppanggolin context --pangenome myannopang/pangenome.h5 --sequences some_chlam_proteins.fasta --output test_context --fast
ppanggolin context --pangenome readclusterpang/pangenome.h5 --family some_chlam_families.txt --output test_context -f
cd -
- name: testing metadata command
Expand All @@ -124,3 +125,17 @@ jobs:
cd testingDataset
ppanggolin utils --default_config panrgp -o panrgp_default_config.yaml
ppanggolin panrgp --anno organisms.gbff.list --cluster clusters.tsv -o test_config --config panrgp_default_config.yaml
cd -
- name: testing projection cmd
shell: bash -l {0}
run: |
cd testingDataset
head organisms.gbff.list | sed 's/^/input_org_/g' > organisms.gbff.head.list
ppanggolin projection --pangenome stepbystep/pangenome.h5 -o projection_from_lisy_of_gbff \
--anno organisms.gbff.head.list
ppanggolin projection --pangenome mybasicpangenome/pangenome.h5 -o projection_from_single_fasta \
--organism_name chlam_A --fasta FASTA/GCF_002776845.1_ASM277684v1_genomic.fna.gz \
--spot_graph --graph_formats graphml --fast --keep_tmp -f
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2.189
1.2.191
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ user/Regions-of-Genome-Plasticity
user/Conserved-modules
user/Align
user/Genomic-context
user/projection
user/metadata
user/Outputs
```
Expand Down
2 changes: 1 addition & 1 deletion docs/user/Flat/orgStat.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This file is made of 15 columns described in the following table
| nb_cloud_genes | The number of genes whose family is cloud in that genome |
| nb_exact_core_genes | The number of genes whose family is exact core in that genome |
| nb_soft_core_genes | The number of genes whose family is soft core in that genome |
| completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
| completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
| nb_single_copy_markers | This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default. |

It can be generated using the 'write' subcommand as such :
Expand Down
64 changes: 64 additions & 0 deletions docs/user/projection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Projection
The ppanggolin projection command allows you to annotate external genomes using an existing pangenome. This process eliminates the need to recompute all components, streamlining the annotation process. Input genomes are expected to belong to the same species.

Genes within the input genome are aligned with genes in the pangenome to determine their gene families and partitions. Genes that do not align with any existing gene in the pangenome are considered specific to the input genome and are assigned to the "Cloud" partition. Based on the alignment and partition assignment, Regions of Plasticity (RGPs) within the input genome are predicted. Each RGP that is not located on a contig border is assigned to a spot of insertion. Finally, conserved modules of the pangenome found in the input genome are reported in the output files.

## Input files:

This command supports two input modes depending on whether you want to project a single genome or multiple genomes at once:

Multiple Files in One TSV:
- **Options**: `--fasta` or `--anno`
- **Description**: You can provide a tab-separated file listing organism names alongside their respective FASTA genomic sequences or annotation filepaths, with one line per organism. This mode is suitable when you want to annotate multiple genomes in a single operation. The format of this file is identical to the format used in the annotate and workflow commands; for more details, refer here.

Single File:
- **Options**: `--organism_name` with `--fasta` or `--anno` and `--circular_contigs` (optional)
- **Description**: When annotating a single genome, you can directly provide a single FASTA genomic sequence file or an annotation file in GFF/GBFF format. Additionally, specify the name of the organism using the `--organism_name` option. You can also indicate circular contigs using the `--circular_contigs` option when necessary.


## Output files:

The Output directory contains `summary_projection.tsv` giving an overview of the projection. one line per organism.


| Column | Description|
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Organism name | This column contains name or identifier of the organisms being analyzed.|
| Pangenome file | The path to the pangenome file (pangenome.h5) used for the analysis.|
| Contigs | The number of contigs in the projected genome.|
| Genes | The total number of genes identified in the input genome.|
| Families | The total number of gene families to which genes in the genome of the input organism are assigned.|
| Persistent genes | The number of genes in the "Persistent" partition.|
| Persistent families | The number of gene families in the "Persistent" partition.|
| Shell genes | The number of genes in the "Shell" partition.|
| Shell families | The number of gene families in the "Shell" partition.|
| Cloud genes | The number of genes in the "Cloud" partition.|
| Cloud families | The number of gene families in the "Cloud" parition.|
| Cloud specific families | The number of gene families that are specific to the input organism. These families are unique to the input organism and do not have homologs in any other genomes within the pangenome and have been assigned to the "Cloud" partition.|
| completeness | This indicates the proportion of single copy markers from the persistent partition that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group. |
| RGPs (Regions of Genomic Plasticity) | The number of Regions of Genomic Plasticity (RGPs) predicted within the input genome.|
| Spots | The total number of spots of insertion associated with RGPs in the input genome.|
| New spots | The number of new insertion spots that have been identified in the input genome. These spots represent novel genomic regions compared to other genomes in the pangenome.|
| Modules | The number of modules that have been projected onto the input genome.|


Additionally, within the Output directory, there is a subdirectory for each input genome, named after the input genome itself. Each of these subdirectories contains several files:

For Gene Family and Partition of Input Genes:

- `cds_sequences.fasta`: This file contains the sequences of coding regions (CDS) from the input genome.
- `gene_to_gene_family.tsv`: It provides the mapping of genes to gene families of the pangenome. its format follows [this output](Outputs.md#gene-families-and-genes)
- `sequences_partition_projection.tsv`: This file maps the input genes to its partition (Persistent, Shell or Cloud).
- `specific_genes.tsv`: This file list the gene of the input genomes that do not align to any gene of the pangenome. These genes are assigned to Cloud parititon.

For RGPs and Spots:

- `plastic_regions.tsv`: This file contains information about Regions of Genomic Plasticity (RGPs) within the input genome. Its format follows [this output](Outputs.md#plastic-regions).
- `input_organism_rgp_to_spot.tsv`: It provides information about the association between RGPs and insertion spots in the input genome. Its format follows [this ouput](Outputs.md#spots).

Optionally, you can produce a graph of the RGPs using the `--spot_graph` option. This graph is similar as the one produce by the `ppanggolin spot` command.

For Modules:

- `modules_in_input_organism.tsv`: This file lists the modules that have been found in the input genome. Its format follows [this ouput](Outputs.md#modules-in-organisms).

67 changes: 48 additions & 19 deletions ppanggolin/RGP/genomicIsland.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import logging
import argparse
from pathlib import Path
from typing import Set
from typing import Set, Iterable
# installed libraries
from tqdm import tqdm

Expand Down Expand Up @@ -191,27 +191,56 @@ def max_index_node(lst):
return contig_regions


def compute_org_rgp(organism: Organism, multigenics: set, persistent_penalty: int = 3, variable_gain: int = 1,
min_length: int = 3000, min_score: int = 4, naming: str = "contig") -> Set[Region]:
def compute_org_rgp( organism: Organism, multigenics: set,
persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4,
naming: str = "contig", disable_bar: bool = True ) -> set:
"""
Compute regions of genomic plasticity (RGP) on the given organism based on the provided parameters.
:param organism: The Organism object representing the organism.
:param multigenics: A set of multigenic persistent families of the pangenome graph.
:param persistent_penalty: Penalty score to apply to persistent multigenic families (default: 3).
:param variable_gain: Gain score to apply to variable multigenic families (default: 1).
:param min_length: Minimum length threshold (in base pairs) for the regions to be considered RGP (default: 3000).
:param min_score: Minimum score threshold for considering a region as RGP (default: 4).
:param naming: Naming scheme for the regions, either "contig" or "organism" (default: "contig").
:param disable_bar: Whether to disable the progress bar. It is recommended to disable it when calling this function in a loop on multiple organisms (default: True).
:return: A set of RGPs of the provided organism.
"""
org_regions = set()
for contig in organism.contigs:
for contig in tqdm(organism.contigs, total=organism.number_of_contigs, unit="contig", disable=disable_bar):
if contig.number_of_genes != 0: # some contigs have no coding genes...
# can definitely multiprocess this part, as not THAT much information is needed...
matrix = init_matrices(contig, multigenics, persistent_penalty, variable_gain)
org_regions |= mk_regions(contig, matrix, multigenics, min_length, min_score, persistent_penalty,
variable_gain, naming=naming)
org_regions |= mk_regions(
contig,
matrix,
multigenics,
min_length,
min_score,
persistent_penalty,
variable_gain,
naming=naming
)
return org_regions


def naming_scheme(pangenome: Pangenome):
def naming_scheme(organisms: Iterable[Organism]) -> str:
"""
Determine the naming scheme for the contigs in the pangenome.
:param organisms: Iterable of organims objects
:return: Naming scheme for the contigs ("contig" or "organism").
"""
contigsids = set()
for org in pangenome.organisms:
for org in organisms:
for contig in org.contigs:
oldlen = len(contigsids)
contigsids.add(contig.name)
if oldlen == len(contigsids):
logging.getLogger("PPanGGOLiN").warning("You have contigs with identical identifiers in your "
"assemblies. identifiers will be supplemented with your "
"assemblies. Identifiers will be supplemented with your "
"provided organism names.")
return "organism"
return "contig"
Expand Down Expand Up @@ -248,25 +277,25 @@ def predict_rgp(pangenome: Pangenome, persistent_penalty: int = 3, variable_gain
# check statuses and load info
check_pangenome_former_rgp(pangenome, force)
check_pangenome_info(pangenome, need_annotations=True, need_families=True, need_graph=False, need_partitions=True,
disable_bar=disable_bar)
disable_bar=disable_bar)

logging.getLogger("PPanGGOLiN").info("Detecting multigenic families...")
multigenics = pangenome.get_multigenics(dup_margin)
logging.getLogger("PPanGGOLiN").info("Compute Regions of Genomic Plasticity ...")
name_scheme = naming_scheme(pangenome)
name_scheme = naming_scheme(pangenome.organisms)
for org in tqdm(pangenome.organisms, total=pangenome.number_of_organisms, unit="genomes", disable=disable_bar):
for region in compute_org_rgp(org, multigenics, persistent_penalty, variable_gain, min_length,
min_score, naming=name_scheme):
pangenome.add_region(region)
min_score, naming=name_scheme):
pangenome.add_region(region)
logging.getLogger("PPanGGOLiN").info(f"Predicted {pangenome.number_of_rgp} RGP")

# save parameters and save status
pangenome.parameters["RGP"] = {}
pangenome.parameters["RGP"]["persistent_penalty"] = persistent_penalty
pangenome.parameters["RGP"]["variable_gain"] = variable_gain
pangenome.parameters["RGP"]["min_length"] = min_length
pangenome.parameters["RGP"]["min_score"] = min_score
pangenome.parameters["RGP"]["dup_margin"] = dup_margin
pangenome.parameters["rgp"] = {}
pangenome.parameters["rgp"]["persistent_penalty"] = persistent_penalty
pangenome.parameters["rgp"]["variable_gain"] = variable_gain
pangenome.parameters["rgp"]["min_length"] = min_length
pangenome.parameters["rgp"]["min_score"] = min_score
pangenome.parameters["rgp"]["dup_margin"] = dup_margin
pangenome.status['predictedRGP'] = "Computed"


Expand Down
Loading

0 comments on commit f08819d

Please sign in to comment.