Skip to content

Commit

Permalink
Merge branch 'dev' into context
Browse files Browse the repository at this point in the history
  • Loading branch information
jpjarnoux authored Oct 12, 2023
2 parents cbb38c8 + f08819d commit c7bc585
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 17 deletions.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ user/Regions-of-Genome-Plasticity
user/Conserved-modules
user/Align
user/Genomic-context
user/projection
user/metadata
user/Outputs
```
Expand Down
2 changes: 1 addition & 1 deletion docs/user/Flat/orgStat.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This file is made of 15 columns described in the following table
| nb_cloud_genes | The number of genes whose family is cloud in that genome |
| nb_exact_core_genes | The number of genes whose family is exact core in that genome |
| nb_soft_core_genes | The number of genes whose family is soft core in that genome |
| completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
| completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
| nb_single_copy_markers | This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default. |

It can be generated using the 'write' subcommand as such :
Expand Down
3 changes: 2 additions & 1 deletion docs/user/projection.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Projection command
# Projection
The ppanggolin projection command allows you to annotate external genomes using an existing pangenome. This process eliminates the need to recompute all components, streamlining the annotation process. Input genomes are expected to belong to the same species.

Genes within the input genome are aligned with genes in the pangenome to determine their gene families and partitions. Genes that do not align with any existing gene in the pangenome are considered specific to the input genome and are assigned to the "Cloud" partition. Based on the alignment and partition assignment, Regions of Plasticity (RGPs) within the input genome are predicted. Each RGP that is not located on a contig border is assigned to a spot of insertion. Finally, conserved modules of the pangenome found in the input genome are reported in the output files.
Expand Down Expand Up @@ -35,6 +35,7 @@ The Output directory contains `summary_projection.tsv` giving an overview of the
| Cloud genes | The number of genes in the "Cloud" partition.|
| Cloud families | The number of gene families in the "Cloud" parition.|
| Cloud specific families | The number of gene families that are specific to the input organism. These families are unique to the input organism and do not have homologs in any other genomes within the pangenome and have been assigned to the "Cloud" partition.|
| completeness | This indicates the proportion of single copy markers from the persistent partition that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group. |
| RGPs (Regions of Genomic Plasticity) | The number of Regions of Genomic Plasticity (RGPs) predicted within the input genome.|
| Spots | The total number of spots of insertion associated with RGPs in the input genome.|
| New spots | The number of new insertion spots that have been identified in the input genome. These spots represent novel genomic regions compared to other genomes in the pangenome.|
Expand Down
32 changes: 17 additions & 15 deletions ppanggolin/projection/projection.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import time
from pathlib import Path
import tempfile
from typing import Tuple, Set, Dict, Iterator, Optional, List, Iterable, Any
from typing import Tuple, Set, Dict, Optional, List, Iterable, Any
from collections import defaultdict
import csv
from itertools import chain
Expand Down Expand Up @@ -102,12 +102,26 @@ def launch(args: argparse.Namespace):
check_pangenome_info(pangenome, need_annotations=True, need_families=True, disable_bar=args.disable_prog_bar,
need_rgp=predict_rgp, need_modules=project_modules, need_gene_sequences=False,
need_spots=project_spots)


print("number_of_organisms", pangenome.number_of_organisms)
logging.getLogger('PPanGGOLiN').info('Retrieving parameters from the provided pangenome file.')
pangenome_params = argparse.Namespace(
**{step: argparse.Namespace(**k_v) for step, k_v in pangenome.parameters.items()})

# dup margin value here is specified in argument and is used to compute completeness.
# Thats mean it can be different than dup margin used in spot and RGPS.

# TODO make this single_copy_fams a method of class Pangenome that should be used in write --stats
single_copy_fams = set()

for fam in pangenome.gene_families:
if fam.named_partition == "persistent":
dup = len([genes for genes in fam.get_org_dict().values() if
len([gene for gene in genes if not gene.is_fragment]) > 1])

if (dup / fam.number_of_organisms) < args.dup_margin:
single_copy_fams.add(fam)


genome_name_to_fasta_path, genome_name_to_annot_path = None, None

Expand Down Expand Up @@ -196,17 +210,6 @@ def launch(args: argparse.Namespace):
input_orgs_to_modules = project_and_write_modules(pangenome, organisms, output_dir)

organism_2_summary = {}
# dup margin value here is specified in argument and is used to compute completeness.
# Thats mean it can be different than dup margin used in spot and RGPS.
single_copy_fams = set()

for fam in pangenome.gene_families:
if fam.named_partition == "persistent":
dup = len([genes for genes in fam.get_org_dict().values() if
len([gene for gene in genes if not gene.is_fragment]) > 1])

if (dup / fam.number_of_organisms) < args.dup_margin:
single_copy_fams.add(fam)

for organism in organisms:
# summarize projection for all input organisms
Expand Down Expand Up @@ -494,7 +497,6 @@ def summarize_projection(input_organism:Organism, pangenome:Pangenome, single_c
"Shell": {"genes":shell_gene_count, "families":shell_family_count},
"Cloud": {"genes":cloud_gene_count, "families":cloud_family_count - singleton_gene_count, "specific families":singleton_gene_count},
"Completeness":completeness,
"Single copy markers":single_copy_markers_count,
"RGPs": rgp_count,
"Spots": spot_count,
"New spots": new_spot_count,
Expand Down

0 comments on commit c7bc585

Please sign in to comment.