Merge branch 'dev' into context

labgem · Oct 12, 2023 · c7bc585 · c7bc585
2 parents cbb38c8 + f08819d
commit c7bc585
Show file tree

Hide file tree

Showing 4 changed files with 21 additions and 17 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -163,6 +163,7 @@ user/Regions-of-Genome-Plasticity
 user/Conserved-modules
 user/Align
 user/Genomic-context
+user/projection
 user/metadata
 user/Outputs
 ```

diff --git a/docs/user/Flat/orgStat.md b/docs/user/Flat/orgStat.md
@@ -19,7 +19,7 @@ This file is made of 15 columns described in the following table
 | nb_cloud_genes         | The number of genes whose family is cloud in that genome                                                                                                                                                                                                                                                                                                                                                                                                                                |
 | nb_exact_core_genes    | The number of genes whose family is exact core in that genome                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | nb_soft_core_genes     | The number of genes whose family is soft core in that genome                                                                                                                                                                                                                                                                                                                                                                                                                            |
-| completeness           | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
+| completeness           | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
 | nb_single_copy_markers | This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default.                                                                                                                                                                  |
 
 It can be generated using the 'write' subcommand as such : 

diff --git a/docs/user/projection.md b/docs/user/projection.md
@@ -1,4 +1,4 @@
-# Projection command
+# Projection
 The ppanggolin projection command allows you to annotate external genomes using an existing pangenome. This process eliminates the need to recompute all components, streamlining the annotation process. Input genomes are expected to belong to the same species.
 
 Genes within the input genome are aligned with genes in the pangenome to determine their gene families and partitions. Genes that do not align with any existing gene in the pangenome are considered specific to the input genome and are assigned to the "Cloud" partition. Based on the alignment and partition assignment, Regions of Plasticity (RGPs) within the input genome are predicted. Each RGP that is not located on a contig border is assigned to a spot of insertion. Finally, conserved modules of the pangenome found in the input genome are reported in the output files.
@@ -35,6 +35,7 @@ The Output directory contains `summary_projection.tsv` giving an overview of the
 | Cloud genes                          | The number of genes in the "Cloud" partition.|
 | Cloud families                       | The number of gene families in the "Cloud" parition.|
 | Cloud specific families              | The number of gene families that are specific to the input organism. These families are unique to the input organism and do not have homologs in any other genomes within the pangenome and have been assigned to the "Cloud" partition.|
+| completeness           | This indicates the proportion of single copy markers from the persistent partition that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a *de novo* estimation of the completeness based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group. |
 | RGPs (Regions of Genomic Plasticity) | The number of Regions of Genomic Plasticity (RGPs) predicted within the input genome.|
 | Spots                                | The total number of spots of insertion associated with RGPs in the input genome.|
 | New spots                            | The number of new insertion spots that have been identified in the input genome. These spots represent novel genomic regions compared to other genomes in the pangenome.|

diff --git a/ppanggolin/projection/projection.py b/ppanggolin/projection/projection.py
@@ -10,7 +10,7 @@
 import time
 from pathlib import Path
 import tempfile
-from typing import Tuple, Set, Dict, Iterator, Optional, List, Iterable, Any
+from typing import Tuple, Set, Dict, Optional, List, Iterable, Any
 from collections import defaultdict
 import csv
 from itertools import chain
@@ -102,12 +102,26 @@ def launch(args: argparse.Namespace):
     check_pangenome_info(pangenome, need_annotations=True, need_families=True, disable_bar=args.disable_prog_bar,
                          need_rgp=predict_rgp, need_modules=project_modules, need_gene_sequences=False,
                          need_spots=project_spots)
-
-
+    
+    print("number_of_organisms", pangenome.number_of_organisms)
     logging.getLogger('PPanGGOLiN').info('Retrieving parameters from the provided pangenome file.')
     pangenome_params = argparse.Namespace(
         **{step: argparse.Namespace(**k_v) for step, k_v in pangenome.parameters.items()})
 
+    # dup margin value here is specified in argument and is used to compute completeness. 
+    # Thats mean it can be different than dup margin used in spot and RGPS.
+
+    # TODO make this single_copy_fams a method of class Pangenome that should be used in write --stats 
+    single_copy_fams = set()
+
+    for fam in pangenome.gene_families:
+        if fam.named_partition == "persistent":
+            dup = len([genes for genes in fam.get_org_dict().values() if
+                        len([gene for gene in genes if not gene.is_fragment]) > 1])
+
+            if (dup / fam.number_of_organisms) < args.dup_margin:
+               single_copy_fams.add(fam)
+
 
     genome_name_to_fasta_path, genome_name_to_annot_path = None, None
 
@@ -196,17 +210,6 @@ def launch(args: argparse.Namespace):
         input_orgs_to_modules = project_and_write_modules(pangenome, organisms, output_dir)
 
     organism_2_summary = {}
-    # dup margin value here is specified in argument and is used to compute completeness. 
-    # Thats mean it can be different than dup margin used in spot and RGPS.
-    single_copy_fams = set()
-
-    for fam in pangenome.gene_families:
-        if fam.named_partition == "persistent":
-            dup = len([genes for genes in fam.get_org_dict().values() if
-                        len([gene for gene in genes if not gene.is_fragment]) > 1])
-
-            if (dup / fam.number_of_organisms) < args.dup_margin:
-               single_copy_fams.add(fam)
 
     for organism in organisms:
         # summarize projection for all input organisms
@@ -494,7 +497,6 @@ def summarize_projection(input_organism:Organism,  pangenome:Pangenome, single_c
         "Shell": {"genes":shell_gene_count, "families":shell_family_count},
         "Cloud": {"genes":cloud_gene_count, "families":cloud_family_count - singleton_gene_count, "specific families":singleton_gene_count},
         "Completeness":completeness,
-        "Single copy markers":single_copy_markers_count,
         "RGPs": rgp_count,
         "Spots": spot_count,
         "New spots": new_spot_count,