Merge pull request #137 from labgem/context

Improve context command
labgem · Oct 16, 2023 · b82dc7b · b82dc7b
2 parents f08819d + b878a4c
commit b82dc7b
Show file tree

Hide file tree

Showing 13 changed files with 918 additions and 241 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -108,7 +108,10 @@ jobs:
       run: |
         cd testingDataset
         ppanggolin context --pangenome myannopang/pangenome.h5 --sequences some_chlam_proteins.fasta --output test_context --fast
-        ppanggolin context --pangenome readclusterpang/pangenome.h5 --family some_chlam_families.txt --output test_context -f
+
+        # test from gene family ids. Test here with one family of module 1. The context should find all families of module 1
+        echo AP288_RS05055 > one_family_of_module_1.txt 
+        ppanggolin context --pangenome myannopang/pangenome.h5 --family one_family_of_module_1.txt  --output test_context_from_id
         cd -
     - name: testing metadata command
       shell: bash -l {0}
@@ -131,8 +134,7 @@ jobs:
       run: |
         cd testingDataset
         head organisms.gbff.list | sed 's/^/input_org_/g' > organisms.gbff.head.list
-        ppanggolin projection --pangenome stepbystep/pangenome.h5  -o projection_from_lisy_of_gbff \
-                            --anno organisms.gbff.head.list 
+        ppanggolin projection --pangenome stepbystep/pangenome.h5  -o projection_from_lisy_of_gbff --anno organisms.gbff.head.list 
 
 
         ppanggolin projection --pangenome mybasicpangenome/pangenome.h5  -o projection_from_single_fasta \

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-1.2.191
+1.2.193
diff --git a/docs/user/Genomic-context.md b/docs/user/Genomic-context.md
@@ -30,21 +30,25 @@ In this case, you can give a pangenome without gene families representatives seq
 
 In case of you are using families ID, you will only have as output the `gene_context.tsv` file. In the other case, you use sequences, you will have another output file to report the alignment between sequences and pangenome families (see detail in align subcommand).
 
-There are 4 columns in `gene_context.tsv`. 
+There are 6 columns in `gene_context.tsv`. 
 
-1. **geneContext ID**: identifier of the found context. It is incrementally generated, beginning with 1
+1. **geneContext ID**: Identifier of the found context. It is incrementally generated, beginning with 1
 2. **Gene family name**: Identifier of the gene family, from the pangenome, correspond to the found context
 3. **Sequence ID**: Identifier of the searched sequence in the pangenome
 4. **Nb Genomes**: Number of genomes where the genomic context is found
 5. **Partition**: Partition of the gene family corresponding to the found context
+6. **Target family**: Whether the family is a target family, meaning it matches an input sequence, or a family provided as input.
 
 In **sequence Id**, it is possible to find a NA value. This case, correspond to another gene family found in the context.
 
 ## Detailed options
-| option name      | Description                                                                                                                                                                                                       |
-|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| --no_defrag      | Do not use the defragmentation step, to align sequences with MMseqs2                                                                                                                                              |
-| --identity       | Minimum identity percentage threshold                                                                                                                                                                             |
-| --coverage       | Minimum coverage percentage threshold                                                                                                                                                                             |
-| -t, --transitive | Size of the transitive closure used to build the graph. This indicates the number of non-related genes allowed in-between two related genes. Increasing it will improve precision but lower sensitivity a little. |
-| -s, --jaccard    | Minimum jaccard similarity used to filter edges between gene families. Increasing it will improve precision but lower sensitivity a lot.                                                                          |
+
+| option name | Description |
+|-----------------------------|---------------------------------------------------------------------------|
+| --fast | Use representative sequences of gene families for input gene alignment. This option is recommended for faster processing but may be less sensitive. By default, all pangenome genes are used for alignment. This argument makes sense only when --sequence is provided. (default: False) |
+| --no_defrag | Do not use the defragmentation step, to align sequences with MMseqs2 (default: False) |
+| --identity | Minimum identity percentage threshold (default: 0.8)|
+| --coverage | Minimum coverage percentage threshold (default: 0.8)|
+| -t, --transitive | Size of the transitive closure used to build the graph. This indicates the number of non-related genes allowed in-between two related genes. Increasing it will improve precision but lower sensitivity a little. (default: 4) |
+| -s, --jaccard | Minimum jaccard similarity used to filter edges between gene families. Increasing it will improve precision but lower sensitivity a lot. (default: 0.85) |
+| -w, --window_size | Number of neighboring genes that are considered on each side of a gene of interest when searching for conserved genomic contexts. (default: 5) |