Skip to content

Commit

Permalink
Merge pull request #137 from labgem/context
Browse files Browse the repository at this point in the history
Improve context command
  • Loading branch information
jpjarnoux authored Oct 16, 2023
2 parents f08819d + b878a4c commit b82dc7b
Show file tree
Hide file tree
Showing 13 changed files with 918 additions and 241 deletions.
8 changes: 5 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,10 @@ jobs:
run: |
cd testingDataset
ppanggolin context --pangenome myannopang/pangenome.h5 --sequences some_chlam_proteins.fasta --output test_context --fast
ppanggolin context --pangenome readclusterpang/pangenome.h5 --family some_chlam_families.txt --output test_context -f
# test from gene family ids. Test here with one family of module 1. The context should find all families of module 1
echo AP288_RS05055 > one_family_of_module_1.txt
ppanggolin context --pangenome myannopang/pangenome.h5 --family one_family_of_module_1.txt --output test_context_from_id
cd -
- name: testing metadata command
shell: bash -l {0}
Expand All @@ -131,8 +134,7 @@ jobs:
run: |
cd testingDataset
head organisms.gbff.list | sed 's/^/input_org_/g' > organisms.gbff.head.list
ppanggolin projection --pangenome stepbystep/pangenome.h5 -o projection_from_lisy_of_gbff \
--anno organisms.gbff.head.list
ppanggolin projection --pangenome stepbystep/pangenome.h5 -o projection_from_lisy_of_gbff --anno organisms.gbff.head.list
ppanggolin projection --pangenome mybasicpangenome/pangenome.h5 -o projection_from_single_fasta \
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2.191
1.2.193
22 changes: 13 additions & 9 deletions docs/user/Genomic-context.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,25 @@ In this case, you can give a pangenome without gene families representatives seq

In case of you are using families ID, you will only have as output the `gene_context.tsv` file. In the other case, you use sequences, you will have another output file to report the alignment between sequences and pangenome families (see detail in align subcommand).

There are 4 columns in `gene_context.tsv`.
There are 6 columns in `gene_context.tsv`.

1. **geneContext ID**: identifier of the found context. It is incrementally generated, beginning with 1
1. **geneContext ID**: Identifier of the found context. It is incrementally generated, beginning with 1
2. **Gene family name**: Identifier of the gene family, from the pangenome, correspond to the found context
3. **Sequence ID**: Identifier of the searched sequence in the pangenome
4. **Nb Genomes**: Number of genomes where the genomic context is found
5. **Partition**: Partition of the gene family corresponding to the found context
6. **Target family**: Whether the family is a target family, meaning it matches an input sequence, or a family provided as input.

In **sequence Id**, it is possible to find a NA value. This case, correspond to another gene family found in the context.

## Detailed options
| option name | Description |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| --no_defrag | Do not use the defragmentation step, to align sequences with MMseqs2 |
| --identity | Minimum identity percentage threshold |
| --coverage | Minimum coverage percentage threshold |
| -t, --transitive | Size of the transitive closure used to build the graph. This indicates the number of non-related genes allowed in-between two related genes. Increasing it will improve precision but lower sensitivity a little. |
| -s, --jaccard | Minimum jaccard similarity used to filter edges between gene families. Increasing it will improve precision but lower sensitivity a lot. |

| option name | Description |
|-----------------------------|---------------------------------------------------------------------------|
| --fast | Use representative sequences of gene families for input gene alignment. This option is recommended for faster processing but may be less sensitive. By default, all pangenome genes are used for alignment. This argument makes sense only when --sequence is provided. (default: False) |
| --no_defrag | Do not use the defragmentation step, to align sequences with MMseqs2 (default: False) |
| --identity | Minimum identity percentage threshold (default: 0.8)|
| --coverage | Minimum coverage percentage threshold (default: 0.8)|
| -t, --transitive | Size of the transitive closure used to build the graph. This indicates the number of non-related genes allowed in-between two related genes. Increasing it will improve precision but lower sensitivity a little. (default: 4) |
| -s, --jaccard | Minimum jaccard similarity used to filter edges between gene families. Increasing it will improve precision but lower sensitivity a lot. (default: 0.85) |
| -w, --window_size | Number of neighboring genes that are considered on each side of a gene of interest when searching for conserved genomic contexts. (default: 5) |
Loading

0 comments on commit b82dc7b

Please sign in to comment.