-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/projection' into context
- Loading branch information
Showing
72 changed files
with
10,319 additions
and
4,477 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
1.2.127 | ||
1.2.191 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
From version 2.0.0, it is possible to add metadata link to pangenome elements using PPanGGOLiN. | ||
Metadata can be associated with: genes, genomes, families, RGPs, spots and modules from a simple TSV file. | ||
To add metadata in your pangenome you can launch the command is as follows: | ||
|
||
`ppanggolin metadata -p PANGENOME --metadata METADATA.TSV --source SOURCE --assign ASSIGN` | ||
|
||
- `--source` arguments corresponds to the origin of the metadata and will be used as the storage key in the pangenome. | ||
- `--assign` Choose to which pangenome elements who want to add metadata in the following list {families,genomes,genes,RGPs,spots,modules} | ||
|
||
# Metadata format | ||
|
||
PPanGGOLiN allows to use a highly flexible metadata file. Only one column name is mandatory, and it is identical to the | ||
assignment argument chosen by the user. | ||
|
||
For example the TSV file to assign metadata to gene families to functional annotation could be as follows: | ||
|
||
| families | Accesion | Function | Description | | ||
|----------|----------|----------|-------------| | ||
| GF_1 | Acc_1 | Fn_1 | Desc_1 | | ||
| GF_2 | Acc_2 | Fn_2 | Desc_2 | | ||
| GF_2 | Acc_3 | Fn_3 | Desc_3 | | ||
| ... | ... | ... | ... | | ||
| GF_n | Acc_n | Fn_n | Desc_n | | ||
|
||
*Note: As you can see in the above table, one element (here GF_2) can be associated with more than one metadata.* | ||
|
||
## Command specifiq option details | ||
|
||
### `--metadata` | ||
PPanGGOLiN enables to give one TSV at a time to add metadata. Look at [Metadata Format](<https://github.com/labgem/PPanGGOLiN/wiki/Metadata#Metadata Format>) | ||
|
||
### `--source` | ||
The source is the key use to access to metadata in pangenome. | ||
So if the name of the source already exist in the pangenome it can be overwritten only with `--force` option. | ||
This system allow to have multiple metadata source that can be read and use in PPanGGOLiN. | ||
|
||
### `--assign` | ||
PPanGGOLiN allows to add metadata to all pangenome elements: families,genomes,genes,RGPs,spots,modules. | ||
But the user can only give one metadata file at a time as he can provide only source and so one type of pangenome element. | ||
|
||
### `--omit` | ||
You can use this option to skip the error provide by an unfind ID in the pangenome. | ||
This could be useful if you are using a general TSV with element not in the pangenome, but must be used with carefully. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Projection command | ||
The ppanggolin projection command allows you to annotate external genomes using an existing pangenome. This process eliminates the need to recompute all components, streamlining the annotation process. Input genomes are expected to belong to the same species. | ||
|
||
Genes within the input genome are aligned with genes in the pangenome to determine their gene families and partitions. Genes that do not align with any existing gene in the pangenome are considered specific to the input genome and are assigned to the "Cloud" partition. Based on the alignment and partition assignment, Regions of Plasticity (RGPs) within the input genome are predicted. Each RGP that is not located on a contig border is assigned to a spot of insertion. Finally, conserved modules of the pangenome found in the input genome are reported in the output files. | ||
|
||
## Input files: | ||
|
||
This command supports two input modes depending on whether you want to project a single genome or multiple genomes at once: | ||
|
||
Multiple Files in One TSV: | ||
- **Options**: `--fasta` or `--anno` | ||
- **Description**: You can provide a tab-separated file listing organism names alongside their respective FASTA genomic sequences or annotation filepaths, with one line per organism. This mode is suitable when you want to annotate multiple genomes in a single operation. The format of this file is identical to the format used in the annotate and workflow commands; for more details, refer here. | ||
|
||
Single File: | ||
- **Options**: `--organism_name` with `--fasta` or `--anno` and `--circular_contigs` (optional) | ||
- **Description**: When annotating a single genome, you can directly provide a single FASTA genomic sequence file or an annotation file in GFF/GBFF format. Additionally, specify the name of the organism using the `--organism_name` option. You can also indicate circular contigs using the `--circular_contigs` option when necessary. | ||
|
||
|
||
## Output files: | ||
|
||
The Output directory contains `summary_projection.tsv` giving an overview of the projection. one line per organism. | ||
|
||
|
||
| Column | Description| | ||
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| Organism name | This column contains name or identifier of the organisms being analyzed.| | ||
| Pangenome file | The path to the pangenome file (pangenome.h5) used for the analysis.| | ||
| Contigs | The number of contigs in the projected genome.| | ||
| Genes | The total number of genes identified in the input genome.| | ||
| Families | The total number of gene families to which genes in the genome of the input organism are assigned.| | ||
| Persistent genes | The number of genes in the "Persistent" partition.| | ||
| Persistent families | The number of gene families in the "Persistent" partition.| | ||
| Shell genes | The number of genes in the "Shell" partition.| | ||
| Shell families | The number of gene families in the "Shell" partition.| | ||
| Cloud genes | The number of genes in the "Cloud" partition.| | ||
| Cloud families | The number of gene families in the "Cloud" parition.| | ||
| Cloud specific families | The number of gene families that are specific to the input organism. These families are unique to the input organism and do not have homologs in any other genomes within the pangenome and have been assigned to the "Cloud" partition.| | ||
| RGPs (Regions of Genomic Plasticity) | The number of Regions of Genomic Plasticity (RGPs) predicted within the input genome.| | ||
| Spots | The total number of spots of insertion associated with RGPs in the input genome.| | ||
| New spots | The number of new insertion spots that have been identified in the input genome. These spots represent novel genomic regions compared to other genomes in the pangenome.| | ||
| Modules | The number of modules that have been projected onto the input genome.| | ||
|
||
|
||
Additionally, within the Output directory, there is a subdirectory for each input genome, named after the input genome itself. Each of these subdirectories contains several files: | ||
|
||
For Gene Family and Partition of Input Genes: | ||
|
||
- `cds_sequences.fasta`: This file contains the sequences of coding regions (CDS) from the input genome. | ||
- `gene_to_gene_family.tsv`: It provides the mapping of genes to gene families of the pangenome. its format follows [this output](Outputs.md#gene-families-and-genes) | ||
- `sequences_partition_projection.tsv`: This file maps the input genes to its partition (Persistent, Shell or Cloud). | ||
- `specific_genes.tsv`: This file list the gene of the input genomes that do not align to any gene of the pangenome. These genes are assigned to Cloud parititon. | ||
|
||
For RGPs and Spots: | ||
|
||
- `plastic_regions.tsv`: This file contains information about Regions of Genomic Plasticity (RGPs) within the input genome. Its format follows [this output](Outputs.md#plastic-regions). | ||
- `input_organism_rgp_to_spot.tsv`: It provides information about the association between RGPs and insertion spots in the input genome. Its format follows [this ouput](Outputs.md#spots). | ||
|
||
Optionally, you can produce a graph of the RGPs using the `--spot_graph` option. This graph is similar as the one produce by the `ppanggolin spot` command. | ||
|
||
For Modules: | ||
|
||
- `modules_in_input_organism.tsv`: This file lists the modules that have been found in the input genome. Its format follows [this ouput](Outputs.md#modules-in-organisms). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
from .genomicIsland import subparser, launch | ||
from .spot import * | ||
from . import rgp_cluster |
Oops, something went wrong.