Skip to content

Commit

Permalink
add specific doc for gff, tables and proksee output
Browse files Browse the repository at this point in the history
  • Loading branch information
JeanMainguy committed Nov 9, 2023
1 parent d365c2a commit de63e72
Show file tree
Hide file tree
Showing 3 changed files with 100 additions and 0 deletions.
48 changes: 48 additions & 0 deletions docs/user/Flat/gff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

The `--gff` argument generates GFF files, each containing pangenome annotations for individual genomes within the pangenome. The GFF file format is a widely recognized standard in bioinformatics and can seamlessly integrate into downstream analysis tools.

To generate GFF files from a pangenome HDF5 file, you can use the following command:

```bash
ppanggolin write_genomes -p pangenome.h5 --gff -o output
```

This command will create a gff directory within the output directory, with one GFF file per genome.

Pangenome annotations within the GFF are recorded in the attribute column of the file.

For CDS features, pangenome annotations are recorded in the attribute column of the file:

CDS features have the following attributes:

- **family:** ID of the gene family to which the gene belongs.
- **partition:** The partition of the gene family, categorized as persistent, shell, or cloud.
- **module:** If the gene family belongs to a module, the module ID is specified with the key 'module.'
- **rgp:** If the gene is part of a Region of Genomic Plasticity (RGP), the RGP name is specified with the key 'rgp.'

For Regions of Genomic Plasticity (RGPs), RGPs are specified under the feature type 'region.'

RGPs have the following attributes:

- The attribute 'spot' designates the spot ID where the RGP is inserted. When the RGP has no spot, the term 'No_spot' is used.
- The 'Note' attribute specifies that this feature is an RGP.


Here is an example showcasing the initial lines of the GFF file for the Acinetobacter baumannii AYE genomes:

```gff
##gff-version 3
##sequence-region NC_010401.1 1 5644
##sequence-region NC_010402.1 1 9661
##sequence-region NC_010403.1 1 2726
##sequence-region NC_010404.1 1 94413
##sequence-region NC_010410.1 1 3936291
NC_010401.1 . region 1 5644 . + . ID=NC_010401.1;Is_circular=true
NC_010401.1 ppanggolin region 629 5591 . . . Name=NC_010401.1_RGP_0;spot=No_spot;Note=Region of Genomic Plasticity (RGP)
NC_010401.1 external gene 629 1579 . + . ID=gene-ABAYE_RS00005
NC_010401.1 external CDS 629 1579 . + 0 ID=ABAYE_RS00005;Parent=gene-ABAYE_RS00005;product=replication initiation protein;family=ABAYE_RS00005;partition=cloud;rgp=NC_010401.1_RGP_0
NC_010401.1 external gene 1576 1863 . + . ID=gene-ABAYE_RS00010
NC_010401.1 external CDS 1576 1863 . + 0 ID=ABAYE_RS00010;Parent=gene-ABAYE_RS00010;product=hypothetical protein;family=ABAYE_RS00010;partition=cloud;rgp=NC_010401.1_RGP_0
NC_010401.1 external gene 2054 2572 . - . ID=gene-ABAYE_RS00015
NC_010401.1 external CDS 2054 2572 . - 0 ID=ABAYE_RS00015;Parent=gene-ABAYE_RS00015;product=tetratricopeptide repeat protein;family=HTZ92_RS18670;partition=shell;rgp=NC_010401.1_RGP_0
```
31 changes: 31 additions & 0 deletions docs/user/Flat/proksee.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
The `--proksee` argument generates JSON map files containing pangenome annotations, which can be visualized using Proksee at [https://proksee.ca/](https://proksee.ca/).

To generate JSON map files, you can use the following command:

```bash
ppanggolin write_genomes -p pangenome.h5 --proksee -o output
```

This command will create a proksee directory within the output directory, with one JSON file per genome.


To load a JSON map file on Proksee, follow these steps:
1. Navigate to the "Map JSON" tab.
2. Upload your file using the browse button.
3. Click the "Create Map" button to generate the visualization.

A genome visualized by Proksee with PPanGGOLiN annotation appears as depicted below:


```{image} ../_static/proksee_exemple_A_baumannii_AYE.png
:align: center
```

*Image: Genome visualized by Proksee with PPanGGOLiN annotation.*


The visualization consists of three tracks:
- **Genes:** Color-coded by their gene family partition.
- **RGP (Region of Genomic Plasticity):** Spot associated to the RGPs are specified in the annotation of the object.
- **Module:** Displaying modules within the genome. The completion of the module is specified in the annotation of the object.

21 changes: 21 additions & 0 deletions docs/user/Flat/tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
This option writes in a 'tables' directory. There will be a file written in the .tsv file format for every single genome in the pangenome.
The columns of this file are described in the following table :

| Column | Description |
|----------------------|--------------------------------------------------------------------------------------------------------------------------------|
| gene | the unique identifier of the gene |
| contig | the contig that the gene is on |
| start | the start position of the gene |
| stop | the stop position of the gene |
| strand | The strand that the gene is on |
| ori | Will be T if the gene name is dnaA |
| family | the family identifier to which the gene belongs to |
| nb_copy_in_org | The number of copy of the family in the organism (basically, if 1, the gene has no closely related paralog in that organism) |
| partition | the partition to which the gene family of the gene belongs to |
| persistent_neighbors | The number of neighbors classified as 'persistent' in the pangenome graph |
| shell_neighbors | The number of neighbors classified as 'shell' in the pangenome graph |
| cloud_neighbors | The number of neighbors classidied as 'cloud' in the pangenome graph |

Those files can be generated as such :

`ppanggolin write_genomes -p pangenome.h5 --tables`

0 comments on commit de63e72

Please sign in to comment.