Skip to content

Commit

Permalink
add fasta and metadata doc for write_genomes
Browse files Browse the repository at this point in the history
  • Loading branch information
JeanMainguy committed Nov 10, 2023
1 parent de63e72 commit 48078d5
Show file tree
Hide file tree
Showing 5 changed files with 65 additions and 1 deletion.
Binary file added docs/_static/proksee_metadata_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions docs/user/Flat/genomes_fasta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<!-- ### Adding Fasta Sequences into GFF and proksee JSON map Files -->

PPanGGOLiN allows the incorporation of fasta sequences into GFF files and proksee JSON map files. This integration with Proksee provides access to various tools that rely on DNA sequences, including the construction of GC% and GC skew profiles, and conducting blast searches for example.


Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command.

- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF filepaths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.

- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format.

46 changes: 46 additions & 0 deletions docs/user/Flat/genomes_metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<!-- ### Incorporating Metadata into Tables, GFF, and Proksee Files -->

You can inject metadata, previously added with the `metadata` command, into genome outputs using the `--add_metadata` parameter. When users add metadata, they specify the source of this metadata. These metadata sources can be selectively included using the `--metadata_sources` parameter. By default, all sources are added when the `--add_metadata` flag is specified.

#### Metadata in GFF Files

Metadata is integrated into the attributes column of the GFF file. The patterns for adding metadata are as follows:

- In CDS lines, metadata associated with genes follow this pattern: `gene_<source>_<column>=<value>`. Gene family metadata follows a similar pattern: `gene_<source>_<column>=<value>`.
- In the contig lines of type `region` describing the contig, genome metadata is added with the pattern: `genome_<source>_<column>=<value>`, and contig metadata is added with: `contig_<source>_<column>=<value>`.
- In RGP lines, metadata is added using the pattern: `rpg_<source>_<column>=<value>`.

For example, if we associate metadata is associated with the gene family DYB08_RS16060 with the source `pfam`:

```tsv
families accession type description
DYB08_RS16060 PF18894 domain This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.
```

This metadata file can be added to the pangenome with the metadata command:

```bash
ppanggolin metadata -p pangenome.h5 --source pfam --metadata family_pfam_annotation.tsv --assign families
```

When writing GFF output with the `--add_metadata` flag:

```bash
ppanggolin write_genomes -p pangenome.h5 --proksee -o proksee_out --gff --add_metadata
```

A gene belonging to this family would have the following attribute in its GFF line: `family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain`.

```gff
NC_010404.1 external CDS 77317 77958 . - 0 ID=ABAYE_RS00475;Parent=gene-ABAYE_RS00475;product=putative metallopeptidase;family=DYB08_RS16060;partition=persistent;rgp=NC_010404.1_RGP_0;family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain
```

### Metadata in Proksee Visualization

Metadata can be seamlessly incorporated into Proksee JSON MAP files, enriching the visualization experience. These metadata details become accessible by simply hovering the mouse over the features.

For instance, with the metadata previously added to the DYB08_RS16060 gene family, the Proksee visualization would resemble the example below:

```{image} ../_static/proksee_metadata_example.png
:align: center
```
2 changes: 1 addition & 1 deletion docs/user/Flat/gff.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ RGPs have the following attributes:
- The 'Note' attribute specifies that this feature is an RGP.


Here is an example showcasing the initial lines of the GFF file for the Acinetobacter baumannii AYE genomes:
Here is an example showcasing the initial lines of the GFF file for the Acinetobacter baumannii AYE genome:

```gff
##gff-version 3
Expand Down
7 changes: 7 additions & 0 deletions docs/user/Outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,14 @@ Writes 'flat' files that represent the genomes along with their associated pange
### proksee
```{include} Flat/proksee.md
```
### Adding Fasta Sequences into GFF and proksee JSON map Files

```{include} Flat/genomes_fasta.md
```

### Incorporating Metadata into Tables, GFF, and Proksee Files
```{include} docs/user/Flat/genomes_metadata.md
```
## Fasta
```{include} sequence/fasta.md
```
Expand Down

0 comments on commit 48078d5

Please sign in to comment.