-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add fasta and metadata doc for write_genomes
- Loading branch information
1 parent
de63e72
commit 48078d5
Showing
5 changed files
with
65 additions
and
1 deletion.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<!-- ### Adding Fasta Sequences into GFF and proksee JSON map Files --> | ||
|
||
PPanGGOLiN allows the incorporation of fasta sequences into GFF files and proksee JSON map files. This integration with Proksee provides access to various tools that rely on DNA sequences, including the construction of GC% and GC skew profiles, and conducting blast searches for example. | ||
|
||
|
||
Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command. | ||
|
||
- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF filepaths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences. | ||
|
||
- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
<!-- ### Incorporating Metadata into Tables, GFF, and Proksee Files --> | ||
|
||
You can inject metadata, previously added with the `metadata` command, into genome outputs using the `--add_metadata` parameter. When users add metadata, they specify the source of this metadata. These metadata sources can be selectively included using the `--metadata_sources` parameter. By default, all sources are added when the `--add_metadata` flag is specified. | ||
|
||
#### Metadata in GFF Files | ||
|
||
Metadata is integrated into the attributes column of the GFF file. The patterns for adding metadata are as follows: | ||
|
||
- In CDS lines, metadata associated with genes follow this pattern: `gene_<source>_<column>=<value>`. Gene family metadata follows a similar pattern: `gene_<source>_<column>=<value>`. | ||
- In the contig lines of type `region` describing the contig, genome metadata is added with the pattern: `genome_<source>_<column>=<value>`, and contig metadata is added with: `contig_<source>_<column>=<value>`. | ||
- In RGP lines, metadata is added using the pattern: `rpg_<source>_<column>=<value>`. | ||
|
||
For example, if we associate metadata is associated with the gene family DYB08_RS16060 with the source `pfam`: | ||
|
||
```tsv | ||
families accession type description | ||
DYB08_RS16060 PF18894 domain This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes. | ||
``` | ||
|
||
This metadata file can be added to the pangenome with the metadata command: | ||
|
||
```bash | ||
ppanggolin metadata -p pangenome.h5 --source pfam --metadata family_pfam_annotation.tsv --assign families | ||
``` | ||
|
||
When writing GFF output with the `--add_metadata` flag: | ||
|
||
```bash | ||
ppanggolin write_genomes -p pangenome.h5 --proksee -o proksee_out --gff --add_metadata | ||
``` | ||
|
||
A gene belonging to this family would have the following attribute in its GFF line: `family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain`. | ||
|
||
```gff | ||
NC_010404.1 external CDS 77317 77958 . - 0 ID=ABAYE_RS00475;Parent=gene-ABAYE_RS00475;product=putative metallopeptidase;family=DYB08_RS16060;partition=persistent;rgp=NC_010404.1_RGP_0;family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain | ||
``` | ||
|
||
### Metadata in Proksee Visualization | ||
|
||
Metadata can be seamlessly incorporated into Proksee JSON MAP files, enriching the visualization experience. These metadata details become accessible by simply hovering the mouse over the features. | ||
|
||
For instance, with the metadata previously added to the DYB08_RS16060 gene family, the Proksee visualization would resemble the example below: | ||
|
||
```{image} ../_static/proksee_metadata_example.png | ||
:align: center | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters