From 5bffd1b42e32d3105ee2ffffb10f4a58cd83306f Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:39:36 +0100 Subject: [PATCH 1/8] Update pangenomeAnnotation.md to use genomes.* files instead of organisms.* --- docs/user/PangenomeAnalyses/pangenomeAnnotation.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/user/PangenomeAnalyses/pangenomeAnnotation.md b/docs/user/PangenomeAnalyses/pangenomeAnnotation.md index ad1d3fb7..16ad7990 100644 --- a/docs/user/PangenomeAnalyses/pangenomeAnnotation.md +++ b/docs/user/PangenomeAnalyses/pangenomeAnnotation.md @@ -8,7 +8,7 @@ If you do so, the provided genomes will be annotated using the following tools: - [ARAGORN](http://www.ansikte.se/ARAGORN/) to annotate tRNAs - [Infernal](http://eddylab.org/infernal/) coupled with HMM of the bacterial and archaeal rRNAs downloaded from [RFAM](https://rfam.xfam.org/) to annotate rRNAs. -To proceed with this stage of the pipeline, you need to create an **organisms.fasta.list** file. +To proceed with this stage of the pipeline, you need to create an **genomes.fasta.list** file. This file should be tab-separated with each line depicting an individual genome and its pertinent information with the following organization (only the first two columns are mandatory): @@ -17,12 +17,12 @@ its pertinent information with the following organization (only the first two co - The following columns contain Contig identifiers present in the associated FASTA file that should be analyzed as being circular. For the 'circular contig identifiers,' if you do not have access to this information, you can safely ignore this part as it does not have a big impact on the resulting pangenome. -You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list). +You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list). To run the annotation part, you can use this minimal command: ``` -ppanggolin annotate --fasta organisms.fasta.list +ppanggolin annotate --fasta genomes.fasta.list ``` #### Use a different genetic code in my annotation step @@ -48,7 +48,7 @@ to specify Infernal's RNA annotation model. ### Use annotation files for your pangenome -You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list). +You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list). ```{note} Use your own annotation for your genome is highly recommended, particularly if you already @@ -58,7 +58,7 @@ have functional annotations, as they can be added to the pangenome. You can provide them using the following command: ``` -ppanggolin annotate --anno organisms.gbff.list +ppanggolin annotate --anno genomes.gbff.list ``` #### How to deal with annotation files without sequences @@ -67,7 +67,7 @@ If your annotation files do not contain the genome sequence, you can use both options simultaneously to obtain the gene annotations and gene sequences, as follows: ``` -ppanggolin annotate --anno organisms.gbff.list --fasta organisms.fasta.list +ppanggolin annotate --anno genomes.gbff.list --fasta genomes.fasta.list ``` #### Take the pseudogenes into account for pangenome analyses From 005e213964bbe1e8f9f81f20f0f0031d755652b2 Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:43:16 +0100 Subject: [PATCH 2/8] Update practicalInformation.md to use genome instead of organism --- docs/user/practicalInformation.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/user/practicalInformation.md b/docs/user/practicalInformation.md index 15243320..84d2d8ab 100644 --- a/docs/user/practicalInformation.md +++ b/docs/user/practicalInformation.md @@ -85,12 +85,12 @@ ppanggolin utils --default_config panrgp ```yaml input_parameters: - # A tab-separated file listing the organism names, and the fasta filepath of its - # genomic sequence(s) (the fastas can be compressed with gzip). One line per organism. + # A tab-separated file listing the genome names, and the fasta filepath of its + # genomic sequence(s) (the fastas can be compressed with gzip). One line per genome. # fasta: - # A tab-separated file listing the organism names, and the gff/gbff filepath of + # A tab-separated file listing the genome names, and the gff/gbff filepath of # its annotations (the files can be compressed with gzip). One line - # per organism. If this is provided, those annotations will be used. + # per genome. If this is provided, those annotations will be used. # anno: general_parameters: From f6f28d607f3c545a481d4bc156186385a1f52451 Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:47:18 +0100 Subject: [PATCH 3/8] Update writeGenomes.md to use genome instead of organism --- docs/user/writeGenomes.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/user/writeGenomes.md b/docs/user/writeGenomes.md index eb6179c7..af301081 100644 --- a/docs/user/writeGenomes.md +++ b/docs/user/writeGenomes.md @@ -2,7 +2,7 @@ The `write_genomes` command creates 'flat' files representing genomes with their pangenome annotations. -To generate output for specific genomes, use the `--organisms` argument. This argument accepts a list of organism names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single organism name. +To generate output for specific genomes, use the `--genomes` argument. This argument accepts a list of genome names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single genome name. ### Genes table with pangenome annotations @@ -20,7 +20,7 @@ The following table outlines the columns present in the generated files: | stop | Stop position of the gene | | strand | Gene location strand | | family | ID of the gene's associated family in the pangenome | -| nb_copy_in_org | Number of copies of a family present in the organism; 1 indicates no close paralogs | +| nb_copy_in_org | Number of copies of a family present in the genome; 1 indicates no close paralogs | | partition | Gene family partition in the pangenome | | persistent_neighbors | Number of neighbors classified as 'persistent' in the pangenome graph | | shell_neighbors | Number of neighbors classified as 'shell' in the pangenome graph | @@ -137,9 +137,9 @@ PPanGGOLiN allows the incorporation of fasta sequences into GFF files and prokse Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command. -- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences. +- `--anno`: This option requires a tab-separated file containing genome names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences. -- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format. +- `--fasta`: Use this option with a tab-separated file that lists genome names alongside the filepaths of their genomic sequences in fasta format. ### Incorporating Metadata into Tables, GFF, and Proksee Files From ef6975104ad0c7c908e03ae2c21a226beea955ea Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:48:39 +0100 Subject: [PATCH 4/8] Update quickWorkflow.md to use genome instead of organism --- docs/user/QuickUsage/quickWorkflow.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/user/QuickUsage/quickWorkflow.md b/docs/user/QuickUsage/quickWorkflow.md index 26ce7111..5890751b 100644 --- a/docs/user/QuickUsage/quickWorkflow.md +++ b/docs/user/QuickUsage/quickWorkflow.md @@ -63,24 +63,24 @@ The minimal subcommand only need your own annotations files (using `.gff` or `.g as long as they include the genomic dna sequences, such as the ones provided by Prokka or Bakta. ```bash -ppanggolin all --anno organism.gbff.list +ppanggolin all --anno genome.gbff.list ``` It uses parameters that we found to be generally the best when working with species pangenomes. -The file **organism.gbff.list** is a tab-separated file with the following organisation : +The file **genome.gbff.list** is a tab-separated file with the following organisation : 1. The first column contains a unique genome name 2. The second column the path to the associated annotation file 3. Each line represents a genome -An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list) directory. +An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list) directory. [//]: # (### PPanGGOLiN: Pangenome analyses from list of fasta files) You can also give PPanGGOLiN `.fasta` files, such as: ``` -ppanggolin all --fasta organism.fasta.list +ppanggolin all --fasta genome.fasta.list ``` Again you must use a tab-separated file but this time with the following organisation: @@ -90,7 +90,7 @@ Again you must use a tab-separated file but this time with the following organis 3. Circular contig identifiers are indicated in the following columns 4. Each line represents a genome -Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list) directory. +Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list) directory. ```{tip} Downloading genomes from NCBI refseq or genbank for a species of interest can be easily accomplished using CLI tools like [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download) or the [genome updater](https://github.com/pirovc/genome_updater) script. From f559a715ac8ad6137356d84ccc89bc7edb670d42 Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:50:22 +0100 Subject: [PATCH 5/8] Update pangenomeWorkflow.md to use genome instead or organism --- docs/user/PangenomeAnalyses/pangenomeWorkflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user/PangenomeAnalyses/pangenomeWorkflow.md b/docs/user/PangenomeAnalyses/pangenomeWorkflow.md index 552f50d2..44865c45 100644 --- a/docs/user/PangenomeAnalyses/pangenomeWorkflow.md +++ b/docs/user/PangenomeAnalyses/pangenomeWorkflow.md @@ -45,12 +45,12 @@ To use this command, you need to provide a tab-separated list of either annotati You can use the workflow with annotation files as such: ``` -ppanggolin workflow --anno organism.gbff.list +ppanggolin workflow --anno genome.gbff.list ``` For fasta files, you have to change for: ``` -ppanggolin workflow --fasta organism.fasta.list +ppanggolin workflow --fasta genome.fasta.list ``` Moreover, as detailed [in the section about providing your gene families](./pangenomeAnalyses.md#read-clustering), @@ -58,7 +58,7 @@ if you wish to use different gene clustering methods than those provided by PPan it is also possible to provide your own clustering results with the workflow command as such: ``` -ppanggolin workflow --anno organism.gbff.list --clusters clusters.tsv +ppanggolin workflow --anno genome.gbff.list --clusters clusters.tsv ``` All the workflow parameters are obtained from the commands explained below, except for the `--no_flat_files` option, which solely pertains to it. This option prevents the automatic generation of the output files listed and described [in the pangenome output section](./pangenomeAnalyses.md#pangenome-outputs). From 906f3f7afd9f61d354109b2d67e675706e81945d Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:52:39 +0100 Subject: [PATCH 6/8] Update rgpPrediction.md to use genome instead of organism --- docs/user/RGP/rgpPrediction.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user/RGP/rgpPrediction.md b/docs/user/RGP/rgpPrediction.md index 90c6623e..cd16dd7b 100644 --- a/docs/user/RGP/rgpPrediction.md +++ b/docs/user/RGP/rgpPrediction.md @@ -59,12 +59,12 @@ graph LR You can use the `panrgp` with annotation (gff3 or gbff) files with `--anno` option, as such: ```bash -ppanggolin panrgp --anno organism.gbff.list +ppanggolin panrgp --anno genome.gbff.list ``` For fasta files, you need to use the alternative `--fasta` option, as such: ```bash -ppanggolin panrgp --fasta organism.fasta.list +ppanggolin panrgp --fasta genome.fasta.list ``` Just like [workflow](../PangenomeAnalyses/pangenomeAnalyses.md#workflow), this command will deal with the [annotation](../PangenomeAnalyses/pangenomeAnalyses.md#annotation), [clustering](../PangenomeAnalyses/pangenomeAnalyses.md#compute-pangenome-gene-families), [graph](../PangenomeAnalyses/pangenomeAnalyses.md#graph) and [partition](../PangenomeAnalyses/pangenomeAnalyses.md#partition) commands by itself. From 9f4e1d2dd06f3003286ad5ddc1ad53122d7090c5 Mon Sep 17 00:00:00 2001 From: Adelme Bazin Date: Tue, 12 Mar 2024 15:54:18 +0100 Subject: [PATCH 7/8] Update pangenomeGraphOut.md to use genome instead of organism --- docs/user/PangenomeAnalyses/pangenomeGraphOut.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user/PangenomeAnalyses/pangenomeGraphOut.md b/docs/user/PangenomeAnalyses/pangenomeGraphOut.md index 3530afb6..0b792a3d 100644 --- a/docs/user/PangenomeAnalyses/pangenomeGraphOut.md +++ b/docs/user/PangenomeAnalyses/pangenomeGraphOut.md @@ -12,7 +12,7 @@ Using Gephi, the layout can be tuned as illustrated below: We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker with the layout parameters. In the _light.gexf file : -The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of organisms that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family. +The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of genomes that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family. The edges contain the number of times they are present in the pangenome. From a68f0895644afee2d40a33b78a1c3ee3c31c8db9 Mon Sep 17 00:00:00 2001 From: JeanMainguy Date: Tue, 12 Mar 2024 17:34:14 +0100 Subject: [PATCH 8/8] homogenise genomes list file by adding a s to the file: genomes.fasta.list --- docs/user/PangenomeAnalyses/pangenomeWorkflow.md | 6 +++--- docs/user/QuickUsage/quickWorkflow.md | 6 +++--- docs/user/RGP/rgpPrediction.md | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/user/PangenomeAnalyses/pangenomeWorkflow.md b/docs/user/PangenomeAnalyses/pangenomeWorkflow.md index 44865c45..5793229c 100644 --- a/docs/user/PangenomeAnalyses/pangenomeWorkflow.md +++ b/docs/user/PangenomeAnalyses/pangenomeWorkflow.md @@ -45,12 +45,12 @@ To use this command, you need to provide a tab-separated list of either annotati You can use the workflow with annotation files as such: ``` -ppanggolin workflow --anno genome.gbff.list +ppanggolin workflow --anno genomes.gbff.list ``` For fasta files, you have to change for: ``` -ppanggolin workflow --fasta genome.fasta.list +ppanggolin workflow --fasta genomes.fasta.list ``` Moreover, as detailed [in the section about providing your gene families](./pangenomeAnalyses.md#read-clustering), @@ -58,7 +58,7 @@ if you wish to use different gene clustering methods than those provided by PPan it is also possible to provide your own clustering results with the workflow command as such: ``` -ppanggolin workflow --anno genome.gbff.list --clusters clusters.tsv +ppanggolin workflow --anno genomes.gbff.list --clusters clusters.tsv ``` All the workflow parameters are obtained from the commands explained below, except for the `--no_flat_files` option, which solely pertains to it. This option prevents the automatic generation of the output files listed and described [in the pangenome output section](./pangenomeAnalyses.md#pangenome-outputs). diff --git a/docs/user/QuickUsage/quickWorkflow.md b/docs/user/QuickUsage/quickWorkflow.md index 5890751b..472593ae 100644 --- a/docs/user/QuickUsage/quickWorkflow.md +++ b/docs/user/QuickUsage/quickWorkflow.md @@ -63,12 +63,12 @@ The minimal subcommand only need your own annotations files (using `.gff` or `.g as long as they include the genomic dna sequences, such as the ones provided by Prokka or Bakta. ```bash -ppanggolin all --anno genome.gbff.list +ppanggolin all --anno genomes.gbff.list ``` It uses parameters that we found to be generally the best when working with species pangenomes. -The file **genome.gbff.list** is a tab-separated file with the following organisation : +The file **genomes.gbff.list** is a tab-separated file with the following organisation : 1. The first column contains a unique genome name 2. The second column the path to the associated annotation file @@ -80,7 +80,7 @@ An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingD You can also give PPanGGOLiN `.fasta` files, such as: ``` -ppanggolin all --fasta genome.fasta.list +ppanggolin all --fasta genomes.fasta.list ``` Again you must use a tab-separated file but this time with the following organisation: diff --git a/docs/user/RGP/rgpPrediction.md b/docs/user/RGP/rgpPrediction.md index cd16dd7b..08aa704b 100644 --- a/docs/user/RGP/rgpPrediction.md +++ b/docs/user/RGP/rgpPrediction.md @@ -59,12 +59,12 @@ graph LR You can use the `panrgp` with annotation (gff3 or gbff) files with `--anno` option, as such: ```bash -ppanggolin panrgp --anno genome.gbff.list +ppanggolin panrgp --anno genomes.gbff.list ``` For fasta files, you need to use the alternative `--fasta` option, as such: ```bash -ppanggolin panrgp --fasta genome.fasta.list +ppanggolin panrgp --fasta genomes.fasta.list ``` Just like [workflow](../PangenomeAnalyses/pangenomeAnalyses.md#workflow), this command will deal with the [annotation](../PangenomeAnalyses/pangenomeAnalyses.md#annotation), [clustering](../PangenomeAnalyses/pangenomeAnalyses.md#compute-pangenome-gene-families), [graph](../PangenomeAnalyses/pangenomeAnalyses.md#graph) and [partition](../PangenomeAnalyses/pangenomeAnalyses.md#partition) commands by itself.