-
Notifications
You must be signed in to change notification settings - Fork 31
Basic usage and practical information
We tried to make PPanGGOLiN relatively easy to use by making this 'workflow' subcommand. It runs a pangenome analysis whose exact steps will depend on the input files you provide it with. In the end, you will end up with some files and figures that describe the pangenome of your taxonomic group of interest in different ways.
The minimal subcommand is as follow :
ppanggolin workflow --fasta ORGANISMS_FASTA_LIST
It uses parameters that we found to be generally the best when working with species pangenomes.
The file ORGANISMS_FASTA_LIST is a tsv-separated file with the following organisation :
- The first column contains a unique organism name
- The second column the path to the associated FASTA file
- Circular contig identifiers are indicated in the following columns
- Each line represents an organism
An example with 50 Chlamydia trachomatis genomes can be found in the testingDataset/ directory.
You can also give PPanGGOLiN your own annotations using .gff or .gbff/.gbk files instead of .fasta files as long as they include the genomic dna sequences, such as the ones provided by prokka using the following command :
ppanggolin workflow --anno ORGANISMS_ANNOTATION_LIST
Another example of such a file can be found in the testingDataset/ directory.
This command works exactly like 'workflow'. The difference is that it will run more analysis related to Regions of Genome Plasticity.
Most of PPanGGOLiN's commands should be run with as many CPUs as you can give them by using the --cpu option as PPanGGOLiN's speed increases relatively well with the number of CPUs. While the 'smallest' pangenomes (up to a few hundred genomes) can be easily analyzed on a normal desktop computer, the biggest ones will require a good amount of RAM. For example, 40 strains of E. coli were analyzed in 3 minutes using 1.2Go of RAM using 16 threads. 1000 strains were analyzed in 45 minutes with 14 Go of RAM using 16 threads, and as of writing those lines, 20 656 genomes was the biggest pangenome we did and it required about a day and 120 Go of RAM. The following graphic can give you an idea of the time it takes for a pangenome analysis given the number of genomes in input:
As most programs in bioinformatics, you can always specify some utility options.
You can specify the number of CPUs to use (which is recommended ! The default is to use just one) using the option --cpu
.
You can specify the output directory (if not provided, one will be generated) using the option --output
.
If you work in a strange environment that has no, or little available disk space in the '/tmp' (or your system equivalent, what is stored in TMPDIR) directory, you can specify a new temporary directory using --tmp
And if you want to redo an analysis from scratch and store it in a directory that already exists, you will have to use the --force
option. Be wary, however, that the data in that directory will be overwritten if named identically as any output file written by ppanggolin.