metashot/prok-quality is a comprehensive and easy-to-use pipeline for assessing the quality of prokaryotic genomes. metashot/prok-quality reports the quality measures recommended by the MIMAG standard (https://doi.org/10.1038/nbt.3893), including basic assembly statistics, completeness, contamination, rRNA and tRNA genes. Moreover, it relies on GUNC (https://doi.org/10.1101/2020.12.16.422776) to detect chimerism (i.e. non-redundandt contamination). Reproducibility is guaranteed by Nextflow and versioned Docker images.
Note: This workflow is not intended for classify "finished" SAGs or MAGs. The "finished" category is reserved for genomes that can be assembled with extensive manual review and editing.
Albanese D and Donati C. Large-scale quality assessment of prokaryotic genomes with metashot/prok-quality. F1000Research 2021, 10:822 (https://doi.org/10.12688/f1000research.54418.1)
- Input: genomes/metagenomic bins in FASTA format;
- Completeness, contamination and strain heterogeneity estimates using CheckM;
- Chimerism, non-redundand contamination detection using GUNC;
- 5S, 23S and 16S prediction with Barrnap;
- Transfer RNA (tRNA) prediction with tRNAscan-SE;
- Filter genomes by their completeness, contamination and GUNC prediction;
- An extended summary of genome quality (including the rRNA and tRNA genes found) is reported;
- Dereplication (optional) using drep.
Software included:
Software | Version |
---|---|
CheckM | 1.1.2 |
GUNC | 1.0.5 |
Barrnap | 0.9 |
tRNAscan-SE | 2.0.6 |
drep | 2.6.2 |
- Install Docker (or Singulariry) and Nextflow (see Dependencies);
- Start running the analysis:
nextflow run metashot/prok-quality \
--genomes '*.fa' \
--outdir results
The GUNC database will be downloaded automatically from the Internet. If you want to download the GUNC database before running the analysis, use the following lines:
GUNC_DB=/path/to/gunc_db
docker run --rm -v${GUNC_DB}:/guncdb -w /guncdb metashot/gunc:1.0.5-1 gunc download_db .
Later, run the workflow adding the parameter --gunc_db $GUNC_DB/gunc_db_name.dmnd
Options and default values are declared in nextflow.config
.
--genomes
: input genomes/bins in FASTA format (default"data/*.fa"
)--ext
: FASTA files extension, files with different extensions will be ignored (default"fa"
)--outdir
: output directory (default"results"
)--gunc_db
: GUNC database. If 'none' the database will be automatically downloaded and will be placed the output folder (gunc_db
directory) (default"none"
)
--reduced_tree
: reduce the memory requirements to approximately 14 GB, set--max_memory
to16.GB
(defaultfalse
)--checkm_batch_size
: run CheckM on "checkm_batch_size" genomes at once see Ecogenomics/CheckM#118 (default1000
)
--gunc_batch_size
: run GUNC on "gunc_batch_size" genomes at once (default100
)
--min_completeness
: discard sequences with less thanmin_completeness
% completeness (default50
)--max_contamination
: discard sequences with more thanmax_contamination
% contamination (default10
)--gunc_filter
: if true, discard genomes that do not pass the GUNC filter (defaultfalse
)
--skip_dereplication
: skip the dereplication step (defaultfalse
)--ani_thr
: ANI threshold for dereplication (> 0.90) (default0.95
)--min_overlap
: minimum required overlap in the alignment between genomes to compute ANI (default0.30
)
--max_cpus
: maximum number of CPUs for each process (default8
)--max_memory
: maximum memory for each process (default70.GB
)--max_time
: maximum time for each process (default96.h
)
See also System requirements.
The files and directories listed below will be created in the results
directory after the pipeline has finished.
genome_info.tsv
: summary of genomes quality (including completeness, contamination, GUNC filter, N50, rRNA genes found, number of tRNA and tRNA types). This file contains:- Genome: the genome filename
- Completeness, Contamination, Strain heterogeneity: CheckM estimates
- GUNC pass: if a genome doesn't pass GUNC analysis it means it is likely to be chimeric
- Genome size (bp), ... # predicted genes: basic genome statistics (see https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa)
- 5S rRNA, 23S rRNA, 16S rRNA: Yes if the rRNA gene was found
- # tRNA, # tRNA types: the number of tRNA and tRNA types found respectively
filtered
: this folder contains the genomes filtered according to--min_completeness
,--max_contamination
and--gunc_filter
optionsgenome_info_filtered.tsv
: same asgenome_info.tsv
, but only for the filtered genomesderep_info.tsv
: dereplication summary (if--skip_dereplication=false
) This file contains:- Genome: genome filename
- Cluster: the cluster ID (from 0 to N-1)
- Representative: is this genome the cluster representative?
filtered_repr
: this folder contains the genomes representative genomes (if--skip_dereplication=false
)
checkm
: contains the original checkm's qc filegunc
: contains the original GUNC output filebarrnap
: GFF and FASTA files containing the predicted rRNA sequences for bacteria (.bac
) and archea (.arc
) modelstrnascan_se
: TSV and FASTA files containing the predicted tRNA sequences for bacteria (.bac
) and archea (.arc
) modelsdrep
: original data tables, figures and log of drep.
Following MIMAG/MISAG standards, you can classify a prokaryotic genome as high-quality draft when:
- its completeness is >90% and the contamination is <5%;
- 23S, 16S, and 5S rRNA genes can be predicted;
- at least 18 tRNA types can be predicted.
A genome can be classified as medium-quality draft when its completeness is >=50% and the contamination is <10%.
SCG-based tools like CheckM can have very low sensitivity towards contamination
by fragments from unrelated organisms (non-redundant contamination). In order to
circumvent this problem, we suggest to consider the GUNC analysis in addition to
the SCG-based estimation of contamination (default bahaviour, see
--gunc_filter
option)
When --skip_dereplication=false
, filtered genomes will be dereplicated. After
dereplication, for each cluster the genome with the higher score is selected as
representative. The score is computed using the following formula:
score = completeness - 5 x contamination + 0.5 x log(N50)
Common ANI thresholds are 95% for species-level dereplication or 99% as
upper-bound limit. By default the dereplication is performed with the
species-level ANI threshold (0.95, parameter --ani_thr
).
Please refer to System requirements for the complete list of system requirements options.
CheckM requires approximately 70 GB of memory. However, if you have only 16 GB
RAM, a reduced genome tree (--reduced_tree
option) can also be used (see
https://github.com/Ecogenomics/CheckM/wiki/Installation#system-requirements).
For each GB of input data the workflow requires approximately 0.5/1 GB for the final output and 2/3 GB for the working directory.