prok-quality

metashot/prok-quality is a comprehensive and easy-to-use pipeline for assessing the quality of prokaryotic genomes. metashot/prok-quality reports the quality measures recommended by the MIMAG standard (https://doi.org/10.1038/nbt.3893), including basic assembly statistics, completeness, contamination, rRNA and tRNA genes. Moreover, it relies on GUNC (https://doi.org/10.1101/2020.12.16.422776) to detect chimerism (i.e. non-redundandt contamination). Reproducibility is guaranteed by Nextflow and versioned Docker images.

Note: This workflow is not intended for classify "finished" SAGs or MAGs. The "finished" category is reserved for genomes that can be assembled with extensive manual review and editing.

Please cite

Albanese D and Donati C. Large-scale quality assessment of prokaryotic genomes with metashot/prok-quality. F1000Research 2021, 10:822 (https://doi.org/10.12688/f1000research.54418.1)

MetaShot Home

Main features

Input: genomes/metagenomic bins in FASTA format;
Completeness, contamination and strain heterogeneity estimates using CheckM;
Chimerism, non-redundand contamination detection using GUNC;
5S, 23S and 16S prediction with Barrnap;
Transfer RNA (tRNA) prediction with tRNAscan-SE;
Filter genomes by their completeness, contamination and GUNC prediction;
An extended summary of genome quality (including the rRNA and tRNA genes found) is reported;
Dereplication (optional) using drep.

Software included:

Software	Version
CheckM	1.1.2
GUNC	1.0.5
Barrnap	0.9
tRNAscan-SE	2.0.6
drep	2.6.2

Quick start

Install Docker (or Singulariry) and Nextflow (see Dependencies);
Start running the analysis:

nextflow run metashot/prok-quality \
  --genomes '*.fa' \
  --outdir results

The GUNC database will be downloaded automatically from the Internet. If you want to download the GUNC database before running the analysis, use the following lines:

GUNC_DB=/path/to/gunc_db
docker run --rm -v${GUNC_DB}:/guncdb -w /guncdb metashot/gunc:1.0.5-1 gunc download_db .

Later, run the workflow adding the parameter --gunc_db $GUNC_DB/gunc_db_name.dmnd

Parameters

Options and default values are declared in nextflow.config.

Input and output

--genomes: input genomes/bins in FASTA format (default "data/*.fa")
--ext: FASTA files extension, files with different extensions will be ignored (default "fa")
--outdir: output directory (default "results")
--gunc_db: GUNC database. If 'none' the database will be automatically downloaded and will be placed the output folder (gunc_db directory) (default "none")

CheckM:

--reduced_tree : reduce the memory requirements to approximately 14 GB, set --max_memory to 16.GB (default false)
--checkm_batch_size: run CheckM on "checkm_batch_size" genomes at once see Ecogenomics/CheckM#118 (default 1000)

GUNC:

--gunc_batch_size: run GUNC on "gunc_batch_size" genomes at once (default 100)

Genome filtering

--min_completeness: discard sequences with less than min_completeness% completeness (default 50)
--max_contamination: discard sequences with more than max_contamination% contamination (default 10)
--gunc_filter: if true, discard genomes that do not pass the GUNC filter (default false)

Dereplication

--skip_dereplication: skip the dereplication step (default false)
--ani_thr: ANI threshold for dereplication (> 0.90) (default 0.95)
--min_overlap: minimum required overlap in the alignment between genomes to compute ANI (default 0.30)

Resource limits

--max_cpus: maximum number of CPUs for each process (default 8)
--max_memory: maximum memory for each process (default 70.GB)
--max_time: maximum time for each process (default 96.h)

Output

The files and directories listed below will be created in the results directory after the pipeline has finished.

Main

genome_info.tsv: summary of genomes quality (including completeness, contamination, GUNC filter, N50, rRNA genes found, number of tRNA and tRNA types). This file contains:
- Genome: the genome filename
- Completeness, Contamination, Strain heterogeneity: CheckM estimates
- GUNC pass: if a genome doesn't pass GUNC analysis it means it is likely to be chimeric
- Genome size (bp), ... # predicted genes: basic genome statistics (see https://github.com/Ecogenomics/CheckM/wiki/Genome-Quality-Commands#qa)
- 5S rRNA, 23S rRNA, 16S rRNA: Yes if the rRNA gene was found
- # tRNA, # tRNA types: the number of tRNA and tRNA types found respectively
filtered: this folder contains the genomes filtered according to --min_completeness, --max_contamination and --gunc_filter options
genome_info_filtered.tsv: same as genome_info.tsv, but only for the filtered genomes
derep_info.tsv: dereplication summary (if --skip_dereplication=false) This file contains:
- Genome: genome filename
- Cluster: the cluster ID (from 0 to N-1)
- Representative: is this genome the cluster representative?
filtered_repr: this folder contains the genomes representative genomes (if --skip_dereplication=false)

Secondary

checkm: contains the original checkm's qc file
gunc: contains the original GUNC output file
barrnap: GFF and FASTA files containing the predicted rRNA sequences for bacteria (.bac) and archea (.arc) models
trnascan_se: TSV and FASTA files containing the predicted tRNA sequences for bacteria (.bac) and archea (.arc) models
drep: original data tables, figures and log of drep.

Documentation

A note on MIMAG/MISAG standards

Following MIMAG/MISAG standards, you can classify a prokaryotic genome as high-quality draft when:

its completeness is >90% and the contamination is <5%;
23S, 16S, and 5S rRNA genes can be predicted;
at least 18 tRNA types can be predicted.

A genome can be classified as medium-quality draft when its completeness is >=50% and the contamination is <10%.

SCG-based tools like CheckM can have very low sensitivity towards contamination by fragments from unrelated organisms (non-redundant contamination). In order to circumvent this problem, we suggest to consider the GUNC analysis in addition to the SCG-based estimation of contamination (default bahaviour, see --gunc_filter option)

A note on dereplication

When --skip_dereplication=false, filtered genomes will be dereplicated. After dereplication, for each cluster the genome with the higher score is selected as representative. The score is computed using the following formula:

score = completeness - 5 x contamination + 0.5 x log(N50)

Common ANI thresholds are 95% for species-level dereplication or 99% as upper-bound limit. By default the dereplication is performed with the species-level ANI threshold (0.95, parameter --ani_thr).

System requirements

Please refer to System requirements for the complete list of system requirements options.

Memory

CheckM requires approximately 70 GB of memory. However, if you have only 16 GB RAM, a reduced genome tree (--reduced_tree option) can also be used (see https://github.com/Ecogenomics/CheckM/wiki/Installation#system-requirements).

Disk

For each GB of input data the workflow requires approximately 0.5/1 GB for the final output and 2/3 GB for the working directory.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
bin		bin
docs/images		docs/images
modules		modules
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
process.config		process.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prok-quality

Please cite

Main features

Quick start

Parameters

Input and output

CheckM:

GUNC:

Genome filtering

Dereplication

Resource limits

Output

Main

Secondary

Documentation

A note on MIMAG/MISAG standards

A note on dereplication

System requirements

Memory

Disk

About

Releases 9

Packages

Languages

License

metashot/prok-quality

Folders and files

Latest commit

History

Repository files navigation

prok-quality

Please cite

Main features

Quick start

Parameters

Input and output

CheckM:

GUNC:

Genome filtering

Dereplication

Resource limits

Output

Main

Secondary

Documentation

A note on MIMAG/MISAG standards

A note on dereplication

System requirements

Memory

Disk

About

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages