-
Notifications
You must be signed in to change notification settings - Fork 37
User manual
With CAMISIM you can create simulated metagenome data sets out of taxonomic profiles or de novo from a list of genomes. If a taxonomic profile is used as input, the output data set is created from the NCBI complete genomes, reflecting the input profile as closely as possible and will contain the same number of samples as the input profile, if not specified otherwise. If the community is designed de novo, a user-defined number of the complete genomes is used for creating a community which maximizes genome novelty as well as phylogenetic spread.
CAMISIM can generate four types of simulated metagenome samples de novo:
- Individual simulate metagenome sample.
Uses a taxonomic profile sampled from a log-normal distribution - A time series of simulated metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution with Gaussian noise, a normal distribution added to successively generated samples. - A set of replicate simulated metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution, with Gaussian noise repeatedly added to the original log-normal values. - Differential abundance metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution.
The output of CAMISIM is the read files in fastq format, a mapping of these reads to the used genomes in BAM format, "gold standards" for assembly (fasta), binning and profiling. All files are described here.
Download CAMISIM from the github repository into any directory and make sure the following Dependencies are fullfilled. The metagenome simulation pipeline can be downloaded using git:
git clone https://github.com/CAMI-challenge/CAMISIM
Please make sure that a software is of the specified version, as outdated (or sadly sometimes too new) software can cause errors. The path to the binaries of those tools can be set in the configuration file.
The pipeline has only been tested on LINUX systems and might fail to run on any other systems (and no support for these will be provided).
The following software is required for the simulation pipeline to run:
Nextflow is the workflow framework used for this software. Nextflow can also be installed via conda (https://anaconda.org/bioconda/nextflow).
CAMISIM uses mamba to install and configure all needed software packages. Mamba can also be installed via conda (https://anaconda.org/conda-forge/mamba).
conda optional
CAMISIM can also use conda instead of mamba to install and configure all needed software packages. This is not recommended for performance reasons. If it is still needed to use conda, adjust it in the config files (configuration file).
The simulation will be conducted primarily in the work directory created by the pipeline. The results will then be copied to the specified output directory. Be sure to have enough space at both locations. Required hard drive space and RAM can vary a lot. Very small simulations can be run on a laptop with 4GB RAM, while realistic simulations of several hundred genomes, given a realistic metagenome size, can easily require several hundreds of gigabyte of both RAM and HD space, the chosen metagenome size being the relevant factor.
A database dump of the NCBI taxonomy is included, current versions can be downloaded from the NCBI FTP-Server.
If the community design should be performed de novo, genomes in fasta format to be sampled from are needed. Otherwise they will be downloaded from the NCBI complete genomes.
The de novo community design needs two files to run:
-
A file containing, tab separated, a genome identifier and that path to the file of the genome.
-
A [[meta data file|meta-data-file-format] that contains, tab separated and with header, genome identifier, novelty categorization, otu assignment and a taxonomic classification.
At minimum the following files are required: "nodes.dmp", "merged.dmp", "names.dmp"
de novo:
>> nextflow run main.nf
or from_profile:
>> nextflow run main.nf --params.biom_profile "${projectDir}/defaults/mini.biom"
Further use cases can be found here.
Multiple kinds of simulated metagenome data sets can be created, either using an existing 16S rRNA profile or de novo: Single sample, differential abundance or time series data sets with
different insert sizes and complexities.
All data sets can be simulated as paired-end sequencing from Illumina HiSeq or other machines as well as other technologies (like ONT or PacBio) and error rates.
For generation of data sets and their samples, the following steps are undertaken:
After genome data validation (step 1), the community composition is designed according to specified criteria or the given taxonomic profile (step 2) and the metagenome data sets simulated (step 3).
Creation of the gold standards (step 4) represents the last part of the pipeline.
For the de novo metagenome simulation, all sequences shorter than 1000 base pairs are removed from the provided genome assemblies and sequences validated to contain only valid characters. The input genomes can be draft genomes in fasta format and as such contain beside the bases ACGT also ambiguous DNA characters such as 'RYWSMKHBVDN'.
If the input is a taxonomic profile in BIOM format, the taxa appearing in it, are matched to closely related, complete genomes from the NCBI RefSeq or additional reference files with e.g. new genomes, such that the simulated metagenome (genomes and their abundances) reflects the input profile as closely as possible. This is done via the matching of scientific names in the input profile. Given the taxonomic names of the OTUs in the BIOM profile (this is required), a NCBI taxonomy ID is inferred. For all reference genomes, a taxonomy ID is also required. In the next step, the lineages between the OTUs and the genomes are compared and an OTU is mapped to a genome which has the smallest path between OTU taxonomy ID and genome taxonomy ID. Since the shortest path might be really long, a threshold is set to family level: An OTU will not get mapped to a genome, which is not in the same family, i.e. for every mapped genome it is guaranteed that they are at least of the same family. Usually this leaves some OTUs unassigned, either because their scientific name did not yield a NCBI taxonomy ID or because for the taxonomy ID no complete genomes are available. It is possible to assign "random" genomes from the reference collection(s) to the unmapped OTUs to increase complexity via an extra option.
For the de novo community design, a specified number of sequences (either circular elements or genomes) is selected according to certain criteria, see below, from all input sequences (e.g. circular elements or genomes) to generate a specific data set. The random selection of genomes for a particular data set is guided based on their membership in novelty category (Taxonomic annotation step 4.) and OTUs: To increase taxonomic spread, for a specified overall number of genomes to be included in a particular data set, the selection algorithm attempts to draw the same number of genomes from each novelty category (new strain, new species, new genus, new family, new order). If a category does not have enough genomes to thus enable coming up with the specified genome number, more genomes are drawn from the other categories, in equal amounts, if possible. This is repeated until the specified number of genomes has been sampled. To model strain level diversity within a species, for every OTU, all available genomes are included in one particular data set up to five genomes at most. More than five genomes per OTU are only included, if the specified data set size requires more genomes to be added, that are not available otherwise.
To increase strain level diversity in the data sets, a number of additional genomes can be simulated with sgEvolver12, representing related strains to the genomes selected above (see below). To this end, a genome is randomly chosen, and a strain number is drawn from a geometric distribution (p=0.3), which represented the number of related strains to be created for that particular genome. This is repeated, until the specified overall number strains to be generated is reached. For each of these genomes, sgEvolver simulates 40 strains, with an increasing genetic distance (in steps of 0.1%) to the preceding ones, up to a total distance of 4%. From the 40 strains, the specified strain number is randomly selected and added to the genome set.
For data set simulation, genome and circular element abundance distributions are required for every data set. For a single sample data set, the abundance distribution is created by sampling from a log-normal distribution with mu set to 1 and sigma to 2 by default. To reflect a differential abundance experiment, two or more abundance samples can be drawn independently from a log-normal distribution with the same parameter settings. For consecutive samples of a time series, abundances for the next sample are calculated by sampling new values from the log-normal distribution, adding these to the previous value and dividing by two. The absolute abundance of the circular elements can be set to be like 15 times as high as the abundance of the microbial genomes, to replicate high natural plasmid abundances. For each sample, a taxonomic profile for higher ranking taxa is calculated based on the genome abundance values and their taxonomic IDs.
Based on the genome abundance distribution, the genome sizes and the specified metagenome sample output size, for every genome, the coverage in the simulated samples is calculated. The chosen read simulator then generates reads, using the technology and properties (like read length, insert size or error rates if applicable) as specified by the user, sampling each genome proportionally to the specified abundance in the community. A read simulation of a dataset can be run a second time using a different mean insert size using the same genome abundances as before to replicate the use of a mix of technologies on the same samples. The size of every sample of each dataset can be set to any size.
Outputs of this step are for every sample of a data set, a FASTQ and a BAM file, which specifies the location of every read in the input sequences.
A gold standard assembly is made with SAMtools for each sample individually and for all samples of each dataset together (pooled gold standard), which included every sequence position with a coverage of at least one, respectively. Specifically, the SAM file obtained from the read simulation specifying the location of each read on the reference genomes is used with SAMtools for calculation of the per-site coverage. Sequences are then broken into gold standard contig sequences at zero coverage sites. The contig sequences are then labeled with a unique dataset and sample identifying prefix and an increasing index as suffix. In the last step, sequences of the simulated data sets are shuffled using the unix command 'shuf' and anonymized, thus creating the challenge data sets.
'{i}' is the index for each sample that is to be generated.
- Column 1: genome_ID
- Column 2: abundance
'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.
All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed. Also index files *.fai will be present after simulation
This file contains a list of replaced sequence ids, may be empty if none where replaced.
- Column 1: genome_ID
- Column 2: original sequence id
- Column 3: new sequence id
If the metagenome is simulated from profile this file is present. It contains a list of genomes paths to the copies in the output directory in the 'genomes' folder.
- Column 1: genome_ID
- Column 2: file path
If the metagenome is simulated from profile this file is present. Merged meta data of genomes of each community that are actually used for the simulation.
If no Taxonomy dump from the NCBI was specified in the nextflow.config, the downloaded and used one by CAMISIM will be stored here.
The bam files generated based on reads generated from the read simulator.
The simulated reads.
A file containing all reads of this sample.
If anonymization is done, this will be the only fastq file.
Mapping of reads for evaluation
- Column 1: anonymous read id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: read id
Fasta file with perfect assembly of reads of this sample
Mapping of contigs for evaluation
- Column 1: anonymous contig id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
Fasta file with perfect assembly of reads from all samples
Mapping of contigs from pooled reads fo_validate_raw_genomesr evaluation.
- Column 1: anonymous_contig_id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
Taxonomic profile for each sample
The seeds used for the metagenomesimulation. If the results of the simulation need to be reproduced, use the used_initial_seed from the {out}/seed/seed.txt in the next run. The seed can be configured in the nextflow.config file.