Skip to content
Peter Hofmann edited this page Oct 27, 2015 · 42 revisions

MetagenomeSimulationPipeline is a pipeline for the simulation of a metagenome.

This repository contains all the individual steps to simulate metagenomic shotgun samples.

##Genome Annotation Annotates genomes based on marker genes

###Find and extract marker genes

Using Hmmer marker genes are found and copied into a separate file.

###Alignment and clustering

Mothur 1.33.3 then finds the best alignment to a silva reference alignment of marker genes.
After that Mothur 1.33.3 calculates the genetic distances between all sequences and creates a clustering based on these distances.

###Annotation Based on the clustering of marker genes, genomes are taxonomically classified, organised in otus and put in novelty categories. Their average nucleotide identity to reference genomes is also calculated if possible using MUMmer.

##Metagenome simulation After genomes are annotated, a simulated metagenome can be generated.

###Validation As preparation the format of all sequence files is validated to be fasta formatted and contain nucleotides, not amino acids. Since fully known genomes are hard to come by, not only 'GATC' but ambiguous representations of nucleotides 'RYWSMKHBVDN' are also accepted.

###Community Design

To design a community for all samples, abundances are drawn based on a chosen distribution. Strains are selected with the goal of diversity. Also, artificial strains are generated if required.

####Strain-Simulation In case more genomes are wanted than are available, artificial strains are generated to compensate.

####Selection of strains After it is known how many strains are required, strains are selected.

####Distribution of strains For every strain, including artificial strains, abundances are drawn for every sample.

####Taxonomic Profile Based on the abundances, files with the taxonomic profile are made.

###Prepare genomes A copy of each genome is made and placed into the project folder, but with their description removed. The description was removed, because they seem to cause ART Illumina to create corrupt sam files. More importantly, if a sequence name is not unique it is renamed. A file with a list of all changed sequence names is created and also placed into the folder.

###Read simulation Currently only ART Illumina is fully implemented. It creates simulated (pair end) reads based on error profiles and also sam files.
The pIRS read simulator still requires manual steps and is not fully implemented. pIRS requires a configuration file generated dynamically for each strain. PBSIM generates maf files that need to be converted, in combination with the sequence files, to sam files. Those are created by a script afterwards.

###Assembly to contigs (Gold standard assembly) A gold standard assembly refers to the ideal assembly of reads. Since reads are simulated, the position of those reads in relation to the raw sequence is known. With that the simulated reads can be assembled to flawless contigs. This is done using SAMtools and its mpileup parameter on the generated sam/bam files.

###Anonymization First all sequences are shuffled using the unix command shuf. It has been ensured that paired end reads that belong to another stay together in that process. The sequences are then labeled with a new name with an increasing index.

Clone this wiki locally