File Formats

The following file format definitions are used in data exchange between stages of the CAMI pipeline.

##Genome annotation ####Input ####Configuration File This file contains the options and values required for the pipeline to run and a path to this file is a required program argument.

####Output Files available when finished.

####16S_rRNA.fna Contains accepted marker genes. Fasta formated file. Sequence ids are internal ids found in the 'id_mapping.tsv' file.

####16S_rRNA.fna.rejected.fna Contains rejected marker genes. Fasta formated file. Sequence ids are internal ids found in the 'id_mapping.tsv' file.

####id_mapping.tsv Tab separated data table.

Column 1: Internal id
Column 2: Original id
Column 3: NCBI taxnonomic ID (Silva), if known
Column 4: NCBI taxnonomic ID (EMBL), if known

####meta_data.tsv Tab separated data table. Column have no fixed order. First row must have column names.

genome_ID: Original genome id
prediction_threshold: A relative genome distance threshold a taxonomic classification was made of.
NCBI_ID: Taxonomic classification. NCBI Taxonomic id
SCIENTIFIC_NAME: Scientific name of taxonomic classification
novelty_category: Novelty category of a genome
OTU: Id of genomes that were clustered together
ANI: Average nucleotide identity to the closest reference genome
ANI_NOVELTY_CATEGORY: Novelty category based on ani
ANI_TAXONOMIC_COMPARE: Taxonomic id of closest reference genome
ANI_SCIENTIFIC_NAME: Scientific name of closest reference genome

####mothur_cluster_16S_rRNA.list Tab separated data table. First row must have column names.

Column 1 'label': Relative genome distance thresholds. Example: unique, 0.01, 0.02, 0.03
Column 2 'numOtus': Number of groups (otu)
Column 3+ 'Otu<index>': Comma separated lists of internal ids

Example:
label numOtus Otu001
unique 518 SR_517,SR_518,SR_462
0.001 469 SR_517,SR_518,SR_462

##Metagenome Simulation ####Input ####Configuration File This file contains the options and values required for the pipeline to run and a path to this file is a required program argument.

####Output ####{out}/distributions/distribution_{i}.txt '{i}' is the index for each sample that is to be generated.

Column 1: genome_ID
Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

####{out}/source_genomes/*.fna All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed.

####{out}/source_genomes/sequence_id_map.txt This file contains a list of replaced sequence ids.

Column 1: genome_ID
Column 2: original sequence id
Column 3: new sequence id

####{out}/internal/genome_locations.tsv List of genomes paths to the copies in the output directory in the 'source_genomes' folder.

Column 1: genome_ID
Column 2: file path

####{out}/internal/meta_data.tsv Merged meta data of genomes of each community that are actually used for the simulation.

####{out}/internal/unused_c{i}_{original_name}.tsv Unused meta data of genomes of every community.

####{out}/sample_{i}/bam/{genome_id}.bam bam files generated based on reads generated from the read simulator

####{out}/sample_{i}/fastq/*.fq If no anonymization is not done in which case the original fastq files will be here.

####{out}/sample_{i}/fastq/anonymous_reads.fq If anonymization is done, this will be the only fastq file.

####{out}/sample_{i}/fastq/reads_mapping.tsv Mapping of reads for evaluation

Column 1: anonymous read id
Column 2: genome id
Column 3: taxonomic id
Column 4: read id

####{out}/sample_{i}/anonymous_gsa.fasta Fasta file with perfect assembly of reads of this sample

####{out}/sample_{i}/gsa_mapping.tsv Mapping of contigs for evaluation

Column 1: anonymous contig id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

####{out}/anonymous_gsa_pooled.fasta Fasta file with perfect assembly of reads from all samples

####{out}/gsa_pooled_mapping.tsv Mapping of contigs from pooled reads for evaluation.

Column 1: anonymous_contig_id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

####{out}/taxonomic_profile_{i}.txt Taxonomic profile for each sample

####{out}/taxonomic_profile_{i}.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Formats

Clone this wiki locally