File Formats

The following file format definitions are used in data exchange between stages of the CAMI pipeline.

Input

BIOM file
For running the from_profile mode, a BIOM file is required. For details on this format, please consult the link given above. Also make sure that the id column of your OTUs do not contain special characters (for example &, | or ;) and that the taxonomy field is in the "greengenes taxonomy format", e.g. ["g__Escherichia"," s__Escherichia coli"]
Configuration File
This contains the options and values required for the pipeline to run and a path to this file is a required program argument for the de novo simulation and, if not the default config is supposed to be used, also for the from profile mode. The following arguments from the config file expect a file path:
id_to_genome_file
Tab separated data table. It maps genome ids with the file path to genomes.
- Column 1: Genome id
- Column 2: file path
metadata Tab separated data table It maps genome ids with additional information of their classification
- Row 1 (header): genome_ID\tOTU\tNCBI_ID\tnovelty_category
- Column 1: Genome id
- Column 2: Operational taxonomic unit (OTU) - membership in some taxonomic unit
- Column 3: NCBI taxonomy identifier
- Column 4: novelty category - if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (new_strain, new_species, new_genus, ...)
  Also see more information on this file here: Genome selection
id_to_gff_file
Optional, tab separated data table. It maps genome ids with the file path to the gene annotation of a genome.
- Column 1: Genome id
- Column 2: file path
distributions_file_path Path to optional, tab separated data tables. If this option is not blank, no distribution will be drawn, but the abundance values provided within this file are used. It maps genome ids to their abundance in a certain sample, one file for each sample is required. The individual files per sample should be comma-separated
- Column 1: Genome id
- Column 2: Abundance (float)

Output

{out}/distributions/distribution_{i}.txt

'{i}' is the index for each sample that is to be generated.

Column 1: genome_ID
Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

{out}/source_genomes/*.fna

All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed.

{out}/source_genomes/sequence_id_map.txt

This file contains a list of replaced sequence ids.

Column 1: genome_ID
Column 2: original sequence id
Column 3: new sequence id

{out}/internal/genome_locations.tsv

List of genomes paths to the copies in the output directory in the 'source_genomes' folder.

Column 1: genome_ID
Column 2: file path

{out}/internal/meta_data.tsv

Merged meta data of genomes of each community that are actually used for the simulation.

{out}/internal/unused_c{i}_{original_name}.tsv

Unused meta data of genomes of every community.

{out}/sample_{i}/bam/{genome_id}.bam

bam files generated based on reads generated from the read simulator

{out}/sample_{i}/fastq/*.fq

If no anonymization is not done in which case the original fastq files will be here.

{out}/sample_{i}/fastq/anonymous_reads.fq

If anonymization is done, this will be the only fastq file.

{out}/sample_{i}/fastq/reads_mapping.tsv

Mapping of reads for evaluation

Column 1: anonymous read id
Column 2: genome id
Column 3: taxonomic id
Column 4: read id

{out}/sample_{i}/anonymous_gsa.fasta

Fasta file with perfect assembly of reads of this sample

{out}/sample_{i}/gsa_mapping.tsv

Mapping of contigs for evaluation

Column 1: anonymous contig id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

{out}/anonymous_gsa_pooled.fasta

Fasta file with perfect assembly of reads from all samples

{out}/gsa_pooled_mapping.tsv

Mapping of contigs from pooled reads for evaluation.

Column 1: anonymous_contig_id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

{out}/taxonomic_profile_{i}.txt

Taxonomic profile for each sample

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Formats

Input

Output

{out}/distributions/distribution_{i}.txt

{out}/source_genomes/*.fna

{out}/source_genomes/sequence_id_map.txt

{out}/internal/genome_locations.tsv

{out}/internal/meta_data.tsv

{out}/internal/unused_c{i}_{original_name}.tsv

{out}/sample_{i}/bam/{genome_id}.bam

{out}/sample_{i}/fastq/*.fq

{out}/sample_{i}/fastq/anonymous_reads.fq

{out}/sample_{i}/fastq/reads_mapping.tsv

{out}/sample_{i}/anonymous_gsa.fasta

{out}/sample_{i}/gsa_mapping.tsv

{out}/anonymous_gsa_pooled.fasta

{out}/gsa_pooled_mapping.tsv

{out}/taxonomic_profile_{i}.txt

{out}/taxonomic_profile_{i}.txt

Clone this wiki locally