Skip to content
Peter Hofmann edited this page Dec 9, 2015 · 21 revisions

The following file format definitions are used in data exchange between stages of the CAMI pipeline.

##Genome annotation ####Input ####Configuration File This file contains the options and values required for the pipeline to run and a path to this file is a required program argument. The following arguments from this file expect a file path.

  • reference_genomes_file
    Tab separated data table. It maps genome ids with the file path to reference genomes with known taxonomic ids.

    • Column 1: Genome id
    • Column 2: file path
  • reference_genomes_map_file
    Tab separated data table. It maps a genome id with a taxonomic id.

    • Column 1: Genome id
    • Column 2: NCBI taxnonomic ID
  • input_reference_fna_file
    Fasta formatted sequence file. Contains marker genes of reference data of known genomes. Is optional and usually left empty. If set to a empty file, marker genes of reference genomes will not be extracted.

  • input_genomes_file
    Tab separated data table. It maps genome ids with the file path to genomes of which the taxonomic id is unknown.

    • Column 1: Genome id
    • Column 2: file path
  • mothur_ref_distances
    Tab separated data table. It contains the distances between every sequence, up to a maximum distance.

    • Column 1: Sequence id
    • Column 2: Sequence id
    • Column 2: Distance
  • mothur_alignment_ref.fasta
    Fasta formatted file generated from a arb-silva alignment. Contains aligned SSU sequences.

  • mothur_ref_names
    Generated by mothur using the unique() command

  • map.tsv
    Tab separated data table. It maps internal ids with original sequence ids and taxonomic ids if available.

    • Column 1: Internal sequence id
    • Column 2: Original sequence id
    • Column 3: NCBI taxnonomic ID (Silva), if known
    • Column 4: NCBI taxnonomic ID (EMBL), if known

####Output Files available when finished.

####meta_data.tsv Tab separated data table with the annotation of genomes. It is the only file later used for the metagenome simulation.
Column have no fixed order. First row must have column names.

  • genome_ID: Original genome id
  • prediction_threshold: A relative genome distance threshold a taxonomic classification was made of.
  • NCBI_ID: Taxonomic classification. NCBI Taxonomic id
  • SCIENTIFIC_NAME: Scientific name of taxonomic classification
  • novelty_category: Novelty category of a genome
  • OTU: Id of genomes that were clustered together
  • ANI: Average nucleotide identity to the closest reference genome
  • ANI_NOVELTY_CATEGORY: Novelty category based on ani
  • ANI_TAXONOMIC_COMPARE: Taxonomic id of closest reference genome
  • ANI_SCIENTIFIC_NAME: Scientific name of closest reference genome

####16S_rRNA.fna Fasta formated file. Contains accepted marker genes.
Sequence ids are internal ids found in the 'id_mapping.tsv' file.

####16S_rRNA.fna.rejected.fna Fasta formated file. Contains rejected marker genes.
Sequence ids are internal ids found in the 'id_mapping.tsv' file.

####id_mapping.tsv Tab separated data table. It maps internal ids with original sequence ids and taxonomic ids if available.

  • Column 1: Internal id
  • Column 2: Original id
  • Column 3: NCBI taxnonomic ID (Silva), if known
  • Column 4: NCBI taxnonomic ID (EMBL), if known

####mothur_cluster_16S_rRNA.list Tab separated data table. This is a output of mothur. First row must have column names.

  • Column 1 'label': Relative genome distance thresholds. Example: unique, 0.01, 0.02, 0.03
  • Column 2 'numOtus': Number of groups (otu)
  • Column 3+ 'Otu<index>': Comma separated lists of internal ids

Example:
label numOtus Otu001
unique 518 SR_517,SR_518,SR_462
0.001 469 SR_517,SR_518,SR_462

##Metagenome Simulation ####Input ####Configuration File This file contains the options and values required for the pipeline to run and a path to this file is a required program argument. The following arguments from this file expect a file path.

  • id_to_genome_file
    Tab separated data table. It maps genome ids with the file path to genomes.

    • Column 1: Genome id
    • Column 2: file path
  • id_to_gff_file
    Tab separated data table. It maps genome ids with the file path to the gene annotation of a genome.

    • Column 1: Genome id
    • Column 2: file path

####Output ####{out}/distributions/distribution_{i}.txt '{i}' is the index for each sample that is to be generated.

  • Column 1: genome_ID
  • Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

####{out}/source_genomes/*.fna All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed.

####{out}/source_genomes/sequence_id_map.txt This file contains a list of replaced sequence ids.

  • Column 1: genome_ID
  • Column 2: original sequence id
  • Column 3: new sequence id

####{out}/internal/genome_locations.tsv List of genomes paths to the copies in the output directory in the 'source_genomes' folder.

  • Column 1: genome_ID
  • Column 2: file path

####{out}/internal/meta_data.tsv Merged meta data of genomes of each community that are actually used for the simulation.

####{out}/internal/unused_c{i}_{original_name}.tsv Unused meta data of genomes of every community.

####{out}/sample_{i}/bam/{genome_id}.bam bam files generated based on reads generated from the read simulator

####{out}/sample_{i}/fastq/*.fq If no anonymization is not done in which case the original fastq files will be here.

####{out}/sample_{i}/fastq/anonymous_reads.fq If anonymization is done, this will be the only fastq file.

####{out}/sample_{i}/fastq/reads_mapping.tsv Mapping of reads for evaluation

  • Column 1: anonymous read id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: read id

####{out}/sample_{i}/anonymous_gsa.fasta Fasta file with perfect assembly of reads of this sample

####{out}/sample_{i}/gsa_mapping.tsv Mapping of contigs for evaluation

  • Column 1: anonymous contig id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

####{out}/anonymous_gsa_pooled.fasta Fasta file with perfect assembly of reads from all samples

####{out}/gsa_pooled_mapping.tsv Mapping of contigs from pooled reads for evaluation.

  • Column 1: anonymous_contig_id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

####{out}/taxonomic_profile_{i}.txt Taxonomic profile for each sample

####{out}/taxonomic_profile_{i}.txt

Clone this wiki locally