Skip to content

Customizing the projects.json file

Art Poon edited this page Jun 9, 2017 · 3 revisions

Customizing projects.json

Summary

The projects.json file is (obviously) a JSON file. This file controls the MiCall pipeline by:

  1. Defining the seed references, where a seed reference is a reference sequence to which we perform the preliminary mapping of reads
  2. Grouping seed references into seed groups, such that MiCall will select a single reference from a seed group to move forward from the preliminary mapping stage to the iterative remapping stage.
  3. Defining the coordinate references that are used to extract and interpret nucleotide and amino acid frequencies
  4. Defining how each seed reference is partitioned into regions according to the coordinate reference system; typically, these regions are genes that we want to pull out of a whole genome sequence.

JSON Structure

The JSON file has the following structure:

  • projects
    • max_variants: integer, number of sequence variants for aln2counts:write_nuc_variants to output
    • regions
      • coordinate_region: string reference to a coordinate reference in regions
      • seed_region_names: string reference to a seed reference in regions
  • regions
    • is_nucleotide: boolean
    • reference: list of strings comprising a nucleotide or amino acid sequence
    • seed_group: string or null

Separating the definition of individual regions from the regions field within each entry in projects facilitates a many-to-one mapping, such that a defined region may be used in more than one project.

Regions

Both seed and coordinate reference sequences are defined by region entries in the JSON file. Typically, a seed reference is a nucleotide sequence (is_nucleotide=true) and may be assigned to a seed group. In contrast, a coordinate reference is typically an amino acid sequence (is_nucleotide=false) and has no seed group assignment (null).

The nucleotide or amino acid sequence is specified as a double-quoted string. The convention in MiCall is to break these strings up into comma-separated substrings of a maximum 65 characters each:

  "ATAGGACAAGGAATTTGTAGAGCTATTTTAAACATACCTAGAAGAATCAGACAGGGCCTCGAAAG",
  "AGCTTTGCTATAA"

These strings are contained in a JSON array object defined by square brackets [...].

Projects

A project is defined by a set of seed references and a map of these references to regions within coordinate references (coordinate regions). A project may comprise more than one coordinate reference. For example,

Cookbook

Define a simple project with single seed and coordinate references