-
Notifications
You must be signed in to change notification settings - Fork 9
Customizing the projects.json file
The projects.json
file is (obviously) a JSON file. This file controls the MiCall pipeline by:
- Defining the seed references, where a seed reference is a reference sequence to which we perform the preliminary mapping of reads
- Grouping seed references into seed groups, such that MiCall will select a single reference from a seed group to move forward from the preliminary mapping stage to the iterative remapping stage.
- Defining the coordinate references that are used to extract and interpret nucleotide and amino acid frequencies
- Defining how each seed reference is partitioned into regions according to the coordinate reference system; typically, these regions are genes that we want to pull out of a whole genome sequence.
The JSON file has the following structure:
-
projects
-
max_variants
: integer, number of sequence variants foraln2counts:write_nuc_variants
to output -
regions
-
coordinate_region
: string reference to a coordinate reference inregions
-
seed_region_names
: string reference to a seed reference inregions
-
-
-
regions
-
is_nucleotide
:boolean
-
reference
:list
of strings comprising a nucleotide or amino acid sequence -
seed_group
: string ornull
-
Separating the definition of individual regions from the regions
field within each entry in projects
facilitates a many-to-one mapping, such that a defined region may be used in more than one project.
Both seed and coordinate reference sequences are defined by region entries in the JSON file. Typically, a seed reference is a nucleotide sequence (is_nucleotide=true
) and may be assigned to a seed group. In contrast, a coordinate reference is typically an amino acid sequence (is_nucleotide=false
) and has no seed group assignment (null
).
The nucleotide or amino acid sequence is specified as a double-quoted string. The convention in MiCall is to break these strings up into comma-separated substrings of a maximum 65 characters each:
"ATAGGACAAGGAATTTGTAGAGCTATTTTAAACATACCTAGAAGAATCAGACAGGGCCTCGAAAG",
"AGCTTTGCTATAA"
These strings are contained in a JSON array object defined by square brackets [...]
.
A project is defined by a set of seed references and a map of these references to regions within coordinate references (coordinate regions
). A project may comprise more than one coordinate reference. For example,