-
Notifications
You must be signed in to change notification settings - Fork 6
JSON output file
This page describes the contents of the gzipped JSON file log.json.gz
,
made when running viridian run_one_sample
.
The file contains a dictionary of run details. The main entries (keys) are:
-
run_summary
- has high-level details on the run -
stages_completed
- the progress of each main stage in the pipeline -
reads
- high-level summary of read counts -
read_depth
- genome coverage and read depth information -
amplicon_scheme_name
- name of the identified amplicon scheme -
scheme_choice
- details of the amplicon scheme scoring -
amplicons
- details of the amplicon scheme that was used -
self_qc
- details of read pileup information at each masked position -
sequences
- consensus sequence and variations (for MSAs and tree building)
Please read on for more details about the contents of each of those entries.
This section is a dictionary with a basic summary of the run. Here is an example (most of the key/value pairs in the options dictionary are omitted for brevity):
"last_stage_completed": "Finished",
"command": "viridian run_one_sample ... full command line used",
"options": {
"debug": false,
"outdir": "OUT",
"force": false,
},
"cwd": "/foo/bar/",
"version": "1.1.0",
"finished_running": true,
"start_time": "2023-09-08T13:37:59+00:00",
"end_time": "2023-09-08T13:39:28+00:00",
"hostname": "thehoff",
"result": "Success",
"errors": [],
"temp_processing_dir": "/tmp/viridian.rxs2ttki",
"total_amplicons": 98,
"successful_amplicons": 98,
"consensus_length": 29836,
"consensus_N_count": 96,
"consensus_N_percent": 0.32,
"consensus_ACGT_count": 29740,
"consensus_ACGT_percent": 99.68,
"consensus_het_count": 0,
"consensus_het_percent": 0.0,
"run_time": "0:01:29.384867"
The most important thing to check is:
"result": "Success"
meaning that the run finished successfully.
If instead is says "Fail", then something went wrong and
the details will be in the stages_completed
section.
The other entries should be self-explanatory.
This is a list of the stages that were completed. Each time a stage finishes the json file is written, so that if Viridian crashes or is killed, you can see the last stage that was run.
A successful run looks like this:
"stages_completed": [
"1/10 Start pipeline (0.0s)",
"2/10 Process amplicon scheme files (0.1s)",
"3/10 Map reads to reference (36.8s)",
"4/10 Detect amplicon scheme (2.7s)",
"5/10 Sample reads (23.3s)",
"6/10 Initial consensus sequence (6.1s)",
"7/10 Initial VCF and MSA of consensus/reference (0.4s)",
"8/10 QC using reads vs consensus sequence (17.9s)",
"9/10 Final QC checks (0.1s)",
"10/10 Tidy up final files and log (1.0s)",
"Finished"
]
The entries can vary depending on the command line options. For example, if a BAM file of mapped reads was provided, then the "Map reads to reference" stage would not be present. However, the final entry for a successful run is always "Finished".
The "reads" section is a dictionary of summary statistics of the reads. Here is an example for paired Illumina reads:
"reads": {
"unpaired_reads": 0,
"reads1": 337637,
"reads2": 337637,
"total_reads": 675274,
"mapped": 667394,
"match_any_amplicon": 328507,
"read_lengths": {
"250": 20,
"251": 675254
}
}
The meaning of these should be clear, except for match_any_amplicon
.
For unpaired reads, this is simply the number of reads that matched
to any amplicon in the chosen amplicon scheme (not all schemes
under consideration). For paired reads it is the number of read
pairs that matched, since both reads within a pair must be considered
together when matching to an amplicon - their order and orientation
is important.
The "read_lengths" dictionary is a count of the number of reads of each given read length. In that example, there were 20 reads of length 250, and the remaining 675254 reads all had length 251.
This has a summary of the read depth and genome coverage. Here is an example:
"read_depth": {
"depth_at_least": {
"1": 29865,
"2": 29862,
"5": 29862,
"10": 29836,
"15": 29836,
"20": 29794,
"50": 29600,
"100": 29600
},
"percent_at_least_x_depth": {
"1": 99.87,
"2": 99.86,
"5": 99.86,
"10": 99.78,
"15": 99.78,
"20": 99.64,
"50": 98.99,
"100": 98.99
},
"mean_depth": 5470.33,
"mode_depth": 7393,
"median_depth": 5051
}
These are all based on read mapping to the genome without using
any information on amplicons schemes.
The mean, mode and median depths are calculated with respect to
the entire genome (amplicon schemes do not cover the whole
genome). In that example, 99.64% of the genome (29794bp)
had at least 20X read depth. This is the value used during QC
(the options --coverage_min_x
and --coverage_min_pc
),
where by default Viridian requires at least 50 percent of
the genome with at least 20X read depth
This is simply a key/value pair with the chosen amplicon scheme, for example:
"amplicon_scheme_name": "COVID-ARTIC-V3"
This section has details of the amplicon scheme scores, and which scheme was chosen as best matching the reads. Example:
"scheme_choice": {
"scores": {
"COVID-ARTIC-V3": 4902,
"COVID-ARTIC-V4.1": 808,
"COVID-ARTIC-V5.0-5.3.2_400": 293,
"COVID-ARTIC-V5.0-5.2.0_1200": 184,
"COVID-MIDNIGHT-1200": 320,
"COVID-AMPLISEQ-V1": -193,
"COVID-VARSKIP-V1a-2b": 59
},
"best_schemes": [
"COVID-ARTIC-V3"
],
"best_score": 4902,
"best_scheme": "COVID-ARTIC-V3",
"score_ratio": 0.16
}
In that example, the best scheme was COVID-ARTIC-V3
, with
a score of 4902. The second-best scheme was
COVID-ARTIC-V4.1
with a score of 808. The ratio
of these (score_ratio
) was 808/4902 = 0.16.
By default, the best score needs to be at least 250, and the
ratio no more than 0.5 (options --min_scheme_score
and
--max_scheme_ratio
).
The best_schemes
entry is a list to allow for the extremely
unlikely (and never seen!) case that two schemes score
equally well. If this happened, then the score ratio would be 1
and the run halted.
TBC
TBC