Skip to content

JSON output file

martinghunt edited this page Jan 3, 2024 · 28 revisions

This page describes the contents of the gzipped JSON file log.json.gz, made when running viridian run_one_sample.

The file contains a dictionary of run details. The main entries (keys) are:

  • run_summary - has high-level details on the run
  • stages_completed - the progress of each main stage in the pipeline
  • reads - high-level summary of read counts
  • read_depth - genome coverage and read depth information
  • amplicon_scheme_name - name of the identified amplicon scheme
  • scheme_choice - details of the amplicon scheme scoring
  • amplicons - details of the amplicon scheme that was used
  • self_qc - details of read pileup information at each masked position
  • sequences - consensus sequence and variations (for MSAs and tree building)

Please read on for more details about the contents of each of those entries.

run_summary

This section is a dictionary with a basic summary of the run. Here is an example (most of the key/value pairs in the options dictionary are omitted for brevity):

"last_stage_completed": "Finished",
"command": "viridian run_one_sample ... full command line used",
"options": {
  "debug": false,
  "outdir": "OUT",
  "force": false,
},
"cwd": "/foo/bar/",
"version": "1.1.0",
"finished_running": true,
"start_time": "2023-09-08T13:37:59+00:00",
"end_time": "2023-09-08T13:39:28+00:00",
"hostname": "thehoff",
"result": "Success",
"errors": [],
"temp_processing_dir": "/tmp/viridian.rxs2ttki",
"total_amplicons": 98,
"successful_amplicons": 98,
"consensus_length": 29836,
"consensus_N_count": 96,
"consensus_N_percent": 0.32,
"consensus_ACGT_count": 29740,
"consensus_ACGT_percent": 99.68,
"consensus_het_count": 0,
"consensus_het_percent": 0.0,
"run_time": "0:01:29.384867"

The most important thing to check is:

"result": "Success"

meaning that the run finished successfully. If instead is says "Fail", then something went wrong and the details will be in the stages_completed section. The other entries should be self-explanatory.

stages_completed

This is a list of the stages that were completed. Each time a stage finishes the json file is written, so that if Viridian crashes or is killed, you can see the last stage that was run.

A successful run looks like this:

"stages_completed": [
  "1/10 Start pipeline (0.0s)",
  "2/10 Process amplicon scheme files (0.1s)",
  "3/10 Map reads to reference (36.8s)",
  "4/10 Detect amplicon scheme (2.7s)",
  "5/10 Sample reads (23.3s)",
  "6/10 Initial consensus sequence (6.1s)",
  "7/10 Initial VCF and MSA of consensus/reference (0.4s)",
  "8/10 QC using reads vs consensus sequence (17.9s)",
  "9/10 Final QC checks (0.1s)",
  "10/10 Tidy up final files and log (1.0s)",
  "Finished"
]

The entries can vary depending on the command line options. For example, if a BAM file of mapped reads was provided, then the "Map reads to reference" stage would not be present. However, the final entry for a successful run is always "Finished".

reads

The "reads" section is a dictionary of summary statistics of the reads. Here is an example for paired Illumina reads:

"reads": {
  "unpaired_reads": 0,
  "reads1": 337637,
  "reads2": 337637,
  "total_reads": 675274,
  "mapped": 667394,
  "match_any_amplicon": 328507,
  "read_lengths": {
  "250": 20,
  "251": 675254
  }
}

The meaning of these should be clear, except for match_any_amplicon. For unpaired reads, this is simply the number of reads that matched to any amplicon in the chosen amplicon scheme (not all schemes under consideration). For paired reads it is the number of read pairs that matched, since both reads within a pair must be considered together when matching to an amplicon - their order and orientation is important.

The "read_lengths" dictionary is a count of the number of reads of each given read length. In that example, there were 20 reads of length 250, and the remaining 675254 reads all had length 251.

read_depth

This has a summary of the read depth and genome coverage. Here is an example:

"read_depth": {
  "depth_at_least": {
    "1": 29865,
    "2": 29862,
    "5": 29862,
    "10": 29836,
    "15": 29836,
    "20": 29794,
    "50": 29600,
    "100": 29600
  },
  "percent_at_least_x_depth": {
    "1": 99.87,
    "2": 99.86,
    "5": 99.86,
    "10": 99.78,
    "15": 99.78,
    "20": 99.64,
    "50": 98.99,
    "100": 98.99
  },
  "mean_depth": 5470.33,
  "mode_depth": 7393,
  "median_depth": 5051
}

These are all based on read mapping to the genome without using any information on amplicons schemes. The mean, mode and median depths are calculated with respect to the entire genome (amplicon schemes do not cover the whole genome). In that example, 99.64% of the genome (29794bp) had at least 20X read depth. This is the value used during QC (the options --coverage_min_x and --coverage_min_pc), where by default Viridian requires at least 50 percent of the genome with at least 20X read depth

amplicon_scheme_name

TBC

scheme_choice

TBC

amplicons

self_qc

TBC

sequences

TBC

Clone this wiki locally