Skip to content

JSON output file

martinghunt edited this page Dec 13, 2021 · 28 revisions

This page describes the contents of the file log.json, made when running viridian_workflow run_one_sample.

The main entries in the file are:

  • run_summary - has high-level details on the run
  • read_and_primer_stats - high-level read counts (and reads mapped etc), and amplicon scheme identification details
  • amplicon_scheme_name - the name of the amplicon scheme that was used (more details are at the end of the section on read_and_primer_stats)
  • read_sampling - read depths and related information for each amplicon
  • viridian - details of the results of consensus calling using Viridian
  • self_qc - this is to be implemented.

Please read on below for more details about the contents of each of those entries.

run_summary

An example run_summary entry is:

"run_summary": {
    "last_stage_completed": "Finished",
    "command": "viridian_workflow run_one_sample --tech illumina --ref_fasta ref.fa --reads1 reads_1.fastq.gz --reads2 reads_2.fastq.gz --outdir OUT",
    "options": {
      "debug": false,
      ... etc listing all the command line options ...
    },
    "cwd": "/hps/nobackup/iqbal/mhunt/Covid_test_data_20210813.VWF.20211213.d1932ec1ea/Thielen",
    "version": "0.1.1",
    "finished_running": true,
    "start_time": "2021-12-13T10:51:03",
    "end_time": "2021-12-13T10:54:49",
    "hostname": "myhost",
    "result": "Success",
    "run_time": "0:03:46.060333"

This should be mostly self-explanatory.

The file is written at several stages during the pipeline. Initially, result will be Unknown. The above example is how it looks at the end of a successful run - the key thing is that result says Success. If the pipeline detects something wrong during the run, then result will be a list of error messages. For example if too many amplicons have not enough reads to reliably call a consensus, the pipeline will stop and this will be in the output:

"result": ["Too many amplicons are too low depth. STOPPING"]

read_and_primer_stats

This section contains information on mapping all the original input reads to the reference genome, and attempting to allocate them to amplicon(s). Here is an example for paired Illumina reads:

"read_and_primer_stats": {
    "unpaired_reads": 0,
    "reads1": 489949,
    "reads2": 489949,
    "total_reads": 979898,
    "mapped": 971562,
    "match_any_amplicon": 486271,
    "read_lengths": {
      "149": 1023,
      "150": 692551,
      ... etc. key=length, value=number of reads ...
    },
    "amplicon_scheme_set_matches": {
      "COVID-ARTIC-V3;COVID-ARTIC-V4;COVID-MIDNIGHT-1200": 83644,
      "COVID-ARTIC-V3;COVID-MIDNIGHT-1200": 298897,
      "COVID-ARTIC-V3": 84823,
      "COVID-ARTIC-V3;COVID-ARTIC-V4": 18654,
      "COVID-MIDNIGHT-1200": 252,
      "COVID-ARTIC-V4": 1
    },
    "amplicon_scheme_simple_counts": {
      "COVID-ARTIC-V3": 486018,
      "COVID-ARTIC-V4": 102299,
      "COVID-MIDNIGHT-1200": 382793
    },
    "chosen_amplicon_scheme": "COVID-ARTIC-V3"
  }

The first few entries contain the number of reads: these are paired reads, so we have counts for forward reads reads1 and reverse reads reads2, and total_reads = forward plus reverse reads. For unpaired nanopore reads, the read count would be in reads, the reads1/reads2 values would be zero, and reads = total_reads. The mapped entry is simply the total number of mapped reads. The read_lengths entry is a histogram of read length to number of reads (it includes all reads, whether mapped or not).

The match_any_amplicon count is for read pairs if the reads are paired, and for reads if the reads are unpaired. It is the number of (unpaired) reads, or number of fragments/read pairs, that match to any amplicon from any of the amplicon schemes under consideration. For read pairs, the entire fragment (ie start of left read and end of right read) is considered, and therefore the count is for read pairs, not individual reads.

Since amplicon positions can overlap between amplicon schemes, a read (pair) can be allocated to zero, one, or more than one amplicon. The entry amplicon_scheme_set_matches shows the number of reads matching different combinations of schemes. For example, a read could match amplicon 1 from ARTIC V3 and amplicon 1 from Midnight-1200, and in this case the counter for "COVID-ARTIC-V3;COVID-MIDNIGHT-1200" would be incremented.

The entry amplicon_scheme_simple_counts shows the number of reads allocated to each amplicon, ignoring combinations. For example, a read matching both COVID-ARTIC-V3 and COVID-MIDNIGHT-1200 would result in the counters for both those schemes being incremented.

Finally, the entry chosen_amplicon_scheme shows the amplicon scheme that was chosen. Currently the naive method of taking the scheme with the most counts from amplicon_scheme_simple_counts is used. This may change in the future.

Note that there is a top-level entry in the JSON called amplicon_scheme_name. This is the scheme that was actually used. It will usually be the same as chosen_amplicon_scheme. However, if the option to force the scheme was used (--force_amp_scheme) then amplicon_scheme_name will be that forced choice, regardless of the result in chosen_amplicon_scheme.

read_sampling

This section contains details of mapped reads, depths, and sampled reads from each amplicon in the chosen amplicon scheme.

It looks like this:

"read_sampling": {
    "nCoV-2019_1_pool1": {
      "start": 31,
      "end": 410,
      "total_mapped_bases": 2627260,
      "total_depth": 6913.84,
      "sampled_bases": 380387,
      "sampled_depth": 1001.02,
      "pass": true
    },
    "nCoV-2019_2_pool2": {
    ... details for this amplicon ...
    },
    ... remaining amplicons ...
}

Each key is an amplicon name, and each value has the results for that amplicon. The start and end coordinates of the amplicon in the reference genome are in start and end. There are 1-based inclusive coordinates. total_mapped_bases is the total number of bases from the reads mapped to that amplicon (ie we exclude trimmed bases). total_depth is the mean depth of the amplicon from all of the reads. sampled_bases is the total length of the sampled reads, and sampled_depth is the mean depth of the sampled reads.

After sampling, the pipeline checks if each amplicon has at least 10X mean read depth (ie sampled_depth at least 10). The entry pass is set to true for amplicons that do have enough depth, and false otherwise.

viridian

self_qc

To be completed

Clone this wiki locally