For each PacBio dataset (Movie ID), we compared yield at Q30 for ccs (baseline), and v0.2, v0.3, v1.0, v1.1, v1.2 of DeepConsensus.
Movie ID | Sample | Chemistry | Mean insert size |
---|---|---|---|
m64011_181218_235052 | HG002 | 1 | 11 kb |
m64008_201124_002822 | HG002 | 2.2 | 15 kb |
m64014_200920_132517 | HG002 | 2.2 | 24 kb |
version | movie | dataset | num_reads_ccs | num_reads | yield@emQ20 | yield@emQ20/ccs | yield@emQ30 | yield@emQ30/ccs | yield@emQ40 | yield@emQ40/ccs | hours |
---|---|---|---|---|---|---|---|---|---|---|---|
v1.2 | m64011_181218_235052 | chem1_11kb | 1,392,300 | 1,552,566 | 17.16 Gb | 111.72% | 12.17 Gb | 137.81% | 5.32 Gb | 217.55% | 219.39 |
v1.2 | m64008_201124_002822 | chem2.2_15kb | 2,687,977 | 2,894,238 | 43.00 Gb | 108.55% | 33.06 Gb | 129.70% | 10.35 Gb | 259.46% | 532.03 |
v1.2 | m64014_200920_132517 | chem2.2_24kb | 1,918,627 | 2,083,487 | 49.75 Gb | 109.96% | 32.92 Gb | 196.82% | 3.11 Gb | 1203.8% | 661.91 |
yield@emQ30/ccs
or "Yield at empirical Q30 relative to CCS" is calculated as
follows:
- Filter DeepConsensus output to predicted Q20.
- For each read, align it to the truth and calculate identity from that alignment: identity = # matches / (# matches + # mismatches + # insertions + # deletions).
- Take all the reads that have identity >= 0.999 (this is Q30).
- Because longer reads are more useful than shorter reads, we count the total bases and not just the number of reads.
- Next we repeat the above for the original CCS reads (run with default params = Q20 filtered) and subtract and divide them to get a percentage, e.g. 40% percent means that DeepConsensus increased yield of high quality reads in bases by 40% over CCS.
These were run on GCP n1-standard-16
machines with no GPU (in 500 shards,
combined above), with --batch_zmws=100 --batch_size=1024
. For recommendations
on the optimal runtime setting and compute setups, see the
runtime metrics page.
The --skip_windows_above
option (introduced in v0.3) allows DeepConsensus to
skip windows whose average CCS base qualities are already above a certain
quality threshold. The windows that are skipped just adopt the CCS sequence
without correction. This saves runtime, but there is a yield tradeoff, shown in
this chart for m64014_200920_132517-chr20:
The default in v1.2 is Q45, but you can adjust this level using
--skip_windows_above
.