For each PacBio dataset (Movie ID), we compared yield at Q30 for ccs (baseline), and v0.2, v0.3, v1.0, v1.1 of DeepConsensus.
Movie ID | Sample | Chemistry | Mean insert size |
---|---|---|---|
m64011_181218_235052 | HG002 | 1 | 11 kb |
m64008_201124_002822 | HG002 | 2.2 | 15 kb |
m64014_200920_132517 | HG002 | 2.2 | 24 kb |
version | movie | dataset | num_reads_ccs | num_reads | yield@emQ20 | yield@emQ20/ccs | yield@emQ30 | yield@emQ30/ccs | yield@emQ40 | yield@emQ40/ccs | hours |
---|---|---|---|---|---|---|---|---|---|---|---|
v1.1 | m64011_181218_235052 | chem1_11kb | 1,392,300 | 1,557,424 | 17.18 Gb | 111.83% | 12.14 Gb | 137.40% | 5.10 Gb | 208.66% | 233.02 |
v1.1 | m64008_201124_002822 | chem2.2_15kb | 2,687,977 | 2,899,794 | 42.97 Gb | 108.49% | 32.74 Gb | 128.44% | 9.64 Gb | 241.85% | 567.41 |
v1.1 | m64014_200920_132517 | chem2.2_24kb | 1,918,627 | 2,087,945 | 49.74 Gb | 109.94% | 32.52 Gb | 194.41% | 2.73 Gb | 1058.0% | 724.48 |
yield@emQ30/ccs
or "Yield at empirical Q30 relative to CCS" is calculated as
follows:
- Filter DeepConsensus output to predicted Q20.
- For each read, align it to the truth and calculate identity from that alignment: identity = # matches / (# matches + # mismatches + # insertions + # deletions).
- Take all the reads that have identity >= 0.999 (this is Q30).
- Because longer reads are more useful than shorter reads, we count the total bases and not just the number of reads.
- Next we repeat the above for the original CCS reads (run with default params = Q20 filtered) and subtract and divide them to get a percentage, e.g. 40% percent means that DeepConsensus increased yield of high quality reads in bases by 40% over CCS.
These were run on GCP n1-standard-16
machines with no GPU (in 500 shards,
combined above), with --batch_zmws=100 --batch_size=1024
, which is generally
what we recommend. For more information on compute setups, see the
runtime metrics page.
The --skip_windows_above
option (introduced in v0.3) allows DeepConsensus to
skip windows whose average CCS base qualities are already above a certain
quality threshold. The windows that are skipped just adopt the CCS sequence
without correction. This saves runtime, but there is a yield tradeoff, shown in
this chart for m64014_200920_132517-chr20:
The default in v1.1 is Q45, but you can adjust this level using
--skip_windows_above
.