New subsampling algorithm ("maximum overlap") #2

averagehat · 2015-03-25T18:06:26Z

As stated in the TODO section, the current sampling method falls victim to picking additional alignments when the minimum depth has already been reached at that reference index. i.e.
if your average read depth is 150 and you are sub selecting at 10 depth, the first base will contain 10 depth, but position 2 will contain those 10 reads, plus potentially 10 more reads. Then the 3rd position will contain 10+10+10, and so on.

These reads can stack the entire length of the position 1 sequence, causing the upper depth-bound of the random-subsample solution to be min_depth * len(read).

An iterative solution which only picks read when needed (and backtracks if necessary) can have an upper bound of 2 * min_depth (some overlap may be necessary in a case like:) insert diagram.
We avoid a higher depth by always picking the read with the highest overlap index (POS + len(sequence). This allows the user to more precisely define the wanted depth.

NOTES:

Always picking the read with the greatest overlap (longest read) (and almost never picking short reads) may effect variant calling.
The iterative algorithm (maybe?) lends itself to brief (tall) spikes, because it always picks greatest overlap. This minimizes the sum of all overflow (?). However, it may be possible and desirable to use an algorithm which spreads the overflow more evenly and results in "hills" rather than "spikes."

The text was updated successfully, but these errors were encountered:

necrolyte2 · 2015-03-26T15:49:04Z

There may be cases where you have a bam that contains reads from multiple platforms such as:

Roche 454 (avg read length ~600bp lots of reads)
IonTorrent (avg read length ~?bp lots of reads)
MiSeq (avg read length ~250-300bp lots of reads)
Sanger (longer read lengths >600bp but single read)
PacBio (really long read lengths >1000bp? maybe single read?)

So, in these scenarios I suspect that everything would still work out as the algorithm "should" pick the PacBio and Sanger reads first as they would be longer

I'm not sure if we really need to worry about the effect on variant calling as we would make an assertion that data has already been quality filtered so all bases are equal-ish.

This was referenced Mar 27, 2015

Minimizing depth algorithm #5

Merged

More representative subsampling "more-random" algorithm #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New subsampling algorithm ("maximum overlap") #2

New subsampling algorithm ("maximum overlap") #2

averagehat commented Mar 25, 2015

necrolyte2 commented Mar 26, 2015

New subsampling algorithm ("maximum overlap") #2

New subsampling algorithm ("maximum overlap") #2

Comments

averagehat commented Mar 25, 2015

necrolyte2 commented Mar 26, 2015