Problem

Current implementation of Bamboozle barcode function doesn't work

samtools phase appears to be giving good results based on the first test

Kmer table per strain per allele, based on reads mapping to the site (barcodesearch_v2.py)

Will PCR introduce bias when counting?

Need to generate test data to verify that the pipeline works

~

Looking at the Live2Tell data, the barcoding function has failed to return the desired results
When viewed in IGV, many of the mapped reads are highlighted in red, meaning that the insert size is larger than expected

Is this a quirk of the view?
If the insert size is restricted, will this result in certain variants not being found?

When assembling regions identified as unique between strains in Bamboozle, the regions have turned out to be identical in some strains

112 and 113 in particular seem to be very similar
Use bowtie2's --maxins/--minins flags?
- Will certain alleles be unreported in this instance?
Use bcftools consensus's -I flag to report variants using ambiguity codes?
- Levenshtein distance couldn't be used in hom vs. het situations
  - Max/min number of differences could be done later if needed
How does a reference-based assembly look in this case? Would a longer repeat region be assembled as intended?

Solutions?

Map the reads to the reference, but limit the fragment size to avoid repeat region issues

If mapping has already been done, check the BAM header and flash a warning if the details don't add up

Map only concordant pairs, i.e. no singletons or discordant pairs?

Get an average (median?) coverage for the genome, for use with the DP field later
Run variant calling, but include a depth filter based on the genome coverage calculated above,
in addition to the existing quality filter (does this need adjusting?)
Assemble both alleles when calculating the consensus distances