-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation #121
Comments
Hello Tim, Thank you for all of your work on MCHap! It's a really great tool. I am planning to use it on some Flex-Seq data generated in alfalfa, but I had a question that I thought might make a good documentation topic: Do you have any recommended practices for upstream processing of the VCF used to identify SNVs? I use Freebayes for calling, typically followed by decomposition of primitive alleles using vcflib. However, I'm unsure of whether or not the input VCF should be filtered - e.g. is it best to filter by quality, missing data percentage, depth, MAF, etc.? The raw Freebayes output contains a lot of junk with very low allele frequencies or call rates - I'm not sure how MCHap would handle such variants. Thank you! Brian |
Hi @etnite, thanks for the feedback! That's a good question and something that I need to investigate more thoroughly. In my PhD work I used a similar process to what you've described: Freebayes -> decomposition of primitive alleles -> Some filtering. However, these days I would suggest using minimal, if any, filtering. Then review the results and consider further filtering if necessary. If the results are very messy (lots of unique haplotypes, low posterior probabilities (GPH and PHPM)) then filtering the input SNVs is one option that may help cleaning up the result. MCHap assemble will automatically exclude SNVs at the sample level if they have a very high probability of being homozygous (configured by the My understanding is that the QUAL score from FreeBayes is essentially a probability of some variation being present. So a low QUAL score shouldn't be too different from a low MAF. Low read depth can be a real issue for MCHap and results in low confidence calls with many unique haplotypes. However, with targeted sequencing like Flex-seq, it's probably more robust to adjust the assembly windows (bed intervals) to exclude any low coverage areas. I have come across sites with high QUAL scores and read depth, but low GQ scores. My assumption was that this may be indicative of some sort of copy number variation or misaligned/off-target sequences. MCHap won't handle either of these situations very well. Filtering these variants is an option, but in may just be masking the underlying issue. So, these days I'd recommend running MCHap without filtering. The Overall, I think filtering by depth is the best option if you do want to apply some filtering. Removing the low call rate and low MAF stuff shouldn't haven a big impact on the results of MCHap, but it might speed up the analysis a bit. Hopefully that's helpful! |
Many thanks @timothymillar. This all makes sense to me. You are correct that in our Flex-seq data depth doesn't tend to be much of an issue, though there are always a handful of targeted regions where the assay doesn't work very efficiently for whatever reason. As you mention it might be easiest to simply exclude these from the .bed file. I watched your presentation for Polyploid Tools so I think I understand what the I'm looking forwards to testing MCHap on a few regions of particular interest - then I can try scaling up to the whole genome once I get some experience working with the output. |
Just an update @etnite, I've released version 0.9.0 which includes a new tool |
Need to write some more extensive documentation.
assemble
,call
etc.The text was updated successfully, but these errors were encountered: