-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent alignments depending on whether reference file is gzipped or not #70
Comments
Thanks for posting the issue, thats definitely not right... I'll take a look and see if I can replicate it. In the meantime, please feel free to post the stderr output of the two commands. |
Hi, Thanks for the quick response. I did some debugging on my end, and the duplication of exactly the same alignment was a bug on my end (I had the same contig was included multiple times in the query fasta). However, even on a clean fasta, I do get different alignments depending on whether the reference is gzipped or not. I have the stderr attached. Thanks again, |
Ahh! Ok that makes sense. As far as the different results based on the reference goes, it looks MashMap is finding twice as many unique k-mer seeds in the uncompressed reference as opposed to the compressed reference (19,972,584 vs 10,660,334).
|
Hi, Re 1: Yes, I could verify that hs1.fa and hs1.fa.gz are identical Re 2: I created copies of the reference files to and re-ran it mashmap but the issue persists. So I don't think lingering index files are the issue. Unfortunately, I can't share my data, yet, but I think it might be an issue specific to my input sequence (it's a primary assembly from hifiasm). I tried to align the reference against itself using the uncompressed version as query and once the compressed and once the uncompressed version as reference, respectively. It still generates twice as many unique k-mer seeds in the case of using an uncompressed reference but it generates consistent alignments. I don't know if it's relevant but the k-mer complexities are different. I have attached stderr outputs and alignment outputs. And the reference sequence is available from here: Thanks, |
From looking at the output logs, the sketch size of the compressed version was 20, but the uncompressed was 40. I was replicate the issue and actually, it looks like this has been around since MashMap2! Basically, in Section 5 of the original MashMap paper, they show that the sketch size which satisfies their constraints depends on the reference size. Since MashMap2, the raw file size has been used to determine the reference size. The easiest hack would to just be to decompress the file twice (once to compute the reference length, then again to actually index it), but w/ a large file thats an extra 30 seconds for no good reason. Perhaps we'll just warn users that without a In the meantime, you can set the sketch size to 40 manually to ensure consistent results. |
As a side note, with |
Thank you! |
Hi,
I am using mashmap v3.1.3 and I noticed that the same alignment was reported multiple times in the output file. For example, here is the output for reference chromosome 21:
I found this odd so I re-ran it. This time I happened to use an uncompressed version of my reference sequence file and I didn't get duplicated alignments, but I got some new alignments and the positions of previously found alignments changed. Here is again the output for reference chr21:
I used these commands:
Any ideas what could trigger such a behavior?
Thanks,
Aaron
The text was updated successfully, but these errors were encountered: