-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: Error while executing the Bisulfite bisulphite-mapping #90
Comments
Hey Paul. That is super weird. So you mentioned that it isn't an issue with the fastq. So you used like vi to check the fastq and make sure that line 140016020119448 line does have an @ for the next read? Or the line that varies from run to run, since you mentioned that that error can move around a bit depending on the fastq you are running. That seems like a super easy thing to do and I'm not doubting you did it, I'm actually more curious about how you check your fastqs before processing so I can add that to my own workflow haha. So I recently worked on updating the gemBS pipeline in a python environment and got the bioconda build up to v3.5.5_IHEC. I'm curious if you set up gemBS using anaconda and tried running your fastq files through that it if you also get the same error. I'm gonna guess it will still throw the error but at least it would be a start to troubleshooting this issue. |
I meant the error line will change even rerunning with the same fastq. From what I could tell manually inspecting it, it looked fine around the line given. We run We run gemBS using Docker. Maybe conda would help, but I'm not inclined to try it as I've only encountered this issue once for this file specifically. There could be something pathological in the input that causes the mapper to error. |
Yeah, I really like the containerized approach and there shouldn't be an issue with the software that lives in the container. Oh that's even weirder. Yeah if you already checked the resources and they are set high enough and If it's only happening on that file though there has to be something weird about that input. It would make sense if there was a line missing and the total file wasn't divisiable by 4 anymore. (haha the equivalent of a frameshift mutation in a fastq). Then the error would come from there not being an @ in the right place to start the next read correctly. But if the error is moving around I don't think that makes sense it would have to be at the same place every time and you could see it immediately. Is this file super valuable? or can you throw it out and just decrease your overall replicates? I'd be interested to see the samtools flagstats on the other BAM files if you are getting good alignment maybe that file just has some low-quality reads that are throwing GEM off? Again that isn't likely, GEM is really good at keeping the MAPQ very high when mapping cause ti does those complete strata searches after grouping to find better matches than BWA-MEM. But it might give you a hint at the overall quality of the reads and why this file is acting so weird. Have you thought about trying the files with bwa-meth to see if that file acts weird with that aligner as well? (A file with that many lines will take a full week to run though bwa though haha) I'd be interested to see if @heathsc has seen this before. |
We do have another replicate, but without a solid reason like severe quality/integrity issue it's unlikely we'd throw it out. We did process it back in the day with Bismark/Bowtie, and the quality metrics are very similar between the two replicates. |
Yeah if you can process it with bowtie then something is going on. Well there a ton of options you can modify for GEM. Maybe heath can give you some tips as to what you should try and change to see if you can get the file to run then you can dial it in a bit more. |
Hi @JakeLehle I don't think the number here refers to line number? Do you know what could this be? The file (https://www.encodeproject.org/files/ENCFF419MXX/) contains 930309473 reads. I have indexed the fastq and can randomly access it, so I could easily check if there is something curious about this particular spot in the file. |
Hello @ottojolanki. Hmm well with paired FASTQ files you probably already know that each read starts with an @ symbol with info about the read and then a new line with the sequence for the forward read, then a > symbol on the third line, and on the fourth line the reverse sequence. So I interpreted this error message as saying that read associated with that spot in the file was missing some of that syntax and that what was throwing the error. But based on talking with @paul-sud he said he had looked at the line the error message mentions and didn't see an issue with the syntax. In addition, he did some cool pre-processing with validateFiles that should catch that kind of stuff. You are right though if the file only has 930309473 reads then that times 4 is only 3721237892 lines total so that doesn't make sense with the error message saying it's happening on 140016020119448. However, if you want to open up the file and jump to that line with vim and then take a screenshot any missing syntax should be super obvious. As for the weirdness with the file line numbers I'm not sure about that you will have to talk with @paul-sud about that I'm mainly focused on making sure this isn't a gemBS related problem. Which it sounds like it isn't, because other files were working fine. |
Which line or location in the file do you mean here? The file is 200GB, 140016020119448 is a number several orders of magnitudes larger than either the number of bytes or lines in the file. I was able to index the fastq using pyfastx (in addition to validatefiles), which leads me to believe that the format is correct. Now that I have the file indexed I can randomly access any read there, I am just unsure how to interpret the above mentioned huge number 140016020119448 into a location/line in the fastq file. |
@ottojolanki ah okay I found it. That number isn't a line in the input file. I don't know what it is, but it isn't a line. https://githubhot.com/repo/smarco/gem3-mapper/issues/29 Since gemBS has the gem3-mapper linked to the repository through submodules, all you should need to do is recursively re-clone the repository or fetch the upstream updates and then reinstall gemBS and hopefully, this should fix your issue. |
@JakeLehle Thanks for the nice words. I'm really glad that the gem3-mapper is useful. Reading this was very rewarding. Thanks. |
Thanks for getting to the bottom of this. This repository points (see here) to the |
No, I have to port the fix to the gembs branch.
Simon
…On Wed, 15 Jun 2022, 22:06 Ismail Moghul, ***@***.***> wrote:
Thanks for getting to the bottom of this.
This repository points (see here
<https://github.com/heathsc/gemBS/blob/master/.gitmodules>) to the gembs
branch on the https://github.com/smarco/gem3-mapper.git repo. And the
gembs branch does not have the above mentioned commit. So, do we use the
master branch instead? The two branches seem to be quite different?
—
Reply to this email directly, view it on GitHub
<#90 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY465ZYFG576EEVHNP5ZQDVPIZSTANCNFSM5QKLQ2UA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear Simon, our team has the same issue. Do you have any idea of the timeline for the fix to be implemented? |
Hi @heathsc, Do we have a rough timeline for a fix for this? |
Hello,
We are seeing an odd error trying to map one particular fastq. I checked the resource utilization and didn't see anything excessive. I'm quite confident it's not a fastq issue as the error insinuates, as we validate all our fastqs before starting the pipeline. Furthermore, the exact location the parse errror is thrown (in this case 140016020119448) varies from run to run.
I'd be curious to know if you have any thoughts on any potential fix.
Thanks,
Paul
The text was updated successfully, but these errors were encountered: