ValueError: Error while executing the Bisulfite bisulphite-mapping #90

paul-sud · 2022-03-09T18:30:02Z

Hello,

We are seeing an odd error trying to map one particular fastq. I checked the resource utilization and didn't see anything excessive. I'm quite confident it's not a fastq issue as the error insinuates, as we validate all our fastqs before starting the pipeline. Furthermore, the exact location the parse errror is thrown (in this case 140016020119448) varies from run to run.

I'd be curious to know if you have any thoughts on any potential fix.

2022-03-01 22:49:57,314 ERROR: 2022/3/1 22:39:47 -- # 135900000 sequences processed
2022-03-01 22:49:57,314 ERROR: GEM::FatalError (input_fasta.c:299,input_fasta_parser_prompt_error)
2022-03-01 22:49:57,314 ERROR: Parsing FASTA/FASTQ error(./fastq/1/ENCFF419MXX.fastq.gz:140016020119448). Beginning Symbol ('>' or '@') not found. Bad syntax
2022-03-01 22:49:57,314 ERROR: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2022-03-01 22:49:57,314 ERROR: GEM::Unexpected error occurred. Sorry for the inconvenience
2022-03-01 22:49:57,314 ERROR: Feedback and bug reporting it's highly appreciated,
2022-03-01 22:49:57,314 ERROR: => Please report or email ([email protected])
2022-03-01 22:49:57,314 ERROR: GEM::Running-Thread (threadID = 4)
2022-03-01 22:49:57,314 ERROR: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2022-03-01 22:49:57,314 ERROR: GEM::Version v3.6.0-cnag-release
2022-03-01 22:49:57,314 ERROR: GEM::CMD /root/.local/lib/python3.7/site-packages/gemBS/gemBSbinaries/gem-mapper -I indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS.gem --i1 ./fastq/1/ENCFF419MXX.fastq.gz --i2 ./fastq/1/ENCFF007FYK.fastq.gz -p --bisulfite-conversion inferred-C2T-G2A -t 8 --report-file ./mapping/sample_1/1.json -r @RG\tID:1\tSM:1\tBC:sample_1\tPU:1 --underconversion-sequence chrL
2022-03-01 22:49:57,314 ERROR: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2022-03-01 22:49:57,314 ERROR: GEM::Input.State
2022-03-01 22:49:57,314 ERROR: Current sequence is <<Empty>>
2022-03-01 22:49:57,314 ERROR: GEM::Input.State
2022-03-01 22:49:57,314 ERROR: Current sequence is <<Empty>>
ValueError: Error while executing the Bisulfite bisulphite-mapping

Thanks,

Paul

The text was updated successfully, but these errors were encountered:

JakeLehle · 2022-03-09T19:46:33Z

Hey Paul. That is super weird. So you mentioned that it isn't an issue with the fastq. So you used like vi to check the fastq and make sure that line 140016020119448 line does have an @ for the next read? Or the line that varies from run to run, since you mentioned that that error can move around a bit depending on the fastq you are running. That seems like a super easy thing to do and I'm not doubting you did it, I'm actually more curious about how you check your fastqs before processing so I can add that to my own workflow haha.

So I recently worked on updating the gemBS pipeline in a python environment and got the bioconda build up to v3.5.5_IHEC. I'm curious if you set up gemBS using anaconda and tried running your fastq files through that it if you also get the same error.

I'm gonna guess it will still throw the error but at least it would be a start to troubleshooting this issue.

paul-sud · 2022-03-09T20:15:26Z

I meant the error line will change even rerunning with the same fastq. From what I could tell manually inspecting it, it looked fine around the line given. We run validateFiles from UCSC which catches anomalies in a bunch of different file types, fastqs included.

We run gemBS using Docker. Maybe conda would help, but I'm not inclined to try it as I've only encountered this issue once for this file specifically. There could be something pathological in the input that causes the mapper to error.

JakeLehle · 2022-03-09T20:44:23Z

Yeah, I really like the containerized approach and there shouldn't be an issue with the software that lives in the container.
I get you about not wanting to get too much into the weeds with messing with your overall workflow. You could always try it with conda but I still think you would just end up getting the same error.

Oh that's even weirder. Yeah if you already checked the resources and they are set high enough and If it's only happening on that file though there has to be something weird about that input. It would make sense if there was a line missing and the total file wasn't divisiable by 4 anymore. (haha the equivalent of a frameshift mutation in a fastq). Then the error would come from there not being an @ in the right place to start the next read correctly. But if the error is moving around I don't think that makes sense it would have to be at the same place every time and you could see it immediately.

Is this file super valuable? or can you throw it out and just decrease your overall replicates?

I'd be interested to see the samtools flagstats on the other BAM files if you are getting good alignment maybe that file just has some low-quality reads that are throwing GEM off? Again that isn't likely, GEM is really good at keeping the MAPQ very high when mapping cause ti does those complete strata searches after grouping to find better matches than BWA-MEM. But it might give you a hint at the overall quality of the reads and why this file is acting so weird.

Have you thought about trying the files with bwa-meth to see if that file acts weird with that aligner as well? (A file with that many lines will take a full week to run though bwa though haha)

I'd be interested to see if @heathsc has seen this before.

paul-sud · 2022-03-09T23:08:03Z

We do have another replicate, but without a solid reason like severe quality/integrity issue it's unlikely we'd throw it out. We did process it back in the day with Bismark/Bowtie, and the quality metrics are very similar between the two replicates.

JakeLehle · 2022-03-10T12:45:46Z

Yeah if you can process it with bowtie then something is going on. Well there a ton of options you can modify for GEM. Maybe heath can give you some tips as to what you should try and change to see if you can get the file to run then you can dial it in a bit more.

ottojolanki · 2022-04-13T21:41:38Z

140_016_020_119_448 line

Hi @JakeLehle I don't think the number here refers to line number? Do you know what could this be? The file (https://www.encodeproject.org/files/ENCFF419MXX/) contains 930309473 reads. I have indexed the fastq and can randomly access it, so I could easily check if there is something curious about this particular spot in the file.

JakeLehle · 2022-04-15T14:18:15Z

Hello @ottojolanki. Hmm well with paired FASTQ files you probably already know that each read starts with an @ symbol with info about the read and then a new line with the sequence for the forward read, then a > symbol on the third line, and on the fourth line the reverse sequence. So I interpreted this error message as saying that read associated with that spot in the file was missing some of that syntax and that what was throwing the error. But based on talking with @paul-sud he said he had looked at the line the error message mentions and didn't see an issue with the syntax. In addition, he did some cool pre-processing with validateFiles that should catch that kind of stuff. You are right though if the file only has 930309473 reads then that times 4 is only 3721237892 lines total so that doesn't make sense with the error message saying it's happening on 140016020119448.

However, if you want to open up the file and jump to that line with vim and then take a screenshot any missing syntax should be super obvious.

As for the weirdness with the file line numbers I'm not sure about that you will have to talk with @paul-sud about that I'm mainly focused on making sure this isn't a gemBS related problem. Which it sounds like it isn't, because other files were working fine.

ottojolanki · 2022-04-15T16:34:34Z

@JakeLehle

Hello @ottojolanki.
...
You are right though if the file only has 930309473 reads then that times 4 is only 3721237892 lines total so that doesn't make sense with the error message saying it's happening on 140016020119448
However, if you want to open up the file and jump to that line with vim and then take a screenshot any missing syntax should be super obvious.

Which line or location in the file do you mean here? The file is 200GB, 140016020119448 is a number several orders of magnitudes larger than either the number of bytes or lines in the file. I was able to index the fastq using pyfastx (in addition to validatefiles), which leads me to believe that the format is correct. Now that I have the file indexed I can randomly access any read there, I am just unsure how to interpret the above mentioned huge number 140016020119448 into a location/line in the fastq file.

JakeLehle · 2022-04-15T16:55:44Z

@ottojolanki ah okay I found it. That number isn't a line in the input file. I don't know what it is, but it isn't a line.
This issue has to do with very large input files (200GB sounds like it would be big enough to fit into that category). Luckily it was also fixed in the most recent update for gem3-mapper by @smarco. He is a gentleman and a scholar and we owe him our thanks.

https://githubhot.com/repo/smarco/gem3-mapper/issues/29
smarco/gem3-mapper@82cfcaa

Since gemBS has the gem3-mapper linked to the repository through submodules, all you should need to do is recursively re-clone the repository or fetch the upstream updates and then reinstall gemBS and hopefully, this should fix your issue.

smarco · 2022-04-17T14:30:37Z

@JakeLehle Thanks for the nice words. I'm really glad that the gem3-mapper is useful. Reading this was very rewarding. Thanks.

IsmailM · 2022-06-15T20:05:50Z

Thanks for getting to the bottom of this.

This repository points (see here) to the gembs branch on the https://github.com/smarco/gem3-mapper.git repo. And the gembs branch does not have the above mentioned commit. So, do we use the master branch instead? The two branches seem to be quite different?

heathsc · 2022-06-16T07:45:01Z

No, I have to port the fix to the gembs branch. Simon

…

On Wed, 15 Jun 2022, 22:06 Ismail Moghul, ***@***.***> wrote: Thanks for getting to the bottom of this. This repository points (see here <https://github.com/heathsc/gemBS/blob/master/.gitmodules>) to the gembs branch on the https://github.com/smarco/gem3-mapper.git repo. And the gembs branch does not have the above mentioned commit. So, do we use the master branch instead? The two branches seem to be quite different? — Reply to this email directly, view it on GitHub <#90 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAY465ZYFG576EEVHNP5ZQDVPIZSTANCNFSM5QKLQ2UA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

AndreiWCansor · 2022-06-28T09:27:15Z

Dear Simon, our team has the same issue. Do you have any idea of the timeline for the fix to be implemented?

IsmailM · 2022-07-28T08:39:56Z

Hi @heathsc,

Do we have a rough timeline for a fix for this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Error while executing the Bisulfite bisulphite-mapping #90

ValueError: Error while executing the Bisulfite bisulphite-mapping #90

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 9, 2022

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 9, 2022

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 10, 2022

ottojolanki commented Apr 13, 2022

JakeLehle commented Apr 15, 2022

ottojolanki commented Apr 15, 2022

JakeLehle commented Apr 15, 2022

smarco commented Apr 17, 2022

IsmailM commented Jun 15, 2022

heathsc commented Jun 16, 2022 via email

AndreiWCansor commented Jun 28, 2022

IsmailM commented Jul 28, 2022

ValueError: Error while executing the Bisulfite bisulphite-mapping #90

ValueError: Error while executing the Bisulfite bisulphite-mapping #90

Comments

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 9, 2022

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 9, 2022

paul-sud commented Mar 9, 2022

JakeLehle commented Mar 10, 2022

ottojolanki commented Apr 13, 2022

JakeLehle commented Apr 15, 2022

ottojolanki commented Apr 15, 2022

JakeLehle commented Apr 15, 2022

smarco commented Apr 17, 2022

IsmailM commented Jun 15, 2022

heathsc commented Jun 16, 2022 via email

AndreiWCansor commented Jun 28, 2022

IsmailM commented Jul 28, 2022