RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx #4

bbalog87 · 2018-11-22T15:59:38Z

Hi,
RA has failed with the following error message

unequal lengths in sequence and overlap file for sequence with id 508204 without any further log message.

Here are the last few lines of the log-file:

[M::main] CMD: /disk2/nguinkal/ra/vendor/minimap2/minimap2 -t 100 -x ava-pb /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 12065.368 sec; CPU: 868149.797 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204!

I am running RA on Debian based linux Workstation.
I have no clue what is causing that bug. Thanks for any hints.
Cheers,
Julien

The text was updated successfully, but these errors were encountered:

rvaser · 2018-11-22T16:05:28Z

Hi Julien,
please run the following commands from ra root folder and paste the output here:

cd vendor/rala
git status
git log

Best regards,
Robert

bbalog87 · 2018-11-22T16:53:19Z

Hi @rvaser
Thanks for replying.

git status

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

	modified:   ../../../vendor/rala (modified content, untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

git log

Author: rvaser <[email protected]>
Date:   Sat Nov 17 09:10:56 2018 +0100

   Fixed illumina polishing extension

commit 20a39156f7efb59de36616e6f0314202b287b804
Author: rvaser <[email protected]>
Date:   Sat Nov 17 00:50:22 2018 +0100

   Updated rala

commit ca71a013084ed41f3ecc2ca3918be5f5a2f482d9
Author: rvaser <[email protected]>
Date:   Thu Nov 8 11:28:39 2018 +0100

   Updated rala

rvaser · 2018-11-22T17:16:05Z

Please run head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2 and paste the sequence causing the problem here.

rvaser · 2018-11-23T10:23:38Z

Please also run git log from ra/vendor/rala.

bbalog87 · 2018-11-23T12:44:38Z

> head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2
just output this:
AGAGCTCTGTCCACGTCAATATCGCTCCTAATTTACATTTCACCTTATTTAAACGCTTTT
GGTGTTTTGTGTAGGTTATTCCCTCTTTCAGTAATGTAAGCATTAACTGCGATGCCGCTA

Why not -n 508204 instead of -n 1016410?
Best, Julien

> git log in ra/vendor/rala
outputs this:

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

    Reverted some changes

rvaser · 2018-11-23T14:27:33Z

As each sequence is usually represented with two lines in FASTA format (one for header and one for the data), we are using 2 * read_id + 2 to obtain the sequence with head and tail. Nonetheless, your file seems to be wrapped to 60 characters per line which makes it a bit harder to extract the sequence by id. The error you have encountered is that my parser has obtained a sequences with length that differs from the length stored in the .paf file generated from minimap2, which is quite odd as I have tested this feature. I'll check again.

P.S. I truncated the outputs in above messages so I do not have to scroll too much :D

rvaser · 2018-11-23T14:42:48Z

Wrapped files are forking for me. Do you mind sending me your file so I can investigate further?

bbalog87 · 2018-11-23T15:02:25Z

Okay, I understand the issue. But how can I fix it?
unfortunately I cannot send some sample data because the files are about ~950GB for illumina PE data and ~120GB for PacBio data, since RA expects two types of data: TSG and NGS data.
PS: My PE-data are in file.R1.fq and file.R2 data. I have poooled them in one file before passing them to RA. Maybe, minimap2 has problems in mapping such pooled reads?

rvaser · 2018-11-23T15:47:10Z

I understand, that is a lot of data. The error occurred in the layout step where only TGS data is used. Can you please provide me with the command you run?

bbalog87 · 2018-11-23T16:02:43Z

Ok, here is the command I ran

$: ./ra -t 100 -x pb /path/to/pacbioData/SMRT_Sequel-pooled/zander_pooled.subreads.renamed.fasta /path/to/pe-reads/pooled/in/one/singleFile/HiSeqX/pe.renamed.fastq

rvaser · 2018-11-23T16:37:58Z

That looks fine. I am not sure what could be wrong. By the extension .renamed.fasta, did you perhaps rename your reads with ids? If so can you grep the reported sequence from the file?

bbalog87 · 2018-11-26T14:10:30Z

Yes, I renamed my reads with Ids
like:

1
READ1
2
READ2
...

bbalog87 · 2018-11-26T14:16:00Z

Sorry. No reads were renamed, just the headers in NGS reads file

rvaser · 2018-11-26T18:45:07Z

Can you please replace this lines https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L55-L57 (located at vendor/rala/src/overlap.cpp) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    a_id_, a_length_, piles[a_id_]->data().size());

and https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L74-L76 (in the same file) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    b_id_, b_length_, piles[b_id_]->data().size());

Run make afterwards and paste the log here again please.

bbalog87 · 2018-11-27T16:53:38Z

Okay, I have just copied the whole log file, since I do not know in which part you are interested to.

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

rvaser · 2018-11-27T17:24:49Z

By log I meant that you run ra again with your data and paste here everything it outputs until the error occurs again :)

bbalog87 · 2018-11-27T17:44:26Z

ok, sorry. It will take a while to run. I will post here tomorrow.

bbalog87 · 2018-11-28T08:46:41Z

Hi @rvaser ,

can I send you the whole log file by email? The same error has occurred.

[M::main] CMD: /Zander_Project/pipelines/assembly/ra/vendor/minimap2/minimap2 -t 130 -x ava-pb /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 11905.419 sec; CPU: 882940.242 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204 (23914 - 23317)!

rvaser · 2018-11-28T09:16:01Z

Please do so :)

rvaser · 2018-11-28T15:49:48Z

Please run awk 'NR==508206' RS='>' zander_pooled.subreads.renamed.fasta > read_of_interest.fasta and send me the file via email.

bbalog87 · 2018-11-28T18:17:07Z

@rvaser I have just sent the read of interest via email.

rvaser · 2018-11-29T10:28:23Z

Please run grep "m54229_180630_022602" zander_pooled.subreads.renamed.fasta and make sure there is only one match.

The read length that my parser obtains matches to the one in file (it even has the length in its name m54229_180630_022602 74908655 0_23317) . So I guess that there might be two sequences with the same name. Or something went wrong in minimap2 parsing which I'll look into if there is only one match in the above grep command.

bbalog87 · 2018-11-29T18:31:31Z

Hi @rvaser,

It can't be only one match, since all reads (headers) in zander_pooled.subreads.renamed.fasta start with "m54229_180630_022602"

rvaser · 2018-11-29T18:37:02Z

That is the problem. I only take the read name up to the first white space to "save" memory, nonetheless I think it is expected that this part of the name should be unique. You can easily circumvent this problem by replacing all spaces with underscores in your read file. Something like this sed 's/ /_/g' zander_pooled.subreads.renamed.fasta > zander_pooled.subreads.fixed.fasta.

bbalog87 · 2018-11-29T18:41:53Z

Ok Thank you. Your asembler (RA) doesn't get information from the reads name (fasta-headers)? Because some do, and renaming fasta-headers causes problems. I will rename as suggested and re-run RA-assembler. Thank you very much. I hope it fixes it.

rvaser · 2018-11-29T18:43:53Z

Headers are discarded immediately and replaced with ids. I hope that this fixes it as well :D

bbalog87 · 2018-11-29T18:46:13Z

Okay, then it doesn't matter whatever is in the fast-headers :) I would let you know.

bbalog87 · 2018-11-29T18:47:44Z

PS: Do I need to reset the part of the code in rala, that I edited?

rvaser · 2018-11-29T18:48:24Z

Nope, it will only trigger if the same error occurs.

bbalog87 · 2018-11-29T18:49:39Z

Okay. 👍

bbalog87 · 2018-12-03T15:00:03Z

Hi @rvaser
minimap2 stage finished without error. rala is now running. I think the error was triggered by non-unique reads identifiers in the fasta file.
PS: Do you have an estimation of the expected run time of Ra depending on the data?

rvaser · 2018-12-03T15:08:28Z

Hi Julien,
that depends on how repetitive is your genome. Depending on the size of your data and if you don't have anything plant related, I would say under a week. I hope you have enough RAM as you have 1TB Illumina reads :D

Best regards,
Robert

bbalog87 · 2018-12-03T15:23:02Z

Hi Robert,
I have 1.5 TB RAM and I'm using 40 cpus.
Illunima PE-data : ~ 950 GB
PacBio data : ~120 GB
I hope the RAM will not crash :)
The genome is not so repetitive. (~12% repeat content, 0.1% hetorozygosity)

Best,
Julien

rvaser · 2018-12-03T15:27:32Z

That will be a bit tight because overhead for Illumina data equals 0.5 times the size of the FASTQ file. You might want to downsample it a bit. How much is the coverage if you have freaking ~500GB of sequences (i.e. how large is the genome you are assembling)?

bbalog87 · 2018-12-03T15:39:25Z

Actuallly, the illumina data are in fastq. Thus, the raw reads size is ~50% of the fastq (= ~450 GB). The theoretical (k-mer profile) mean coverage is ~140X. I have not computed yet the real sequencing coverage, since the genome size is unknown. But I expect a genome size < 2 Gb

bbalog87 · 2018-12-03T15:42:26Z

What is the most RAM-consuming step in the RA pipeline?

I think, it should be racon-stage. Currently, rala has a peak RAM consumption of only ~140 GB.

rvaser · 2018-12-03T15:43:47Z

Racon, as it loads all reads into memory. The problem will be the Illumina polishing step. You can try running ra without Illumina data and try to polish the assembly afterwards with Illumina and racon. If the program crashes, all intermediate files will be deleted, including the PB polished assembly.

rvaser · 2018-12-03T15:48:31Z

Or you can let it run, be sneaky and copy the intermediate files while it polishes with Illumina data :)

bbalog87 · 2018-12-03T15:50:49Z

Okay, this is a great suggestion 👍 . If ra only uses illumina data for polishing (not for consensus), then it also makes sense to run ra without NGS data. I can polish the PB contig afterwards using racon, as you suggested.

bbalog87 · 2018-12-06T11:32:42Z

Hi Robert,

As expected (and feared) ra has failed because of RAM overload :( I think, our data load and computational resources are just not suitable to run ra in NGS+TGS setup. RA can run assembly with only TGS, right? Maybe I should try running ra just with PB-data.

Best,
Julien

rvaser · 2018-12-06T11:36:57Z

Hi Julien,
as I thought you have unusual amount of Illumina data. You can of course run ra with PB only data. Later if you want to you can subsample the Illumina reads for manual polishing with racon.

Best regards,
Robert

bbalog87 · 2018-12-06T11:49:24Z

But according to your workflow description in github, even, with only TGS data, racon is ran twice over the preliminary PB only-based assembly. By manual polishing, you probably mean, polishing the contigs created by ra using a subset o high quality (e.g. Q>35) illumina reads. I will definitively try this approach.

Thanks for the kind overall support for this issue.

rvaser · 2018-12-06T11:54:53Z

Racon is run twice with TGS data, and once with NGS data. The NGS polishing is the part you will do manually due to memory shortage, i.e. polish the outputted ra contigs with subset of Illumina reads.

claumer · 2019-01-10T13:27:11Z

Hi there, sorry to reawaken this, but it seems I'm having the same problem with ra. Minimap2 runs just fine, creates an overlaps.paf, rala begins and creates an empty preconstruction.fasta file... end then the job stops, the working directory is emptied, and this gets printed to stdout:

[M::main] Version: 2.14-r892-dirty
[M::main] CMD: /nfs/research1/marioni/claumer/ra/vendor/minimap2/minimap2 -t 40 -x ava-ont Lm10_151218_renamed.fastq Lm10_151218_renamed.fastq
[M::main] Real time: 3842.330 sec; CPU: 57219.102 sec; Peak RSS: 57.284 GB
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 2345870!
[Ra::run] preconstruction overlap stage
[Ra::run] preconstruction stage
[Ra::exit] warning: unable to clean work directory!

I'm calling ra like this:

../ra/build/bin/ra -t 40 -u -x ont Lm10_151218_renamed.fastq

And I do not think that it's a problem with unique headers as some of the messages above seemed to indicate... I saw the same problem both on my raw dataset and on a version where I used fastx_renamer to give each read a number in sequence.

I am running this on an amplified dataset: a longish-insert library that I have used adapter-driven PCR to amplify, following an optimised custom protocol that should minimize coverage bias, although it's possible that some exists. Because of the nature of PCR which is selective for shorter inserts, the results tend to look like this:

6823466 reads with read length N50 of 2643bp and maximum of 14993bp
8191293 reads with read length N50 of 1513bp and maximum of 3497bp

I'm having a bear of a time getting canu to finish on these (lots of relatively short reads causes the all-by-all mhap steps to take forever with it), so I was really excited when I came across ra which seems like it should have some considerable advantages over e.g. miniasm. But it seems that as written there's a difficulty parsing the paf...

Happy to provide any materials you might need to help debug this, log files, raw reads, etc.

rvaser · 2019-01-10T13:40:16Z

Hi Christopher,
have you checked that the identifiers are all different? You can send me the data via mail if you don't mind so I can investigate locally.

Best regards,
Robert

claumer · 2019-01-10T17:05:51Z

Hi Robert,

I've just sent you a link by email to download these data and play around yourself. I find with this dataset on my local cluster that minimap2 needs a minimum of 60 GB to work with the 40 threads I've been assigning it.

Pretty sure the headers all should be different after running fastx_renamer. So I think it should be some other kind of problem...

Grazie mille,
Chris Laumer

RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx #4

RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx #4

Comments

bbalog87 commented Nov 22, 2018 • edited by rvaser Loading

rvaser commented Nov 22, 2018 • edited Loading

bbalog87 commented Nov 22, 2018 • edited by rvaser Loading

rvaser commented Nov 22, 2018

rvaser commented Nov 23, 2018

bbalog87 commented Nov 23, 2018 • edited by rvaser Loading

rvaser commented Nov 23, 2018 • edited Loading

rvaser commented Nov 23, 2018

bbalog87 commented Nov 23, 2018

rvaser commented Nov 23, 2018

bbalog87 commented Nov 23, 2018

rvaser commented Nov 23, 2018

bbalog87 commented Nov 26, 2018

bbalog87 commented Nov 26, 2018

rvaser commented Nov 26, 2018

bbalog87 commented Nov 27, 2018 • edited Loading

rvaser commented Nov 27, 2018

bbalog87 commented Nov 27, 2018

bbalog87 commented Nov 28, 2018

rvaser commented Nov 28, 2018

rvaser commented Nov 28, 2018

bbalog87 commented Nov 28, 2018

rvaser commented Nov 29, 2018

bbalog87 commented Nov 29, 2018

rvaser commented Nov 29, 2018 • edited Loading

bbalog87 commented Nov 29, 2018

rvaser commented Nov 29, 2018

bbalog87 commented Nov 29, 2018 • edited Loading

bbalog87 commented Nov 29, 2018

rvaser commented Nov 29, 2018

bbalog87 commented Nov 29, 2018

bbalog87 commented Dec 3, 2018 • edited Loading

rvaser commented Dec 3, 2018

bbalog87 commented Dec 3, 2018 • edited Loading

rvaser commented Dec 3, 2018

bbalog87 commented Dec 3, 2018

bbalog87 commented Dec 3, 2018 • edited Loading

rvaser commented Dec 3, 2018 • edited Loading

rvaser commented Dec 3, 2018

bbalog87 commented Dec 3, 2018 • edited Loading

bbalog87 commented Dec 6, 2018 • edited Loading

rvaser commented Dec 6, 2018

bbalog87 commented Dec 6, 2018

rvaser commented Dec 6, 2018

claumer commented Jan 10, 2019

rvaser commented Jan 10, 2019

claumer commented Jan 10, 2019

bbalog87 commented Nov 22, 2018 •

edited by rvaser

Loading

rvaser commented Nov 22, 2018 •

edited

Loading

bbalog87 commented Nov 22, 2018 •

edited by rvaser

Loading

bbalog87 commented Nov 23, 2018 •

edited by rvaser

Loading

rvaser commented Nov 23, 2018 •

edited

Loading

bbalog87 commented Nov 27, 2018 •

edited

Loading

rvaser commented Nov 29, 2018 •

edited

Loading

bbalog87 commented Nov 29, 2018 •

edited

Loading

bbalog87 commented Dec 3, 2018 •

edited

Loading

bbalog87 commented Dec 3, 2018 •

edited

Loading

bbalog87 commented Dec 3, 2018 •

edited

Loading

rvaser commented Dec 3, 2018 •

edited

Loading

bbalog87 commented Dec 3, 2018 •

edited

Loading

bbalog87 commented Dec 6, 2018 •

edited

Loading