-
Notifications
You must be signed in to change notification settings - Fork 9
RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx #4
Comments
Hi Julien, cd vendor/rala
git status
git log Best regards, |
Hi @rvaser
|
Please run |
Please also run |
Why not -n 508204 instead of -n 1016410?
|
As each sequence is usually represented with two lines in FASTA format (one for header and one for the data), we are using P.S. I truncated the outputs in above messages so I do not have to scroll too much :D |
Wrapped files are forking for me. Do you mind sending me your file so I can investigate further? |
Okay, I understand the issue. But how can I fix it? |
I understand, that is a lot of data. The error occurred in the layout step where only TGS data is used. Can you please provide me with the command you run? |
Ok, here is the command I ran
|
That looks fine. I am not sure what could be wrong. By the extension .renamed.fasta, did you perhaps rename your reads with ids? If so can you |
Yes, I renamed my reads with Ids
|
Sorry. No reads were renamed, just the headers in NGS reads file |
Can you please replace this lines https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L55-L57 (located at fprintf(stderr, "[rala::Overlap::transmute] error: "
"unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
a_id_, a_length_, piles[a_id_]->data().size()); and https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L74-L76 (in the same file) with: fprintf(stderr, "[rala::Overlap::transmute] error: "
"unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
b_id_, b_length_, piles[b_id_]->data().size()); Run |
Okay, I have just copied the whole log file, since I do not know in which part you are interested to.
|
By log I meant that you run ra again with your data and paste here everything it outputs until the error occurs again :) |
ok, sorry. It will take a while to run. I will post here tomorrow. |
Hi @rvaser , can I send you the whole log file by email? The same error has occurred.
|
Please do so :) |
Please run |
@rvaser I have just sent the read of interest via email. |
Please run The read length that my parser obtains matches to the one in file (it even has the length in its name m54229_180630_022602 74908655 0_23317) . So I guess that there might be two sequences with the same name. Or something went wrong in minimap2 parsing which I'll look into if there is only one match in the above grep command. |
Hi @rvaser, It can't be only one match, since all reads (headers) in zander_pooled.subreads.renamed.fasta start with "m54229_180630_022602" |
That is the problem. I only take the read name up to the first white space to "save" memory, nonetheless I think it is expected that this part of the name should be unique. You can easily circumvent this problem by replacing all spaces with underscores in your read file. Something like this |
Ok Thank you. Your asembler (RA) doesn't get information from the reads name (fasta-headers)? Because some do, and renaming fasta-headers causes problems. I will rename as suggested and re-run RA-assembler. Thank you very much. I hope it fixes it. |
Headers are discarded immediately and replaced with ids. I hope that this fixes it as well :D |
Okay, then it doesn't matter whatever is in the fast-headers :) I would let you know. |
PS: Do I need to reset the part of the code in rala, that I edited? |
Nope, it will only trigger if the same error occurs. |
Okay. 👍 |
Hi @rvaser |
Hi Julien, Best regards, |
Hi Robert, Best, |
That will be a bit tight because overhead for Illumina data equals 0.5 times the size of the FASTQ file. You might want to downsample it a bit. How much is the coverage if you have freaking ~500GB of sequences (i.e. how large is the genome you are assembling)? |
Actuallly, the illumina data are in fastq. Thus, the raw reads size is ~50% of the fastq (= ~450 GB). The theoretical (k-mer profile) mean coverage is ~140X. I have not computed yet the real sequencing coverage, since the genome size is unknown. But I expect a genome size < 2 Gb |
What is the most RAM-consuming step in the RA pipeline? I think, it should be |
Racon, as it loads all reads into memory. The problem will be the Illumina polishing step. You can try running ra without Illumina data and try to polish the assembly afterwards with Illumina and racon. If the program crashes, all intermediate files will be deleted, including the PB polished assembly. |
Or you can let it run, be sneaky and copy the intermediate files while it polishes with Illumina data :) |
Okay, this is a great suggestion 👍 . If ra only uses illumina data for polishing (not for consensus), then it also makes sense to run ra without NGS data. I can polish the PB contig afterwards using racon, as you suggested. |
Hi Robert, As expected (and feared) ra has failed because of RAM overload :( I think, our data load and computational resources are just not suitable to run ra in NGS+TGS setup. RA can run assembly with only TGS, right? Maybe I should try running ra just with PB-data. Best, |
Hi Julien, Best regards, |
But according to your workflow description in github, even, with only TGS data, racon is ran twice over the preliminary PB only-based assembly. By manual polishing, you probably mean, polishing the contigs created by ra using a subset o high quality (e.g. Q>35) illumina reads. I will definitively try this approach. Thanks for the kind overall support for this issue. |
Racon is run twice with TGS data, and once with NGS data. The NGS polishing is the part you will do manually due to memory shortage, i.e. polish the outputted ra contigs with subset of Illumina reads. |
Hi there, sorry to reawaken this, but it seems I'm having the same problem with ra. Minimap2 runs just fine, creates an overlaps.paf, rala begins and creates an empty preconstruction.fasta file... end then the job stops, the working directory is emptied, and this gets printed to stdout: [M::main] Version: 2.14-r892-dirty I'm calling ra like this: ../ra/build/bin/ra -t 40 -u -x ont Lm10_151218_renamed.fastq And I do not think that it's a problem with unique headers as some of the messages above seemed to indicate... I saw the same problem both on my raw dataset and on a version where I used fastx_renamer to give each read a number in sequence. I am running this on an amplified dataset: a longish-insert library that I have used adapter-driven PCR to amplify, following an optimised custom protocol that should minimize coverage bias, although it's possible that some exists. Because of the nature of PCR which is selective for shorter inserts, the results tend to look like this: 6823466 reads with read length N50 of 2643bp and maximum of 14993bp I'm having a bear of a time getting canu to finish on these (lots of relatively short reads causes the all-by-all mhap steps to take forever with it), so I was really excited when I came across ra which seems like it should have some considerable advantages over e.g. miniasm. But it seems that as written there's a difficulty parsing the paf... Happy to provide any materials you might need to help debug this, log files, raw reads, etc. |
Hi Christopher, Best regards, |
Hi Robert, I've just sent you a link by email to download these data and play around yourself. I find with this dataset on my local cluster that minimap2 needs a minimum of 60 GB to work with the 40 threads I've been assigning it. Pretty sure the headers all should be different after running fastx_renamer. So I think it should be some other kind of problem... Grazie mille, |
Hi,
RA has failed with the following error message
unequal lengths in sequence and overlap file for sequence with id 508204
without any further log message.Here are the last few lines of the log-file:
I am running RA on Debian based linux Workstation.
I have no clue what is causing that bug. Thanks for any hints.
Cheers,
Julien
The text was updated successfully, but these errors were encountered: