Skip to content
This repository has been archived by the owner on May 30, 2019. It is now read-only.

RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx #4

Open
bbalog87 opened this issue Nov 22, 2018 · 46 comments

Comments

@bbalog87
Copy link

bbalog87 commented Nov 22, 2018

Hi,
RA has failed with the following error message

unequal lengths in sequence and overlap file for sequence with id 508204 without any further log message.

Here are the last few lines of the log-file:

[M::main] CMD: /disk2/nguinkal/ra/vendor/minimap2/minimap2 -t 100 -x ava-pb /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 12065.368 sec; CPU: 868149.797 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204!

I am running RA on Debian based linux Workstation.
I have no clue what is causing that bug. Thanks for any hints.
Cheers,
Julien

@rvaser
Copy link
Owner

rvaser commented Nov 22, 2018

Hi Julien,
please run the following commands from ra root folder and paste the output here:

cd vendor/rala
git status
git log

Best regards,
Robert

@bbalog87
Copy link
Author

bbalog87 commented Nov 22, 2018

Hi @rvaser
Thanks for replying.

git status

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

	modified:   ../../../vendor/rala (modified content, untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

git log

Author: rvaser <[email protected]>
Date:   Sat Nov 17 09:10:56 2018 +0100

   Fixed illumina polishing extension

commit 20a39156f7efb59de36616e6f0314202b287b804
Author: rvaser <[email protected]>
Date:   Sat Nov 17 00:50:22 2018 +0100

   Updated rala

commit ca71a013084ed41f3ecc2ca3918be5f5a2f482d9
Author: rvaser <[email protected]>
Date:   Thu Nov 8 11:28:39 2018 +0100

   Updated rala

@rvaser
Copy link
Owner

rvaser commented Nov 22, 2018

Please run head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2 and paste the sequence causing the problem here.

@rvaser
Copy link
Owner

rvaser commented Nov 23, 2018

Please also run git log from ra/vendor/rala.

@bbalog87
Copy link
Author

bbalog87 commented Nov 23, 2018

> head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2
just output this:
AGAGCTCTGTCCACGTCAATATCGCTCCTAATTTACATTTCACCTTATTTAAACGCTTTT
GGTGTTTTGTGTAGGTTATTCCCTCTTTCAGTAATGTAAGCATTAACTGCGATGCCGCTA

Why not -n 508204 instead of -n 1016410?
Best, Julien

> git log in ra/vendor/rala
outputs this:

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

    Reverted some changes

@rvaser
Copy link
Owner

rvaser commented Nov 23, 2018

As each sequence is usually represented with two lines in FASTA format (one for header and one for the data), we are using 2 * read_id + 2 to obtain the sequence with head and tail. Nonetheless, your file seems to be wrapped to 60 characters per line which makes it a bit harder to extract the sequence by id. The error you have encountered is that my parser has obtained a sequences with length that differs from the length stored in the .paf file generated from minimap2, which is quite odd as I have tested this feature. I'll check again.

P.S. I truncated the outputs in above messages so I do not have to scroll too much :D

@rvaser
Copy link
Owner

rvaser commented Nov 23, 2018

Wrapped files are forking for me. Do you mind sending me your file so I can investigate further?

@bbalog87
Copy link
Author

Okay, I understand the issue. But how can I fix it?
unfortunately I cannot send some sample data because the files are about ~950GB for illumina PE data and ~120GB for PacBio data, since RA expects two types of data: TSG and NGS data.
PS: My PE-data are in file.R1.fq and file.R2 data. I have poooled them in one file before passing them to RA. Maybe, minimap2 has problems in mapping such pooled reads?

@rvaser
Copy link
Owner

rvaser commented Nov 23, 2018

I understand, that is a lot of data. The error occurred in the layout step where only TGS data is used. Can you please provide me with the command you run?

@bbalog87
Copy link
Author

Ok, here is the command I ran

$: ./ra -t 100 -x pb /path/to/pacbioData/SMRT_Sequel-pooled/zander_pooled.subreads.renamed.fasta /path/to/pe-reads/pooled/in/one/singleFile/HiSeqX/pe.renamed.fastq

@rvaser
Copy link
Owner

rvaser commented Nov 23, 2018

That looks fine. I am not sure what could be wrong. By the extension .renamed.fasta, did you perhaps rename your reads with ids? If so can you grep the reported sequence from the file?

@bbalog87
Copy link
Author

Yes, I renamed my reads with Ids
like:

1
READ1
2
READ2
...

@bbalog87
Copy link
Author

Sorry. No reads were renamed, just the headers in NGS reads file

@rvaser
Copy link
Owner

rvaser commented Nov 26, 2018

Can you please replace this lines https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L55-L57 (located at vendor/rala/src/overlap.cpp) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    a_id_, a_length_, piles[a_id_]->data().size());

and https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L74-L76 (in the same file) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    b_id_, b_length_, piles[b_id_]->data().size());

Run make afterwards and paste the log here again please.

@bbalog87
Copy link
Author

bbalog87 commented Nov 27, 2018

Okay, I have just copied the whole log file, since I do not know in which part you are interested to.

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

@rvaser
Copy link
Owner

rvaser commented Nov 27, 2018

By log I meant that you run ra again with your data and paste here everything it outputs until the error occurs again :)

@bbalog87
Copy link
Author

ok, sorry. It will take a while to run. I will post here tomorrow.

@bbalog87
Copy link
Author

Hi @rvaser ,

can I send you the whole log file by email? The same error has occurred.

[M::main] CMD: /Zander_Project/pipelines/assembly/ra/vendor/minimap2/minimap2 -t 130 -x ava-pb /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 11905.419 sec; CPU: 882940.242 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204 (23914 - 23317)!

@rvaser
Copy link
Owner

rvaser commented Nov 28, 2018

Please do so :)

@rvaser
Copy link
Owner

rvaser commented Nov 28, 2018

Please run awk 'NR==508206' RS='>' zander_pooled.subreads.renamed.fasta > read_of_interest.fasta and send me the file via email.

@bbalog87
Copy link
Author

@rvaser I have just sent the read of interest via email.

@rvaser
Copy link
Owner

rvaser commented Nov 29, 2018

Please run grep "m54229_180630_022602" zander_pooled.subreads.renamed.fasta and make sure there is only one match.

The read length that my parser obtains matches to the one in file (it even has the length in its name m54229_180630_022602 74908655 0_23317) . So I guess that there might be two sequences with the same name. Or something went wrong in minimap2 parsing which I'll look into if there is only one match in the above grep command.

@bbalog87
Copy link
Author

Hi @rvaser,

It can't be only one match, since all reads (headers) in zander_pooled.subreads.renamed.fasta start with "m54229_180630_022602"

@rvaser
Copy link
Owner

rvaser commented Nov 29, 2018

That is the problem. I only take the read name up to the first white space to "save" memory, nonetheless I think it is expected that this part of the name should be unique. You can easily circumvent this problem by replacing all spaces with underscores in your read file. Something like this sed 's/ /_/g' zander_pooled.subreads.renamed.fasta > zander_pooled.subreads.fixed.fasta.

@bbalog87
Copy link
Author

Ok Thank you. Your asembler (RA) doesn't get information from the reads name (fasta-headers)? Because some do, and renaming fasta-headers causes problems. I will rename as suggested and re-run RA-assembler. Thank you very much. I hope it fixes it.

@rvaser
Copy link
Owner

rvaser commented Nov 29, 2018

Headers are discarded immediately and replaced with ids. I hope that this fixes it as well :D

@bbalog87
Copy link
Author

bbalog87 commented Nov 29, 2018

Okay, then it doesn't matter whatever is in the fast-headers :) I would let you know.

@bbalog87
Copy link
Author

PS: Do I need to reset the part of the code in rala, that I edited?

@rvaser
Copy link
Owner

rvaser commented Nov 29, 2018

Nope, it will only trigger if the same error occurs.

@bbalog87
Copy link
Author

Okay. 👍

@bbalog87
Copy link
Author

bbalog87 commented Dec 3, 2018

Hi @rvaser
minimap2 stage finished without error. rala is now running. I think the error was triggered by non-unique reads identifiers in the fasta file.
PS: Do you have an estimation of the expected run time of Ra depending on the data?

@rvaser
Copy link
Owner

rvaser commented Dec 3, 2018

Hi Julien,
that depends on how repetitive is your genome. Depending on the size of your data and if you don't have anything plant related, I would say under a week. I hope you have enough RAM as you have 1TB Illumina reads :D

Best regards,
Robert

@bbalog87
Copy link
Author

bbalog87 commented Dec 3, 2018

Hi Robert,
I have 1.5 TB RAM and I'm using 40 cpus.
Illunima PE-data : ~ 950 GB
PacBio data : ~120 GB
I hope the RAM will not crash :)
The genome is not so repetitive. (~12% repeat content, 0.1% hetorozygosity)

Best,
Julien

@rvaser
Copy link
Owner

rvaser commented Dec 3, 2018

That will be a bit tight because overhead for Illumina data equals 0.5 times the size of the FASTQ file. You might want to downsample it a bit. How much is the coverage if you have freaking ~500GB of sequences (i.e. how large is the genome you are assembling)?

@bbalog87
Copy link
Author

bbalog87 commented Dec 3, 2018

Actuallly, the illumina data are in fastq. Thus, the raw reads size is ~50% of the fastq (= ~450 GB). The theoretical (k-mer profile) mean coverage is ~140X. I have not computed yet the real sequencing coverage, since the genome size is unknown. But I expect a genome size < 2 Gb

@bbalog87
Copy link
Author

bbalog87 commented Dec 3, 2018

What is the most RAM-consuming step in the RA pipeline?

I think, it should be racon-stage. Currently, rala has a peak RAM consumption of only ~140 GB.

@rvaser
Copy link
Owner

rvaser commented Dec 3, 2018

Racon, as it loads all reads into memory. The problem will be the Illumina polishing step. You can try running ra without Illumina data and try to polish the assembly afterwards with Illumina and racon. If the program crashes, all intermediate files will be deleted, including the PB polished assembly.

@rvaser
Copy link
Owner

rvaser commented Dec 3, 2018

Or you can let it run, be sneaky and copy the intermediate files while it polishes with Illumina data :)

@bbalog87
Copy link
Author

bbalog87 commented Dec 3, 2018

Okay, this is a great suggestion 👍 . If ra only uses illumina data for polishing (not for consensus), then it also makes sense to run ra without NGS data. I can polish the PB contig afterwards using racon, as you suggested.

@bbalog87
Copy link
Author

bbalog87 commented Dec 6, 2018

Hi Robert,

As expected (and feared) ra has failed because of RAM overload :( I think, our data load and computational resources are just not suitable to run ra in NGS+TGS setup. RA can run assembly with only TGS, right? Maybe I should try running ra just with PB-data.

Best,
Julien

@rvaser
Copy link
Owner

rvaser commented Dec 6, 2018

Hi Julien,
as I thought you have unusual amount of Illumina data. You can of course run ra with PB only data. Later if you want to you can subsample the Illumina reads for manual polishing with racon.

Best regards,
Robert

@bbalog87
Copy link
Author

bbalog87 commented Dec 6, 2018

But according to your workflow description in github, even, with only TGS data, racon is ran twice over the preliminary PB only-based assembly. By manual polishing, you probably mean, polishing the contigs created by ra using a subset o high quality (e.g. Q>35) illumina reads. I will definitively try this approach.

Thanks for the kind overall support for this issue.

@rvaser
Copy link
Owner

rvaser commented Dec 6, 2018

Racon is run twice with TGS data, and once with NGS data. The NGS polishing is the part you will do manually due to memory shortage, i.e. polish the outputted ra contigs with subset of Illumina reads.

@claumer
Copy link

claumer commented Jan 10, 2019

Hi there, sorry to reawaken this, but it seems I'm having the same problem with ra. Minimap2 runs just fine, creates an overlaps.paf, rala begins and creates an empty preconstruction.fasta file... end then the job stops, the working directory is emptied, and this gets printed to stdout:

[M::main] Version: 2.14-r892-dirty
[M::main] CMD: /nfs/research1/marioni/claumer/ra/vendor/minimap2/minimap2 -t 40 -x ava-ont Lm10_151218_renamed.fastq Lm10_151218_renamed.fastq
[M::main] Real time: 3842.330 sec; CPU: 57219.102 sec; Peak RSS: 57.284 GB
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 2345870!
[Ra::run] preconstruction overlap stage
[Ra::run] preconstruction stage
[Ra::exit] warning: unable to clean work directory!

I'm calling ra like this:

../ra/build/bin/ra -t 40 -u -x ont Lm10_151218_renamed.fastq

And I do not think that it's a problem with unique headers as some of the messages above seemed to indicate... I saw the same problem both on my raw dataset and on a version where I used fastx_renamer to give each read a number in sequence.

I am running this on an amplified dataset: a longish-insert library that I have used adapter-driven PCR to amplify, following an optimised custom protocol that should minimize coverage bias, although it's possible that some exists. Because of the nature of PCR which is selective for shorter inserts, the results tend to look like this:

6823466 reads with read length N50 of 2643bp and maximum of 14993bp
8191293 reads with read length N50 of 1513bp and maximum of 3497bp

I'm having a bear of a time getting canu to finish on these (lots of relatively short reads causes the all-by-all mhap steps to take forever with it), so I was really excited when I came across ra which seems like it should have some considerable advantages over e.g. miniasm. But it seems that as written there's a difficulty parsing the paf...

Happy to provide any materials you might need to help debug this, log files, raw reads, etc.

@rvaser
Copy link
Owner

rvaser commented Jan 10, 2019

Hi Christopher,
have you checked that the identifiers are all different? You can send me the data via mail if you don't mind so I can investigate locally.

Best regards,
Robert

@claumer
Copy link

claumer commented Jan 10, 2019

Hi Robert,

I've just sent you a link by email to download these data and play around yourself. I find with this dataset on my local cluster that minimap2 needs a minimum of 60 GB to work with the 40 threads I've been assigning it.

Pretty sure the headers all should be different after running fastx_renamer. So I think it should be some other kind of problem...

Grazie mille,
Chris Laumer

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants