Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non derterministic segfault and miscs #15

Open
Sebastien-Raguideau opened this issue Jan 21, 2022 · 2 comments
Open

Non derterministic segfault and miscs #15

Sebastien-Raguideau opened this issue Jan 21, 2022 · 2 comments

Comments

@Sebastien-Raguideau
Copy link

Hello,

Thanks for FastK, it is truly useful.

I'm looking at kmer coverage of contigs of a small assembly. It means that if I want to get kmer coverage/histogram for each single contig I need to create 1 fasta file per contig and apply FastK to it. It is quite involved.
I do trivial parallelisation over all files. My pipeline stop randomly as a consequence of a segfault on random contigs. Running incriminated step by itself outside of the pipeline does not allow to reproduce the issue. Restarting the pipeline from scratch does make it segfault again but not on the same files. My intuition is that there might be an issue with multiple FastK instance writing in the same temporary folder?
Here is an example of error message:
/bin/bash: line 1: 193787 Segmentation fault (core dumped) FastK -t1 -T2 seqs/folder43/contig_12.fa -P'seqs/folder43'

I am having another similar problem with random segfault. This time with Logex with
Logex -H 'result=A&.B' sample.ktab contigs.ktab
In all examples I've looked at, segfault happened when contigs.ktab was empty (well formed table with 0 kmer). Though running the same command line outside of the pipeline works without issue and indeed produce a working .hist file (even though with no kmer).
This second issue is less problematic in the sense that I can just pre-filter for empty contigs.ktab.

Additionally here are a few miscellaneous issues I encountered. I'm mostly puzzled by the first one:

  • Looking at kmer in small individual genes/contigs, for sequences of size 100+bp and k=40, I get the following error message:
    "FastK: Too much of the data is in reads less than the k-mer size".
    If I append "NN" at the end of the sequence, I obtain expected results without failure. And some other smaller sequences do not show that issue. I joined an example.
  • Fastmerge doesn't handle empty ktab. It segfaults.
  • Fastq.gz are unziped in the working directory but not removed after

If it's useful, I'm working on Ubuntu 16.04.7 LTS with gcc version 9.4.0.

Best,
Seb

@thegenemyers
Copy link
Owner

thegenemyers commented Jan 30, 2022 via email

@Sebastien-Raguideau
Copy link
Author

Hi Gene,

Thanks a lot for all of this.

Yes, from my pipeline there is no possibility for 2 root to be identical. So it is likely not related to temporary files.
I am not on a distributed system but I can see how overly frequent writing/reading could be an issue. I'll try to reduce the number of concurrent jobs and see if I can make the problem disappear.

I didn't realise that the remaining unzipped files was from me interrupting my pipeline.... That makes sense as it would not occurs all the time.

I am myself also unclear on the interest of looking at such small reads/contigs, It is quite possible that in the future I won't trust coverage of contigs under a certain size, though I am pretty sure It will always be under 10Mbp :)

Thanks again,
Seb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants