-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non derterministic segfault and miscs #15
Comments
Hi Seb,
Thanks for your input. As best I recall all the temporary file
names have the
form <temp_path>/<root>.... where <temp_path> is the temporary directory
(seqs/folder/43 in your example) and root is the root of the first file
after stripping
of any suffixes (contig_12 in your example). So if these two strings
are the same
for any two calls then yes you are going to have a problem.
Beyond that it is hard to say anything, albeit I'll mention that,
if on a distributed file system
jobs are too small, then I have observed such jobs can crash due to io
synchronization failures
(of the distributed system). What you describe may be in that regime,
one indicator would be
that the jobs that crash on any given attempt vary.
I did check and fix problems involving empty tables so now both
Logex and Fastmerge should
be fine on empty tables.
I also fixed it so any files FastK unzipped are cleaned up (the code
should have done so for normal
exits, but I arranged it so clean up occurs in the event of an abnormal
exit also).
FastK was not really designed for arbitrarily short reads
especially ones that are not much bigger
than the k-mer size (40 by default). It expects a large corpus of data
and actually "trains" on an initial
10Mbp or so (if its available) to determine how to distribute k-mers for
the sort proper. With a 100bp
sequence it just doesn't have enough data to train on and hence the
error. But I thought about this and
really it should just be a warning, FastK will work even if the training
"fails", so I have changed the code
appropriately. So your short example should now run to completion
albeit you will see a warning statement.
The changes above have been pushed to the github master. Please let
me know if there
is anything else I can do.
Best, Gene
…On 1/21/22, 3:02 PM, Sebastien-Raguideau wrote:
Hello,
Thanks for FastK, it is truly useful.
I'm looking at kmer coverage of contigs of a small assembly. It means
that if I want to get kmer coverage/histogram for each single contig I
need to create 1 fasta file per contig and apply FastK to it. It is
quite involved.
I do trivial parallelisation over all files. My pipeline stop randomly
as a consequence of a segfault on random contigs. Running incriminated
step by itself outside of the pipeline does not allow to reproduce the
issue. Restarting the pipeline from scratch does make it segfault
again but not on the same files. My intuition is that there might be
an issue with multiple FastK instance writing in the same temporary
folder?
Here is an example of error message:
|/bin/bash: line 1: 193787 Segmentation fault (core dumped) FastK -t1
-T2 seqs/folder43/contig_12.fa -P'seqs/folder43' |
I am having another similar problem with random segfault. This time
with Logex with
|Logex -H 'result=A&.B' sample.ktab contigs.ktab|
In all examples I've looked at, segfault happened when contigs.ktab
was empty (well formed table with 0 kmer). Though running the same
command line outside of the pipeline works without issue and indeed
produce a working .hist file (even though with no kmer).
This second issue is less problematic in the sense that I can just
pre-filter for empty contigs.ktab.
Additionally here are a few miscellaneous issues I encountered. I'm
mostly puzzled by the first one:
* Looking at kmer in small individual genes/contigs, for sequences
of size 100+bp and k=40, I get the following error message:
"FastK: Too much of the data is in reads less than the k-mer size".
If I append "NN" at the end of the sequence, I obtain expected
results without failure. And some other smaller sequences do not
show that issue. I joined an example
<https://github.com/thegenemyers/FASTK/files/7913702/example.txt>.
* Fastmerge doesn't handle empty ktab. It segfaults.
* Fastq.gz are unziped in the working directory but not removed after
If it's useful, I'm working on Ubuntu 16.04.7 LTS with gcc version 9.4.0.
Best,
Seb
—
Reply to this email directly, view it on GitHub
<#15>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUSINRRPNWVODXKYX5QDKLUXFROBANCNFSM5MPTD5FQ>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this
thread.Message ID: ***@***.***>
|
Hi Gene, Thanks a lot for all of this. Yes, from my pipeline there is no possibility for 2 root to be identical. So it is likely not related to temporary files. I didn't realise that the remaining unzipped files was from me interrupting my pipeline.... That makes sense as it would not occurs all the time. I am myself also unclear on the interest of looking at such small reads/contigs, It is quite possible that in the future I won't trust coverage of contigs under a certain size, though I am pretty sure It will always be under 10Mbp :) Thanks again, |
Hello,
Thanks for FastK, it is truly useful.
I'm looking at kmer coverage of contigs of a small assembly. It means that if I want to get kmer coverage/histogram for each single contig I need to create 1 fasta file per contig and apply FastK to it. It is quite involved.
I do trivial parallelisation over all files. My pipeline stop randomly as a consequence of a segfault on random contigs. Running incriminated step by itself outside of the pipeline does not allow to reproduce the issue. Restarting the pipeline from scratch does make it segfault again but not on the same files. My intuition is that there might be an issue with multiple FastK instance writing in the same temporary folder?
Here is an example of error message:
/bin/bash: line 1: 193787 Segmentation fault (core dumped) FastK -t1 -T2 seqs/folder43/contig_12.fa -P'seqs/folder43'
I am having another similar problem with random segfault. This time with Logex with
Logex -H 'result=A&.B' sample.ktab contigs.ktab
In all examples I've looked at, segfault happened when contigs.ktab was empty (well formed table with 0 kmer). Though running the same command line outside of the pipeline works without issue and indeed produce a working .hist file (even though with no kmer).
This second issue is less problematic in the sense that I can just pre-filter for empty contigs.ktab.
Additionally here are a few miscellaneous issues I encountered. I'm mostly puzzled by the first one:
"FastK: Too much of the data is in reads less than the k-mer size".
If I append "NN" at the end of the sequence, I obtain expected results without failure. And some other smaller sequences do not show that issue. I joined an example.
If it's useful, I'm working on Ubuntu 16.04.7 LTS with gcc version 9.4.0.
Best,
Seb
The text was updated successfully, but these errors were encountered: