Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hawk.out run too long #20

Open
SC-Duan opened this issue May 8, 2020 · 18 comments
Open

hawk.out run too long #20

SC-Duan opened this issue May 8, 2020 · 18 comments

Comments

@SC-Duan
Copy link

SC-Duan commented May 8, 2020

Hi,
I have 91 samples and it takes too long time (already 18 days) to run "hawk.out 42 49", the hawk_out.txt file is still empty, I set noThread=10, what is wrong with that? The read coverage in each sample is about 8x and the size of the genome is 2.2Gb.
Thank you!

@atifrahman
Copy link
Owner

That is really surprising. For the datasets we have analyzed, hawk takes much less time compared to the time needed for counting k-mers. How long did k-mer counting take?

We ran hawk on ~200 human samples and it took about a day (with 32 threads).

@SC-Duan
Copy link
Author

SC-Duan commented May 11, 2020

k-mer counting are very fast, is there some methods to check the problem of hawk.out?

@atifrahman
Copy link
Owner

Can you please share the 'hawk_out.txt' file?

@SC-Duan
Copy link
Author

SC-Duan commented May 18, 2020

hawk_out.txt file is empty.

@SonjaKersten
Copy link

Hi, I have the same issue. I'm running runHawk on Reads_case_sorted.txt and Reads_control_sorted.txt of 124Gb each for 13 days now. The hawk_out.txt file is still empty. What could be potentially wrong?

@robertwhbaldwin
Copy link

Did someone figure out what the problem was? I may be having the same issue. thanks - RObert

@robertwhbaldwin
Copy link

I think that i may have a similar problem. I'm running runHawk with kmer counts from 50 samples sequenced to ~9x coverage (genome size 2 g). Each kmer file was 20-25G.
Anyways, I'm running Hawk on an AWS EC2 instance and have already racked up $200 in charges I'd like to know if I should kill it or let it run. The Kmer counting step took about 2 days (1 hour per sample). The runHawk has been going for 23 hrs. I was expecting it to have finished by now. The instance type is m5ad.4xlarge with 16 vCPU and 64 G RAM. thanks - RObert

@atifrahman
Copy link
Owner

Has anything been written to case_out_wo_bonf.kmerDiff and control_out_wo_bonf.kmerDiff? If not, you can probably kill it.

@robertwhbaldwin
Copy link

No neither file has anything in it. I'll have to stop the job. It would be good to know what the problem was (too few resources) and how to spot it early on. None of the files initially created by runHawk had anything being added to them over the run. If there's some way to check to see if things are progressing or not that would be helpful. Also any recommendations on what compute resources should be used (more threads, faster cpu, more RAM etc) because I could launch it with a new instance. I should also point out that I had to keep the input files on the ESB (remote) storage and not the local storage which would hamper performance but I don't think that's the issue here. - Thanks - RObert

@atifrahman
Copy link
Owner

Sorry about that!

We'll look into it. We never encountered this on any of the datasets we used. The datasets are so large that it's difficult for others to share them so that we can debug. We'll give it another shot.

@robertwhbaldwin
Copy link

I tried runHawk again on a different instance. I started over from the beginning reinstalling all software. I ran it with 35 threads and ~80Gigs RAM and moved the input sorted kmer files to local SSD storage. I left it over night to run and after ~8hrs found it still running with no change to the output files. I saved the AMI but will not be attempting this again unless the issue is resolved. My input sorted Kmer files were 25 gigs each (50 samples). Does that seem too large for a 2 gig diploid genome at 9X coverage? Let me know if I can help in any way to resolve this issue.

@atifrahman
Copy link
Owner

Can you please share one or two of the k-mer count files? We can check whether they are in the expected format. When I tried it on ~200 human samples, the total size of k-mer count files was >5TB. So, they don't seem unreasonably large.

@robertwhbaldwin
Copy link

Do you want the whole files?

@atifrahman
Copy link
Owner

If possible, yes. You can upload them somewhere and share the link by emailing me at [email protected]

@SonjaKersten
Copy link

I'm also still on the same issue. However, I don't know, whether it is the file size or the fact that I run it on only two pools of a bulked segregant experiment (two samples). I would appreciate if you let me know how the issue got resolved.
Thanks Sonja

@robertwhbaldwin
Copy link

For those still dealing with this problem, it turns out that my kmer file had the incorrect format. Check your kmer file.
I ran the kmer step with the unmodified version of jellyfish2 but applied the patch provided by HAWK. My kmer file was incorrect because the first column had the kmer strings. The first column should be a number representing the kmer string and not the kmer string itself. The second column should be the count for the kmer.

@robertwhbaldwin
Copy link

And I'll add that you can install jellyfish unmodified version but you may need to use version 2.2.10 as suggested in the HAWK documentation. I tried using the patch to a more recent version of jellyfish2 and the output was not formatted properly. WHen I applied to 2.2.10 it was fixed.

@SonjaKersten
Copy link

Thanks Robert, I will check and try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants