Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with dRep dereplication with large contigs list #247

Open
yeoeunyun opened this issue Dec 13, 2024 · 4 comments
Open

Problem with dRep dereplication with large contigs list #247

yeoeunyun opened this issue Dec 13, 2024 · 4 comments

Comments

@yeoeunyun
Copy link

Hi, Matt. Thanks for developing such a useful tool.

I'm trying to dereplicate a dataset using drep with 132,642 contigs.

Based on the manual and your comments in Github, I understand that for datasets with more than 5,000 contigs, it is recommended to use a contigs list text file with paths and the options --multiround_primary_clustering and --primary_chunksize 3000

So, here's my command I used:

dRep dereplicate ELB_drep_result_total/ 
   -g ELB_contig_list.txt 
   -pa 0.9 
   -sa 0.95 
   -nc 0.85 
   -l 1 
   --ignoreGenomeQuality 
   -p 50 
   --multiround_primary_clustering 
   --primary_chunksize 3000
   -d

However, the process was not completed successfully. It terminated with the message just "Killed" and did not generate an error log.

Below is the log content saved in the output file:

12-12 22:32 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
12-12 22:32 DEBUG    ***Logger started up at /data/ELB_drep_result_total/log/logger.log***
12-12 22:32 DEBUG    Command to run dRep was: /usr/local/anaconda3/envs/drep/bin/dRep dereplicate ELB_drep_result_total/ -g ELB_contig_list.txt ^Ca 0.9 -sa 0.95 -nc 0.85 -l 1 --ignoreGenomeQuality -p 50 --multiround_primary_clustering --primary_chunksize 3000 -d

12-12 22:32 DEBUG    dRep version 3.5.0 was run 

12-12 22:32 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

12-12 22:32 DEBUG    Namespace(operation='dereplicate', work_directory='ELB_drep_result_total/', processors=50, debug=True, genomes=['ELB_contig_list.txt', '^Ca', '0.9'], length=1.0, completeness=75, contamination=25, ignoreGenomeQuality=True, genomeInfo=None, checkM_method='lineage_wf', set_recursion='0', checkm_group_size=2000, S_algorithm='fastANI', MASH_sketch=1000, SkipMash=False, SkipSecondary=False, skani_extra='', n_PRESET='normal', P_ani=0.9, S_ani=0.95, cov_thresh=0.85, coverage_method='larger', clusterAlg='average', multiround_primary_clustering=True, primary_chunksize=3000, greedy_secondary_clustering=False, run_tertiary_clustering=False, completeness_weight=1, contamination_weight=5, strain_heterogeneity_weight=1, N50_weight=0.5, size_weight=0, centrality_weight=1, extra_weight_table=None, gen_warnings=False, warn_dist=0.25, warn_sim=0.98, warn_aln=0.25, skip_plots=False)
12-12 22:32 DEBUG    Starting the dereplicate operation
12-12 22:32 INFO     ***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************
    
12-12 22:32 DEBUG    Loading work directory in filter
12-12 22:32 DEBUG    Located: /data/ELB_drep_result_total
Datatables: []
Cluster files: []
Arguments: []
12-12 22:32 DEBUG    Validating filter arguments
12-12 22:32 INFO     Will filter the genome list
12-12 22:37 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
12-12 22:37 DEBUG    ***Logger started up at /data/ELB_drep_result_total/log/logger.log***
12-12 22:37 DEBUG    Command to run dRep was: /usr/local/anaconda3/envs/drep/bin/dRep dereplicate ELB_drep_result_total/ -g ELB_contig_list.txt -pa 0.9 -sa 0.95 -nc 0.85 -l 1 --ignoreGenomeQuality -p 50 --multiround_primary_clustering --primary_chunksize 3000 -d

12-12 22:37 DEBUG    dRep version 3.5.0 was run 

12-12 22:37 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

12-12 22:37 DEBUG    Namespace(operation='dereplicate', work_directory='ELB_drep_result_total/', processors=50, debug=True, genomes=['ELB_contig_list.txt'], length=1.0, completeness=75, contamination=25, ignoreGenomeQuality=True, genomeInfo=None, checkM_method='lineage_wf', set_recursion='0', checkm_group_size=2000, S_algorithm='fastANI', MASH_sketch=1000, SkipMash=False, SkipSecondary=False, skani_extra='', n_PRESET='normal', P_ani=0.9, S_ani=0.95, cov_thresh=0.85, coverage_method='larger', clusterAlg='average', multiround_primary_clustering=True, primary_chunksize=3000, greedy_secondary_clustering=False, run_tertiary_clustering=False, completeness_weight=1, contamination_weight=5, strain_heterogeneity_weight=1, N50_weight=0.5, size_weight=0, centrality_weight=1, extra_weight_table=None, gen_warnings=False, warn_dist=0.25, warn_sim=0.98, warn_aln=0.25, skip_plots=False)
12-12 22:37 DEBUG    Starting the dereplicate operation
12-12 22:37 INFO     ***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************
    
12-12 22:37 DEBUG    Loading work directory in filter
12-12 22:37 DEBUG    Located: /data/ELB_drep_result_total
Datatables: []
Cluster files: []
Arguments: []
12-12 22:37 DEBUG    Validating filter arguments
12-12 22:37 INFO     Will filter the genome list
12-12 22:37 INFO     Loading genomes from a list
12-12 22:37 INFO     132,642 genomes were input to dRep
12-12 22:37 INFO     Calculating genome info of genomes
12-12 22:37 DEBUG    Skipping all quality-based filtering
12-12 22:37 DEBUG    Storing resulting files
12-12 22:37 INFO     ***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
    
12-12 22:37 INFO     Running primary clustering
12-12 22:37 INFO     Running pair-wise MASH clustering
12-12 22:37 INFO       Will split genomes into 45 groups for primary clustering
12-12 22:49 INFO       Comparing group 1 of 45
12-12 22:49 INFO       Comparing group 2 of 45
12-12 22:50 INFO       Comparing group 3 of 45
12-12 22:50 INFO       Comparing group 4 of 45
12-12 22:50 INFO       Comparing group 5 of 45
12-12 22:50 INFO       Comparing group 6 of 45
12-12 22:51 INFO       Comparing group 7 of 45
12-12 22:51 INFO       Comparing group 8 of 45
12-12 22:51 INFO       Comparing group 9 of 45
12-12 22:51 INFO       Comparing group 10 of 45
12-12 22:52 INFO       Comparing group 11 of 45
12-12 22:52 INFO       Comparing group 12 of 45
12-12 22:52 INFO       Comparing group 13 of 45
12-12 22:52 INFO       Comparing group 14 of 45
12-12 22:53 INFO       Comparing group 15 of 45
12-12 22:53 INFO       Comparing group 16 of 45
12-12 22:53 INFO       Comparing group 17 of 45
12-12 22:53 INFO       Comparing group 18 of 45
12-12 22:54 INFO       Comparing group 19 of 45
12-12 22:54 INFO       Comparing group 20 of 45
12-12 22:54 INFO       Comparing group 21 of 45
12-12 22:54 INFO       Comparing group 22 of 45
12-12 22:55 INFO       Comparing group 23 of 45
12-12 22:55 INFO       Comparing group 24 of 45
12-12 22:55 INFO       Comparing group 25 of 45
12-12 22:55 INFO       Comparing group 26 of 45
12-12 22:56 INFO       Comparing group 27 of 45
12-12 22:56 INFO       Comparing group 28 of 45
12-12 22:56 INFO       Comparing group 29 of 45
12-12 22:56 INFO       Comparing group 30 of 45
12-12 22:57 INFO       Comparing group 31 of 45
12-12 22:57 INFO       Comparing group 32 of 45
12-12 22:57 INFO       Comparing group 33 of 45
12-12 22:57 INFO       Comparing group 34 of 45
12-12 22:58 INFO       Comparing group 35 of 45
12-12 22:58 INFO       Comparing group 36 of 45
12-12 22:58 INFO       Comparing group 37 of 45
12-12 22:58 INFO       Comparing group 38 of 45
12-12 22:59 INFO       Comparing group 39 of 45
12-12 22:59 INFO       Comparing group 40 of 45
12-12 22:59 INFO       Comparing group 41 of 45
12-12 22:59 INFO       Comparing group 42 of 45
12-12 23:00 INFO       Comparing group 43 of 45
12-12 23:00 INFO       Comparing group 44 of 45
12-12 23:00 INFO       Comparing group 45 of 45
12-12 23:04 INFO       Final step: comparing between all groups
12-12 23:04 DEBUG    Clustering MASH database
12-12 23:04 DEBUG    Clustering MASH database
12-12 23:04 DEBUG    Clustering MASH database
12-12 23:04 DEBUG    Clustering MASH database
12-12 23:04 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:05 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 DEBUG    Clustering MASH database
12-12 23:06 INFO     Comparing 132,389 genomes

How could I solve this problem?

Thank you!

@MrOlm
Copy link
Owner

MrOlm commented Dec 13, 2024

Hi @yeoeunyun ,

Sorry to hear this happened. When crashing at that step, it almost certainly crashed because it ran out of RAM.

If you run the same command again, with the -d option again, it should be able to pick up where it left off and not re-run everything again.

Unfortunately, the only real solution here is to either make more RAM available or to decrease the number of genomes. I'll also just say that dRep uses genomes as inputs, not contigs (unless of course each contig is one genome).

Best,
Matt

@yeoeunyun
Copy link
Author

Dear @MrOlm

Thank you for your kind response!

As you suggested, I'll try rerunning it with -d option.

I have one additional question: would lowering the --primary_chunksize option (to around 1,000–2,000) help complete the process?

I'll explore ways to reduce the number of contigs, but actually using contigs as input is needed for my research purpose.

Thanks a lot.

@MrOlm
Copy link
Owner

MrOlm commented Dec 14, 2024

Hi @yeoeunyun ,

Great. Paradoxically, increasing the --primary_chunksize option (to around 10,000 or 20,000?) will help the process complete.

Best of luck,
Matt

@yeoeunyun
Copy link
Author

Hi @MrOlm

Thank you so much. I’ll give it a try following your suggestions.

Have a nice day :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants