dRep v3.5.0 spend >1 days on Clustring MASH results #244

PuziJiang · 2024-11-20T06:14:08Z

Dear Sir,
When I used dRep v3.4 to compare 56910 genomes with parameters " --multiround_primary_clustering --greedy_secondary_clustering --debug", the clustering stopped due to PANDAS error like below:

..:: dRep compare Step 1. Cluster ::..

Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 4 groups for primary clustering
Comparing group 1 of 4
Comparing group 2 of 4
Comparing group 3 of 4
Comparing group 4 of 4
Final step: comparing between all groups
Comparing 49,864 genomes
Traceback (most recent call last):
File "/share/home/bgi_wangj/Anaconda3/envs/drep/bin/dRep", line 32, in
Controller().parseArguments(args)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 102, in parseArguments
self.compare_operation(**vars(args))
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 53, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_workflows.py", line 97, in compare_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 246, in run_second_round_clusteri
ng
Cdb2, cluster_ret = cluster_mash_database(mdb, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 279, in cluster_mash_database
linkage_db = db.pivot(index="genome1", columns="genome2", values="dist")
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/frame.py", line 7793, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 517, in pivot
return indexed.unstack(columns_listlike)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/series.py", line 4081, in unstack
return unstack(self, level, fill_value)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 461, in unstack
obj.index, level=level, constructor=obj._constructor_expanddim
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 131, in init
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Job 13251352 stderr output:

So, I updated dRep to v3.5.0 and install python 3.9.0.
However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.

logger.log which compare 56910 genomes

logger.log which compare less than 50000 genomes

I have no idea what happen to it.

Besides, I found that source code for setting fastANI parameters has some wrong. It should be "minFrag", not "minFraction" in d_cluster/external.py.
I also think it is better if you provide version of dependent softwares.

Looking forward to your kind reply!

The text was updated successfully, but these errors were encountered:

MrOlm · 2024-11-20T19:49:38Z

Hi @PuziJiang -

Thanks for this. I have seen it take a while (over 24h) to cluster really big databases before. For your example of less than 50000 genomes, I think those steps of "saving Mdb" and "saving CdbF" don't necessarily indicate the clustering is done yet; I think those are preparing for the clustering. It also only does that on --debug mode (in case you didn't run dbug mode on your first example).

Unfortunately I don't have a lot of good advice about how to handle this- the only think I can recommend is to definitely use --debug if you're not already.

Thank you as well for the fastANI points- I'll look into that.

Best,
Matt

PuziJiang · 2024-11-21T03:45:34Z

Dear Matt,

I have set the parameters with "--multiround_primary_clustering --greedy_secondary_clustering --primary_chunk
size 15000 --debug" in the first example.
Unfortunately, it just recorded "Clustering MASH database" and had no new output in past 48 hours. The latest output was posted below:

May I need to split these genomes into two parts and perform twice dRep compare?

MrOlm · 2024-11-21T16:15:36Z

Splitting the genomes into two groups and trying again is a good idea. I would also maybe let it go for 24 more hours- with lots of similar genomes like that, it can just take a while to cluster

Matt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dRep v3.5.0 spend >1 days on Clustring MASH results #244

dRep v3.5.0 spend >1 days on Clustring MASH results #244

PuziJiang commented Nov 20, 2024 •

edited

Loading

MrOlm commented Nov 20, 2024

PuziJiang commented Nov 21, 2024

MrOlm commented Nov 21, 2024

dRep v3.5.0 spend >1 days on Clustring MASH results #244

dRep v3.5.0 spend >1 days on Clustring MASH results #244

Comments

PuziJiang commented Nov 20, 2024 • edited Loading

So, I updated dRep to v3.5.0 and install python 3.9.0. However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.

MrOlm commented Nov 20, 2024

PuziJiang commented Nov 21, 2024

MrOlm commented Nov 21, 2024

PuziJiang commented Nov 20, 2024 •

edited

Loading

So, I updated dRep to v3.5.0 and install python 3.9.0.
However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.