Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dRep v3.5.0 spend >1 days on Clustring MASH results #244

Open
PuziJiang opened this issue Nov 20, 2024 · 3 comments
Open

dRep v3.5.0 spend >1 days on Clustring MASH results #244

PuziJiang opened this issue Nov 20, 2024 · 3 comments

Comments

@PuziJiang
Copy link

PuziJiang commented Nov 20, 2024

Dear Sir,
When I used dRep v3.4 to compare 56910 genomes with parameters " --multiround_primary_clustering --greedy_secondary_clustering --debug", the clustering stopped due to PANDAS error like below:

..:: dRep compare Step 1. Cluster ::..

Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 4 groups for primary clustering
Comparing group 1 of 4
Comparing group 2 of 4
Comparing group 3 of 4
Comparing group 4 of 4
Final step: comparing between all groups
Comparing 49,864 genomes
Traceback (most recent call last):
File "/share/home/bgi_wangj/Anaconda3/envs/drep/bin/dRep", line 32, in
Controller().parseArguments(args)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 102, in parseArguments
self.compare_operation(**vars(args))
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 53, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_workflows.py", line 97, in compare_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 246, in run_second_round_clusteri
ng
Cdb2, cluster_ret = cluster_mash_database(mdb, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 279, in cluster_mash_database
linkage_db = db.pivot(index="genome1", columns="genome2", values="dist")
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/frame.py", line 7793, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 517, in pivot
return indexed.unstack(columns_listlike)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/series.py", line 4081, in unstack
return unstack(self, level, fill_value)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 461, in unstack
obj.index, level=level, constructor=obj._constructor_expanddim
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 131, in init
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Job 13251352 stderr output:

So, I updated dRep to v3.5.0 and install python 3.9.0.
However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.

logger.log which compare 56910 genomes
image


logger.log which compare less than 50000 genomes
image

I have no idea what happen to it.

Besides, I found that source code for setting fastANI parameters has some wrong. It should be "minFrag", not "minFraction" in d_cluster/external.py.
I also think it is better if you provide version of dependent softwares.

Looking forward to your kind reply!

@MrOlm
Copy link
Owner

MrOlm commented Nov 20, 2024

Hi @PuziJiang -

Thanks for this. I have seen it take a while (over 24h) to cluster really big databases before. For your example of less than 50000 genomes, I think those steps of "saving Mdb" and "saving CdbF" don't necessarily indicate the clustering is done yet; I think those are preparing for the clustering. It also only does that on --debug mode (in case you didn't run dbug mode on your first example).

Unfortunately I don't have a lot of good advice about how to handle this- the only think I can recommend is to definitely use --debug if you're not already.

Thank you as well for the fastANI points- I'll look into that.

Best,
Matt

@PuziJiang
Copy link
Author

Dear Matt,

I have set the parameters with "--multiround_primary_clustering --greedy_secondary_clustering --primary_chunk
size 15000 --debug" in the first example.
Unfortunately, it just recorded "Clustering MASH database" and had no new output in past 48 hours. The latest output was posted below:
image

May I need to split these genomes into two parts and perform twice dRep compare?

@MrOlm
Copy link
Owner

MrOlm commented Nov 21, 2024

Splitting the genomes into two groups and trying again is a good idea. I would also maybe let it go for 24 more hours- with lots of similar genomes like that, it can just take a while to cluster

Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants