You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear Sir,
When I used dRep v3.4 to compare 56910 genomes with parameters " --multiround_primary_clustering --greedy_secondary_clustering --debug", the clustering stopped due to PANDAS error like below:
..:: dRep compare Step 1. Cluster ::..
Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 4 groups for primary clustering
Comparing group 1 of 4
Comparing group 2 of 4
Comparing group 3 of 4
Comparing group 4 of 4
Final step: comparing between all groups
Comparing 49,864 genomes
Traceback (most recent call last):
File "/share/home/bgi_wangj/Anaconda3/envs/drep/bin/dRep", line 32, in
Controller().parseArguments(args)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 102, in parseArguments
self.compare_operation(**vars(args))
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 53, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_workflows.py", line 97, in compare_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 246, in run_second_round_clusteri
ng
Cdb2, cluster_ret = cluster_mash_database(mdb, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 279, in cluster_mash_database
linkage_db = db.pivot(index="genome1", columns="genome2", values="dist")
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/frame.py", line 7793, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 517, in pivot
return indexed.unstack(columns_listlike)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/series.py", line 4081, in unstack
return unstack(self, level, fill_value)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 461, in unstack
obj.index, level=level, constructor=obj._constructor_expanddim
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 131, in init
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Job 13251352 stderr output:
So, I updated dRep to v3.5.0 and install python 3.9.0.
However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.
logger.log which compare 56910 genomes
logger.log which compare less than 50000 genomes
I have no idea what happen to it.
Besides, I found that source code for setting fastANI parameters has some wrong. It should be "minFrag", not "minFraction" in d_cluster/external.py.
I also think it is better if you provide version of dependent softwares.
Looking forward to your kind reply!
The text was updated successfully, but these errors were encountered:
Thanks for this. I have seen it take a while (over 24h) to cluster really big databases before. For your example of less than 50000 genomes, I think those steps of "saving Mdb" and "saving CdbF" don't necessarily indicate the clustering is done yet; I think those are preparing for the clustering. It also only does that on --debug mode (in case you didn't run dbug mode on your first example).
Unfortunately I don't have a lot of good advice about how to handle this- the only think I can recommend is to definitely use --debug if you're not already.
Thank you as well for the fastANI points- I'll look into that.
I have set the parameters with "--multiround_primary_clustering --greedy_secondary_clustering --primary_chunk
size 15000 --debug" in the first example.
Unfortunately, it just recorded "Clustering MASH database" and had no new output in past 48 hours. The latest output was posted below:
May I need to split these genomes into two parts and perform twice dRep compare?
Splitting the genomes into two groups and trying again is a good idea. I would also maybe let it go for 24 more hours- with lots of similar genomes like that, it can just take a while to cluster
Dear Sir,
When I used dRep v3.4 to compare 56910 genomes with parameters " --multiround_primary_clustering --greedy_secondary_clustering --debug", the clustering stopped due to PANDAS error like below:
Running primary clustering
Running pair-wise MASH clustering
Will split genomes into 4 groups for primary clustering
Comparing group 1 of 4
Comparing group 2 of 4
Comparing group 3 of 4
Comparing group 4 of 4
Final step: comparing between all groups
Comparing 49,864 genomes
Traceback (most recent call last):
File "/share/home/bgi_wangj/Anaconda3/envs/drep/bin/dRep", line 32, in
Controller().parseArguments(args)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 102, in parseArguments
self.compare_operation(**vars(args))
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/controller.py", line 53, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_workflows.py", line 97, in compare_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 246, in run_second_round_clusteri
ng
Cdb2, cluster_ret = cluster_mash_database(mdb, **kwargs)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/drep/d_cluster/compare_utils.py", line 279, in cluster_mash_database
linkage_db = db.pivot(index="genome1", columns="genome2", values="dist")
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/frame.py", line 7793, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/pivot.py", line 517, in pivot
return indexed.unstack(columns_listlike)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/series.py", line 4081, in unstack
return unstack(self, level, fill_value)
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 461, in unstack
obj.index, level=level, constructor=obj._constructor_expanddim
File "/share/home/bgi_wangj/Anaconda3/envs/drep/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 131, in init
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Job 13251352 stderr output:
So, I updated dRep to v3.5.0 and install python 3.9.0.
However the process of clustering MASH results spend >1 day and have no any log out, while other work of comparing less than 50,000 genomes showed it would spend <30 mins.
logger.log which compare 56910 genomes
logger.log which compare less than 50000 genomes
I have no idea what happen to it.
Besides, I found that source code for setting fastANI parameters has some wrong. It should be "minFrag", not "minFraction" in d_cluster/external.py.
I also think it is better if you provide version of dependent softwares.
Looking forward to your kind reply!
The text was updated successfully, but these errors were encountered: