-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask performance on xhistogram #8
Comments
Thanks Shanice for the detailed and actionable bug report! |
I am pretty sure this is the same issue I am having for a project (I mentioned this in the last meeting). Thanks for the issue @stb2145. Very curious on how to improve this. |
Yeah, thanks for the nice example @stb2145 -- that helps a lot Which Pangeo Cloud JupyterHub are you running on (e.g. the normal gcp deployment vs. https://staging.us-central1-b.gcp.pangeo.io)? Also, what worker instance types are you running on (e.g. small, medium, large)? Here are the performance reports I got when running your example on the staging deployment with medium instances: I do see some difference in performance (~180s vs. ~210s for without That said, I don't see the ~3x time difference which you observed. In particular, your |
I can run the example again and send you updated dask reports. I'll try to also send dask reports of the actual data I'm working on so you can see the real performance issue - usually the workers die and give up on the |
I can also try to submit a similar use case with a different data source as comparison. Will have to create those nice performance reports tomorrow. Thanks for working on this |
So I am 90% sure this is an xhistogram bug and the problem is in this code: I believe it can be fixed by changing the xhistogram implementation to use map_blocks and avoiding dask.array.reshape. |
Following up on dask reports for the actual data I'm working on, I can't provide info on the performance because I always have to eventually stop the kernel since the workers seem to freeze and not finish the tasks. I reproduced the dask reports for the theta_dims and theta_no_dims and the discrepancy was similar to yours @jrbourbeau. It is worth investigating further to see if changing xhistogram implementation will do the trick, as @rabernat already pointed out ☝️. |
I will try to help out over here today. My idea was to first run a very simple test of the following. Use xarray.map_blocks to apply histogram to chunks. This should use xhistogram on numpy arrays, and I hope to bypass that dask reshaping pointed out by @rabernat. If this shows promise, I will try to dig deeper into the xhistogram code. Do you see any issues with this @jrbourbeau? |
@jbusecke trying out @gjoseph92 could you look at the section of |
Much appreciate the help on the xhistogram side @gjoseph92, since I am not very familiar with the internals at all. |
Neither are we : ) I'm looking forward to digging in on |
So not sure why, but today I've been able to load multiple variables with dask and xhistogram. I didn't do anything different, just ran the same nb that had the issue and I've been able to move forward... Not sure if this means it was a fluke, or if there is still an underlying problem with how xhistogram works with dask... @jbusecke could you weigh in with your own issue you've been having, and have things 'miraculously' starting working again for you? |
I ran @stb2145 notebook both on staging and production(with upstream I suggested to leave this issue open while I run my (very similar) computations and can confirm that these are not experiencing any issue. A performance audit on xhistogram might still be worth it though, just because it is pretty widely used in our group. |
I am working on another use case that is blowing up my memory, but I am struggling with getting enough workers online to test it properly (currently crawling along on 1 worker 😭). |
Ok I just pushed a sample notebook that emulates my workflow. This should be a rather straightforward use of xhistogram, and I am using synthetic data (rather small in the context of the CMIP6 data available), yet the computation is basically showing worker comms all over the place, and eventually dies. Note that I have already set the time chunks to the smallest value (1) here, the data usually has time chunks along the size of 3-8. I also pushed the performance report, but am not sure how to embed it as nicely as @stb2145 did. I think this computation can be used as a good benchmark for xhistogram. |
There was this commit pangeo-data/pangeo-cloud-federation@e0b8afd a few days ago which updated packages on staging, but prod hasn't been updated in a few months 🤷
Thanks @jbusecke -- I'll take a look |
It's a little extra steps by
(Like I said, a little extra but convenient for those you are sharing with!) |
First: should we split this into two issues? I'm not sure if @stb2145's issues with multiple dims are the same as @jbusecke's. I've just been working with @jbusecke's notebook assuming it was representative of all the problems, but after re-reading this thread from the top, I'm not sure. Re the dims issue: (after reading a lot about joint-distribution histograms because I'm not that good at math) I would expect that a histogram along 3 dimensions is ~3x "harder" than a histogram of the flattened array. The output is 3x larger, plus you have to bring all 3 chunks together into one place to compute their conditional probabilities relative to each other—there's no way to parallelize that step. However, the computation should still run fine, even if it takes longer. So if it's consistently failing, something's wrong. Brief update on our progress so far:
Graphshist.data.visualize() hist.data.visualize(color="order") hist.data.visualize(optimize_graph=True) hist.data.dask.visualize("hlg.png") Next steps:
|
Hi @gjoseph92, thanks so much for this. I just realized that my example was actually flawed because the resulting dataarray is quite large (and as you point out, grinds to a halt due to the For context: My 'real-world' calculation actually succeeds now, but I still thing the performance is far from optimal, so I think using both of these cases as 'benchmarks' for xhistogram is still worth pursuing. |
Hi everone,
I'm having trouble loading my variables when I call
histogram()
from xhistogram and am not sure what is going on. I'm running with a cluster and I'm noticing my Dask dashboard is not performing well - especially when I use thedims
argument inhistogram()
. I've attached a simple workflow that'll replicate this issue and you will see from the dask performance reports (the two links below) that the variable,theta_dist_with_dims
, is performing poorly compared to when dims is not given in thetheta_dist_nodims
variable.I need to specify the dimensions argument for my variables, so I would appreciate any help in figuring out the issue here and a solution/workaround to making dask perform better with xhistogram, thanks!
theta_dist_nodims-dask-report
theta_dist_with_dims-dask-report
To get dask reports when running
load()
:The text was updated successfully, but these errors were encountered: