-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cuDF spilling statistics to RMM/GPU memory plot #8148
Add cuDF spilling statistics to RMM/GPU memory plot #8148
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 11h 48m 16s ⏱️ - 2m 58s For more details on these failures, see this check. Results for commit 936f0f6. ± Comparison against base commit 87576ae. ♻️ This comment has been updated with latest results. |
That's awesome @charlesbluca , when this is somewhat usable (if it isn't already), would you mind posting some screenshot(s) to see what it looks like? |
Made some modifications to add this information to the RMM individual plot, since that seems to be the most sensible place to include this information; I've opted for something similar to the individual worker memory plots, where spilled memory is represented as grey: Note that the color scheme follows Dask defaults because I'm using |
It seems that we've now lost the opacity for GPU memory used/RMM pool size/used memory size, or is it simply the case that you didn't have a pool enable and thus it didn't show?
I think the green tones were @jacobtomlinson 's idea, they provide a nice touch for GPU memory, particularly I think it looked nicer as well. I'd anyway defer visual details to Jacob anyway, I think he's back Monday anyway. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in two minds about the colours. The defaults will feel familiar to users, but they may confuse regular memory plots with GPU memory plots. Using green more clearly highlights they are NVIDIA GPU memory plots. My previous comment on this subject was that there was an early suggestion to use a RAPIDS purple-based scheme, but my recommendation was to stick with green as users may use other GPU libraries with Dask.
Yup, apologies for the confusion, with a partially utilized RMM pool enabled things should look something like this: Based on @jacobtomlinson's comment, I think it makes sense to keep the original green to avoid potential confusion here, so will go ahead and push some changes to |
def __init__(self, neutral_color="blue", target_color="orange", terminated_color="red"): | ||
self.neutral_color = neutral_color | ||
self.target_color = target_color | ||
self.terminated_color = terminated_color |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that we could also configure the colors for the worker memory bars when they are approaching the memory limit (i.e. possibly spilling) or have been terminated; right now I've only overridden the neutral color for the GPU plot but not sure if it makes sense to use different colors for these other states?
target = dask.config.get("distributed.worker.memory.target") | ||
spill = dask.config.get("distributed.worker.memory.spill") | ||
terminate = dask.config.get("distributed.worker.memory.terminate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming these variables don't come up when deciding GPU to CPU spilling and can probably be ignored in that case, not sure what the best way to do this is? First thought was to just use a context manager to override config when initializing the MemoryColor
, not sure if that will screw up AMM behavior down the line
This looks really good! Thanks @charlesbluca . I think what's missing now is a test . I think pulling inspiration from |
In #8148 (comment) can you remind me what all the values are ? 17.88 GiB is ? |
Ah, I notice now that the numbers there look a little wonky - I had hardcoded some values to illustrate what all the colors/opacities should look like, but missed that the RMM util exceeds the pool size 😅 a more realistic representation of what things would look like is this: In the above screenshot:
While the worker hover tool represents these values for each individual worker.
Yup, will go ahead and add a test once I resolve merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally looking good to me from a GPU perspective and would be happy to merge it.
However there seem to be a lot of CI failures and I'm a little out of the loop on the state of the distributed CI. Perhaps @fjetter or @hendrikmakait could weigh in on whether these failures are expected?
@fjetter @hendrikmakait @jrbourbeau another gentle nudge. Are the CI failures here of concern? |
# enable monitoring of cuDF spilling | ||
export CUDF_SPILL=on | ||
export CUDF_SPILL_STATS=1 | ||
export DASK_DISTRIBUTED__DIAGNOSTICS__CUDF=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially set added this config variable to get test_no_unnecessary_imports_on_worker[scipy]
and test_malloc_trim_threshold
passing in GPU CI, but it seems like there isn't a trivial way to enable/disable cuDF spilling monitoring on a per-test basis.
Is there a way that we could somehow achieve this, or would it make sense to just not run these specific tests on GPU if we're expecting users to have cuDF installed on the workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would doing the same this test work? If not, maybe our only option would be to launch the test in a separate process so that we can have full control of environment variables before import cudf
.
@gen_cluster( | ||
client=True, | ||
nthreads=[("127.0.0.1", 1)], | ||
Worker=dask_cuda.CUDAWorker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pentschev this was the impetus for rapidsai/dask-build-environment#73, as using the default worker class here we seem to be unable to initialize the spilling monitoring? This was unexpected to me as I would think all we require is a spilling-enabled installation of cuDF on the workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would my #8148 (comment) above help with the initialization issues? Otherwise, let's leave it like this for now and discuss how to improve our testing strategy next week.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long silence, CI failures are unrelated.
if dask.config.get("distributed.diagnostics.cudf"): | ||
try: | ||
import cudf as _cudf # noqa: F401 | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
except Exception: | |
except ImportError: |
Addresses rapidsai/dask-cuda#1226
Exposes cuDF's spilling statistics as an extra worker metric, and refactors the RMM/GPU memory plot to reflect this information in a similar way to the standard worker memory plot.
cc @pentschev @quasiben
pre-commit run --all-files