Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuDF spilling statistics to RMM/GPU memory plot #8148

Merged
merged 26 commits into from
Dec 18, 2023

Conversation

charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented Aug 31, 2023

Addresses rapidsai/dask-cuda#1226

Exposes cuDF's spilling statistics as an extra worker metric, and refactors the RMM/GPU memory plot to reflect this information in a similar way to the standard worker memory plot.

cc @pentschev @quasiben

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Aug 31, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   11h 48m 16s ⏱️ - 2m 58s
  3 939 tests +  1    3 826 ✔️ ±0     111 💤 +  1    2 ±0 
49 563 runs  +27  47 235 ✔️ ±0  2 317 💤 +27  11 ±0 

For more details on these failures, see this check.

Results for commit 936f0f6. ± Comparison against base commit 87576ae.

♻️ This comment has been updated with latest results.

@pentschev
Copy link
Member

That's awesome @charlesbluca , when this is somewhat usable (if it isn't already), would you mind posting some screenshot(s) to see what it looks like?

@charlesbluca
Copy link
Member Author

Made some modifications to add this information to the RMM individual plot, since that seems to be the most sensible place to include this information; I've opted for something similar to the individual worker memory plots, where spilled memory is represented as grey:

image

Note that the color scheme follows Dask defaults because I'm using MemoryColor to control the bar's coloring, if this is a problem it shouldn't be too difficult to modify that class to allow for overriding the default color scheme.

@charlesbluca charlesbluca marked this pull request as ready for review September 1, 2023 19:19
@charlesbluca charlesbluca changed the title Add individual dashboard page for cuDF spilling statistics Add cuDF spilling statistics to RMM/GPU memory plot Sep 1, 2023
@pentschev
Copy link
Member

Made some modifications to add this information to the RMM individual plot, since that seems to be the most sensible place to include this information; I've opted for something similar to the individual worker memory plots, where spilled memory is represented as grey:

It seems that we've now lost the opacity for GPU memory used/RMM pool size/used memory size, or is it simply the case that you didn't have a pool enable and thus it didn't show?

Note that the color scheme follows Dask defaults because I'm using MemoryColor to control the bar's coloring, if this is a problem it shouldn't be too difficult to modify that class to allow for overriding the default color scheme.

I think the green tones were @jacobtomlinson 's idea, they provide a nice touch for GPU memory, particularly I think it looked nicer as well. I'd anyway defer visual details to Jacob anyway, I think he's back Monday anyway. 🙂

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in two minds about the colours. The defaults will feel familiar to users, but they may confuse regular memory plots with GPU memory plots. Using green more clearly highlights they are NVIDIA GPU memory plots. My previous comment on this subject was that there was an early suggestion to use a RAPIDS purple-based scheme, but my recommendation was to stick with green as users may use other GPU libraries with Dask.

@charlesbluca
Copy link
Member Author

It seems that we've now lost the #7718 (comment), or is it simply the case that you didn't have a pool enable and thus it didn't show?

Yup, apologies for the confusion, with a partially utilized RMM pool enabled things should look something like this:

image

Based on @jacobtomlinson's comment, I think it makes sense to keep the original green to avoid potential confusion here, so will go ahead and push some changes to MemoryColor to allow the colors to be overridden

Comment on lines 279 to 282
def __init__(self, neutral_color="blue", target_color="orange", terminated_color="red"):
self.neutral_color = neutral_color
self.target_color = target_color
self.terminated_color = terminated_color
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that we could also configure the colors for the worker memory bars when they are approaching the memory limit (i.e. possibly spilling) or have been terminated; right now I've only overridden the neutral color for the GPU plot but not sure if it makes sense to use different colors for these other states?

Comment on lines 284 to 286
target = dask.config.get("distributed.worker.memory.target")
spill = dask.config.get("distributed.worker.memory.spill")
terminate = dask.config.get("distributed.worker.memory.terminate")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming these variables don't come up when deciding GPU to CPU spilling and can probably be ignored in that case, not sure what the best way to do this is? First thought was to just use a context manager to override config when initializing the MemoryColor, not sure if that will screw up AMM behavior down the line

@quasiben
Copy link
Member

quasiben commented Sep 7, 2023

This looks really good! Thanks @charlesbluca . I think what's missing now is a test . I think pulling inspiration from test_worker_http.py or previous rmm tests test_rmm_diagnostics.py would be good

@quasiben
Copy link
Member

quasiben commented Sep 7, 2023

In #8148 (comment) can you remind me what all the values are ?

17.88 GiB is ?
5.96 is ?
95.37 GiB is ?
256 GiB is the total amount of GPU memory (this is the rmm pool allocation across all 8 GPUs), correct ?

@charlesbluca
Copy link
Member Author

In #8148 (comment) can you remind me what all the values are?

Ah, I notice now that the numbers there look a little wonky - I had hardcoded some values to illustrate what all the colors/opacities should look like, but missed that the RMM util exceeds the pool size 😅 a more realistic representation of what things would look like is this:

image

In the above screenshot:

  • 11.93 GiB represents the sum of RMM allocated bytes across all the workers
  • 23.84 GiB represents the sum of the RMM pool sizes across all workers
  • 47.68 GiB represents the sum of GPU memory utilization (as tracked by pynvml) across all workers
  • 256.00 GiB represents the sum of total GPU memory (as tracked by pynvml) across all workers

While the worker hover tool represents these values for each individual worker.

I think what's missing now is a test

Yup, will go ahead and add a test once I resolve merge conflicts

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally looking good to me from a GPU perspective and would be happy to merge it.

However there seem to be a lot of CI failures and I'm a little out of the loop on the state of the distributed CI. Perhaps @fjetter or @hendrikmakait could weigh in on whether these failures are expected?

@jacobtomlinson
Copy link
Member

@fjetter @hendrikmakait @jrbourbeau another gentle nudge. Are the CI failures here of concern?

# enable monitoring of cuDF spilling
export CUDF_SPILL=on
export CUDF_SPILL_STATS=1
export DASK_DISTRIBUTED__DIAGNOSTICS__CUDF=1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially set added this config variable to get test_no_unnecessary_imports_on_worker[scipy] and test_malloc_trim_threshold passing in GPU CI, but it seems like there isn't a trivial way to enable/disable cuDF spilling monitoring on a per-test basis.

Is there a way that we could somehow achieve this, or would it make sense to just not run these specific tests on GPU if we're expecting users to have cuDF installed on the workers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would doing the same this test work? If not, maybe our only option would be to launch the test in a separate process so that we can have full control of environment variables before import cudf.

@gen_cluster(
client=True,
nthreads=[("127.0.0.1", 1)],
Worker=dask_cuda.CUDAWorker,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pentschev this was the impetus for rapidsai/dask-build-environment#73, as using the default worker class here we seem to be unable to initialize the spilling monitoring? This was unexpected to me as I would think all we require is a spilling-enabled installation of cuDF on the workers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would my #8148 (comment) above help with the initialization issues? Otherwise, let's leave it like this for now and discuss how to improve our testing strategy next week.

@charlesbluca
Copy link
Member Author

rerun tests

@hendrikmakait hendrikmakait self-requested a review October 25, 2023 15:40
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long silence, CI failures are unrelated.

if dask.config.get("distributed.diagnostics.cudf"):
try:
import cudf as _cudf # noqa: F401
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
except Exception:
except ImportError:

@hendrikmakait hendrikmakait merged commit b44e661 into dask:main Dec 18, 2023
20 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants