Skip to content

Commit

Permalink
Merge branch 'expr-explicit-comms' of github.com:rjzamora/dask-cuda i…
Browse files Browse the repository at this point in the history
…nto expr-explicit-comms
  • Loading branch information
rjzamora committed Apr 2, 2024
2 parents 92b496c + a394334 commit 206101f
Show file tree
Hide file tree
Showing 7 changed files with 20 additions and 12 deletions.
5 changes: 5 additions & 0 deletions ci/release/update-version.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,8 @@ for FILE in .github/workflows/*.yaml; do
sed_runner "/shared-workflows/ s/@.*/@branch-${NEXT_SHORT_TAG}/g" "${FILE}"
done
sed_runner "s/RAPIDS_VERSION_NUMBER=\".*/RAPIDS_VERSION_NUMBER=\"${NEXT_SHORT_TAG}\"/g" ci/build_docs.sh

# Docs referencing source code
find docs/source/ -type f -name *.rst -print0 | while IFS= read -r -d '' filename; do
sed_runner "s|/branch-[^/]*/|/branch-${NEXT_SHORT_TAG}/|g" "${filename}"
done
3 changes: 3 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,6 @@ Explicit-comms
.. currentmodule:: dask_cuda.explicit_comms.comms
.. autoclass:: CommsContext
:members:

.. currentmodule:: dask_cuda.explicit_comms.dataframe.shuffle
.. autofunction:: shuffle
8 changes: 4 additions & 4 deletions docs/source/examples/ucx.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Enabling UCX communication
==========================

A CUDA cluster using UCX communication can be started automatically with LocalCUDACluster or manually with the ``dask cuda worker`` CLI tool.
In either case, a ``dask.distributed.Client`` must be made for the worker cluster using the same Dask UCX configuration; see `UCX Integration -- Configuration <../ucx.html#configuration>`_ for details on all available options.
In either case, a ``dask.distributed.Client`` must be made for the worker cluster using the same Dask UCX configuration; see `UCX Integration -- Configuration <../../ucx/#configuration>`_ for details on all available options.

LocalCUDACluster with Automatic Configuration
---------------------------------------------
Expand All @@ -29,7 +29,7 @@ To connect a client to a cluster with automatically-configured UCX and an RMM po
LocalCUDACluster with Manual Configuration
------------------------------------------

When using LocalCUDACluster with UCX communication and manual configuration, all required UCX configuration is handled through arguments supplied at construction; see `API -- Cluster <../api.html#cluster>`_ for a complete list of these arguments.
When using LocalCUDACluster with UCX communication and manual configuration, all required UCX configuration is handled through arguments supplied at construction; see `API -- Cluster <../../api/#cluster>`_ for a complete list of these arguments.
To connect a client to a cluster with all supported transports and an RMM pool:

.. code-block:: python
Expand Down Expand Up @@ -148,7 +148,7 @@ We communicate to the scheduler that we will be using UCX with the ``--protocol`
Workers
^^^^^^^

All UCX configuration options have analogous options in ``dask cuda worker``; see `API -- Worker <../api.html#worker>`_ for a complete list of these options.
All UCX configuration options have analogous options in ``dask cuda worker``; see `API -- Worker <../../api/#worker>`_ for a complete list of these options.
To start a cluster with all supported transports and an RMM pool:

.. code-block:: bash
Expand All @@ -163,7 +163,7 @@ To start a cluster with all supported transports and an RMM pool:
Client
^^^^^^

A client can be configured to use UCX by using ``dask_cuda.initialize``, a utility which takes the same UCX configuring arguments as LocalCUDACluster and adds them to the current Dask configuration used when creating it; see `API -- Client initialization <../api.html#client-initialization>`_ for a complete list of arguments.
A client can be configured to use UCX by using ``dask_cuda.initialize``, a utility which takes the same UCX configuring arguments as LocalCUDACluster and adds them to the current Dask configuration used when creating it; see `API -- Client initialization <../../api/#client-initialization>`_ for a complete list of arguments.
To connect a client to the cluster we have made:

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions docs/source/explicit_comms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Communication and scheduling overhead can be a major bottleneck in Dask/Distribu
The idea is that Dask/Distributed spawns workers and distribute data as usually while the user can submit tasks on the workers that communicate explicitly.

This makes it possible to bypass Distributed's scheduler and write hand-tuned computation and communication patterns. Currently, Dask-CUDA includes an explicit-comms
implementation of the Dataframe `shuffle <https://github.com/rapidsai/dask-cuda/blob/d3c723e2c556dfe18b47b392d0615624453406a5/dask_cuda/explicit_comms/dataframe/shuffle.py#L210>`_ operation used for merging and sorting.
implementation of the Dataframe `shuffle <../api/#dask_cuda.explicit_comms.dataframe.shuffle.shuffle>`_ operation used for merging and sorting.


Usage
Expand All @@ -14,4 +14,4 @@ Usage
In order to use explicit-comms in Dask/Distributed automatically, simply define the environment variable ``DASK_EXPLICIT_COMMS=True`` or setting the ``"explicit-comms"``
key in the `Dask configuration <https://docs.dask.org/en/latest/configuration.html>`_.

It is also possible to use explicit-comms in tasks manually, see the `API <api.html#explicit-comms>`_ and our `implementation of shuffle <https://github.com/rapidsai/dask-cuda/blob/branch-0.20/dask_cuda/explicit_comms/dataframe/shuffle.py>`_ for guidance.
It is also possible to use explicit-comms in tasks manually, see the `API <../api/#explicit-comms>`_ and our `implementation of shuffle <https://github.com/rapidsai/dask-cuda/blob/branch-24.06/dask_cuda/explicit_comms/dataframe/shuffle.py>`_ for guidance.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ While Distributed can be used to leverage GPU workloads through libraries such a

- **Automatic instantiation of per-GPU workers** -- Using Dask-CUDA's LocalCUDACluster or ``dask cuda worker`` CLI will automatically launch one worker for each GPU available on the executing node, avoiding the need to explicitly select GPUs.
- **Automatic setting of CPU affinity** -- The setting of CPU affinity for each GPU is done automatically, preventing memory transfers from taking suboptimal paths.
- **Automatic selection of InfiniBand devices** -- When UCX communication is enabled over InfiniBand, Dask-CUDA automatically selects the optimal InfiniBand device for each GPU (see `UCX Integration <ucx.html>`_ for instructions on configuring UCX communication).
- **Automatic selection of InfiniBand devices** -- When UCX communication is enabled over InfiniBand, Dask-CUDA automatically selects the optimal InfiniBand device for each GPU (see `UCX Integration <ucx>`_ for instructions on configuring UCX communication).
- **Memory spilling from GPU** -- For memory-intensive workloads, Dask-CUDA supports spilling from GPU to host memory when a GPU reaches the default or user-specified memory utilization limit.
- **Allocation of GPU memory** -- when using UCX communication, per-GPU memory pools can be allocated using `RAPIDS Memory Manager <https://github.com/rapidsai/rmm>`_ to circumvent the costly memory buffer mappings that would be required otherwise.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/spilling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ JIT-Unspill
The regular spilling in Dask and Dask-CUDA has some significate issues. Instead of tracking individual objects, it tracks task outputs.
This means that a task returning a collection of CUDA objects will either spill all of the CUDA objects or none of them.
Other issues includes *object duplication*, *wrong spilling order*, and *non-tracking of sharing device buffers*
(see: https://github.com/dask/distributed/issues/4568#issuecomment-805049321).
(`see discussion <https://github.com/dask/distributed/issues/4568#issuecomment-805049321>`_).

In order to address all of these issues, Dask-CUDA introduces JIT-Unspilling, which can improve performance and memory usage significantly.
For workloads that require significant spilling
Expand Down
8 changes: 4 additions & 4 deletions docs/source/ucx.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Automatic

Beginning with Dask-CUDA 22.02 and assuming UCX >= 1.11.1, specifying UCX transports is now optional.

A local cluster can now be started with ``LocalCUDACluster(protocol="ucx")``, implying automatic UCX transport selection (``UCX_TLS=all``). Starting a cluster separately -- scheduler, workers and client as different processes -- is also possible, as long as Dask scheduler is created with ``dask scheduler --protocol="ucx"`` and connecting a ``dask cuda worker`` to the scheduler will imply automatic UCX transport selection, but that requires the Dask scheduler and client to be started with ``DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True``. See `Enabling UCX communication <examples/ucx.html>`_ for more details examples of UCX usage with automatic configuration.
A local cluster can now be started with ``LocalCUDACluster(protocol="ucx")``, implying automatic UCX transport selection (``UCX_TLS=all``). Starting a cluster separately -- scheduler, workers and client as different processes -- is also possible, as long as Dask scheduler is created with ``dask scheduler --protocol="ucx"`` and connecting a ``dask cuda worker`` to the scheduler will imply automatic UCX transport selection, but that requires the Dask scheduler and client to be started with ``DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True``. See `Enabling UCX communication <../examples/ucx/>`_ for more details examples of UCX usage with automatic configuration.

Configuring transports manually is still possible, please refer to the subsection below.

Expand Down Expand Up @@ -79,12 +79,12 @@ However, some will affect related libraries, such as RMM:
.. note::
These options can be used with mainline Dask.distributed.
However, some features are exclusive to Dask-CUDA, such as the automatic detection of InfiniBand interfaces.
See `Dask-CUDA -- Motivation <index.html#motivation>`_ for more details on the benefits of using Dask-CUDA.
See `Dask-CUDA -- Motivation <../#motivation>`_ for more details on the benefits of using Dask-CUDA.

Usage
-----

See `Enabling UCX communication <examples/ucx.html>`_ for examples of UCX usage with different supported transports.
See `Enabling UCX communication <../examples/ucx/>`_ for examples of UCX usage with different supported transports.

Running in a fork-starved environment
-------------------------------------
Expand All @@ -97,7 +97,7 @@ this when using Dask-CUDA's UCX integration, processes launched via
multiprocessing should use the start processes using the
`"forkserver"
<https://docs.python.org/dev/library/multiprocessing.html#contexts-and-start-methods>`_
method. When launching workers using `dask cuda worker <quickstart.html#dask-cuda-worker>`_, this can be
method. When launching workers using `dask cuda worker <../quickstart/#dask-cuda-worker>`_, this can be
achieved by passing ``--multiprocessing-method forkserver`` as an
argument. In user code, the method can be controlled with the
``distributed.worker.multiprocessing-method`` configuration key in
Expand Down

0 comments on commit 206101f

Please sign in to comment.