Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Update cuda_copy perf estimates for Grace-Hopper #10155

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions config/ucx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ UCX_IB_MLX5_DEVX_OBJECTS=
UCX_GDR_COPY_BW=0MBs,get_dedicated:30GBs,put_dedicated:30GBs
UCX_GDR_COPY_LAT=30e-9
UCX_DISTANCE_BW=auto,sys:16500MBs
UCX_CUDA_COPY_BW=h2d:400GBs,d2h:300GBs,d2d:400GBs,other:10000MBs

[Fujitsu ARM]
CPU vendor=Fujitsu ARM
Expand Down
36 changes: 25 additions & 11 deletions src/uct/cuda/cuda_copy/cuda_copy_iface.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,24 @@ static ucs_config_field_t uct_cuda_copy_iface_config_table[] = {
"Max number of cuda events. -1 is infinite",
ucs_offsetof(uct_cuda_copy_iface_config_t, max_cuda_events), UCS_CONFIG_TYPE_UINT},

{"BW", "10000MBs",
"Effective memory bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW},
/* TODO: 1. Add separate keys for shared and dedicated bandwidth
2. Remove the "other" key (use pref_loc for managed memory) */
{"BW", "10000MBs,h2d:8300MBs,d2h:11660MBs,d2d:320GBs",
"Effective memory bandwidth", 0,
UCS_CONFIG_TYPE_KEY_VALUE(UCS_CONFIG_TYPE_BW,
{"h2d", "host to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.h2d)},
{"d2h", "device to host bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2h)},
{"d2d", "device to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2d)},
{"other", "any other memory types combinations bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.other)},
{NULL})},

{NULL}
};


/* Forward declaration for the delete function */
static void UCS_CLASS_DELETE_FUNC_NAME(uct_cuda_copy_iface_t)(uct_iface_t*);

Expand Down Expand Up @@ -134,7 +144,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface,

iface_attr->latency = UCT_CUDA_COPY_IFACE_LATENCY;
iface_attr->bandwidth.dedicated = 0;
iface_attr->bandwidth.shared = iface->config.bandwidth;
iface_attr->bandwidth.shared = iface->config.bw.other;
iface_attr->overhead = UCT_CUDA_COPY_IFACE_OVERHEAD;
iface_attr->priority = 0;

Expand Down Expand Up @@ -407,16 +417,17 @@ uct_cuda_copy_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr)
perf_attr->bandwidth.dedicated = 0;
if ((src_mem_type == UCS_MEMORY_TYPE_HOST) &&
(dst_mem_type == UCS_MEMORY_TYPE_CUDA)) {
perf_attr->bandwidth.shared = (zcopy ? 8300.0 : 7900.0) * UCS_MBYTE;
perf_attr->bandwidth.shared = zcopy ? iface->config.bw.h2d :
iface->config.bw.h2d * 0.95;
} else if ((src_mem_type == UCS_MEMORY_TYPE_CUDA) &&
(dst_mem_type == UCS_MEMORY_TYPE_HOST)) {
perf_attr->bandwidth.shared = (zcopy ? 11660.0 : 9320.0) *
UCS_MBYTE;
perf_attr->bandwidth.shared = zcopy ? iface->config.bw.d2h :
iface->config.bw.d2h * 0.95;
Comment on lines +420 to +425
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bcopy BW is slower than zcopy one? BTW 11660 * 0.95 is not equal to 9320. Maybe we need to introduce to different env variables like BCOPY_BW and ZCOPY_BW to control this values accurately. Or if we are OK with changing performance in common case, maybe better just not to distinguish bcopy/zcopy perf and set one value in both cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not zcopy vs. bcopy; it's zcopy vs. short. Unlike zcopy, put/get short operations invoke cuStreamSynchronize per operation. Therefore, we want to advertise a slightly lower bw for the short vs zcopy operation for cuda_copy.
I'm not sure what 9320 represents. Why do you want it to be equal to 9320?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I am thinking about whether we need this difference to be made by this way because each change to performance estimation without proper performance testing can lead to unforeseen degradation in some cases. So if we want to tune performance for GH systems only I would like to leave performance for other platforms untouched. Or if we are OK to change performance on all platforms in that PR, I am wondering whether this 5% difference really matters or we can follow this KISS principle and set the same values for both zcopy and short cases.

@brminich @yosefe WDYT?

} else if ((src_mem_type == UCS_MEMORY_TYPE_CUDA) &&
(dst_mem_type == UCS_MEMORY_TYPE_CUDA)) {
perf_attr->bandwidth.shared = 320.0 * UCS_GBYTE;
perf_attr->bandwidth.shared = iface->config.bw.d2d;
} else {
perf_attr->bandwidth.shared = iface->config.bandwidth;
perf_attr->bandwidth.shared = iface->config.bw.other;
}
}

Expand Down Expand Up @@ -491,7 +502,10 @@ static UCS_CLASS_INIT_FUNC(uct_cuda_copy_iface_t, uct_md_h md, uct_worker_h work
self->id = ucs_generate_uuid((uintptr_t)self);
self->config.max_poll = config->max_poll;
self->config.max_cuda_events = config->max_cuda_events;
self->config.bandwidth = config->bandwidth;
self->config.bw.h2d = config->bw.h2d;
self->config.bw.d2h = config->bw.d2h;
self->config.bw.d2d = config->bw.d2d;
self->config.bw.other = config->bw.other;
UCS_STATIC_BITMAP_RESET_ALL(&self->streams_to_sync);

ucs_mpool_params_reset(&mp_params);
Expand Down
14 changes: 12 additions & 2 deletions src/uct/cuda/cuda_copy/cuda_copy_iface.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,12 @@ typedef struct uct_cuda_copy_iface {
struct {
unsigned max_poll;
unsigned max_cuda_events;
double bandwidth;
struct {
double h2d;
double d2h;
double d2d;
double other;
} bw;
} config;
/* handler to support arm/wakeup feature */
struct {
Expand All @@ -87,7 +92,12 @@ typedef struct uct_cuda_copy_iface_config {
uct_iface_config_t super;
unsigned max_poll;
unsigned max_cuda_events;
double bandwidth;
struct {
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir why not use bw[UCS_MEMORY_TYPE_LAST][UCS_MEMORY_TYPE_LAST] and avoid explicit fields for each direction? This way we don't have to introduce other field and we can populate the bandwidths by referring to the specific source-destination combination. For example, replace:

         {"h2d", "host to device bandwidth",
          ucs_offsetof(uct_cuda_copy_iface_config_t, bw.h2d)},
          ...

with

         {"h2d", "host to device bandwidth",
          ucs_offsetof(uct_cuda_copy_iface_config_t, bw.[UCS_MEMORY_TYPE_UNKNOWN][UCS_MEMORY_TYPE_CUDA])},

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing it that way initially in fact. But, then I thought that bw matrix will have entries for types that are completely irrelevant to CUDA, and there will be much more entries than the four in this struct. Having said that, I'm not strongly against using the matrix.

double h2d;
double d2h;
double d2d;
double other;
} bw;
} uct_cuda_copy_iface_config_t;


Expand Down
Loading