From e6254a20812f1d59f200a94428320420275f6e62 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Tue, 1 Oct 2024 12:31:11 -0400 Subject: [PATCH 1/5] Sync branches (#4) * Sync branches gpg: using RSA key 22223038B47B3ED4B3355AB11B54779B4780494E gpg: Good signature from "Peter Park (MKMPETEPARK01) " [ultimate] Make docs conform to ROCm docs style guide (#10) * add rocm-documentation as CODEOWNERS * add linting workflow * update index/toc * formatting * bump rocm-docs-core to 1.8.2 * format and convert links to intersphinx * formatting and links * wording and other stuff * link to hardware support section * fix wording more wording * Update docs/how-to/single-node-config.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/single-node-config.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/multi-node-config.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/multi-node-config.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * update wording words make definition more clear update * headings and code font --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * fix link --- .github/CODEOWNERS | 3 + .github/workflows/linting.yml | 19 + .wordlist.txt | 6 + docs/conf.py | 52 +- docs/how-to/multi-node-config.rst | 1181 ++++++++++--------- docs/how-to/single-node-config.rst | 1667 ++++++++++++++------------- docs/index.rst | 76 +- docs/reference/hardware-support.rst | 103 +- docs/sphinx/_toc.yml.in | 28 +- docs/sphinx/requirements.in | 2 +- docs/sphinx/requirements.txt | 294 ++--- 11 files changed, 1827 insertions(+), 1604 deletions(-) create mode 100644 .github/CODEOWNERS create mode 100644 .github/workflows/linting.yml create mode 100644 .wordlist.txt diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS new file mode 100644 index 0000000..76a87e8 --- /dev/null +++ b/.github/CODEOWNERS @@ -0,0 +1,3 @@ +docs/ @ROCm/rocm-documentation +*.md @ROCm/rocm-documentation +*.rst @ROCm/rocm-documentation diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml new file mode 100644 index 0000000..b00669a --- /dev/null +++ b/.github/workflows/linting.yml @@ -0,0 +1,19 @@ +name: Linting + +on: + push: + branches: + - develop + - 'docs/*' + - 'roc**' + pull_request: + branches: + - develop + - 'docs/*' + - 'roc**' + +jobs: + call-workflow-passing-data: + name: Documentation + uses: ROCm/rocm-docs-core/.github/workflows/linting.yml@develop + diff --git a/.wordlist.txt b/.wordlist.txt new file mode 100644 index 0000000..4e3ac0e --- /dev/null +++ b/.wordlist.txt @@ -0,0 +1,6 @@ +Broadcom +InfiniBand +NIC +RoCE +perftest +perftests diff --git a/docs/conf.py b/docs/conf.py index 9bfdeee..49b669c 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -1,26 +1,26 @@ -# Configuration file for the Sphinx documentation builder. -# -# This file only contains a selection of the most common options. For a full -# list see the documentation: -# https://www.sphinx-doc.org/en/master/usage/configuration.html - -# configurations for PDF output by Read the Docs -project = "GPU Cluster Networking Documentation" -author = "Advanced Micro Devices, Inc." -copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." -version = "0.1.0" -release = version -setting_all_article_info = False - -external_toc_path = "./sphinx/_toc.yml" - -extensions = ["rocm_docs"] - -external_projects_current_project = "gpu-cluster-networking" - -html_theme = "rocm_docs_theme" -html_theme_options = {"flavor": "rocm-docs-home"} - -html_title = project - -html_theme_options = {"link_main_doc": True} +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# configurations for PDF output by Read the Docs +project = "GPU cluster networking documentation" +author = "Advanced Micro Devices, Inc." +copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." +version = "0.1.0" +release = version +setting_all_article_info = False + +external_toc_path = "./sphinx/_toc.yml" + +extensions = ["rocm_docs"] + +external_projects_current_project = "gpu-cluster-networking" + +html_theme = "rocm_docs_theme" +html_theme_options = {"flavor": "rocm-docs-home"} + +html_title = project + +html_theme_options = {"link_main_doc": True} diff --git a/docs/how-to/multi-node-config.rst b/docs/how-to/multi-node-config.rst index 36344c6..68b6d01 100644 --- a/docs/how-to/multi-node-config.rst +++ b/docs/how-to/multi-node-config.rst @@ -1,565 +1,616 @@ -.. meta:: - :description: How to configure multiple nodes for testing - :keywords: network validation, DCGPU, multi node, ROCm, RCCL, machine learning, LLM, usage, tutorial - -****************************************************** -Multi node network configuration for AMD Instinct™ GPUs -****************************************************** - -With single node configuration testing completed and verified, we can move on to validating network connections in node pairs. All the tests described in this guide must be run between two nodes in a client-server relationship. Both nodes must be configured and verified per the :doc:`Single node configuration guide` before running any node-to-node performance tests. - -.. _Multinode-Prerequisites: - -Prerequisites -============= - -Before following the steps in this guide, ensure you have performed these actions first: - -* Install all required software for MPI in the `ROCm documentation `_. - - * Specifically, follow the installation instructions for Open MPI, OSU benchmarks, and collective operations. - -* Install `Slurm Workload Manager `_ (if applicable). - -* Implement passwordless SSH. - -Evaluate platform-specific BIOS tunings ---------------------------------------- - -Check your BIOS settings to make sure they are optimized for AMD GPUs. See the `AMD Instinct Optimization Guides `_ for more details. - -* Enable large bar addressing in the BIOS to support peer to peer GPU memory access. -* Verify SRIOV is enabled, if needed. -* Disable ACS (ACS forces P2P transactions through the PCIe root complex). - -.. Note:: - If using virtual devices, AER and ACS should be enabled. - -Single Tier Switch Configuration --------------------------------- - -Take these actions on each single tier (leaf/edge) switch you plan to include in network testing. - -#. Configure remote access to the switch management console. - -#. Verify the switch sees all hosts and ports are active. - -#. For an InfiniBand switch, configure Fabric Manager on the switch or start OpenSM on a host in the network if a subnet manager isn't already in place. - -#. For an ethernet switch, configure MTU size and priority flow control (PFC) and ECN support as needed. - -#. Clear all port counters after the switch is ready to use. - -.. _OFED-Perftest-installation-and-benchmarking: - -OFED Perftest installation and benchmarking -============================================ - -Install and run the `OFED performance tests `_ for host to host (H2H) testing. Loopback is implemented in the tests to remove the switch from benchmark results. Remember to install OFED Perfests on both nodes you plan to use in this section. Commands may require ``sudo`` depending on user privileges. - -#. From the CLI of your host, run ``git clone https://github.com/linux-rdma/perftest.git``. - -#. Navigate to the installation directory and build the tests. - - .. code-block:: shell - - cd perftest - ./autogen.sh - ./configure --prefix=$PWD/install --enable-rocm --with-rocm=/opt/rocm - -#. Locate and open ``Makefile`` in your editor of choice, then append ``-D__HIP_PLATFORM_AMD__`` to ``CFLAGS`` and ``CXXFLAGS`` (lines 450 and 458 at the time of publication). This is required to compile the code correctly for this guide. - -#. Run ``make && make install``. - -#. Repeat these steps on a second node connected to the same switch. - -Run Host-based (CPU) Performance Tests -======================================== - -Once installed, there are six main modules available with OFED Perftests: - -* ib_write_bw - Test bandwidth with RDMA write transactions. -* ib_write_lat - Test latency with RDMA write transactions. -* ib_read_bw - Test bandwidth with RDMA read transactions. -* ib_read_lat - Test latency with RDMA read transactions. -* ib_send_bw - Test bandwidth with send transactions. -* ib_send_lat - Test latency with send transactions. - -The examples in this section use ib_send_bw, but you may accomplish similar with any other test you require. The goal of the tests in this section is to verify high speed Host to Host (H2H) data transfer rates between nodes prior to including GPU traffic, therefore the ``use_rocm`` flag is avoided in all commands. - -Run H2H RDMA Benchmark ------------------------ - -To run the OFED perftest, establish an SSH connection to both nodes you installed the OFED perftests on. - -#. Initiate a server connection on the first node: - - .. code-block:: shell - - $ cd perftest #if not already in directory - - $ numactl -C 1 ./ib_send_bw -a -F -d - - ************************************ - * Waiting for client to connect... * - ************************************ - -#. Initiate a client connection on the second node: - - .. code-block:: shell - - $ cd perftest #if not already in directory - - $ numactl -C 1 ./ib_send_bw -a -F -d - -#. Test should run and complete in several moments. - - .. note:: - The use of ``numactl`` or ``taskset`` commands makes sure NUMA domains are not crossed when communicating, which can create overhead and latency. When running tests you must ensure you use cores local to the network device. - -Consult this table for an explanation of flags used in the ``numactl`` examples and other optional flags that may be useful for you. - -.. raw:: html - - - -.. container:: - :name: perftest-commands-table - - .. list-table:: - :header-rows: 1 - :stub-columns: 1 - :widths: 2 5 - - * - Flag - - Description - - * - -d - - Specifies a NIC to use. Ensure you use a NIC that is both adjacent to a GPU and not crossing NUMA domains or otherwise needing pass traffic between CPUs before egressing from the host. Tools like ``rocm-smi --showtopo`` and ``lstopo`` can help define which NICs are adjacent to which GPUs. - - * - -p - - Assign a port number to the server/client, when running simultaneously you must use different ports. - - * - --report_gbits - - Reports in Gb/s instead of Mb/s. - - * - -m - - Set MTU size. - - * - -b - - Bidirectional runs. - - * - -a - - Runs messages in all sizes. - - * - -n - - Provides the number of iterations. - - * - -F - - Do not show warning if cpufreq_ondemand is loaded. - - * - --use_rocm= - - This is for device testing, allows you to specify which GPU to use. Zero-based numbering. - - * - --perform_warm_up - - Runs several iterations before benchmarking to warm up memory cache. - -As servers typically have one NIC per GPU, you must change the device location frequently as you iterate through tests. - -Run Multithreaded H2H RDMA Benchmark -------------------------------------- - -You can multithread an OFED perftest by running it simultaneously on each NIC in the server. Use ``taskset`` to select a CPU core on the same NUMA domain as the NICs. Although testing the XGMI/Infinity Fabric link between CPUs is not a goal at this point, it's an option if preferred. - -Run Extended Multithreaded H2H RDMA Benchmark ---------------------------------------------- - -Run the previous test, but this time loop it and run it for a minimum of 8 hours. The goal is to stress the IO network on the fabric over a long period of time. - -Run Device-based (GPU) OFED Performance Tests -============================================= - -Once H2H performance is verified, you can run the Device to Device (D2D) OFED perftests that include GPU traffic. - -Run D2D RDMA benchmark ------------------------ - -Use this example to run an OFED perftest between GPUs in pairs (GPU0 to GPU1, GPU2 to GPU3, and so on). - -.. note:: - If you have Mellanox/Nvidia NICs, be aware that the default OFED perftest installation doesn't include ROCm support. Follow the :ref:`installation instructions` if you haven't done so already. - -In this example, localhost is used by the client to call the server. You may use a specific IP address to ensure the network is tested. - -.. code-block:: shell - - $ (ib_write_bw -b -a -d --report_gbits -F -use_rocm=0 >> /dev/null &); sleep 1; ib_write_bw -b -a -d --report_gbits -use_rocm=0 -F localhost - --------------------------------------------------------------------------------------- - RDMA_Write Bidirectional BW Test - Dual-port : OFF Device : - Number of qps : 1 Transport type : IB - Connection type : RC Using SRQ : OFF - PCIe relax order: ON - ibv_wr* API : OFF - TX depth : 128 - CQ Moderation : 100 - Mtu : 4096[B] - Link type : Ethernet - GID index : 3 - Max inline data : 0[B] - rdma_cm QPs : OFF - Data ex. method : Ethernet - --------------------------------------------------------------------------------------- - local address: LID 0000 QPN 0x0901 PSN 0x5e30c8 RKey 0x2000201 VAddr 0x007fe663d20000 - GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:45 - remote address: LID 0000 QPN 0x0901 PSN 0xf40c3c RKey 0x2000201 VAddr 0x007f282a06e000 - GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:35 - --------------------------------------------------------------------------------------- - #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] - 2 5000 0.142947 0.012281 0.767588 - 4 5000 0.28 0.26 8.255475 - 8 5000 0.55 0.54 8.471791 - 16 5000 1.16 1.16 9.025968 - 32 5000 2.31 2.27 8.865877 - 64 5000 4.49 4.43 8.647051 - 128 5000 8.98 8.96 8.745890 - 256 5000 17.57 16.32 7.969287 - 512 5000 34.63 34.41 8.400441 - 1024 5000 67.22 66.92 8.168969 - 2048 5000 129.04 126.20 7.702863 - 4096 5000 188.76 188.56 5.754307 - 8192 5000 194.79 192.62 2.939080 - 16384 5000 195.32 195.21 1.489355 - 32768 5000 203.15 203.13 0.774887 - 65536 5000 204.12 203.85 0.388818 - 131072 5000 204.44 204.43 0.194964 - 262144 5000 204.51 204.51 0.097517 - 524288 5000 204.56 204.56 0.048770 - 1048576 5000 204.57 204.57 0.024387 - 2097152 5000 204.59 204.59 0.012194 - 4194304 5000 204.59 204.59 0.006097 - 8388608 5000 204.59 204.59 0.003049 - --------------------------------------------------------------------------------------- - -.. note:: - If you run the test with different values for ``--use_rocm=#`` on the server and the client, the output will show results from whichever GPU is local to the node you're looking at. The tool is unable to show server and client simultaneously. - -Run H2D/D2H RDMA Benchmark ---------------------------- - -This is similar to the D2D test, but also includes the CPU on either the server or client side of the test-case scenarios. - -For a 2-CPU/8-GPU node you would have have 32 test scenarios per pairs of server. - -.. list-table:: H2D/D2H Benchmark with Server-Side CPUs - :widths: 25 25 25 25 25 25 25 25 25 - :header-rows: 1 - - * - Client - - GPU 0 - - GPU 1 - - GPU 2 - - GPU 3 - - GPU 4 - - GPU 5 - - GPU 6 - - GPU 7 - * - Server - - CPU 0 - - CPU 1 - - - - - - - - - - - - - -.. list-table:: H2D/D2H Benchmark with Client-Side CPUs - :widths: 25 25 25 25 25 25 25 25 25 - :header-rows: 1 - - * - Server - - GPU 0 - - GPU 1 - - GPU 2 - - GPU 3 - - GPU 4 - - GPU 5 - - GPU 6 - - GPU 7 - * - Client - - CPU 0 - - CPU 1 - - - - - - - - - - - - - -To run this test, use a command similar to the example in the D2D benchmark, but only add the ``--use_rocm`` flag on either the server or client side so that one node communicates with the GPUs while the other does so with CPUs. Then run the test a second time with the ``use_rocm`` flag on the other side. Continue to use the most adjacent NIC to the GPU or CPU being tested so that communication doesn't run between between intranode CPUs (testing the internal CPU-CPU fabric isn't a goal at this time). - -D2D RDMA Multithread Benchmark ------------------------------- - -For this test you must run the previous D2D benchmark simultaneously on all GPUs. Scripting is required to accomplish this, but the command input should resemble something like the following image with regard to your RDMA device naming scheme. - -.. image:: ../data/D2D-perftest-multithread.png - :alt: multithread perftest input - -Important OFED perftest flags for this effort include: - -* ``-p `` - Lets you assign specific ports for server/client combinations. Each pair needs an independent port number so you don't inadvertently use the wrong server. - -* ``-n <# of iterations>`` - Default is 1000, you can increase this to have the test run longer. - -* For bandwidth tests only: - - * ``-D `` - Defines how long the test runs for. - - * ``--run_infinitely`` - Requires user to break the runtime, otherwise runs indefinitely. - -D2D RDMA Multithread Extended Benchmark ---------------------------------------- - -Perform the D2D RDMA multithread benchmark again but set the duration for a minimum of 8 hours. - -Build collective tests -====================== - -This section guides you through setting up the remaining tools necessary to simulate an AI workload on your GPU nodes after they have been sufficiently traffic-tested. Per the :ref:`prerequisites`, UCX, UCC, MPI and the OSU benchmarks must already be installed. - -Install RCCL -------------- - -RCCL is likely already installed as part of ROCm on your compute nodes. Sometimes newer features and fixes might be available in the latest version of RCCL, which you can build from source at https://github.com/ROCm/rccl. - -Build RCCL Collective Test --------------------------- - -To more easily build and run the RCCL collective tests, review and implement the script provided in the drop-down (the script also includes an option to install MPICH if needed). Otherwise, you can follow the steps to manually install at https://github.com/ROCm/rccl-tests. - -.. dropdown:: build-and-run_rccl-tests_sweep_multinode.sh - - .. code-block:: shell - :linenos: - - #!/bin/bash -x - - ## change this if ROCm is installed in a non-standard path - ROCM_PATH=/opt/rocm - - ## to use pre-installed MPI, change `build_mpi` to 0 and ensure that libmpi.so exists at `MPI_INSTALL_DIR/lib`. - build_mpi=1 - MPI_INSTALL_DIR=/opt/ompi - - ## to use pre-installed RCCL, change `build_rccl` to 0 and ensure that librccl.so exists at `RCCL_INSTALL_DIR/lib`. - build_rccl=1 - RCCL_INSTALL_DIR=${ROCM_PATH} - - - WORKDIR=$PWD - - ## building mpich - if [ ${build_mpi} -eq 1 ] - then - cd ${WORKDIR} - if [ ! -d mpich ] - then - wget https://www.mpich.org/static/downloads/4.1.2/mpich-4.1.2.tar.gz - mkdir -p mpich - tar -zxf mpich-4.1.2.tar.gz -C mpich --strip-components=1 - cd mpich - mkdir build - cd build - ../configure --prefix=${WORKDIR}/mpich/install --disable-fortran --with-ucx=embedded - make -j 16 - make install - fi - MPI_INSTALL_DIR=${WORKDIR}/mpich/install - fi - - - ## building rccl (develop) - if [ ${build_rccl} -eq 1 ] - then - cd ${WORKDIR} - if [ ! -d rccl ] - then - git clone https://github.com/ROCm/rccl -b develop - cd rccl - ./install.sh -l - fi - RCCL_INSTALL_DIR=${WORKDIR}/rccl/build/release - fi - - - ## building rccl-tests (develop) - cd ${WORKDIR} - if [ ! -d rccl-tests ] - then - git clone https://github.com/ROCm/rccl-tests - cd rccl-tests - make MPI=1 MPI_HOME=${MPI_INSTALL_DIR} NCCL_HOME=${RCCL_INSTALL_DIR} -j - fi - - - ## running multi-node rccl-tests all_reduce_perf for 1GB - cd ${WORKDIR} - - ## requires a hostfile named hostfile.txt for the multi-node setup in ${WORKDIR}/ - - n=`wc --lines < hostfile.txt` # count the numbers of nodes in hostfile.txt - echo "No. of nodes: ${n}" # print number of nodes - m=8 # assuming 8 GPUs per node - echo "No. of GPUs/node: ${m}" # print number of GPUs per node - total=$((n * m)) # total number of MPI ranks (1 per GPU) - echo "Total ranks: ${total}" # print number of GPUs per node - - ### set these environment variables if using Infiniband interconnect - ## export NCCL_IB_HCA=^mlx5_8 - - ### set these environment variables if using RoCE interconnect - ## export NCCL_IB_GID_INDEX=3 - - for coll in all_reduce all_gather alltoall alltoallv broadcast gather reduce reduce_scatter scatter sendrecv - do - # using MPICH; comment next line if using OMPI - mpirun -np ${total} --bind-to numa -env NCCL_DEBUG=VERSION -env PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -env LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt - - ## uncomment, if using OMPI - ## mpirun -np ${total} --bind-to numa -x NCCL_DEBUG=VERSION -x PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -x LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH --mca pml ucx --mca btl ^openib ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt - - sleep 10 - done - -.. Add or link to the RCCL config script once it's cleared for publication. - -Run OSU Micro Benchmarks -========================= - -Running the OSU Micro Benchmarks (OMB) with MPI simulates conditions similar to an AI/HPC workload over your cluster network. Successful MPI runs require that passwordless SSH be configured between all server pairs where OMB is installed and that they also be finger-printed, otherwise the runs fail. - -This section covers the the two types of OMB: - -* Point to point (pt2pt) benchmarks test communication between one discrete component on a server (host or device) to another. -* Collectives benchmarks support the use of multiple devices in a single run. - -In a typical use case, you start with a pair of nodes and run the pt2pt benchmarks then move on to collectives. - -Point to Point (pt2pt) OSU Benchmarks -------------------------------------- - -Commands in the table below must run on two nodes with RoCE or Infiniband interconnect from Host to Host (CPU to CPU). You can invoke the command from either node, but directories must mirror one another or the tests will hang. - -.. note:: - The paths for the MPI and OMB commands presume both are installed in the ``/opt`` directory. Installation paths for your environment may be different and should be updated accordingly. - -.. raw:: html - - - -.. container:: - :name: osu-commands-table - - .. list-table:: - :header-rows: 1 - :stub-columns: 1 - :widths: 2 5 - - * - Command - - Usage - - * - osu_bw - - $OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d rocm - - * - osu_bibw - - $OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -d rocm - - * - osu_mbw_mr - - $OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -d rocm - - * - osu_latency - - /$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d rocm - - * - osu_multi_lat - - $OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_multi_lat -d rocm - -You can change communications mode by appending ``D D`` to the end of command for D2D, or ``D H`` for D2H (and vice-versa). - -Collective OSU Benchmarks -------------------------- - -.. raw:: html - - - -.. container:: - :name: coll-commands-table - - .. list-table:: - :header-rows: 1 - :stub-columns: 1 - :widths: 2 5 - - * - Command - - Usage - - * - osu_allreduce - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d rocm D D - - * - osu_allreduce 2N 16Proc - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d rocm D D - - * - osu_alltoall - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D - - * - osu_alltoall 2N 16Proc - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D - - * - osu_allgather - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d rocm D D - - * - osu_allgather 2N 16Proc - - /opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d rocm D D - -Run RCCL Collective Benchmark -============================= - -RCCL is a collective communication library optimized for collective operations by multi-GPU and multi-node communication primitives that are in turn optimized for AMD Instinct GPUs. The RCCL Test is typically launched using MPI, but you can use MPICH or Open MPI as well. - -.. list-table:: - :stub-columns: 1 - :widths: 2 5 - - * - RCCL with MPI - - /opt/ompi/bin/mpirun -mca oob_tcp_if_exclude docker,lo -mca btl_tcp_if_exclude docker,lo -host {HOST1}:8,{HOST2}:8 -np 16 -x LD_LIBRARY_PATH=/opt/rccl/build/rccl/install/lib:/opt/ompi/lib -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=VERSION -x NCCL_IB_HCA=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 -x NCCL_IGNORE_CPU_AFFINITY=1 /opt/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 - -Reference Documentation -======================= - -* `ROCm Documentation `_ - -* `Slurm Workload Manager Documentation `_ - -* `OFED Performance Test ReadMe `_ - -* `RCCL Test Build Instructions `_ - -Resources and Helpful Links -=========================== - -* `AMD Infinity Hub `_ -* `AMD ROCm Developer Hub `_ +.. meta:: + :description: Learn how to configure multiple nodes for network testing. + :keywords: network validation, DCGPU, multi node, ROCm, RCCL, machine learning, LLM, usage, tutorial + +************************************************************** +Multi-node network configuration for AMD Instinct accelerators +************************************************************** + +After single node configuration testing has been completed and verified, validate network connections in node pairs. All the tests described in this topic must be run between two nodes in a client-server relationship. Both nodes +must be configured and verified per the :doc:`./single-node-config` +before running any node-to-node performance tests. + +.. _Multinode-Prerequisites: + +Prerequisites +============= + +Before following the steps in this guide, complete the following prerequisites. + +* Install all required software for MPI in the + :doc:`ROCm documentation `. + + * Specifically, follow the installation instructions for OpenMPI, OSU + benchmarks, and collective operations. + +* Install `Slurm Workload Manager `_ + (if applicable). Refer to the + `Slurm Workload Manager documentation `_. + +* Implement passwordless SSH. + +Evaluate platform-specific BIOS tunings +--------------------------------------- + +Check your BIOS settings to make sure they are optimized for AMD GPUs. See the +:doc:`AMD Instinct system optimization guides ` +for more information. + +* Enable large bar addressing in the BIOS to support peer to peer GPU memory + access. + +* Verify SR-IOV is enabled, if needed. + +* Disable ACS (ACS forces P2P transactions through the PCIe root complex). + +.. note:: + + If using virtual devices, AER and ACS should be enabled. + +Single tier switch configuration +-------------------------------- + +Take these actions on each single tier (leaf/edge) switch you plan to include in network testing. + +#. Configure remote access to the switch management console. + +#. Verify the switch sees all hosts and ports are active. + +#. For an InfiniBand switch, configure Fabric Manager on the switch or start + OpenSM on a host in the network if a subnet manager isn't already in place. + +#. For an ethernet switch, configure MTU size and priority flow control (PFC) + and ECN support as needed. + +#. Clear all port counters after the switch is ready to use. + +.. _OFED-Perftest-installation-and-benchmarking: + +OFED perftest installation and benchmarking +============================================ + +Install and run the `OFED performance tests `_ +for host to host (H2H) testing. Loopback is implemented in the tests to remove +the switch from benchmark results. Remember to install OFED Perfests on both +nodes you plan to use in this section. Commands may require ``sudo`` depending +on user privileges. + +#. From the CLI of your host, clone the + + .. code-block:: shell + + git clone https://github.com/linux-rdma/perftest.git + +#. Navigate to the installation directory and build the tests. + + .. code-block:: shell + + cd perftest + ./autogen.sh + ./configure --prefix=$PWD/install --enable-rocm --with-rocm=/opt/rocm + +#. Locate and open ``Makefile`` in your editor of choice, then append + ``-D__HIP_PLATFORM_AMD__`` to ``CFLAGS`` and ``CXXFLAGS``. This is required + to compile the code correctly for this guide. + +#. Run ``make && make install``. + +#. Repeat these steps on a second node connected to the same switch. + +Run host-based (CPU) performance tests +====================================== + +Once installed, there are six main modules available with OFED perftests: + +* ``ib_write_bw`` - Test bandwidth with RDMA write transactions. + +* ``ib_write_lat`` - Test latency with RDMA write transactions. + +* ``ib_read_bw`` - Test bandwidth with RDMA read transactions. + +* ``ib_read_lat`` - Test latency with RDMA read transactions. + +* ``ib_send_bw`` - Test bandwidth with send transactions. + +* ``ib_send_lat`` - Test latency with send transactions. + +The examples in this section use ``ib_send_bw``, but you can accomplish similar +with any other test you require. The goal of the tests in this section is to +verify high speed Host to Host (H2H) data transfer rates between nodes before +including GPU traffic, therefore the ``use_rocm`` flag is avoided in all commands. + +The examples in this section use the ``ib_send_bw`` tool, but you can achieve +similar results with other benchmarking tools, depending on your requirements. +The primary objective of these tests is to verify high-speed Host-to-Host (H2H) +data transfer rates between nodes before introducing GPU traffic--as a result, +the ``use_rocm`` flag is intentionally omitted from all commands. + +Run H2H RDMA benchmark +----------------------- + +To run the OFED perftest, establish an SSH connection to both nodes you +installed the OFED perftests on. + +#. Initiate a server connection on the first node: + + .. code-block:: shell-session + + $ cd perftest #if not already in directory + + $ numactl -C 1 ./ib_send_bw -a -F -d + + ************************************ + * Waiting for client to connect... * + ************************************ + +#. Initiate a client connection on the second node: + + .. code-block:: shell-session + + $ cd perftest #if not already in directory + + $ numactl -C 1 ./ib_send_bw -a -F -d + +#. Test should run and complete in several moments. + + .. note:: + + The use of ``numactl`` or ``taskset`` commands makes sure NUMA domains are + not crossed when communicating, which can create overhead and latency. + When running tests you must ensure you use cores local to the network + device. + +Consult this table for an explanation of flags used in the ``numactl`` examples +and other optional flags that may be useful for you. + +-d + Specifies a NIC to use. Ensure you use a NIC that is both adjacent to a GPU and not crossing NUMA domains or otherwise needing pass traffic between CPUs before egressing from the host. Tools like ``rocm-smi --showtopo`` and ``lstopo`` can help define which NICs are adjacent to which GPUs. + +-p + Assign a port number to the server/client. Each instance must run on a different port when executed simultaneously. + +--report_gbits + Reports in Gb/s instead of Mb/s. + +-m + Set MTU size. + +-b + Bidirectional runs. + +-a + Runs messages in all sizes. + +-n + Provides the number of iterations. + +-F + Do not show warning if cpufreq_ondemand is loaded. + +--use_rocm= + This is for device testing, allows you to specify which GPU to use. Zero-based numbering. + +--perform_warm_up + Runs several iterations before benchmarking to warm up memory cache. + +As servers typically have one NIC per GPU, you must change the device location +frequently as you iterate through tests. + +Run multithreaded H2H RDMA benchmark +------------------------------------- + +To perform a multithreaded RDMA benchmark using the OFED perftest, run it +concurrently on each NIC in the server. Use the ``taskset`` command to assign a +CPU core within the same NUMA domain as the NICs. While testing the +XGMI/Infinity Fabric link between CPUs is not required at this stage, it can be +an optional test if desired. + +Run extended multithreaded H2H RDMA benchmark +--------------------------------------------- + +Repeat the multithreaded RDMA benchmark, but loop the test and run it +continuously for at least 8 hours. This extended test is designed to stress the +I/O network fabric over a prolonged period to assess stability and performance +under sustained load. + +Run device-based (GPU) OFED performance tests +============================================= + +After confirming Host-to-Host (H2H) performance, proceed to run Device-to-Device +(D2D) OFED perftests, which include GPU traffic. This will evaluate RDMA +performance between GPUs. + +Run D2D RDMA benchmark +----------------------- + +To run a D2D RDMA benchmark, use the following example setup to test GPU pairs--for +example, GPU0 to GPU1, GPU2 to GPU3. + +.. note:: + + If you have Mellanox or NVIDIA NICs, be aware that the default OFED perftest + installation doesn't include ROCm support. Follow the + :ref:`installation instructions` + if you haven't done so already. + +In this example, ``localhost`` is used by the client to call the server. You may +use a specific IP address to ensure the network is tested. + +.. code-block:: shell + + $ (ib_write_bw -b -a -d --report_gbits -F -use_rocm=0 >> /dev/null &); sleep 1; ib_write_bw -b -a -d --report_gbits -use_rocm=0 -F localhost + --------------------------------------------------------------------------------------- + RDMA_Write Bidirectional BW Test + Dual-port : OFF Device : + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : OFF + TX depth : 128 + CQ Moderation : 100 + Mtu : 4096[B] + Link type : Ethernet + GID index : 3 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet + --------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x0901 PSN 0x5e30c8 RKey 0x2000201 VAddr 0x007fe663d20000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:45 + remote address: LID 0000 QPN 0x0901 PSN 0xf40c3c RKey 0x2000201 VAddr 0x007f282a06e000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:35 + --------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 2 5000 0.142947 0.012281 0.767588 + 4 5000 0.28 0.26 8.255475 + 8 5000 0.55 0.54 8.471791 + 16 5000 1.16 1.16 9.025968 + 32 5000 2.31 2.27 8.865877 + 64 5000 4.49 4.43 8.647051 + 128 5000 8.98 8.96 8.745890 + 256 5000 17.57 16.32 7.969287 + 512 5000 34.63 34.41 8.400441 + 1024 5000 67.22 66.92 8.168969 + 2048 5000 129.04 126.20 7.702863 + 4096 5000 188.76 188.56 5.754307 + 8192 5000 194.79 192.62 2.939080 + 16384 5000 195.32 195.21 1.489355 + 32768 5000 203.15 203.13 0.774887 + 65536 5000 204.12 203.85 0.388818 + 131072 5000 204.44 204.43 0.194964 + 262144 5000 204.51 204.51 0.097517 + 524288 5000 204.56 204.56 0.048770 + 1048576 5000 204.57 204.57 0.024387 + 2097152 5000 204.59 204.59 0.012194 + 4194304 5000 204.59 204.59 0.006097 + 8388608 5000 204.59 204.59 0.003049 + --------------------------------------------------------------------------------------- + +.. note:: + + If you run the test with different values for ``--use_rocm=#`` on the server + and the client, the output will show results from whichever GPU is local to + the node you're looking at. The tool is unable to show server and client + simultaneously. + +Run H2D/D2H RDMA benchmark +--------------------------- + +This is similar to the D2D test, but also includes the CPU on either the server or client side of the test-case scenarios. + +For a 2-CPU/8-GPU node you would have 32 test scenarios per pairs of server. + +.. list-table:: H2D/D2H Benchmark with Server-Side CPUs + :widths: 25 25 25 25 25 25 25 25 25 + :header-rows: 1 + + * - Client + - GPU 0 + - GPU 1 + - GPU 2 + - GPU 3 + - GPU 4 + - GPU 5 + - GPU 6 + - GPU 7 + * - Server + - CPU 0 + - CPU 1 + - + - + - + - + - + - + +.. list-table:: H2D/D2H Benchmark with Client-Side CPUs + :widths: 25 25 25 25 25 25 25 25 25 + :header-rows: 1 + + * - Server + - GPU 0 + - GPU 1 + - GPU 2 + - GPU 3 + - GPU 4 + - GPU 5 + - GPU 6 + - GPU 7 + * - Client + - CPU 0 + - CPU 1 + - + - + - + - + - + - + +To run this test, use a command similar to the example in the D2D benchmark, but +only add the ``--use_rocm`` flag on either the server or client side so that one +node communicates with the GPUs while the other does so with CPUs. Then, run the +test a second time with the ``use_rocm`` flag on the other side. Continue to use +the most adjacent NIC to the GPU or CPU being tested so that communication +doesn't run between intranode CPUs (testing the internal CPU-CPU fabric +isn't a goal now). + +D2D RDMA multithread benchmark +------------------------------ + +For this test you must run the previous D2D benchmark simultaneously on all +GPUs. Scripting is required to accomplish this, but the command input should +resemble something like the following image with regard to your RDMA device +naming scheme. + +.. image:: ../data/D2D-perftest-multithread.png + :alt: multithread perftest input + +Important OFED perftest flags for this effort include: + +-p + Lets you assign specific ports for server/client combinations. Each pair needs an independent port number so you don't inadvertently use the wrong server. + +-n <# of iterations> + Default is 1000, you can increase this to have the test run longer. + +For bandwidth tests only: + +-D + Defines how long the test runs for. + +--run_infinitely + Requires user to break the runtime, otherwise runs indefinitely. + +D2D RDMA multithread extended benchmark +--------------------------------------- + +Perform the D2D RDMA multithread benchmark again but set the duration for a +minimum of 8 hours. + +Build collective tests +====================== + +This section guides you through setting up the remaining tools necessary to +simulate an AI workload on your GPU nodes after they have been sufficiently +traffic-tested. Per the :ref:`prerequisites`, UCX, UCC, +MPI and the OSU benchmarks must already be installed. + +Install RCCL +------------- + +RCCL is likely already installed as part of ROCm on your compute nodes. +Sometimes newer features and fixes might be available in the latest version of +RCCL, which you can build from source at ``__. + +Build RCCL collective tests +--------------------------- + +To more easily build and run the RCCL collective tests, review and implement the +script provided in the drop-down (the script also includes an option to install +MPICH if needed). Otherwise, you can follow the steps to manually install at +``__. + +.. dropdown:: build-and-run_rccl-tests_sweep_multinode.sh + + .. code-block:: shell + :linenos: + + #!/bin/bash -x + + ## change this if ROCm is installed in a non-standard path + ROCM_PATH=/opt/rocm + + ## to use pre-installed MPI, change `build_mpi` to 0 and ensure that libmpi.so exists at `MPI_INSTALL_DIR/lib`. + build_mpi=1 + MPI_INSTALL_DIR=/opt/ompi + + ## to use pre-installed RCCL, change `build_rccl` to 0 and ensure that librccl.so exists at `RCCL_INSTALL_DIR/lib`. + build_rccl=1 + RCCL_INSTALL_DIR=${ROCM_PATH} + + + WORKDIR=$PWD + + ## building mpich + if [ ${build_mpi} -eq 1 ] + then + cd ${WORKDIR} + if [ ! -d mpich ] + then + wget https://www.mpich.org/static/downloads/4.1.2/mpich-4.1.2.tar.gz + mkdir -p mpich + tar -zxf mpich-4.1.2.tar.gz -C mpich --strip-components=1 + cd mpich + mkdir build + cd build + ../configure --prefix=${WORKDIR}/mpich/install --disable-fortran --with-ucx=embedded + make -j 16 + make install + fi + MPI_INSTALL_DIR=${WORKDIR}/mpich/install + fi + + + ## building rccl (develop) + if [ ${build_rccl} -eq 1 ] + then + cd ${WORKDIR} + if [ ! -d rccl ] + then + git clone https://github.com/ROCm/rccl -b develop + cd rccl + ./install.sh -l + fi + RCCL_INSTALL_DIR=${WORKDIR}/rccl/build/release + fi + + + ## building rccl-tests (develop) + cd ${WORKDIR} + if [ ! -d rccl-tests ] + then + git clone https://github.com/ROCm/rccl-tests + cd rccl-tests + make MPI=1 MPI_HOME=${MPI_INSTALL_DIR} NCCL_HOME=${RCCL_INSTALL_DIR} -j + fi + + + ## running multi-node rccl-tests all_reduce_perf for 1GB + cd ${WORKDIR} + + ## requires a hostfile named hostfile.txt for the multi-node setup in ${WORKDIR}/ + + n=`wc --lines < hostfile.txt` # count the numbers of nodes in hostfile.txt + echo "No. of nodes: ${n}" # print number of nodes + m=8 # assuming 8 GPUs per node + echo "No. of GPUs/node: ${m}" # print number of GPUs per node + total=$((n * m)) # total number of MPI ranks (1 per GPU) + echo "Total ranks: ${total}" # print number of GPUs per node + + ### set these environment variables if using Infiniband interconnect + ## export NCCL_IB_HCA=^mlx5_8 + + ### set these environment variables if using RoCE interconnect + ## export NCCL_IB_GID_INDEX=3 + + for coll in all_reduce all_gather alltoall alltoallv broadcast gather reduce reduce_scatter scatter sendrecv + do + # using MPICH; comment next line if using OMPI + mpirun -np ${total} --bind-to numa -env NCCL_DEBUG=VERSION -env PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -env LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt + + ## uncomment, if using OMPI + ## mpirun -np ${total} --bind-to numa -x NCCL_DEBUG=VERSION -x PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -x LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH --mca pml ucx --mca btl ^openib ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt + + sleep 10 + done + +Run OSU Micro Benchmarks +========================= + +Running the OSU Micro Benchmarks (OMB) with MPI simulates conditions similar to an AI/HPC workload over your cluster network. Successful MPI runs require that passwordless SSH be configured between all server pairs where OMB is installed and that they also be finger-printed, otherwise the runs fail. + +This section covers the the two types of OMB: + +* Point to point (pt2pt) benchmarks test communication between one discrete component on a server (host or device) to another. +* Collectives benchmarks support the use of multiple devices in a single run. + +In a typical use case, you start with a pair of nodes and run the pt2pt benchmarks then move on to collectives. + +Point to point (pt2pt) OSU benchmarks +------------------------------------- + +Commands in the table below must run on two nodes with RoCE or InfiniBand interconnect from Host to Host (CPU to CPU). You can invoke the command from either node, but directories must mirror one another or the tests will hang. + +.. note:: + The paths for the MPI and OMB commands presume both are installed in the ``/opt`` directory. Installation paths for your environment may be different and should be updated accordingly. + +.. raw:: html + + + +.. container:: + :name: osu-commands-table + + .. list-table:: + :header-rows: 1 + :stub-columns: 1 + :widths: 2 5 + + * - Command + - Usage + + * - osu_bw + - ``$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw -d rocm`` + + * - osu_bibw + - ``$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -d rocm`` + + * - osu_mbw_mr + - ``$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -d rocm`` + + * - osu_latency + - ``/$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d rocm`` + + * - osu_multi_lat + - ``$OMPI_DIR/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host , -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc $OSU_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_multi_lat -d rocm`` + +You can change communications mode by appending ``D D`` to the end of command for D2D, or ``D H`` for D2H (and vice-versa). + +Collective OSU benchmarks +------------------------- + +.. raw:: html + + + +.. container:: + :name: coll-commands-table + + .. list-table:: + :header-rows: 1 + :stub-columns: 1 + :widths: 2 5 + + * - Command + - Usage + + * - osu_allreduce + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d rocm D D`` + + * - osu_allreduce 2N 16Proc + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d rocm D D`` + + * - osu_alltoall + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D`` + + * - osu_alltoall 2N 16Proc + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D`` + + * - osu_allgather + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 2 -host 10.1.10.110,10.1.10.72 -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d rocm D D`` + + * - osu_allgather 2N 16Proc + - ``/opt/ompi/bin/mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^self,vader,openib --mca coll_hcoll_enable 0 --bind-to none -np 16 -hostfile ./hostfile -x UCX_TLS=all -x MV2_USE_ROCM=1 -x HIP_VISIBLE_DEVICES=1 numactl --localalloc /opt/osu-7.3/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather -d rocm D D`` + +Run RCCL collective benchmark +============================= + +RCCL is a collective communication library optimized for collective operations +by multi-GPU and multi-node communication primitives that are in turn optimized +for AMD Instinct accelerators. The RCCL Test is typically launched using MPI, +but you can use MPICH or Open MPI as well. + +.. list-table:: + :stub-columns: 1 + :widths: 2 5 + + * - RCCL with MPI + - ``/opt/ompi/bin/mpirun -mca oob_tcp_if_exclude docker,lo -mca btl_tcp_if_exclude docker,lo -host {HOST1}:8,{HOST2}:8 -np 16 -x LD_LIBRARY_PATH=/opt/rccl/build/rccl/install/lib:/opt/ompi/lib -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=VERSION -x NCCL_IB_HCA=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 -x NCCL_IGNORE_CPU_AFFINITY=1 /opt/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1`` diff --git a/docs/how-to/single-node-config.rst b/docs/how-to/single-node-config.rst index ea195b8..687c302 100644 --- a/docs/how-to/single-node-config.rst +++ b/docs/how-to/single-node-config.rst @@ -1,783 +1,884 @@ -.. meta:: - :description: How to configure a single node for testing - :keywords: network validation, DCGPU, single node, ROCm, RCCL, machine learning, LLM, usage, tutorial - -******************************************************** -Single node network configuration for AMD Instinct™ GPUs -******************************************************** - -This guide explains how to set up a testing environment on a single GPU node and run benchmarks to simulate an AI/HPC workload. - -Prerequisites -============= - -Before following the steps in this guide, ensure you have performed these actions first: - -* Install GPU and network hardware. - -* Install OS and required GPU and network software on each node: - - * `Install ROCm `_. - - * Install network drivers for NICs (add opensm if using InfiniBand). - -* Configure network. - -* Configure system BIOS and OS settings according to the `system optimization guide `_ for your architecture (MI300, MI200, and so on). - -* Disable NUMA Balancing with ``sudo sysctl kernel.numa_balancing=0``. To verify NUMA balancing is disabled, run ``cat /proc/sys/kernel/numa_balancing`` and confirm that 0 is returned. - -* Run the :ref:`disable ACS script` to disable PCI ACS (Access Control Services) on all PCIe devices that support it (must be done on each reboot). - -* Add the ``iommu=pt``parameter to ``GRUB_CMDLINE_LINUX_DEFAULT`` in ``/etc/default/grub``, then run ``sudo update-grub`` and reboot. See `ROCm troubleshooting `_ for more information. - -* Verify your user belongs to the ``render`` and ``video`` groups as specified in the `group permission settings `_ for ROCm installation. - -Best Practices --------------- - -Applications must be the same on every system. There are two ways to accomplish this: - -#. Have an NFS mount available to all systems where the software is installed. - -#. Make a system image with all the software installed. Note that you must re-image anytime there is a software change. - -Validate PCIe performance -========================= - -Checking that your relevant PCIe devices (GPUs, NICs, and internal switches) are using the maximum available transfer speed and width in their respective bus keeps you from having to troubleshoot any related issues in subsequent testing where it may not be obvious. As a best practice, it's helpful to gather all the PCIe addresses for your GPUs, NICs, and switches in advance and document them so that you don't need to do it while following these steps. - -Check PCIe Device Speed and Width ---------------------------------- - -#. From the command line of your host, run ``lspci`` to retrieve a list of PCIe devices and locate your GPU and network devices. - -#. Run ``sudo lspci -s -vvv | grep Speed`` to review the speed and width of your device. This example shows the speed and width for a GPU at the address ``02:00.0``. - - .. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo lspci -s 02:00.0 -vvv | grep Speed - - LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us - LnkSta: Speed 32GT/s (ok), Width x16 (ok) - - .. tab-item:: Commands - - :: - - sudo lspci -s 02:00.0 -vvv | grep Speed - - The maximum supported speed of the GPU is reported in ``LnkCap`` along with a width of x16. Current status is shown in ``LnkSta`` and we can see both speed and width are aligned. Your values may differ depending on your hardware. - -#. Query and validate all GPUs in your node with the previous steps. - -#. Gather the PCI addresses for your NICs and validate them next. See this example from a NIC running at ``05:00.0``: - - .. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo lspci -s 05:00.0 -vvv | grep Speed - - LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported - LnkSta: Speed 16GT/s (ok), Width x16 (ok) - - .. tab-item:: Commands - - :: - - sudo lspci -s 05:00.0 -vvv | grep Speed - - Here, the NIC is running at a speed of 16GT/s. However, since the NIC configuration only supports PCIe Gen4 speeds this is an expected value. - -Once you verify all GPUs and NICs are running at maximum supported speeds and widths, then proceed to the next section. - -.. note:: - If you are running a cloud instance, hardware passthrough to your guest OS may not be accurate. Verify your ``lspci`` results with your cloud provider. - -Check PCIe Switch Speed and Width ---------------------------------- - -Similar to the previous section, you must next check the PCIe switches in your system to ensure they're reporting the maximum speed and width for ``LnkSta``. - -#. Run ``lspci -vv`` and ``lspci -tv`` to identify PCIe switch locations on the server. - -#. Run ``lspci -vvv | grep Speed`` to verify speed and width as previously demonstrated. - -Check Max Payload Size and Max Read Request -------------------------------------------- - -The ``MaxPayload`` and ``MaxReadReq`` attributes determine the permissible size of individual PCIe packets and the number of read requests permitted at once, respectively. To optimize bandwidth, ensure every GPU and NIC reports the maximum value for both attributes. - -#. Run ``sudo lspci -vvv | grep DevCtl: -C 2`` to review max payload size and max read request. Here is an example using the same NIC as before. - - .. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo lspci -vvv 05:00.0 | grep DevCtl: -C 2 - - DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us - ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 40.000W - DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- - RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset- - MaxPayload 512 bytes, MaxReadReq 4096 bytes - - .. tab-item:: Commands - - :: - - sudo lspci -vvv 05:00.0 | grep DevCtl: -C 2 - -#. ``MaxReadRequest`` is unique in that it can be changed during runtime with the ``setpci`` command. If your value here is lower than expected, you can correct it as follows: - - .. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo lspci -vvvs a1:00.0 | grep axReadReq - - MaxPayload 512 bytes, MaxReadReq 512 bytes - - $ sudo setpci -s a1:00.0 68.w - - 295e - - $ sudo setpci -s a1:00.0 68.w=595e - - $ sudo lspci -vvvs a1:00.0 | grep axReadReq - - MaxPayload 512 bytes, MaxReadReq 4096 bytes - - .. tab-item:: Commands - - :: - - sudo lspci -vvvs a1:00.0 | grep axReadReq - - sudo setpci -s a1:00.0 68.w - - sudo setpci -s a1:00.0 68.w=595e - - sudo lspci -vvvs a1:00.0 | grep axReadReq - -.. note:: - Changes made with ``setpci`` are not persistent across reboots. This example uses a single NIC for simplicity, but in practice you must run the change for each NIC in the node. - -Validate NIC Configuration -========================== - -After you've verified optimal PCIe speeds for all devices, configure your NICs according to best practices in the manufacturer or vendor documentation. This may already include some of the pre-assessment steps outlined in this guide and provide more hardware-specific tuning optimizations. - -Vendor-specific NIC Tuning --------------------------- - -Your NICs may require tuning if it has not already been done. Some steps differ based on the type of NIC you're deploying (InfiniBand or RoCE). - -* Ensure :ref:`ACS is disabled`. - -* For Mellanox NICs (InfiniBand or RoCE): Disable ATS, enable PCI Relaxed Ordering, increase max read requests, enable advanced PCI settings. - - .. code-block:: shell - - sudo mst start - - sudo mst status - - sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1 - - sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=44 - - sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s PCI_WR_ORDERING=1 - - reboot - -* For Broadcom NICs, ensure RoCE is enabled and consider disabling any unused ports. See the :ref:`Broadcom RoCE configuration scripts` for more details. - -* Ensure Relaxed Ordering is enabled in the PCIe settings for your system BIOS as well. - -.. Note:: - All instructions for RoCE networks in this guide and additional guides are based on the v2 protocol. - -Check NIC link speed --------------------- - -Verify the NICs in your servers are reporting the correct speeds. Several commands and utilities are available to measure speed based on your network type. - -* RoCE / Ethernet - - ``sudo ethtool | grep -i speed`` - - ``cat /sys/class/net//speed`` - -* InfiniBand - - ``ibdiagnet`` provides an output of the entire fabric in the default log files. You can verify link speeds here. - - ``ibstat`` or ``ibstatus`` tells you if the link is up and the speed at which it is running for all HCAs in the server. - -Verify Mellanox OFED and Firmware Installation ----------------------------------------------- - -.. Note:: - This step is only necessary for InfiniBand networks. - -Download the latest version of `Mellanox OFED (MLNX_OFED) `_ from Nvidia. Run the installer and flint tools to verify the latest version of MLNX_OFED and firmware is on the HCAs. - -Set up a GPU Testing Environment -================================ - -Next, create a testing environment to gather performance data for your GPUs. This requires installation of ROCm Validation Suite (RVS), TransferBench, and ROCm Bandwidth Test (RBT). - -#. Connect to the CLI of your GPU node. - -#. Follow directions to install RVS at `Installing ROCm Validation Suite `_ - - * Once installed, RVS is located in ``/opt/rocm/``. - -#. Install TransferBench from CLI. - - .. code-block:: shell - - $ git clone https://github.com/ROCmSoftwarePlatform/TransferBench.git - - $ cd TransferBench - - $ sudo make - - # Running make without sudo seems to cause runtime issues - # If this doesn't work, install math libraries manually using https://github.com/RadeonOpenCompute/ROCm/issues/1843 - - $ sudo apt install libstdc++-12-dev - -#. Install ROCm Bandwidth Test from CLI. - - .. code-block:: shell - - $ sudo apt install rocm-bandwidth-test - -Run ROCm Validation Suite (RVS) -------------------------------- - -RVS contains many different tests, otherwise referred to as modules. The relevant tests for this guide are as follows: - -* `P2P Benchmark and Qualification Tool `_ (PBQT) -* `ROCm Configuration Qualification Tool `_ (RCQT) -* `PCI Express Bandwidth Benchmark `_ (PEBB) -* `GPU Properties `_ (GPUP) -* `GPU Stress test `_ (GST) - -You can run multiple tests at once with ``sudo /opt/rocm/rvs/rvs -d 3``, which runs all tests set in ``/opt/rocm/share/rocm-validation-suite/rvs.conf`` at verbosity level 3. The default tests are GPUP, PEQT, PEBB, and PBQT, but you can modify the config file to add your preferred tests. The `RVS documentation `_ has more information on how to modify ``rvs.conf`` and helpful command line options. - -When you identify a problem, use ``rvs -g`` to understand what the GPU ID is referring to. - -.. Note:: - GPU numbering in RVS does not have the same order as in ``rocm-smi``. To map the GPU order listed in ``rvs-g`` to the rocm output, run ``rocm-smi --showbus`` and match each GPU by bus ID. - -You can run a specific RVS test by calling its configuration file with ``sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/.conf``. The following shell examples demonstrate what the commands and outputs look like for some of these tests. - -**Example of GPU stress tests with the GST module** - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/gst_single.conf - - [RESULT] [508635.659800] Action name :gpustress-9000-sgemm-false - [RESULT] [508635.660582] Module name :gst - [RESULT] [508642.648770] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508643.652155] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508644.657965] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508646.633979] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508647.641379] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508648.649070] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508649.657010] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508650.665296] [gpustress-9000-sgemm-false] gst GFLOPS - [RESULT] [508655.632843] [gpustress-9000-sgemm-false] gst GFLOPS Target stress : met :TRUE - - .. tab-item:: Commands - - :: - - sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/gst_single.conf - -**Example of PCIe bandwidth benchmarks with the PBQT module** - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ sudo /opt/rocm/rvs/rvs -c /opt/rocm/share/rocm-validation-suite/conf/pbqt_single.conf -d 3 - - [RESULT] [1148200.536604] Action name :action_1 - - Discovered Nodes - ============================================== - - Node Name Node Type Index GPU ID - ============================================================================================================================= - CPU 0 N/A - - CPU 1 N/A - - CPU 2 N/A - - CPU 3 N/A - - GPU 4 - - GPU 5 - ============================================================================================================================= - [RESULT] [1148200.576371] Module name :pbqt - [INFO ] [1148200.576394] Missing 'device_index' key. - [RESULT] [1148200.576498] [action_1] p2p peers:true distance:72 PCIe:72 - [RESULT] [1148205.576740] [action_1] p2p-bandwidth [1/1] bidirectional: true GBps duration: sec - [RESULT] [1148205.577850] Action name :action_2 - [RESULT] [1148205.577862] Module name :pbqt - [INFO ] [1148205.577883] Missing 'device_index' key. - [RESULT] [1148205.578085] [action_2] p2p peers:true distance:72 PCIe:72 - [INFO ] [1148216.581794] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps - [INFO ] [1148217.581371] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps - [INFO ] [1148218.580844] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps - [INFO ] [1148219.580909] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps - - .. tab-item:: Commands - - :: - - sudo /opt/rocm/rvs/rvs -c /opt/rocm/share/rocm-validation-suite/conf/pbqt_single.conf -d 3 - -Run TransferBench ------------------ - -TransferBench is a tool you can use to benchmark simultaneous transfers between CPU and GPU devices. To use, navigate to the TransferBench installation folder (the folder created when you ran ``git clone https://github.com/ROCmSoftwarePlatform/TransferBench.git`` in previous directions). Run the ``./TransferBench`` command to get a list of common commands, flags, and an overview of your CPU/GPU topology as detected by TransferBench. - -Like RVS, TransferBench runs tests from configuration files. You can either run one of several preset configuration files or define your own. A useful all-around test to run is ``p2p``, which tests the unidirectional and bidirectional transfer rates on all CPUs and GPUs detected by TransferBench. See the example below for the output of this test on a 2-CPU, 8-GPU node with 4 MB transfer packets. - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ ./TransferBench p2p 4M - - TransferBench v1.50 - =============================================================== - [Common] - ALWAYS_VALIDATE = 0 : Validating after all iterations - …… - Bytes Per Direction 4194304 - Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) - SRC+EXE\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 - CPU 00 -> 24.37 25.62 17.32 16.97 17.33 17.47 16.77 17.12 16.91 16.96 - CPU 01 -> 18.83 19.62 14.84 15.47 15.16 15.13 16.11 16.13 16.01 15.91 - - GPU 00 -> 23.83 23.40 108.95 64.58 31.56 28.39 28.44 26.99 47.46 39.97 - GPU 01 -> 24.05 23.93 66.52 109.18 29.07 32.53 27.80 31.73 40.79 36.42 - GPU 02 -> 23.83 23.47 31.48 28.58 109.45 65.11 47.40 40.11 28.45 27.46 - GPU 03 -> 24.35 23.93 28.65 32.00 65.68 108.68 39.85 36.08 27.08 31.49 - GPU 04 -> 23.30 23.84 28.57 26.93 47.36 39.77 110.94 64.66 31.14 28.15 - GPU 05 -> 23.39 24.08 27.19 31.26 39.85 35.49 64.98 110.10 28.57 31.43 - GPU 06 -> 23.43 24.03 47.58 39.22 28.97 26.93 31.48 28.41 109.78 64.98 - GPU 07 -> 23.45 23.94 39.70 35.50 27.08 31.25 28.14 32.19 65.00 110.47 - CPU->CPU CPU->GPU GPU->CPU GPU->GPU - Averages (During UniDir): 22.23 16.35 23.77 37.74 - - Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) - SRC\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 - CPU 00 -> N/A 17.07 16.90 17.09 15.39 17.07 16.62 16.65 16.40 16.32 - CPU 00 <- N/A 13.90 24.06 24.03 24.00 24.21 23.09 23.14 22.11 22.15 - CPU 00 <-> N/A 30.97 40.96 41.12 39.39 41.28 39.71 39.80 38.51 38.47 - - CPU 01 -> 12.85 N/A 15.29 15.14 15.03 15.16 15.95 15.62 16.06 15.85 - CPU 01 <- 17.34 N/A 22.95 23.18 22.98 22.92 23.86 24.05 23.94 23.94 - CPU 01 <-> 30.19 N/A 38.24 38.32 38.01 38.08 39.80 39.67 40.00 39.79 - - - GPU 00 -> 23.99 22.94 N/A 62.40 30.30 25.15 25.00 25.20 46.58 37.99 - GPU 00 <- 16.87 14.75 N/A 65.21 31.10 25.91 25.53 25.48 47.34 38.17 - GPU 00 <-> 40.85 37.69 N/A 127.61 61.40 51.06 50.53 50.68 93.91 76.16 - - GPU 01 -> 24.11 23.20 65.10 N/A 25.88 31.74 25.66 31.01 39.37 34.75 - GPU 01 <- 17.00 14.08 61.91 N/A 26.09 31.90 25.73 31.34 38.97 34.76 - GPU 01 <-> 41.11 37.29 127.01 N/A 51.97 63.64 51.39 62.35 78.35 69.51 - - GPU 02 -> 23.89 22.78 30.94 26.39 N/A 62.22 45.73 38.40 25.95 25.26 - GPU 02 <- 16.59 13.91 30.47 26.54 N/A 63.63 47.42 38.68 26.29 25.64 - GPU 02 <-> 40.48 36.69 61.42 52.93 N/A 125.85 93.15 77.08 52.24 50.90 - - GPU 03 -> 24.15 22.98 25.84 31.69 64.03 N/A 38.82 35.12 25.46 30.82 - GPU 03 <- 17.22 14.19 25.28 31.16 61.90 N/A 38.16 34.85 25.81 30.97 - GPU 03 <-> 41.37 37.16 51.12 62.84 125.93 N/A 76.99 69.97 51.27 61.79 - - GPU 04 -> 23.12 23.73 25.50 25.40 47.04 38.29 N/A 62.44 30.56 25.15 - GPU 04 <- 16.15 12.86 25.13 25.63 46.38 38.65 N/A 63.89 30.88 25.74 - GPU 04 <-> 39.27 36.58 50.63 51.03 93.42 76.94 N/A 126.34 61.43 50.89 - - GPU 05 -> 23.09 24.04 25.61 31.29 38.82 34.96 63.55 N/A 25.87 30.35 - GPU 05 <- 13.65 15.46 25.26 30.87 38.51 34.70 61.57 N/A 26.34 31.47 - GPU 05 <-> 36.75 39.50 50.87 62.16 77.32 69.66 125.12 N/A 52.21 61.82 - - GPU 06 -> 22.09 23.73 47.51 38.56 26.15 25.59 31.32 25.98 N/A 62.34 - GPU 06 <- 16.31 15.40 46.22 39.16 25.63 25.17 30.44 25.58 N/A 63.88 - GPU 06 <-> 38.39 39.13 93.72 77.72 51.78 50.76 61.76 51.56 N/A 126.22 - - GPU 07 -> 22.31 23.88 38.68 34.96 25.54 30.96 25.79 31.28 63.69 N/A - GPU 07 <- 16.27 15.89 38.39 35.06 25.27 30.62 25.25 30.91 62.36 N/A - GPU 07 <-> 38.58 39.77 77.07 70.02 50.81 61.58 51.05 62.20 126.04 N/A - CPU->CPU CPU->GPU GPU->CPU GPU->GPU - Averages (During BiDir): 15.29 19.72 19.39 36.17 - - .. tab-item:: Commands - - :: - - ./TransferBench p2p 4M - -If you want to define your own configuration file, run ``cat ~/TransferBench/examples/example.cfg`` to view an example configuration file with information on commands and arguments to run more granular testing. Running DMA tests between single pairs of devices is one helpful and common use-case for custom configuration files. See the `TransferBench documentation `_ for more information. - -Run ROCm Bandwidth Test (RBT) ------------------------------ - -ROCm Bandwidth Test lets you identify performance characteristics for host-to-device (H2D), device-to-host (D2H), and device-to-device (D2D) buffer copies on a ROCm platform. This assists when looking for abnormalities and tuning performance. - -Run ``/opt/rocm/bin/rocm-bandwidth-test -h`` to get a help screen with available commands. - -.. code-block:: shell - - $ /opt/rocm/bin/rocm-bandwidth-test -h - - Supported arguments: - - -h Prints the help screen - -q Query version of the test - -v Run the test in validation mode - -l Run test to collect Latency data - -c Time the operation using CPU Timers - -e Prints the list of ROCm devices enabled on platform - -i Initialize copy buffer with specified 'long double' pattern - -t Prints system topology and allocatable memory info - -m List of buffer sizes to use, specified in Megabytes - -b List devices to use in bidirectional copy operations - -s List of source devices to use in copy unidirectional operations - -d List of destination devices to use in unidirectional copy operations - -a Perform Unidirectional Copy involving all device combinations - -A Perform Bidirectional Copy involving all device combinations - - NOTE: Mixing following options is illegal/unsupported - Case 1: rocm_bandwidth_test -a with {lm}{1,} - Case 2: rocm_bandwidth_test -b with {clv}{1,} - Case 3: rocm_bandwidth_test -A with {clmv}{1,} - Case 4: rocm_bandwidth_test -s x -d y with {lmv}{2,} - - -The default behavior of ``/opt/rocm/bin/rocm-bandwidth-test`` without any flags runs unilateral and bilateral benchmarks (flags -a and -A) on all available combinations of device. Review the following for examples of common commands and output. - -Getting a list of all ROCm-detected devices: - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ /opt/rocm/bin/rocm-bandwidth-test -e - - RocmBandwidthTest Version: 2.6.0 - - Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -e - - - Device Index: 0 - Device Type: CPU - Device Name: - Allocatable Memory Size (KB): 1044325060 - - Device Index: 1 - Device Type: CPU - Device Name: - Allocatable Memory Size (KB): 1056868156 - - Device Index: 2 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 3 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 4 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 5 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 6 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 7 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 8 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - Device Index: 9 - Device Type: GPU - Device Name: - Device BDF: XX:0.0 - Device UUID: GPU-0000 - Allocatable Memory Size (KB): 67092480 - Allocatable Memory Size (KB): 67092480 - - .. tab-item:: Commands - - :: - - /opt/rocm/bin/rocm-bandwidth-test -e - -Running a unidirectional benchmark between devices 0 (CPU) and 4 (GPU): - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 - ........................................ - RocmBandwidthTest Version: 2.6.0 - - Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 - - - ================ Unidirectional Benchmark Result ================ - ================ Src Device Id: 0 Src Device Type: Cpu ================ - ================ Dst Device Id: 4 Dst Device Type: Gpu ================ - - Data Size Avg Time(us) Avg BW(GB/s) Min Time(us) Peak BW(GB/s) - 1 KB 5.400 0.190 5.280 0.194 - 2 KB 5.360 0.382 5.280 0.388 - 4 KB 5.440 0.753 5.440 0.753 - 8 KB 5.440 1.506 5.440 1.506 - 16 KB 5.880 2.786 5.760 2.844 - 32 KB 6.400 5.120 6.400 5.120 - 64 KB 7.520 8.715 7.520 8.715 - 128 KB 9.920 13.213 9.920 13.213 - 256 KB 14.520 18.054 14.400 18.204 - 512 KB 23.560 22.253 23.520 22.291 - 1 MB 41.880 25.038 41.760 25.110 - 2 MB 78.400 26.749 78.400 26.749 - 4 MB 153.201 27.378 152.641 27.478 - 8 MB 299.641 27.996 299.521 28.007 - 16 MB 592.002 28.340 592.002 28.340 - 32 MB 1176.925 28.510 1176.805 28.513 - 64 MB 2346.730 28.597 2346.730 28.597 - 128 MB 4686.180 28.641 4686.100 28.642 - 256 MB 9365.280 28.663 9365.160 28.663 - 512 MB 18722.762 28.675 18722.482 28.675 - - .. tab-item:: Commands - - :: - - /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 - -Running a bidirectional benchmark on all available device combinations: - -.. tab-set:: - - .. tab-item:: Shell Output - - .. code-block:: shell - - $ /opt/rocm/bin/rocm-bandwidth-test -A - - …… - Bidirectional copy peak bandwidth GB/s - - D/D 0 1 2 3 4 5 6 7 8 9 - - 0 N/A N/A 47.703 47.679 47.619 47.586 38.106 38.160 36.771 36.773 - - 1 N/A N/A 38.351 38.395 36.488 36.454 47.495 47.512 47.525 47.471 - - 2 47.703 38.351 N/A 101.458 80.902 81.300 81.387 79.279 101.526 101.106 - - 3 47.679 38.395 101.458 N/A 81.278 80.488 79.535 79.907 101.615 101.618 - - 4 47.619 36.488 80.902 81.278 N/A 101.643 101.089 101.693 81.336 79.232 - - 5 47.586 36.454 81.300 80.488 101.643 N/A 101.217 101.478 79.460 79.922 - - 6 38.106 47.495 81.387 79.535 101.089 101.217 N/A 101.506 80.497 81.302 - - 7 38.160 47.512 79.279 79.907 101.693 101.478 101.506 N/A 81.301 80.501 - - 8 36.771 47.525 101.526 101.615 81.336 79.460 80.497 81.301 N/A 100.908 - - 9 36.773 47.471 101.106 101.618 79.232 79.922 81.302 80.501 100.908 N/A - - .. tab-item:: Commands - - :: - - /opt/rocm/bin/rocm-bandwidth-test -A - -For a more detailed explanation of different ways to run RBT, see the `ROCm Bandwidth Test User Guide `_. - -Configuration scripts -===================== - -Run these scripts where indicated to aid in the configuration and setup of your devices. - -.. _disable-acs-script: - -.. dropdown:: Disable ACS script - - .. code-block:: shell - - #!/bin/bash - # - # Disable ACS on every device that supports it - # - PLATFORM=$(dmidecode --string system-product-name) - logger "PLATFORM=${PLATFORM}" - # Enforce platform check here. - #case "${PLATFORM}" in - #"OAM"*) - #logger "INFO: Disabling ACS is no longer necessary for ${PLATFORM}" - #exit 0 - #;; - #*) - #;; - #esac - # must be root to access extended PCI config space - if [ "$EUID" -ne 0 ]; then - echo "ERROR: $0 must be run as root" - exit 1 - fi - for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do - # skip if it doesn't support ACS - setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 - if [ $? -ne 0 ]; then - #echo "${BDF} does not support ACS, skipping" - continue - fi - logger "Disabling ACS on `lspci -s ${BDF}`" - setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 - if [ $? -ne 0 ]; then - logger "Error enabling directTrans ACS on ${BDF}" - continue - fi - NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w | awk '{print $NF}'` - if [ "${NEW_VAL}" != "0000" ]; then - logger "Failed to enabling directTrans ACS on ${BDF}" - continue - fi - done - exit 0 - -.. _RoCE-configuration-script-for-Broadcom-Thor-NIC: - -.. dropdown:: RoCE configuration script for Broadcom Thor NIC - - .. code-block:: shell - - # Increase Max Read request Size to 4k - lspci -vvvs 41:00.0 | grep axReadReq - - # Check if Relaxed Ordering is enabled - - for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i getoption -name pcie_relaxed_ordering); done - - # Set Relaxed Ordering if not enabled - - for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i setoption -name pcie_relaxed_ordering -value 1); done - - # Check if RDMA support is enabled - - for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i getoption -name support_rdma -scope 0) - $(sudo niccli -dev=$i \ getoption=support_rdma:1); done - - # Set RMDA support if not enabled - - for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo \ niccli -dev=$i setoption -name support_rdma -scope 0 -value 1) - $(sudo niccli -dev=$i \ setoption -name support_rdma -scope 1 -value 1); done - - # Set Speed Mask - - niccli -dev= setoption=autodetect_speed_exclude_mask:0#01C0 - - # Set 200Gbps - - ethtool -s autoneg off speed 200000 duplex full - - # Set performance profile to RoCE ==REQUIRES REBOOT IF OLDER FIRMWARE LOADED== - - for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo \ niccli -dev=$i setoption -name performance_profile -value 1); done - -Reference Documentation -======================= - -* `ROCm Documentation `_ - -* `ROCm installation for Linux `_ - -* `Nvidia MLNX_OFED Documentation `_ - -* `ROCm Validation Suite Documentation `_ - -* `TransferBench Documentation `_ - -* `ROCm Bandwidth Test User Guide `_ - -* `Broadcom Ethernet Network Adapter User Guide `_ - -Resources and Helpful Links -=========================== - -* `AMD Infinity Hub `_ -* `AMD ROCm Developer Hub `_ +.. meta:: + :description: Learn how to configure a single node for network testing. + :keywords: network validation, DCGPU, single node, ROCm, RCCL, machine learning, LLM, usage, tutorial + +*************************************************************** +Single-node network configuration for AMD Instinct accelerators +*************************************************************** + +This section explains setting up a testing environment on a single accelerator node and running benchmarks to simulate an AI or HPC workload. + +Prerequisites +============= + +Before following the steps in the following sections, ensure you have completed +these prerequisites. + +#. Install GPU and network hardware. Refer to the + :ref:`hardware support matrix `. + +#. Install OS and required GPU and network software on each node: + + * :doc:`Install the ROCm software stack `. + + * Install network drivers for NICs. If using InfiniBand, also install OpenSM. + +#. Ensure network settings are correctly configured for your hardware. + +#. Configure system BIOS and OS settings according to + :doc:`rocm:how-to/system-optimization/index` for your architecture + (MI300, MI200, and so on). + +#. Disable NUMA balancing. + + a. Run ``sudo sysctl kernel.numa_balancing=0``. + + b. To verify NUMA balancing is disabled, run + ``cat /proc/sys/kernel/numa_balancing`` and confirm that ``0`` is + returned. + + c. See :ref:`rocm:mi300x-disable-numa` for more information. + +#. Disable PCI ACS (access control services). Run the + :ref:`disable ACS script` on all PCIe devices supporting + it. This must be done after each reboot. + +#. Configure IOMMU settings. + + a. Add ``iommu=pt`` to the ``GRUB_CMDLINE_LINUX_DEFAULT`` entry in + ``/etc/default/grub``. + + b. Run ``sudo update-grub``, then reboot. + + c. See :ref:`rocm:mi300x-grub-settings` and + :ref:`rocm-install-on-linux:multi-gpu` for more information. + +#. Verify group permissions. + + a. Ensure the user belongs to the ``render`` and ``video`` groups. + + b. Refer to :ref:`rocm-install-on-linux:group_permissions` for guidance. + +Best practices for software consistency +--------------------------------------- + +To ensure consistent software configurations across systems: + +* Use a shared NFS (network file system) mount. Install the necessary software + on a common NFS mount accessible to all systems. + +* Create a system image with all the software installed. Re-image when software + changes are made. + +Validate PCIe performance +========================= + +Checking that your relevant PCIe devices (GPUs, NICs, and internal switches) are +using the maximum available transfer speed and width in their respective bus +keeps you from having to troubleshoot any related issues in subsequent testing +where it may not be obvious. + +.. tip:: + + Gather all the PCIe addresses for your GPUs, NICs, and switches in advance + and document them so that you don't need to do it while following these + steps. + +Check PCIe device speed and width +--------------------------------- + +#. From the command line of your host, run ``lspci`` to retrieve a list of PCIe + devices and locate your GPU and network devices. + +#. Run ``sudo lspci -s -vvv | grep Speed`` to review the speed and + width of your device. This example shows the speed and width for a GPU at the + address ``02:00.0``. + + .. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell + + $ sudo lspci -s 02:00.0 -vvv | grep Speed + + LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us + LnkSta: Speed 32GT/s (ok), Width x16 (ok) + + .. tab-item:: Commands + + :: + + sudo lspci -s 02:00.0 -vvv | grep Speed + + The maximum supported speed of the GPU is reported in ``LnkCap`` along with + a width of x16. Current status is shown in ``LnkSta``--both speed and width + are aligned. Your values may differ depending on your hardware. + +#. Query and validate all GPUs in your node with the previous steps. + +#. Gather the PCI addresses for your NICs and validate them next. See this + example from a NIC running at ``05:00.0``: + + .. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell + + $ sudo lspci -s 05:00.0 -vvv | grep Speed + + LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported + LnkSta: Speed 16GT/s (ok), Width x16 (ok) + + .. tab-item:: Commands + + :: + + sudo lspci -s 05:00.0 -vvv | grep Speed + + Here, the NIC is running at a speed of 16GT/s. However, because the NIC + configuration only supports PCIe Gen4 speeds, this is an expected value. + +Once you verify all GPUs and NICs are running at maximum supported speeds and +widths, then proceed to the next section. + +.. note:: + + If you're running a cloud instance, hardware passthrough to your guest OS + might not be accurate. Verify your ``lspci`` results with your cloud + provider. + +Check PCIe switch speed and width +--------------------------------- + +Now, check the PCIe switches to ensure they are operating at the maximum speed +and width for the ``LnkSta`` (Link Status). + +#. Run ``lspci -vv`` and ``lspci -tv`` to identify PCIe switch locations on the + server. + +#. Run ``lspci -vvv | grep Speed`` to verify speed and width as + previously demonstrated. + +Check max payload size and max read request +------------------------------------------- + +The ``MaxPayload`` and ``MaxReadReq`` attributes define the maximum size of PCIe +packets and the number of simultaneous read requests, respectively. For optimal +bandwidth, ensure that all GPUs and NICs are configured to use the maximum +values for both attributes. + +#. Run ``sudo lspci -vvv | grep DevCtl: -C 2`` to review max + payload size and max read request. Here is an example using the same NIC as + before. + + .. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell-session + + $ sudo lspci -vvv 05:00.0 | grep DevCtl: -C 2 + + DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us + ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 40.000W + DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq- + RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset- + MaxPayload 512 bytes, MaxReadReq 4096 bytes + + .. tab-item:: Commands + + :: + + sudo lspci -vvv 05:00.0 | grep DevCtl: -C 2 + +#. ``MaxReadRequest`` is unique because it can be changed during runtime with + the ``setpci`` command. If your value here is lower than expected, you can + correct it as follows: + + .. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell + + $ sudo lspci -vvvs a1:00.0 | grep axReadReq + + MaxPayload 512 bytes, MaxReadReq 512 bytes + + $ sudo setpci -s a1:00.0 68.w + + 295e + + $ sudo setpci -s a1:00.0 68.w=595e + + $ sudo lspci -vvvs a1:00.0 | grep axReadReq + + MaxPayload 512 bytes, MaxReadReq 4096 bytes + + .. tab-item:: Commands + + :: + + sudo lspci -vvvs a1:00.0 | grep axReadReq + + sudo setpci -s a1:00.0 68.w + + sudo setpci -s a1:00.0 68.w=595e + + sudo lspci -vvvs a1:00.0 | grep axReadReq + +.. note:: + + Changes made with ``setpci`` are not persistent across reboots. This example + uses a single NIC for simplicity, but in practice you must run the change for + each NIC in the node. + +Validate NIC configuration +========================== + +After you've verified optimal PCIe speeds for all devices, configure your NICs +according to best practices in the manufacturer or vendor documentation. This +might already include some of the pre-assessment steps outlined in this guide and +provide more hardware-specific tuning optimizations. + +Vendor-specific NIC tuning +-------------------------- + +Your NICs may require tuning if it has not already been done. Some steps differ +based on the type of NIC you're deploying (InfiniBand or RoCE). + +* Ensure :ref:`ACS is disabled`. + +* For Mellanox NICs (InfiniBand or RoCE): Disable ATS, enable PCI Relaxed Ordering, increase max read requests, enable advanced PCI settings. + + .. code-block:: shell + + sudo mst start + + sudo mst status + + sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1 + + sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=44 + + sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s PCI_WR_ORDERING=1 + + reboot + +* For Broadcom NICs, ensure RoCE is enabled and consider disabling any unused + ports. See the :ref:`Broadcom RoCE configuration scripts` + for more details. + +* Ensure Relaxed Ordering is enabled in the PCIe settings for your system BIOS as well. + +.. note:: + + All instructions for RoCE networks in this guide and additional guides are + based on the v2 protocol. + +Check NIC link speed +-------------------- + +Verify the NICs in your servers are reporting the correct speeds. Several commands and utilities are available to measure speed based on your network type. + +* RoCE / Ethernet + - ``sudo ethtool | grep -i speed`` + - ``cat /sys/class/net//speed`` + +* InfiniBand + - ``ibdiagnet`` provides an output of the entire fabric in the default log files. You can verify link speeds here. + - ``ibstat`` or ``ibstatus`` tells you if the link is up and the speed at which it is running for all HCAs in the server. + +Verify Mellanox OFED and firmware installation +---------------------------------------------- + +.. note:: + + This step is only necessary for InfiniBand networks. + +Download the latest version of +`Mellanox OFED (MLNX_OFED) `_ +from NVIDIA. Run the installer and flint tools to verify the latest version of +MLNX_OFED and firmware is on the HCAs. + +Set up a GPU testing environment +================================ + +Next, create a testing environment to gather performance data for your GPUs. +This requires installation of ROCm Validation Suite (RVS), TransferBench, and +ROCm Bandwidth Test. + +#. Connect to the CLI of your GPU node. + +#. Install ROCm Validation Suite following the directions at + :doc:`ROCmValidationSuite:install/installation` + + * Once installed, RVS is located in ``/opt/rocm/``. + +#. Install TransferBench. Refer to :doc:`transferbench:install/install` for + details. + + .. code-block:: shell + + $ git clone https://github.com/ROCm/TransferBench.git + + $ cd TransferBench + + $ sudo make + + # Running make without sudo seems to cause runtime issues + # If this doesn't work, install math libraries manually using https://github.com/ROCm/ROCm/issues/1843 + + $ sudo apt install libstdc++-12-dev + +#. Install ROCm Bandwidth Test. Refer to :doc:`rocm_bandwidth_test:install/install` + for details. + + .. code-block:: shell + + $ sudo apt install rocm-bandwidth-test + +Run ROCm Validation Suite (RVS) +------------------------------- + +RVS contains many different tests, otherwise referred to as modules. The relevant tests for this guide are as follows: + +* `P2P Benchmark and Qualification Tool `_ (PBQT) + +* `ROCm Configuration Qualification Tool `_ (RCQT) + +* `PCI Express Bandwidth Benchmark `_ (PEBB) + +* `GPU Properties `_ (GPUP) + +* `GPU Stress test `_ (GST) + +You can run multiple tests at once with ``sudo /opt/rocm/rvs/rvs -d 3``, which +runs all tests set in ``/opt/rocm/share/rocm-validation-suite/rvs.conf`` at +verbosity level 3. The default tests are GPUP, PEQT, PEBB, and PBQT, but you can +modify the config file to add your preferred tests. The +:doc:`RVS documentation ` has more +information on how to modify ``rvs.conf`` and helpful command line options. + +.. tip:: + + When you identify a problem, use ``rvs -g`` to understand what the GPU ID is + referring to. + + GPU numbering in RVS does not have the same order as in ``rocm-smi``. To map + the GPU order listed in ``rvs-g`` to the rocm output, run + ``rocm-smi --showbus`` and match each GPU by bus ID. + +You can run a specific RVS test by calling its configuration file with +``sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/.conf``. +The following shell examples demonstrate what the commands and outputs look like +for some of these tests. + +Example of GPU stress tests with the GST module +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell-session + + $ sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/gst_single.conf + + [RESULT] [508635.659800] Action name :gpustress-9000-sgemm-false + [RESULT] [508635.660582] Module name :gst + [RESULT] [508642.648770] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508643.652155] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508644.657965] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508646.633979] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508647.641379] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508648.649070] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508649.657010] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508650.665296] [gpustress-9000-sgemm-false] gst GFLOPS + [RESULT] [508655.632843] [gpustress-9000-sgemm-false] gst GFLOPS Target stress : met :TRUE + + .. tab-item:: Commands + + :: + + sudo /opt/rocm/bin/rvs -c /opt/rocm/share/rocm-validation-suite/conf/gst_single.conf + +Example of PCIe bandwidth benchmarks with the PBQT module +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell-session + + $ sudo /opt/rocm/rvs/rvs -c /opt/rocm/share/rocm-validation-suite/conf/pbqt_single.conf -d 3 + + [RESULT] [1148200.536604] Action name :action_1 + + Discovered Nodes + ============================================== + + Node Name Node Type Index GPU ID + ============================================================================================================================= + CPU 0 N/A + + CPU 1 N/A + + CPU 2 N/A + + CPU 3 N/A + + GPU 4 + + GPU 5 + ============================================================================================================================= + [RESULT] [1148200.576371] Module name :pbqt + [INFO ] [1148200.576394] Missing 'device_index' key. + [RESULT] [1148200.576498] [action_1] p2p peers:true distance:72 PCIe:72 + [RESULT] [1148205.576740] [action_1] p2p-bandwidth [1/1] bidirectional: true GBps duration: sec + [RESULT] [1148205.577850] Action name :action_2 + [RESULT] [1148205.577862] Module name :pbqt + [INFO ] [1148205.577883] Missing 'device_index' key. + [RESULT] [1148205.578085] [action_2] p2p peers:true distance:72 PCIe:72 + [INFO ] [1148216.581794] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps + [INFO ] [1148217.581371] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps + [INFO ] [1148218.580844] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps + [INFO ] [1148219.580909] [action_2] p2p-bandwidth [1/1] bidirectional: true GBps + + .. tab-item:: Commands + + :: + + sudo /opt/rocm/rvs/rvs -c /opt/rocm/share/rocm-validation-suite/conf/pbqt_single.conf -d 3 + +Run TransferBench +----------------- + +TransferBench is a benchmarking tool designed to measure simultaneous data +transfers between CPU and GPU devices. To use it, first navigate to the +TransferBench installation directory. Then, execute the following command to +display available commands, flags, and an overview of your system's CPU/GPU +topology as detected by TransferBench: + +.. code-block:: shell + + ./TransferBench + +Like RVS, TransferBench operates based on configuration files. You can either +choose from several preset configuration files or create a custom configuration +to suit your testing needs. A commonly recommended test is the ``p2p`` +(peer-to-peer) test, which measures unidirectional and bidirectional transfer +rates across all CPUs and GPUs detected by the tool. The following example shows +the output of a ``p2p`` test on a system with 2 CPUs and 8 GPUs, using 4 MB +transfer packets. + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell-session + + $ ./TransferBench p2p 4M + + TransferBench v1.50 + =============================================================== + [Common] + ALWAYS_VALIDATE = 0 : Validating after all iterations + …… + Bytes Per Direction 4194304 + Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC+EXE\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> 24.37 25.62 17.32 16.97 17.33 17.47 16.77 17.12 16.91 16.96 + CPU 01 -> 18.83 19.62 14.84 15.47 15.16 15.13 16.11 16.13 16.01 15.91 + + GPU 00 -> 23.83 23.40 108.95 64.58 31.56 28.39 28.44 26.99 47.46 39.97 + GPU 01 -> 24.05 23.93 66.52 109.18 29.07 32.53 27.80 31.73 40.79 36.42 + GPU 02 -> 23.83 23.47 31.48 28.58 109.45 65.11 47.40 40.11 28.45 27.46 + GPU 03 -> 24.35 23.93 28.65 32.00 65.68 108.68 39.85 36.08 27.08 31.49 + GPU 04 -> 23.30 23.84 28.57 26.93 47.36 39.77 110.94 64.66 31.14 28.15 + GPU 05 -> 23.39 24.08 27.19 31.26 39.85 35.49 64.98 110.10 28.57 31.43 + GPU 06 -> 23.43 24.03 47.58 39.22 28.97 26.93 31.48 28.41 109.78 64.98 + GPU 07 -> 23.45 23.94 39.70 35.50 27.08 31.25 28.14 32.19 65.00 110.47 + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During UniDir): 22.23 16.35 23.77 37.74 + + Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> N/A 17.07 16.90 17.09 15.39 17.07 16.62 16.65 16.40 16.32 + CPU 00 <- N/A 13.90 24.06 24.03 24.00 24.21 23.09 23.14 22.11 22.15 + CPU 00 <-> N/A 30.97 40.96 41.12 39.39 41.28 39.71 39.80 38.51 38.47 + + CPU 01 -> 12.85 N/A 15.29 15.14 15.03 15.16 15.95 15.62 16.06 15.85 + CPU 01 <- 17.34 N/A 22.95 23.18 22.98 22.92 23.86 24.05 23.94 23.94 + CPU 01 <-> 30.19 N/A 38.24 38.32 38.01 38.08 39.80 39.67 40.00 39.79 + + + GPU 00 -> 23.99 22.94 N/A 62.40 30.30 25.15 25.00 25.20 46.58 37.99 + GPU 00 <- 16.87 14.75 N/A 65.21 31.10 25.91 25.53 25.48 47.34 38.17 + GPU 00 <-> 40.85 37.69 N/A 127.61 61.40 51.06 50.53 50.68 93.91 76.16 + + GPU 01 -> 24.11 23.20 65.10 N/A 25.88 31.74 25.66 31.01 39.37 34.75 + GPU 01 <- 17.00 14.08 61.91 N/A 26.09 31.90 25.73 31.34 38.97 34.76 + GPU 01 <-> 41.11 37.29 127.01 N/A 51.97 63.64 51.39 62.35 78.35 69.51 + + GPU 02 -> 23.89 22.78 30.94 26.39 N/A 62.22 45.73 38.40 25.95 25.26 + GPU 02 <- 16.59 13.91 30.47 26.54 N/A 63.63 47.42 38.68 26.29 25.64 + GPU 02 <-> 40.48 36.69 61.42 52.93 N/A 125.85 93.15 77.08 52.24 50.90 + + GPU 03 -> 24.15 22.98 25.84 31.69 64.03 N/A 38.82 35.12 25.46 30.82 + GPU 03 <- 17.22 14.19 25.28 31.16 61.90 N/A 38.16 34.85 25.81 30.97 + GPU 03 <-> 41.37 37.16 51.12 62.84 125.93 N/A 76.99 69.97 51.27 61.79 + + GPU 04 -> 23.12 23.73 25.50 25.40 47.04 38.29 N/A 62.44 30.56 25.15 + GPU 04 <- 16.15 12.86 25.13 25.63 46.38 38.65 N/A 63.89 30.88 25.74 + GPU 04 <-> 39.27 36.58 50.63 51.03 93.42 76.94 N/A 126.34 61.43 50.89 + + GPU 05 -> 23.09 24.04 25.61 31.29 38.82 34.96 63.55 N/A 25.87 30.35 + GPU 05 <- 13.65 15.46 25.26 30.87 38.51 34.70 61.57 N/A 26.34 31.47 + GPU 05 <-> 36.75 39.50 50.87 62.16 77.32 69.66 125.12 N/A 52.21 61.82 + + GPU 06 -> 22.09 23.73 47.51 38.56 26.15 25.59 31.32 25.98 N/A 62.34 + GPU 06 <- 16.31 15.40 46.22 39.16 25.63 25.17 30.44 25.58 N/A 63.88 + GPU 06 <-> 38.39 39.13 93.72 77.72 51.78 50.76 61.76 51.56 N/A 126.22 + + GPU 07 -> 22.31 23.88 38.68 34.96 25.54 30.96 25.79 31.28 63.69 N/A + GPU 07 <- 16.27 15.89 38.39 35.06 25.27 30.62 25.25 30.91 62.36 N/A + GPU 07 <-> 38.58 39.77 77.07 70.02 50.81 61.58 51.05 62.20 126.04 N/A + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During BiDir): 15.29 19.72 19.39 36.17 + + .. tab-item:: Commands + + :: + + ./TransferBench p2p 4M + +If you want to define your own configuration file, run +``cat ~/TransferBench/examples/example.cfg`` to view an example configuration +file with information on commands and arguments to run more granular testing. +Running DMA tests between single pairs of devices is one helpful and common +use case for custom configuration files. See the +`TransferBench documentation ` for more information. + +Run ROCm Bandwidth Test (RBT) +----------------------------- + +ROCm Bandwidth Test lets you identify performance characteristics for +host-to-device (H2D), device-to-host (D2H), and device-to-device (D2D) buffer +copies on a ROCm platform. This assists when looking for abnormalities and +tuning performance. + +Run ``/opt/rocm/bin/rocm-bandwidth-test -h`` to get a help screen with available +commands. + +.. code-block:: shell-session + + $ /opt/rocm/bin/rocm-bandwidth-test -h + + Supported arguments: + + -h Prints the help screen + -q Query version of the test + -v Run the test in validation mode + -l Run test to collect Latency data + -c Time the operation using CPU Timers + -e Prints the list of ROCm devices enabled on platform + -i Initialize copy buffer with specified 'long double' pattern + -t Prints system topology and allocatable memory info + -m List of buffer sizes to use, specified in Megabytes + -b List devices to use in bidirectional copy operations + -s List of source devices to use in copy unidirectional operations + -d List of destination devices to use in unidirectional copy operations + -a Perform Unidirectional Copy involving all device combinations + -A Perform Bidirectional Copy involving all device combinations + + NOTE: Mixing following options is illegal/unsupported + Case 1: rocm_bandwidth_test -a with {lm}{1,} + Case 2: rocm_bandwidth_test -b with {clv}{1,} + Case 3: rocm_bandwidth_test -A with {clmv}{1,} + Case 4: rocm_bandwidth_test -s x -d y with {lmv}{2,} + +The default behavior of ``/opt/rocm/bin/rocm-bandwidth-test`` without any flags +runs unilateral and bilateral benchmarks (flags ``-a`` and ``-A``) on all +available combinations of device. Review the following for examples of common +commands and output. + +Getting a list of all ROCm-detected devices: + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell-session + + $ /opt/rocm/bin/rocm-bandwidth-test -e + + RocmBandwidthTest Version: 2.6.0 + + Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -e + + + Device Index: 0 + Device Type: CPU + Device Name: + Allocatable Memory Size (KB): 1044325060 + + Device Index: 1 + Device Type: CPU + Device Name: + Allocatable Memory Size (KB): 1056868156 + + Device Index: 2 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 3 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 4 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 5 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 6 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 7 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 8 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + Device Index: 9 + Device Type: GPU + Device Name: + Device BDF: XX:0.0 + Device UUID: GPU-0000 + Allocatable Memory Size (KB): 67092480 + Allocatable Memory Size (KB): 67092480 + + .. tab-item:: Commands + + :: + + /opt/rocm/bin/rocm-bandwidth-test -e + +Running a unidirectional benchmark between devices 0 (CPU) and 4 (GPU): + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell + + $ /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 + ........................................ + RocmBandwidthTest Version: 2.6.0 + + Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 + + + ================ Unidirectional Benchmark Result ================ + ================ Src Device Id: 0 Src Device Type: Cpu ================ + ================ Dst Device Id: 4 Dst Device Type: Gpu ================ + + Data Size Avg Time(us) Avg BW(GB/s) Min Time(us) Peak BW(GB/s) + 1 KB 5.400 0.190 5.280 0.194 + 2 KB 5.360 0.382 5.280 0.388 + 4 KB 5.440 0.753 5.440 0.753 + 8 KB 5.440 1.506 5.440 1.506 + 16 KB 5.880 2.786 5.760 2.844 + 32 KB 6.400 5.120 6.400 5.120 + 64 KB 7.520 8.715 7.520 8.715 + 128 KB 9.920 13.213 9.920 13.213 + 256 KB 14.520 18.054 14.400 18.204 + 512 KB 23.560 22.253 23.520 22.291 + 1 MB 41.880 25.038 41.760 25.110 + 2 MB 78.400 26.749 78.400 26.749 + 4 MB 153.201 27.378 152.641 27.478 + 8 MB 299.641 27.996 299.521 28.007 + 16 MB 592.002 28.340 592.002 28.340 + 32 MB 1176.925 28.510 1176.805 28.513 + 64 MB 2346.730 28.597 2346.730 28.597 + 128 MB 4686.180 28.641 4686.100 28.642 + 256 MB 9365.280 28.663 9365.160 28.663 + 512 MB 18722.762 28.675 18722.482 28.675 + + .. tab-item:: Commands + + :: + + /opt/rocm/bin/rocm-bandwidth-test -s 0 -d 4 + +Running a bidirectional benchmark on all available device combinations: + +.. tab-set:: + + .. tab-item:: Shell output + + .. code-block:: shell + + $ /opt/rocm/bin/rocm-bandwidth-test -A + + …… + Bidirectional copy peak bandwidth GB/s + + D/D 0 1 2 3 4 5 6 7 8 9 + + 0 N/A N/A 47.703 47.679 47.619 47.586 38.106 38.160 36.771 36.773 + + 1 N/A N/A 38.351 38.395 36.488 36.454 47.495 47.512 47.525 47.471 + + 2 47.703 38.351 N/A 101.458 80.902 81.300 81.387 79.279 101.526 101.106 + + 3 47.679 38.395 101.458 N/A 81.278 80.488 79.535 79.907 101.615 101.618 + + 4 47.619 36.488 80.902 81.278 N/A 101.643 101.089 101.693 81.336 79.232 + + 5 47.586 36.454 81.300 80.488 101.643 N/A 101.217 101.478 79.460 79.922 + + 6 38.106 47.495 81.387 79.535 101.089 101.217 N/A 101.506 80.497 81.302 + + 7 38.160 47.512 79.279 79.907 101.693 101.478 101.506 N/A 81.301 80.501 + + 8 36.771 47.525 101.526 101.615 81.336 79.460 80.497 81.301 N/A 100.908 + + 9 36.773 47.471 101.106 101.618 79.232 79.922 81.302 80.501 100.908 N/A + + .. tab-item:: Commands + + :: + + /opt/rocm/bin/rocm-bandwidth-test -A + +For a more detailed explanation of different ways to run ROCm Bandwidth Test, +see the +`ROCm Bandwidth Test user guide `_. + +Configuration scripts +===================== + +Run these scripts where indicated to aid in the configuration and setup of your devices. + +.. _disable-acs-script: + +.. dropdown:: Disable ACS script + + .. code-block:: shell + + #!/bin/bash + # + # Disable ACS on every device that supports it + # + PLATFORM=$(dmidecode --string system-product-name) + logger "PLATFORM=${PLATFORM}" + # Enforce platform check here. + #case "${PLATFORM}" in + #"OAM"*) + #logger "INFO: Disabling ACS is no longer necessary for ${PLATFORM}" + #exit 0 + #;; + #*) + #;; + #esac + # must be root to access extended PCI config space + if [ "$EUID" -ne 0 ]; then + echo "ERROR: $0 must be run as root" + exit 1 + fi + for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do + # skip if it doesn't support ACS + setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 + if [ $? -ne 0 ]; then + #echo "${BDF} does not support ACS, skipping" + continue + fi + logger "Disabling ACS on `lspci -s ${BDF}`" + setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 + if [ $? -ne 0 ]; then + logger "Error enabling directTrans ACS on ${BDF}" + continue + fi + NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w | awk '{print $NF}'` + if [ "${NEW_VAL}" != "0000" ]; then + logger "Failed to enabling directTrans ACS on ${BDF}" + continue + fi + done + exit 0 + +.. _RoCE-configuration-script-for-Broadcom-Thor-NIC: + +.. dropdown:: RoCE configuration script for Broadcom Thor NIC + + .. code-block:: shell + + # Increase Max Read request Size to 4k + lspci -vvvs 41:00.0 | grep axReadReq + + # Check if Relaxed Ordering is enabled + + for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i getoption -name pcie_relaxed_ordering); done + + # Set Relaxed Ordering if not enabled + + for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i setoption -name pcie_relaxed_ordering -value 1); done + + # Check if RDMA support is enabled + + for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo niccli -dev=$i getoption -name support_rdma -scope 0) - $(sudo niccli -dev=$i \ getoption=support_rdma:1); done + + # Set RMDA support if not enabled + + for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo \ niccli -dev=$i setoption -name support_rdma -scope 0 -value 1) - $(sudo niccli -dev=$i \ setoption -name support_rdma -scope 1 -value 1); done + + # Set Speed Mask + + niccli -dev= setoption=autodetect_speed_exclude_mask:0#01C0 + + # Set 200Gbps + + ethtool -s autoneg off speed 200000 duplex full + + # Set performance profile to RoCE ==REQUIRES REBOOT IF OLDER FIRMWARE LOADED== + + for i in $(sudo niccli listdev | grep Interface | awk {'print $5'}); \ do echo $i - $(sudo \ niccli -dev=$i setoption -name performance_profile -value 1); done + diff --git a/docs/index.rst b/docs/index.rst index 9f96d47..8168aca 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,26 +1,50 @@ -.. meta:: - :description: How to perform network validation testing on optimized hardware - :keywords: network validation, DCGPU, PCIe, Infiniband, RoCE, ROCm, RCCL, machine learning, LLM, usage, tutorial - -************************************************************************ -Cluster Network Performance Validation for AMD Instinct™ GPU Accelerators -************************************************************************ - -When running AI/HPC applications in a cluster network environment, performance is only as fast as the slowest individual node. It is therefore imperative that you configure each server for maximum data transfer rate and bandwidth usage with respect to hardware, then validate host and device-based performance over your network in single and multi node capacities with the appropriate benchmarking tools. - -Refer to the appropriate networking guides for the necessary steps to validate your network configuration in single and multi node capacities. Each guide includes detailed instructions on system settings, device configuration, networking tools, and performance tests to help you verify your AMD Instinct™-powered GPU clusters are getting optimal speeds and bandwidths during operation. - -.. grid:: 2 - :gutter: 3 - - .. grid-item-card:: How to - - * :doc:`Single node network configuration for AMD Instinct GPUs` - * :doc:`Multi node network configuration for AMD Instinct GPUs` - - .. grid-item-card:: Reference - - * :doc:`reference/hardware-support` - -.. note:: - AMD Instinct™ systems come in many shapes and sizes, and cluster design adds yet another dimension of configuration to consider. These instructions are written at a sufficiently high level to ensure they can be followed for as many environments as possible. While some scenarios do provide specific examples of hardware, know that your configuration is likely to differ from what is being demonstrated in terms of GPUs and CPUs per server, firmware versions, and network interconnect fabric. +.. meta:: + :description: How to perform network validation testing on optimized hardware + :keywords: network validation, DCGPU, PCIe, Infiniband, RoCE, ROCm, RCCL, machine learning, LLM, usage, tutorial + +******************************************************************** +Cluster network performance validation for AMD Instinct accelerators +******************************************************************** + +When running HPC and AI applications in a cluster network environment, +performance is only as fast as the slowest individual node in the network. To +achieve optimal performance, each server must be configured for maximum data +transfer rates and bandwidth utilization based on the available hardware. It is +crucial to validate both host and device performance in single-node and +multi-node environments using the appropriate benchmarking tools. + +Refer to the relevant networking guides for step-by-step instructions on +validating network configurations in single-node and multi-node environments. +These guides cover system settings, device configurations, networking tools, and +performance tests to ensure AMD Instinct™-powered GPU-enabled clusters achieve +optimal speed and bandwidth during operation. + +.. grid:: 2 + :gutter: 3 + + .. grid-item-card:: How to + + * :doc:`Single-node network configuration ` + * :doc:`Multi-node network configuration ` + + .. grid-item-card:: Reference + + * :doc:`reference/hardware-support` + +.. note:: + + AMD Instinct systems come in many shapes and sizes, and cluster design adds + yet another dimension of configuration to consider. These instructions are + written at a sufficiently high level to ensure they can be followed for as + many environments as possible. While some scenarios do provide specific + examples of hardware, know that your configuration is likely to differ from + what is being demonstrated in terms of GPUs and CPUs per server, firmware + versions, and network interconnect fabric. + + AMD Instinct systems vary in form and configuration, with cluster design + introducing additional layers of complexity. The guidance in this + documentation are written at a high level for broad applicability across + diverse environments. While certain scenarios may include specific hardware + examples, your setup will likely differ in terms of GPUs and CPUs per server, + firmware versions, and network interconnects. Adjustments might be necessary + to align with your particular configuration. diff --git a/docs/reference/hardware-support.rst b/docs/reference/hardware-support.rst index d9ff348..c51b210 100644 --- a/docs/reference/hardware-support.rst +++ b/docs/reference/hardware-support.rst @@ -1,42 +1,61 @@ -************************ -Hardware Support Matrix -************************ - -The processes detailed in these guides are validated to run on the following hardware in tandem with AMD Instinct™ GPUs: - -Network Cards for AMD Instinct™ MI300X -============================ - -+--------------------------+---------+---------------------+ -| Product Name | Speed | Interconnect | -+==========================+=========+=====================+ -| Broadcom P2200G | 400Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom P1400GD | 400Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom N1400GD | 400Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom N2200G | 400Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Nvidia ConnectX-7 series | 400Gb/s | RoCEv2 / InfiniBand | -+--------------------------+---------+---------------------+ - -Network Cards for AMD Instinct™ MI100X and MI200X -======================================= - -+--------------------------+---------+---------------------+ -| Product Name | Speed | Interconnect | -+==========================+=========+=====================+ -| Broadcom N2100G | 200Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom N1200G | 200Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom P2100G | 200Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Broadcom P1200G | 200Gb/s | RoCEv2 | -+--------------------------+---------+---------------------+ -| Nvidia ConnectX-6 series | 200Gb/s | RoCEv2 / InfiniBand | -+--------------------------+---------+---------------------+ - - -When deploying ROCm, refer to the `ROCm Compatibility Matrix `_ and install the latest version with regard to OS and driver support. +.. meta:: + :description: AMD Instinct accelerator compatibility with network cards. + :keywords: network validation, DCGPU, PCIe, Infiniband, RoCE, card, + compatibility + +*********************** +Hardware support matrix +*********************** + +When deploying ROCm, compatibility between the accelerators and NICs is +critical for ensuring optimized data transfer in high-performance computing +environments. The following NICs have been validated for use with AMD Instinct +MI300X, MI200, and MI100 accelerators, supporting high-speed interconnects like +RoCE v2 (RDMA over Converged Ethernet) and InfiniBand for low-latency, +high-throughput communication. + +The processes detailed in these guides are validated to run on the following +hardware with AMD Instinct™ accelerators: + +NICs for AMD Instinct MI300X +============================ + ++--------------------------+--------------+----------------------+ +| Product name | Speed (GB/s) | Interconnect | ++==========================+==============+======================+ +| Broadcom P2200G | 400 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom P1400GD | 400 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom N1400GD | 400 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom N2200G | 400 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| NVIDIA ConnectX-7 series | 400 | RoCE v2 / InfiniBand | ++--------------------------+--------------+----------------------+ + +NICs for AMD Instinct MI200 and MI100 series +============================================ + ++--------------------------+--------------+----------------------+ +| Product name | Speed (GB/s) | Interconnect | ++==========================+==============+======================+ +| Broadcom N2100G | 200 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom N1200G | 200 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom P2100G | 200 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| Broadcom P1200G | 200 | RoCE v2 | ++--------------------------+--------------+----------------------+ +| NVIDIA ConnectX-6 series | 200 | RoCE v2 / InfiniBand | ++--------------------------+--------------+----------------------+ + +When deploying ROCm, consult the +:doc:`ROCm compatibility matrix ` to +ensure compatibility, and install the latest version appropriate for your +operating system and driver support. + +Refer to the +`Broadcom Ethernet Network Adapter User Guide `_ +for installation, configuration, and tuning documentation for Broadcom devices. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 9f4fe4e..2c1b347 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -1,14 +1,14 @@ -defaults: - numbered: False - maxdepth: 6 -root: index -subtrees: -- caption: How To - entries: - - file: how-to/single-node-config - title: Single Node Networking - - file: how-to/multi-node-config - title: Multi Node Networking -- caption: Reference - entries: - - file: reference/hardware-support \ No newline at end of file +defaults: + numbered: False + maxdepth: 6 +root: index +subtrees: +- caption: How to + entries: + - file: how-to/single-node-config + title: Single-node networking + - file: how-to/multi-node-config + title: Multi-node networking +- caption: Reference + entries: + - file: reference/hardware-support diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in index f74228b..1cf51d3 100644 --- a/docs/sphinx/requirements.in +++ b/docs/sphinx/requirements.in @@ -1 +1 @@ -rocm-docs-core==1.6.1 +rocm-docs-core==1.8.2 diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt index c6da358..5c013ff 100644 --- a/docs/sphinx/requirements.txt +++ b/docs/sphinx/requirements.txt @@ -1,147 +1,147 @@ -# -# This file is autogenerated by pip-compile with Python 3.10 -# by the following command: -# -# pip-compile requirements.in -# -accessible-pygments==0.0.5 - # via pydata-sphinx-theme -alabaster==0.7.16 - # via sphinx -babel==2.15.0 - # via - # pydata-sphinx-theme - # sphinx -beautifulsoup4==4.12.3 - # via pydata-sphinx-theme -breathe==4.35.0 - # via rocm-docs-core -certifi==2024.7.4 - # via requests -cffi==1.16.0 - # via - # cryptography - # pynacl -charset-normalizer==3.3.2 - # via requests -click==8.1.7 - # via sphinx-external-toc -cryptography==43.0.0 - # via pyjwt -deprecated==1.2.14 - # via pygithub -docutils==0.21.2 - # via - # breathe - # myst-parser - # pydata-sphinx-theme - # sphinx -fastjsonschema==2.20.0 - # via rocm-docs-core -gitdb==4.0.11 - # via gitpython -gitpython==3.1.43 - # via rocm-docs-core -idna==3.7 - # via requests -imagesize==1.4.1 - # via sphinx -jinja2==3.1.4 - # via - # myst-parser - # sphinx -markdown-it-py==3.0.0 - # via - # mdit-py-plugins - # myst-parser -markupsafe==2.1.5 - # via jinja2 -mdit-py-plugins==0.4.1 - # via myst-parser -mdurl==0.1.2 - # via markdown-it-py -myst-parser==3.0.1 - # via rocm-docs-core -packaging==24.1 - # via - # pydata-sphinx-theme - # sphinx -pycparser==2.22 - # via cffi -pydata-sphinx-theme==0.15.4 - # via - # rocm-docs-core - # sphinx-book-theme -pygithub==2.3.0 - # via rocm-docs-core -pygments==2.18.0 - # via - # accessible-pygments - # pydata-sphinx-theme - # sphinx -pyjwt[crypto]==2.8.0 - # via pygithub -pynacl==1.5.0 - # via pygithub -pyyaml==6.0.1 - # via - # myst-parser - # rocm-docs-core - # sphinx-external-toc -requests==2.32.3 - # via - # pygithub - # sphinx -rocm-docs-core==1.6.1 - # via -r requirements.in -smmap==5.0.1 - # via gitdb -snowballstemmer==2.2.0 - # via sphinx -soupsieve==2.5 - # via beautifulsoup4 -sphinx==7.4.7 - # via - # breathe - # myst-parser - # pydata-sphinx-theme - # rocm-docs-core - # sphinx-book-theme - # sphinx-copybutton - # sphinx-design - # sphinx-external-toc - # sphinx-notfound-page -sphinx-book-theme==1.1.3 - # via rocm-docs-core -sphinx-copybutton==0.5.2 - # via rocm-docs-core -sphinx-design==0.6.0 - # via rocm-docs-core -sphinx-external-toc==1.0.1 - # via rocm-docs-core -sphinx-notfound-page==1.0.2 - # via rocm-docs-core -sphinxcontrib-applehelp==1.0.8 - # via sphinx -sphinxcontrib-devhelp==1.0.6 - # via sphinx -sphinxcontrib-htmlhelp==2.0.6 - # via sphinx -sphinxcontrib-jsmath==1.0.1 - # via sphinx -sphinxcontrib-qthelp==1.0.8 - # via sphinx -sphinxcontrib-serializinghtml==1.1.10 - # via sphinx -tomli==2.0.1 - # via sphinx -typing-extensions==4.12.2 - # via - # pydata-sphinx-theme - # pygithub -urllib3==2.2.2 - # via - # pygithub - # requests -wrapt==1.16.0 - # via deprecated +# +# This file is autogenerated by pip-compile with Python 3.10 +# by the following command: +# +# pip-compile requirements.in +# +accessible-pygments==0.0.5 + # via pydata-sphinx-theme +alabaster==1.0.0 + # via sphinx +babel==2.16.0 + # via + # pydata-sphinx-theme + # sphinx +beautifulsoup4==4.12.3 + # via pydata-sphinx-theme +breathe==4.35.0 + # via rocm-docs-core +certifi==2024.8.30 + # via requests +cffi==1.17.1 + # via + # cryptography + # pynacl +charset-normalizer==3.3.2 + # via requests +click==8.1.7 + # via sphinx-external-toc +cryptography==43.0.1 + # via pyjwt +deprecated==1.2.14 + # via pygithub +docutils==0.21.2 + # via + # breathe + # myst-parser + # pydata-sphinx-theme + # sphinx +fastjsonschema==2.20.0 + # via rocm-docs-core +gitdb==4.0.11 + # via gitpython +gitpython==3.1.43 + # via rocm-docs-core +idna==3.10 + # via requests +imagesize==1.4.1 + # via sphinx +jinja2==3.1.4 + # via + # myst-parser + # sphinx +markdown-it-py==3.0.0 + # via + # mdit-py-plugins + # myst-parser +markupsafe==2.1.5 + # via jinja2 +mdit-py-plugins==0.4.2 + # via myst-parser +mdurl==0.1.2 + # via markdown-it-py +myst-parser==4.0.0 + # via rocm-docs-core +packaging==24.1 + # via + # pydata-sphinx-theme + # sphinx +pycparser==2.22 + # via cffi +pydata-sphinx-theme==0.15.4 + # via + # rocm-docs-core + # sphinx-book-theme +pygithub==2.4.0 + # via rocm-docs-core +pygments==2.18.0 + # via + # accessible-pygments + # pydata-sphinx-theme + # sphinx +pyjwt[crypto]==2.9.0 + # via pygithub +pynacl==1.5.0 + # via pygithub +pyyaml==6.0.2 + # via + # myst-parser + # rocm-docs-core + # sphinx-external-toc +requests==2.32.3 + # via + # pygithub + # sphinx +rocm-docs-core==1.8.2 + # via -r requirements.in +smmap==5.0.1 + # via gitdb +snowballstemmer==2.2.0 + # via sphinx +soupsieve==2.6 + # via beautifulsoup4 +sphinx==8.0.2 + # via + # breathe + # myst-parser + # pydata-sphinx-theme + # rocm-docs-core + # sphinx-book-theme + # sphinx-copybutton + # sphinx-design + # sphinx-external-toc + # sphinx-notfound-page +sphinx-book-theme==1.1.3 + # via rocm-docs-core +sphinx-copybutton==0.5.2 + # via rocm-docs-core +sphinx-design==0.6.1 + # via rocm-docs-core +sphinx-external-toc==1.0.1 + # via rocm-docs-core +sphinx-notfound-page==1.0.4 + # via rocm-docs-core +sphinxcontrib-applehelp==2.0.0 + # via sphinx +sphinxcontrib-devhelp==2.0.0 + # via sphinx +sphinxcontrib-htmlhelp==2.1.0 + # via sphinx +sphinxcontrib-jsmath==1.0.1 + # via sphinx +sphinxcontrib-qthelp==2.0.0 + # via sphinx +sphinxcontrib-serializinghtml==2.0.0 + # via sphinx +tomli==2.0.1 + # via sphinx +typing-extensions==4.12.2 + # via + # pydata-sphinx-theme + # pygithub +urllib3==2.2.3 + # via + # pygithub + # requests +wrapt==1.16.0 + # via deprecated From bf975f1d0a05d52717213fe0d332c11e05c85b4c Mon Sep 17 00:00:00 2001 From: Peter Park Date: Tue, 1 Oct 2024 16:25:03 -0400 Subject: [PATCH 2/5] fix links (#12) --- docs/how-to/multi-node-config.rst | 4 ++-- docs/how-to/single-node-config.rst | 4 ++-- docs/index.rst | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/how-to/multi-node-config.rst b/docs/how-to/multi-node-config.rst index bcafc44..b7c6eec 100644 --- a/docs/how-to/multi-node-config.rst +++ b/docs/how-to/multi-node-config.rst @@ -75,7 +75,7 @@ the switch from benchmark results. Remember to install OFED perftests on both nodes you plan to use in this section. Commands may require ``sudo`` depending on user privileges. -#. From the CLI of your host, clone the +#. From the CLI of your host, clone the perftest repo. .. code-block:: shell @@ -410,7 +410,7 @@ script provided in the drop-down (the script also includes an option to install MPICH if needed). Otherwise, you can follow the steps to manually install at ``__. -.. dropdown:: `build-and-run_rccl-tests_sweep_multinode.sh` +.. dropdown:: ``build-and-run_rccl-tests_sweep_multinode.sh`` .. code-block:: shell :linenos: diff --git a/docs/how-to/single-node-config.rst b/docs/how-to/single-node-config.rst index 5a660a5..8b95311 100644 --- a/docs/how-to/single-node-config.rst +++ b/docs/how-to/single-node-config.rst @@ -312,7 +312,7 @@ ROCm Bandwidth Test. #. Connect to the CLI of your GPU node. #. Install ROCm Validation Suite following the directions at - :doc:`ROCmValidationSuite:install/installation` + :doc:`rocmvalidationsuite:install/installation`. * Once installed, RVS is located in ``/opt/rocm/``. @@ -560,7 +560,7 @@ If you want to define your own configuration file, run file with information on commands and arguments to run more granular testing. Running DMA tests between single pairs of devices is one helpful and common use case for custom configuration files. See the -`TransferBench documentation ` for more information. +:doc:`TransferBench documentation ` for more information. Run ROCm Bandwidth Test (RBT) ----------------------------- diff --git a/docs/index.rst b/docs/index.rst index feaf9a0..38299ff 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -16,7 +16,7 @@ multi-node environments using the appropriate benchmarking tools. Refer to the relevant networking guides for step-by-step instructions on validating network configurations in single-node and multi-node environments. These guides cover system settings, device configurations, networking tools, and -performance tests to ensure AMD Instinct™-powered GPU-enabled clusters achieve +performance tests to ensure AMD Instinct™-powered GPU clusters achieve optimal speed and bandwidth during operation. .. grid:: 2 From 970df94e0bfc0f3116a251b547c4ae0a1bc5d835 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Tue, 1 Oct 2024 16:25:03 -0400 Subject: [PATCH 3/5] fix links (#12) --- docs/how-to/multi-node-config.rst | 4 ++-- docs/how-to/single-node-config.rst | 4 ++-- docs/index.rst | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/how-to/multi-node-config.rst b/docs/how-to/multi-node-config.rst index bcafc44..b7c6eec 100644 --- a/docs/how-to/multi-node-config.rst +++ b/docs/how-to/multi-node-config.rst @@ -75,7 +75,7 @@ the switch from benchmark results. Remember to install OFED perftests on both nodes you plan to use in this section. Commands may require ``sudo`` depending on user privileges. -#. From the CLI of your host, clone the +#. From the CLI of your host, clone the perftest repo. .. code-block:: shell @@ -410,7 +410,7 @@ script provided in the drop-down (the script also includes an option to install MPICH if needed). Otherwise, you can follow the steps to manually install at ``__. -.. dropdown:: `build-and-run_rccl-tests_sweep_multinode.sh` +.. dropdown:: ``build-and-run_rccl-tests_sweep_multinode.sh`` .. code-block:: shell :linenos: diff --git a/docs/how-to/single-node-config.rst b/docs/how-to/single-node-config.rst index 5a660a5..8b95311 100644 --- a/docs/how-to/single-node-config.rst +++ b/docs/how-to/single-node-config.rst @@ -312,7 +312,7 @@ ROCm Bandwidth Test. #. Connect to the CLI of your GPU node. #. Install ROCm Validation Suite following the directions at - :doc:`ROCmValidationSuite:install/installation` + :doc:`rocmvalidationsuite:install/installation`. * Once installed, RVS is located in ``/opt/rocm/``. @@ -560,7 +560,7 @@ If you want to define your own configuration file, run file with information on commands and arguments to run more granular testing. Running DMA tests between single pairs of devices is one helpful and common use case for custom configuration files. See the -`TransferBench documentation ` for more information. +:doc:`TransferBench documentation ` for more information. Run ROCm Bandwidth Test (RBT) ----------------------------- diff --git a/docs/index.rst b/docs/index.rst index cd2dee5..8a615ed 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -16,7 +16,7 @@ multi-node environments using the appropriate benchmarking tools. Refer to the relevant networking guides for step-by-step instructions on validating network configurations in single-node and multi-node environments. These guides cover system settings, device configurations, networking tools, and -performance tests to ensure AMD Instinct™-powered GPU-enabled clusters achieve +performance tests to ensure AMD Instinct™-powered GPU clusters achieve optimal speed and bandwidth during operation. .. grid:: 2 From 44730946d3e87f74070165ba56fd49f47043bd7d Mon Sep 17 00:00:00 2001 From: Michael Benavidez Date: Wed, 2 Oct 2024 13:04:57 -0500 Subject: [PATCH 4/5] Update multi-node-config.rst Remove duplicate paragraph. --- docs/how-to/multi-node-config.rst | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/how-to/multi-node-config.rst b/docs/how-to/multi-node-config.rst index b7c6eec..f5c919e 100644 --- a/docs/how-to/multi-node-config.rst +++ b/docs/how-to/multi-node-config.rst @@ -114,11 +114,6 @@ Once installed, there are six main modules available with OFED perftests: * ``ib_send_lat`` - Test latency with send transactions. -The examples in this section use ``ib_send_bw``, but you can accomplish similar -with any other test you require. The goal of the tests in this section is to -verify high speed Host to Host (H2H) data transfer rates between nodes before -including GPU traffic, therefore the ``use_rocm`` flag is avoided in all commands. - The examples in this section use the ``ib_send_bw`` tool, but you can achieve similar results with other benchmarking tools, depending on your requirements. The primary objective of these tests is to verify high-speed Host-to-Host (H2H) From 57adf583e47c233fcf85e8e835155c39aa211e88 Mon Sep 17 00:00:00 2001 From: Michael Benavidez Date: Wed, 2 Oct 2024 13:26:14 -0500 Subject: [PATCH 5/5] Use full spelling of repository --- docs/how-to/multi-node-config.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/multi-node-config.rst b/docs/how-to/multi-node-config.rst index f5c919e..d73c53a 100644 --- a/docs/how-to/multi-node-config.rst +++ b/docs/how-to/multi-node-config.rst @@ -75,7 +75,7 @@ the switch from benchmark results. Remember to install OFED perftests on both nodes you plan to use in this section. Commands may require ``sudo`` depending on user privileges. -#. From the CLI of your host, clone the perftest repo. +#. From the CLI of your host, clone the perftest repository. .. code-block:: shell