Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

28.rccl.sh fails to build for navi10 #44

Open
TyraVex opened this issue Jun 26, 2023 · 2 comments
Open

28.rccl.sh fails to build for navi10 #44

TyraVex opened this issue Jun 26, 2023 · 2 comments

Comments

@TyraVex
Copy link

TyraVex commented Jun 26, 2023

Environment

Hardware description
GPU RX 5700
CPU Ryzen 5 3600
Software version
OS Ubuntu 20.04.6 LTS
ROCm 5.4.x
Python 3.8.10

What is the expected behavior

Build rccl for navi10

What actually happens


|====|
|SLOW|
|====|
/home/tyra/rocm/rocm-build/build/rccl /home/tyra/rocm/rocm-build/build/rccl
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY) (Required is at least version "1.11")
-- hip::amdhip64 is SHARED_LIBRARY
-- HIP compiler: clang
-- HIP runtime: rocclr
-- Found rocm_smi at /opt/rocm/include
RPM version 4.14.2.1
-- rocm-cmake: Set license file to /home/tyra/rocm/ROCm/rccl/LICENSE.txt.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/tyra/rocm/rocm-build/build/rccl
[1/4] Updating git_version.cpp if necessary
-- Updating git_version.cpp
[2/4] Building CXX object CMakeFiles/rccl.dir/git_version.cpp.o
Warning: The --hipcc-func-supp option has been deprecated and will be removed in the future.
[3/4] Linking CXX shared library librccl.so.1.0.50400
FAILED: librccl.so.1.0.50400 
: && /opt/rocm/bin/hipcc -fPIC -O3 -DNDEBUG   -shared -Wl,-soname,librccl.so.1 -o librccl.so.1.0.50400 CMakeFiles/rccl.dir/src/collectives/device/all_reduce.cpp.o CMakeFiles/rccl.dir/src/collectives/device/all_gather.cpp.o CMakeFiles/rccl.dir/src/collectives/device/alltoall_pivot.cpp.o CMakeFiles/rccl.dir/src/collectives/device/reduce.cpp.o CMakeFiles/rccl.dir/src/collectives/device/broadcast.cpp.o CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter.cpp.o CMakeFiles/rccl.dir/src/collectives/device/sendrecv.cpp.o CMakeFiles/rccl.dir/src/collectives/device/onerank_reduce.cpp.o CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o CMakeFiles/rccl.dir/src/init.cc.o CMakeFiles/rccl.dir/src/graph/trees.cc.o CMakeFiles/rccl.dir/src/graph/rings.cc.o CMakeFiles/rccl.dir/src/graph/paths.cc.o CMakeFiles/rccl.dir/src/graph/search.cc.o CMakeFiles/rccl.dir/src/graph/connect.cc.o CMakeFiles/rccl.dir/src/graph/tuning.cc.o CMakeFiles/rccl.dir/src/graph/topo.cc.o CMakeFiles/rccl.dir/src/graph/xml.cc.o CMakeFiles/rccl.dir/src/graph/rome_models.cc.o CMakeFiles/rccl.dir/src/collectives/all_reduce_api.cc.o CMakeFiles/rccl.dir/src/collectives/all_gather_api.cc.o CMakeFiles/rccl.dir/src/collectives/reduce_api.cc.o CMakeFiles/rccl.dir/src/collectives/broadcast_api.cc.o CMakeFiles/rccl.dir/src/collectives/reduce_scatter_api.cc.o CMakeFiles/rccl.dir/src/collectives/sendrecv_api.cc.o CMakeFiles/rccl.dir/src/collectives/gather_api.cc.o CMakeFiles/rccl.dir/src/collectives/scatter_api.cc.o CMakeFiles/rccl.dir/src/collectives/all_to_all_api.cc.o CMakeFiles/rccl.dir/src/collectives/all_to_allv_api.cc.o CMakeFiles/rccl.dir/src/channel.cc.o CMakeFiles/rccl.dir/src/misc/argcheck.cc.o CMakeFiles/rccl.dir/src/misc/nvmlwrap_stub.cc.o CMakeFiles/rccl.dir/src/misc/utils.cc.o CMakeFiles/rccl.dir/src/misc/ibvwrap.cc.o CMakeFiles/rccl.dir/src/misc/rocm_smi_wrap.cc.o CMakeFiles/rccl.dir/src/misc/profiler.cc.o CMakeFiles/rccl.dir/src/misc/npkit.cc.o CMakeFiles/rccl.dir/src/misc/shmutils.cc.o CMakeFiles/rccl.dir/src/misc/signals.cc.o CMakeFiles/rccl.dir/src/misc/socket.cc.o CMakeFiles/rccl.dir/src/misc/param.cc.o CMakeFiles/rccl.dir/src/misc/rocmwrap.cc.o CMakeFiles/rccl.dir/src/misc/strongstream.cc.o CMakeFiles/rccl.dir/src/transport/coll_net.cc.o CMakeFiles/rccl.dir/src/transport/net.cc.o CMakeFiles/rccl.dir/src/transport/net_ib.cc.o CMakeFiles/rccl.dir/src/transport/net_socket.cc.o CMakeFiles/rccl.dir/src/transport/p2p.cc.o CMakeFiles/rccl.dir/src/transport/shm.cc.o CMakeFiles/rccl.dir/src/transport.cc.o CMakeFiles/rccl.dir/src/debug.cc.o CMakeFiles/rccl.dir/src/group.cc.o CMakeFiles/rccl.dir/src/bootstrap.cc.o CMakeFiles/rccl.dir/src/proxy.cc.o CMakeFiles/rccl.dir/src/net.cc.o CMakeFiles/rccl.dir/src/enqueue.cc.o CMakeFiles/rccl.dir/git_version.cpp.o  --amdgpu-target=gfx1010  -fgpu-rdc  -parallel-jobs=8  -ldl  -lrocm_smi64  -L/opt/rocm/lib  /opt/rocm/lib/libamdhip64.so.5.4.50100  --hip-link  --offload-arch=gfx1010  /opt/rocm/llvm/lib/clang/15.0.0/lib/linux/libclang_rt.builtins-x86_64.a && :
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: ld-temp.o <inline asm>:1:2: instruction not supported on this GPU
        buffer_wbinvl1_vol
        ^


lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
clang-15: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

How to reproduce

====== CONFIG ======

export ROCM_INSTALL_DIR=/opt/rocm
export ROCM_MAJOR_VERSION=5
export ROCM_MINOR_VERSION=4
export ROCM_PATCH_VERSION=0
export ROCM_LIBPATCH_VERSION=50400
export CPACK_DEBIAN_PACKAGE_RELEASE=72~20.04
export ROCM_PKGTYPE=DEB
export ROCM_GIT_DIR=/home/tyra/rocm/ROCm
export ROCM_BUILD_DIR=/home/tyra/rocm/rocm-build/build
export ROCM_PATCH_DIR=/home/tyra/rocm/rocm-build/patch
export AMDGPU_TARGETS="gfx1010"
export CMAKE_DIR=/home/tyra/rocm/cmake
export PATH=$ROCM_INSTALL_DIR/bin:$ROCM_INSTALL_DIR/llvm/bin:$ROCM_INSTALL_DIR/hip/bin:$CMAKE_DIR/bin:$PATH

====================

Build script I use

#!/bin/bash

if [ "$EUID" -ne 0 ]; then sudo bash "$0" "$@"; exit; fi
[ "$1" = clean ] && sudo rm -rf rocm-build/ venv/ ROCm/ cmake/ repo && exit

for package in "git" "git-lfs" "python3" "python3-venv" "python-is-python3" "wget"; do
  if ! dpkg -s "$package" &> /dev/null; then
    echo "Installing ..."
    apt install -y "$package"
  fi
done

[ -d rocm-build ] || git clone "https://github.com/xuhuisheng/rocm-build"
[ -d venv ] || python3 -m venv venv --system-site-packages
[ -x repo ] || wget --progress=bar:force "https://storage.googleapis.com/git-repo-downloads/repo" && chmod +x repo

if [ ! -d cmake ]; then
  wget "https://cmake.org/files/v3.18/cmake-3.18.6-Linux-x86_64.tar.gz"
  tar -xf cmake-3.18.6-Linux-x86_64.tar.gz
  mv cmake-3.18.6-Linux-x86_64 cmake
  rm -rf cmake-3.18.6-Linux-x86_64.tar.gz
fi

if [ ! -d ROCm ]; then
  mkdir -p ROCm
  cd ROCm
  git config --global user.email "[email protected]"
  git config --global user.name "$USER"
  git config --global color.ui false
  ../repo init -u "https://github.com/RadeonOpenCompute/ROCm.git" -b roc-5.4.x
  ../repo sync
  cd ..
fi

cd rocm-build
config=$(cat "env.sh" | sed "s:/home/work/local/cmake-3.18.6-Linux-x86_64:$(readlink -f ../cmake):g" | sed "s:/home/work:$(readlink -f ..):g" | sed "s:gfx803:gfx1010:g")
echo -e "\n====== CONFIG ======\n\n\e[34m$(tail -n+3 <<< $config)\e[0m\n\n===================="
echo "$config" > .config; source .config
source ../venv/bin/activate
progress_file=".progress"

if [ -f "$progress_file" ]; then
  if [ "$1" = "startover" ]; then
    rm "$progress_file"
    checkpoint_index=0
  else
    checkpoint_index=$(<"$progress_file")
  fi
else
  checkpoint_index=0
fi

readarray -t scripts <<< $(ls -1 | sort -n | grep .sh | tail -n+3)
scripts=(00.rocm-core.sh "${scripts[@]}")

for i in "${!scripts[@]}"; do
  [ $i -lt $checkpoint_index ] && continue
  line="${scripts[$i]}"
  script_name="${line##*/}"
  navi_script="navi10/$script_name"
  [ -f "$navi_script" ] && scripts[$i]="$navi_script"
done

for i in "${!scripts[@]}"; do
  [ $i -lt $checkpoint_index ] && continue
  line="${scripts[$i]##*/}"
  cd $(dirname "$line")
  read -p $'\e[32m\n\n'"Press ENTER to execute $line"$'\e[0m '
  echo; echo; bash "$line"
  echo "$i" > "$progress_file"
  cd - >/dev/null
done

echo -e "\n\n\e[34m====== Finished ======\e[0m\n\n"

Any ideas of flags I could use/modify?
I couldn't find any relevant google results regarding these errors.

@serhii-nakon
Copy link

Hello I have exactly the same issue #45

@serhii-nakon
Copy link

For gfx1012 I fixed this issue using this patch (try replace gfx1012 with gfx1010) and apply this patch inside ROCm/rccl directory

rccl_patch.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants