Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Centos7 support #2010

Merged
merged 9 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 23 additions & 14 deletions build/build-in-docker
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash

#
# Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -24,30 +24,39 @@ set -e
SCRIPTDIR=$(cd $(dirname $0); pwd)

LOCAL_MAVEN_REPO=${LOCAL_MAVEN_REPO:-"$HOME/.m2/repository"}
CUDF_USE_PER_THREAD_DEFAULT_STREAM=${CUDF_USE_PER_THREAD_DEFAULT_STREAM:-ON}
USE_GDS=${USE_GDS:-ON}
export CMAKE_GENERATOR=${CMAKE_GENERATOR:-"Ninja"}
jlowe marked this conversation as resolved.
Show resolved Hide resolved
export DOCKER_BUILD_EXTRA_ARGS="--platform=linux/amd64 --build-arg CMAKE_ARCH=x86_64"
NvTimLiu marked this conversation as resolved.
Show resolved Hide resolved
CUDA_VER=${CUDA_VER:-cuda11}
jlowe marked this conversation as resolved.
Show resolved Hide resolved
USE_SANITIZER=${USE_SANITIZER:-ON}
jlowe marked this conversation as resolved.
Show resolved Hide resolved
BUILD_FAULTINJ=${BUILD_FAULTINJ:-ON}

if (( $# == 0 )); then
echo "Usage: $0 <Maven build arguments>"
exit 1
fi

_CUDF_CLEAN_SKIP=""
# if ccache is enabled and libcudf.clean.skip not provided
# by the user remove the cpp build directory
#
if [[ "$CCACHE_DISABLE" != "1" ]]; then
if [[ ! "$*" =~ " -Dlibcudf.clean.skip=" ]]; then
# Don't skip clean if ccache is enabled
# unless the user overrides
_CUDF_CLEAN_SKIP="-Dlibcudf.clean.skip=false"
fi
case $(uname -m) in
x86_64|amd64)
arch=amd64;;
aarch64|arm64)
arch=arm64;;
*)
echo "Unsupported CPU architecture"; exit 1;;
esac

# Set env for arm64 build
if [ "$arch" == "arm64" ]; then
profiles="${profiles},arm64"
jlowe marked this conversation as resolved.
Show resolved Hide resolved
USE_GDS="OFF"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?

Copy link
Collaborator Author

@NvTimLiu NvTimLiu May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?

Yes, GDS cuFiles RDMA lib is not in the arm64 CUDA toolkit


[INFO] [exec] Could NOT find cuFile (missing: cuFile_LIBRARY cuFileRDMA_LIBRARY
[INFO] [exec] cuFile_INCLUDE_DIR)


cufaultinj links to static cupti_static is not found in arm64 CUDA toolkit

[INFO] [exec] -- Generating done (0.0s)
[INFO] [exec] CMake Error at faultinj/CMakeLists.txt:34 (target_link_libraries):
[INFO] [exec] Target "cufaultinj" links to:
[INFO] [exec]
[INFO] [exec] CUDA::cupti_static
[INFO] [exec]
[INFO] [exec] but the target was not found. Possible reasons include:
[INFO] [exec]
[INFO] [exec] * There is a typo in the target name.
[INFO] [exec] * A find_package call is missing for an IMPORTED target.
[INFO] [exec] * An ALIAS target is missing.
[INFO] [exec]
[INFO] [exec]
[INFO] [exec]
[INFO] [exec] CMake Generate step failed. Build files cannot be regenerated correctly.


rmm OOM issue reported as below for arm64 test if USE_SANITIZER=OFF, but I've no idea what the root cause of the issue

[ERROR] There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.ja

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix the tests on arm64, not hack the build script to run the sanitizer to hide the problem. We will not be running the sanitizer in production, so letting this slip through by hacking the build script is not what we want.

USE_SANITIZER="ON"
BUILD_FAULTINJ="OFF"
jlowe marked this conversation as resolved.
Show resolved Hide resolved
export DOCKER_BUILD_EXTRA_ARGS="--platform=linux/arm64 --build-arg CMAKE_ARCH=aarch64"
fi

$SCRIPTDIR/run-in-docker mvn \
-Dmaven.repo.local=$LOCAL_MAVEN_REPO \
-DCUDF_USE_PER_THREAD_DEFAULT_STREAM=$CUDF_USE_PER_THREAD_DEFAULT_STREAM \
-DUSE_GDS=$USE_GDS \
$_CUDF_CLEAN_SKIP \
-DBUILD_TESTS=ON -DBUILD_FAULTINJ=${BUILD_FAULTINJ} -Dcuda.version=$CUDA_VER \
-DUSE_SANITIZER=${USE_SANITIZER} \
jlowe marked this conversation as resolved.
Show resolved Hide resolved
jlowe marked this conversation as resolved.
Show resolved Hide resolved
"$@"
6 changes: 3 additions & 3 deletions build/run-in-docker
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash

#
# Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -31,7 +31,7 @@ DOCKER_RUN_EXTRA_ARGS=${DOCKER_RUN_EXTRA_ARGS:-""}
LOCAL_CCACHE_DIR=${LOCAL_CCACHE_DIR:-"$HOME/.ccache"}
LOCAL_MAVEN_REPO=${LOCAL_MAVEN_REPO:-"$HOME/.m2/repository"}

SPARK_IMAGE_NAME="spark-rapids-jni-build:${CUDA_VERSION}-devel-centos7"
SPARK_IMAGE_NAME="spark-rapids-jni-build:${CUDA_VERSION}-devel-rockylinux8"

# ensure directories exist
mkdir -p "$LOCAL_CCACHE_DIR" "$LOCAL_MAVEN_REPO"
Expand Down Expand Up @@ -74,4 +74,4 @@ $DOCKER_CMD run $DOCKER_GPU_OPTS $DOCKER_RUN_EXTRA_ARGS -u $(id -u):$(id -g) --r
-e VERBOSE \
$DOCKER_OPTS \
$SPARK_IMAGE_NAME \
scl enable devtoolset-11 "$RUN_CMD"
scl enable gcc-toolset-11 "$RUN_CMD"
33 changes: 15 additions & 18 deletions ci/Dockerfile
100755 → 100644
jlowe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -17,31 +17,28 @@
###
# Build the image for spark-rapids-jni development environment.
#
# Arguments: CUDA_VERSION=11.8.0
# Arguments: CUDA_VERSION=[11.X.Y, 12.X.Y], OS_RELEASE=[8, 9], TARGETPLATFORM=[linux/amd64, linux/amd64]
#
###
ARG CUDA_VERSION=11.8.0
FROM nvidia/cuda:$CUDA_VERSION-devel-centos7
ARG DEVTOOLSET_VERSION=11
ARG OS_RELEASE=8
# multi-platform build with: docker buildx build --platform linux/arm64,linux/amd64 <ARGS> on either amd64 or arm64 host
# check available official arm-based docker images at https://hub.docker.com/r/nvidia/cuda/tags (OS/ARCH)
FROM --platform=$TARGETPLATFORM nvidia/cuda:$CUDA_VERSION-devel-rockylinux$OS_RELEASE
jlowe marked this conversation as resolved.
Show resolved Hide resolved
ARG TOOLSET_VERSION=11
### Install basic requirements
RUN yum install -y centos-release-scl
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION} rh-python38 epel-release
RUN yum install -y zlib-devel maven tar wget patch ninja-build
# require git 2.18+ to keep consistent submodule operations
RUN yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo.x86_64.rpm && yum install -y git
# pin urllib3<2.0 for https://github.com/psf/requests/issues/6432
RUN scl enable rh-python38 "pip install requests 'urllib3<2.0'"

RUN dnf --enablerepo=powertools install -y scl-utils gcc-toolset-${TOOLSET_VERSION} python39 zlib-devel maven tar wget patch ninja-build git
## pre-create the CMAKE_INSTALL_PREFIX folder, set writable by any user for Jenkins
RUN mkdir /usr/local/rapids && mkdir /rapids && chmod 777 /usr/local/rapids && chmod 777 /rapids
RUN mkdir /usr/local/rapids /rapids && chmod 777 /usr/local/rapids /rapids
jlowe marked this conversation as resolved.
Show resolved Hide resolved

# 3.22.3: CUDA architecture 'native' support + flexible CMAKE_<LANG>_*_LAUNCHER for ccache
ARG CMAKE_VERSION=3.26.4

RUN cd /usr/local && wget --quiet https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-x86_64.tar.gz && \
tar zxf cmake-${CMAKE_VERSION}-linux-x86_64.tar.gz && \
rm cmake-${CMAKE_VERSION}-linux-x86_64.tar.gz
ENV PATH /usr/local/cmake-${CMAKE_VERSION}-linux-x86_64/bin:$PATH
# default x86_64 from x86 build, aarch64 cmake for arm build
ARG CMAKE_ARCH=x86_64
RUN cd /usr/local && wget --quiet https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz && \
tar zxf cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz && \
rm cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}.tar.gz
ENV PATH /usr/local/cmake-${CMAKE_VERSION}-linux-${CMAKE_ARCH}/bin:$PATH

# ccache for interactive builds
ARG CCACHE_VERSION=4.6
Expand All @@ -51,7 +48,7 @@ RUN cd /tmp && wget --quiet https://github.com/ccache/ccache/releases/download/v
cd ccache-${CCACHE_VERSION} && \
mkdir build && \
cd build && \
scl enable devtoolset-${DEVTOOLSET_VERSION} \
scl enable gcc-toolset-${TOOLSET_VERSION} \
"cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DZSTD_FROM_INTERNET=ON \
Expand Down
76 changes: 0 additions & 76 deletions ci/Dockerfile.multi

This file was deleted.

6 changes: 3 additions & 3 deletions ci/Jenkinsfile.premerge
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ import ipp.blossom.*

def githubHelper // blossom github helper
def TEMP_IMAGE_BUILD = true
def IMAGE_PREMERGE = "${common.ARTIFACTORY_NAME}/sw-spark-docker/plugin-jni:centos7-cuda11.8.0-blossom"
def IMAGE_PREMERGE = "${common.ARTIFACTORY_NAME}/sw-spark-docker/plugin-jni:rockylinux8-cuda11.8.0-blossom"
def cpuImage = pod.getCPUYAML(IMAGE_PREMERGE)
def PREMERGE_DOCKERFILE = 'ci/Dockerfile'
def PREMERGE_TAG
Expand Down Expand Up @@ -150,7 +150,7 @@ git --no-pager diff --name-only HEAD \$BASE -- ${PREMERGE_DOCKERFILE} || true"""
}

if (TEMP_IMAGE_BUILD) {
PREMERGE_TAG = "centos7-cuda11.8.0-blossom-dev-${BUILD_TAG}"
PREMERGE_TAG = "rockylinux8-cuda11.8.0-blossom-dev-${BUILD_TAG}"
IMAGE_PREMERGE = "${ARTIFACTORY_NAME}/sw-spark-docker-local/plugin-jni:${PREMERGE_TAG}"
docker.build(IMAGE_PREMERGE, "--network=host -f ${PREMERGE_DOCKERFILE} -t $IMAGE_PREMERGE .")
uploadDocker(IMAGE_PREMERGE)
Expand Down Expand Up @@ -212,7 +212,7 @@ git --no-pager diff --name-only HEAD \$BASE -- ${PREMERGE_DOCKERFILE} || true"""
container('gpu') {
timeout(time: 3, unit: 'HOURS') { // step only timeout for test run
common.resolveIncompatibleDriverIssue(this)
sh 'scl enable devtoolset-11 "ci/premerge-build.sh"'
sh 'scl enable gcc-toolset-11 "ci/premerge-build.sh"'
sh 'bash ci/fuzz-test.sh'
}
}
Expand Down
2 changes: 1 addition & 1 deletion ci/submodule-sync.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# NOTE:
# this script is for jenkins only, and should not be used for local development
# run with ci/Dockerfile in jenkins:
# scl enable devtoolset-11 rh-python38 "ci/submodule-sync.sh"
# scl enable gcc-toolset-11 rh-python38 "ci/submodule-sync.sh"

set -ex

Expand Down
Loading