Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding telemetry #1692

Open
wants to merge 24 commits into
base: branch-24.12
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
8d5473a
adding telemetry (testing)
msarahan Oct 10, 2024
1f9ed88
add action to ensure otel-cli is available for final span update
msarahan Oct 21, 2024
421626d
update description of final span update
msarahan Oct 21, 2024
189604a
set endpoint to gha-runners.nvidia.com
msarahan Oct 29, 2024
943ddb3
rename final telemetry update job
msarahan Oct 29, 2024
496cc0d
set git resource attributes. Pass into pipelines
msarahan Oct 30, 2024
5c54a58
DRYing things off
msarahan Oct 31, 2024
d9413b5
fix reexport->reexports
msarahan Oct 31, 2024
f368d10
fix branches having too much money
msarahan Oct 31, 2024
43c7345
unquote certs?
msarahan Oct 31, 2024
6dcaa76
single quotes?
msarahan Oct 31, 2024
dc99a73
one more try for multiline
msarahan Oct 31, 2024
4236eee
add echo of certs
msarahan Oct 31, 2024
5798541
base64 certs
msarahan Oct 31, 2024
1d0818e
assuming base64-encoded certs
msarahan Oct 31, 2024
c1476aa
fix final telemetry update not depending on pr-builder
msarahan Oct 31, 2024
8d54a55
use inherited secrets for certs
msarahan Oct 31, 2024
74863aa
adapt for shared_actions merge
msarahan Nov 1, 2024
7de05af
change shared-workflows branch to branch-24.12 after merging telemetr…
msarahan Nov 4, 2024
555bbb3
Merge branch 'branch-24.12' into add-telemetry
msarahan Nov 4, 2024
8533f73
revert mambabuild->build
msarahan Nov 4, 2024
f3d7e8e
Merge branch 'add-telemetry' of github.com:msarahan/rmm into add-tele…
msarahan Nov 4, 2024
cdb84fe
fix a hard-coded shared-actions repo checkout
msarahan Nov 4, 2024
3401de0
Update .github/workflows/pr.yaml
msarahan Nov 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 158 additions & 20 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,68 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

env:
OTEL_SERVICE_NAME: 'pr-rmm'
# TODO: this should be set as an org-wide variable
OTEL_EXPORTER_OTLP_ENDPOINT: https://tempo.gha-runners.nvidia.com:4318
# These are where the secrets in github env vars are written to files. These files don't
# exist unless you explicitly write them in a step.
# The purpose of setting the environment variable is to tell OpenTelemetry tools where to find them.
# We abuse it a bit by also using it as the write destination for the certificate files.
OTEL_EXPORTER_OTLP_CERTIFICATE: "/tmp/certs/ca.crt"
OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE: "/tmp/certs/client.crt"
OTEL_EXPORTER_OTLP_CLIENT_KEY: "/tmp/certs/client.key"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_EXPORTER_OTLP_HEADERS: ${{ secrets.OTEL_EXPORTER_OTLP_HEADERS }}
OTEL_RESOURCE_ATTRIBUTES: "git.repository=${{github.repository}},git.ref=${{github.ref}},git.sha=${{github.sha}},git.job_url=${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
msarahan marked this conversation as resolved.
Show resolved Hide resolved

jobs:
telemetry-setup:
runs-on: ubuntu-latest
outputs:
start_time: ${{ steps.timestamp.outputs.START_TIME }}
traceparent: ${{ steps.telemetry-setup.outputs.traceparent }}
endpoint: ${{ steps.var-reexports.outputs.endpoint }}
top_level_service_name: ${{ steps.var-reexports.outputs.service_name }}
otel_resource_attributes: ${{steps.var-reexports.outputs.otel_resource_attributes}}
steps:
- name: Get starting timestamp
id: timestamp
run:
echo "START_TIME=$(date +%s.%N)" >> ${GITHUB_OUTPUT}
- name: Echo endpoint to make it available to shared workflows
id: var-reexports
run: |
echo endpoint="${OTEL_EXPORTER_OTLP_ENDPOINT}" >> ${GITHUB_OUTPUT}
echo service_name="${OTEL_SERVICE_NAME}" >> ${GITHUB_OUTPUT}
echo otel_resource_attributes="${OTEL_RESOURCE_ATTRIBUTES}" >> ${GITHUB_OUTPUT}
- name: Write certificate files for mTLS
run: |
mkdir -p /tmp/certs
cat << EOF > "${OTEL_EXPORTER_OTLP_CERTIFICATE}"
${{ secrets.OTEL_EXPORTER_OTLP_CA_CERTIFICATE }}
EOF
cat << EOF > "${OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE}"
${{ secrets.OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE }}
EOF
cat << EOF > "${OTEL_EXPORTER_OTLP_CLIENT_KEY}"
${{ secrets.OTEL_EXPORTER_OTLP_CLIENT_KEY }}
EOF
- name: Telemetry setup
id: telemetry-setup
uses: rapidsai/shared-actions/telemetry-traceparent@add-telemetry
- name: Start root span
uses: rapidsai/shared-actions/telemetry-create-span@add-telemetry
with:
name: "root span"
traceparent: ${{steps.telemetry-setup.outputs.traceparent}}
start_time: ${{steps.timestamp.outputs.start_time}}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
pr-builder:
needs:
- changed-files
- checks
- telemetry-setup
- conda-cpp-build
- conda-cpp-tests
- conda-python-build
Expand All @@ -24,14 +81,18 @@ jobs:
- wheel-tests
- devcontainer
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/pr-builder.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/pr-builder.yaml@add-telemetry
if: always()
with:
needs: ${{ toJSON(needs) }}
changed-files:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
needs: telemetry-setup
uses: rapidsai/shared-workflows/.github/workflows/changed-files.yaml@add-telemetry
with:
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
files_yaml: |
test_cpp:
- '**'
Expand All @@ -50,75 +111,152 @@ jobs:
- '!img/**'
checks:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
needs: telemetry-setup
uses: rapidsai/shared-workflows/.github/workflows/checks.yaml@add-telemetry
with:
enable_check_generated_files: false
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
ignored_pr_jobs: "final-telemetry-update"
conda-cpp-build:
needs: checks
needs:
- telemetry-setup
- checks
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/conda-cpp-build.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/conda-cpp-build.yaml@add-telemetry
with:
build_type: pull-request
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.resource_attributes}}"
conda-cpp-tests:
needs: [conda-cpp-build, changed-files]
needs: [conda-cpp-build, changed-files, telemetry-setup]
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/conda-cpp-tests.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/conda-cpp-tests.yaml@add-telemetry
if: fromJSON(needs.changed-files.outputs.changed_file_groups).test_cpp
with:
build_type: pull-request
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
conda-python-build:
needs: conda-cpp-build
needs:
- conda-cpp-build
- telemetry-setup
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/conda-python-build.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/conda-python-build.yaml@add-telemetry
with:
build_type: pull-request
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
conda-python-tests:
needs: [conda-python-build, changed-files]
needs: [conda-python-build, changed-files, telemetry-setup]
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/conda-python-tests.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/conda-python-tests.yaml@add-telemetry
if: fromJSON(needs.changed-files.outputs.changed_file_groups).test_python
with:
build_type: pull-request
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
docs-build:
needs: conda-python-build
needs:
- conda-python-build
- telemetry-setup
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/custom-job.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/custom-job.yaml@add-telemetry
with:
build_type: pull-request
node_type: "gpu-v100-latest-1"
arch: "amd64"
container_image: "rapidsai/ci-conda:latest"
run_script: "ci/build_docs.sh"
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
wheel-build-cpp:
needs: checks
needs:
- checks
- telemetry-setup
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/wheels-build.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/wheels-build.yaml@add-telemetry
with:
matrix_filter: group_by([.ARCH, (.CUDA_VER|split(".")|map(tonumber)|.[0])]) | map(max_by(.PY_VER|split(".")|map(tonumber)))
build_type: pull-request
script: ci/build_wheel_cpp.sh
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
wheel-build-python:
needs: wheel-build-cpp
needs:
- wheel-build-cpp
- telemetry-setup
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/wheels-build.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/wheels-build.yaml@add-telemetry
with:
build_type: pull-request
script: ci/build_wheel_python.sh
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
wheel-tests:
needs: [wheel-build-python, changed-files]
needs: [wheel-build-python, changed-files, telemetry-setup]
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/wheels-test.yaml@branch-24.12
uses: rapidsai/shared-workflows/.github/workflows/wheels-test.yaml@add-telemetry
if: fromJSON(needs.changed-files.outputs.changed_file_groups).test_python
with:
build_type: pull-request
script: ci/test_wheel.sh
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
devcontainer:
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
uses: rapidsai/shared-workflows/.github/workflows/build-in-devcontainer.yaml@add-telemetry
needs:
- telemetry-setup
with:
arch: '["amd64"]'
cuda: '["12.5"]'
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{ needs.telemetry-setup.outputs.traceparent }}
build_command: |
sccache -z;
build-all -DBUILD_BENCHMARKS=ON --verbose;
sccache -s;
final-telemetry-update:
msarahan marked this conversation as resolved.
Show resolved Hide resolved
runs-on: ubuntu-latest
needs: [pr-builder, telemetry-setup]
steps:
- name: Get final timestamp
id: timestamp
run:
echo "FINAL_TIME=$(date +%s.%N)" >> ${GITHUB_OUTPUT}
# Main purpose of this traceparent line here is to ensure that otel-cli is installed.
- name: Get job traceparent
uses: rapidsai/shared-actions/telemetry-traceparent@add-telemetry
- name: Write certificate files for mTLS
run: |
mkdir -p /tmp/certs
cat << EOF > ${OTEL_EXPORTER_OTLP_CERTIFICATE}
${{ secrets.OTEL_EXPORTER_OTLP_CA_CERTIFICATE }}
EOF
cat << EOF > ${OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE}
${{ secrets.OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE }}
EOF
cat << EOF > ${OTEL_EXPORTER_OTLP_CLIENT_KEY}
${{ secrets.OTEL_EXPORTER_OTLP_CLIENT_KEY }}
EOF
- name: Update root span with final completion time
if: always()
uses: rapidsai/shared-actions/telemetry-create-span@add-telemetry
with:
service: ${{needs.telemetry-setup.outputs.top_level_service_name}}
name: "end-of-job update"
default_endpoint: "${{needs.telemetry-setup.outputs.endpoint}}"
traceparent: ${{needs.telemetry-setup.outputs.traceparent}}
start_time: ${{needs.telemetry-setup.outputs.start_time}}
end_time: ${{steps.timestamp.outputs.FINAL_TIME}}
otel_resource_attributes: "${{needs.telemetry-setup.outputs.otel_resource_attributes}}"
2 changes: 1 addition & 1 deletion ci/build_cpp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ rapids-print-env
rapids-logger "Begin cpp build"

# This calls mambabuild when boa is installed (as is the case in the CI images)
RAPIDS_PACKAGE_VERSION=$(rapids-generate-version) rapids-conda-retry mambabuild conda/recipes/librmm
RAPIDS_PACKAGE_VERSION=$(rapids-generate-version) rapids-conda-retry build conda/recipes/librmm

rapids-upload-conda-to-s3 cpp
Loading