Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 hardware agnostic front and backend #5

Closed
wants to merge 36 commits into from

Conversation

smedegaard
Copy link
Collaborator

@smedegaard smedegaard commented Nov 12, 2024

Description

This PR decouples the hardware layer from the front- and backend of TorchServe.
Relates to #740

Requirement Files

Added requirements/torch_rocm62.txt, requirements/torch_rocm61.txt and requirements/torch_rocm60.txt for easy install of dependencies needed for AMD support.

Backend

The Python backend supports currently NVIDIA GPUs using hardware specific libraries. There were also a number of functions that could be refactored using more generalized interfaces.

Changes Made to Backend

  • Use torch.cuda for detecting GPU availability and torch.version for differentiating between GPU vendors (NVIDIA, AMD)
  • Use torch.cuda for collecting GPU metrics
    • Exclude nvgpu library usage which is a quick and dirty solution calling nvidia-smi and parsing its output
    • Currently temporary solution for AMD GPUs which relies on using amdsmi library directly
    • When the bug is changed in torch.cuda, same functions can be used for collecting metrics from different GPUs (NVIDIA, AMD)
  • Extend print_env_info for AMD GPUs and reimplement a number of functions
    • Detect versions of HIP runtime, ROCm and MIOpen
    • Collect model names of available GPUs with torch.cuda (NVIDIA, AMD)
    • Use pynvml for detecting nvidia driver and cuda versions
    • Use torch for detecting compiled cuda and cudnn versions
  • Refactor nvidia-specific code in several places

Frontend

The Java frontend that acts as the workload manager had calls to SMIs hard-coded in a few places. This made it difficult for TorchServe to support multiple hardware vendors in a graceful manner.

Changes Made to Frontend

We've introduced a new package org.pytorch.serve.device with the classes SystemInfo and Accelerator. SystemInfo holds an array list of Accelerator objects that holds static information about the specific accelerators on a machine, and the relevant metrics.

Instead of calling the SMIs directly in multiple places in the frontend code we have abstracted the hardware away by adding an instance of SystemInfo to the pre-existing ConfigManager. Now the frontend can get data from the hardware via the methods on SystemInfo without knowing about the specifics of the hardware and SMIs.

To implement the specifics for each of the vendors that was already partially supported we have created a number of utility classes that communicates with the hardware via the relevant SMI.

The following steps are taken in the SystemInfo constructor.

  1. Detect the relevant vendor by calling which {relevant smi} for each of the supported vendors.
    This is how vendor detection was done previously. There might be more robust ways. where is used on Windows systems.
  2. When the accelerator vendor is detected it creates an instance of the relevant utility class , for example ROCmUtility for AMD.
  3. Accelerators are detected, respecting the relevant environment variable for selecting devices. HIP_VISIBLE_DEVICES for AMD, CUDA_VISIBLE_DEVICES for nvidia and XPU_VISIBLE_DEVICES for Intel. All devices are detected if the relevant environment variable is not set.
  4. Finally the metrics for the detected devices are updated

The following is a class diagram showing how the new classes relate to the existing code

classDiagram
    class Accelerator {
        +Integer id
        +AcceleratorVendor vendor
        +String model
        +IAcceleratorUtility acceleratorUtility
        +Float usagePercentage
        +Float memoryUtilizationPercentage
        +Integer memoryAvailableMegabytes
        +Integer memoryUtilizationMegabytes
        +getVendor()
        +getAcceleratorModel()
        +getAcceleratorId()
        +getMemoryAvailableMegaBytes()
        +getUsagePercentage()
        +getMemoryUtilizationPercentage()
        +getMemoryUtilizationMegabytes()
        +setMemoryAvailableMegaBytes()
        +setUsagePercentage()
        +setMemoryUtilizationPercentage()
        +setMemoryUtilizationMegabytes()
        +utilizationToString()
        +updateDynamicAttributes()
    }

    class SystemInfo {
        -AcceleratorVendor acceleratorVendor
        -ArrayList<Accelerator> accelerators
        -IAcceleratorUtility acceleratorUtil
        +hasAccelerators()
        +getNumberOfAccelerators()
        +getAccelerators()
        +updateAcceleratorMetrics()
    }

    class AcceleratorVendor {
        <<enumeration>>
        AMD
        NVIDIA
        INTEL
        APPLE
        UNKNOWN
    }

    class IAcceleratorUtility {
        <<interface>>
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +getUpdatedAcceleratorsUtilization()
    }

    class ICsvSmiParser {
        <<interface>>
        +csvSmiOutputToAccelerators()
    }

    class IJsonSmiParser {
        <<interface>>
        +jsonOutputToAccelerators()
        +extractAcceleratorId()
        +jsonObjectToAccelerator()
        +extractAccelerators()
    }

    class CudaUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +parseAccelerator()
        +parseUpdatedAccelerator()
    }

    class ROCmUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +extractAccelerators()
        +extractAcceleratorId()
        +jsonObjectToAccelerator()
    }

    class XpuUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +parseDiscoveryOutput()
        +parseUtilizationOutput()
    }

    class AppleUtil {
        +getGpuEnvVariableName()
        +getUtilizationSmiCommand()
        +getAvailableAccelerators()
        +smiOutputToUpdatedAccelerators()
        +jsonObjectToAccelerator()
        +extractAcceleratorId()
        +extractAccelerators()
    }

        class ConfigManager {
        -SystemInfo systemInfo
        +init(Arguments args)
    }

    class WorkerThread {
        #ConfigManager configManager
        #WorkerLifeCycle lifeCycle
    }

    class AsyncWorkerThread {
        #boolean loadingFinished
        #CountDownLatch latch
        +run()
        #connect()
    }

    class SystemInfo {
        -Logger logger
        -AcceleratorVendor acceleratorVendor
        -ArrayList<Accelerator> accelerators
        -IAcceleratorUtility acceleratorUtil
        +SystemInfo()
        -createAcceleratorUtility() IAcceleratorUtility
        -populateAccelerators()
        +hasAccelerators() boolean
        +getNumberOfAccelerators() Integer
        +static detectVendorType() AcceleratorVendor
        -static isCommandAvailable(String) boolean
        +getAccelerators() ArrayList<Accelerator>
        -updateAccelerators(List<Accelerator>)
        +updateAcceleratorMetrics()
        +getAcceleratorVendor() AcceleratorVendor
        +getVisibleDevicesEnvName() String
    }

    class Accelerator {
        +Integer id
        +AcceleratorVendor vendor
        +String model
        +Float usagePercentage
        +Float memoryUtilizationPercentage
        +Integer memoryAvailableMegabytes
        +Integer memoryUtilizationMegabytes
        +getVendor() AcceleratorVendor
        +getAcceleratorModel() String
        +getAcceleratorId() Integer
        +getUsagePercentage() Float
        +setUsagePercentage(Float)
        +setMemoryUtilizationPercentage(Float)
        +setMemoryUtilizationMegabytes(Integer)
    }

    class WorkerLifeCycle {
        -ConfigManager configManager
        -ModelManager modelManager
        -Model model
    }

    class WorkerThread {
        #ConfigManager configManager
        #int port
        #Model model
        #WorkerState state
        #WorkerLifeCycle lifeCycle

    }

    WorkerLifeCycle --> "1" ConfigManager
    WorkerLifeCycle --> "1" Model
    WorkerLifeCycle --> "1" Connector
    WorkerThread --> "1" WorkerLifeCycle

    ConfigManager "1" --> "1" SystemInfo
    ConfigManager "1" --> "*" Accelerator
    WorkerThread --> "1" ConfigManager

    WorkerThread --> "1" WorkerLifeCycle
    AsyncWorkerThread --|> WorkerThread

    SystemInfo --> "0..*" Accelerator
    SystemInfo --> "1" IAcceleratorUtility
    SystemInfo --> "1" AcceleratorVendor
    Accelerator --> "1" AcceleratorVendor
    CudaUtil ..|> IAcceleratorUtility
    CudaUtil ..|> ICsvSmiParser
    ROCmUtil ..|> IAcceleratorUtility
    ROCmUtil ..|> IJsonSmiParser
    XpuUtil ..|> IAcceleratorUtility
    XpuUtil ..|> ICsvSmiParser
    AppleUtil ..|> IAcceleratorUtility
    AppleUtil ..|> IJsonSmiParser
Loading

Documentation

  • Added the section "Hardware Support" in the table of contents
  • Moved the pages about hardware support to serve/docs/hardware_support/ and added them under "Hardware Support" in the TOC
  • Added the page "AMD Support"

Screenshot 2024-11-27 120848

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

We build new docker container for each platform using Dockerfile.dev and build arguments CUDA_VERSION and ROCM_VERSION

# AMD instance
docker build -f docker/Dockerfile.rocm -t torch-serve-dev-image-rocm --build-arg ROCM_VERSION=rocm62 --build-arg BUILD_FROM_SRC=true .

Run containers

# AMD instance
docker run --rm -it -w /serve --device=/dev/kfd --device=/dev/dri --entrypoint bash torch-serve-dev-image-rocm

Tests

  • Frontend tests, CPU

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 35s
  • Frontend tests, CUDA

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 5s
  • Frontend tests, ROCm

Logs:

> ./frontend/gradlew -p frontend clean build
...
BUILD SUCCESSFUL in 6m 43s
  • Backend tests, CPU

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
============================================================================ 113 passed, 30 warnings in 38.09s ============================================================================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================================================================== 20 passed in 0.36s ====================================================================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================================================================== 33 passed in 0.20s ====================================================================================
  • Backend tests, CUDA

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
======================================================================= 113 passed, 21 warnings in 83.76s (0:01:23) =======================================================================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================================================================== 20 passed in 0.31s ====================================================================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================================================================== 33 passed in 0.20s ====================================================================================
  • Backend tests, ROCm

Logs:

> python3 -m pytest ts/tests/unit_tests ts/torch_handler/unit_tests
============================ 113 passed, 21 warnings in 48.06s ============================
> cd workflow-archiver && python3 -m pytest workflow_archiver/tests/unit_tests workflow_archiver/tests/integ_tests
=================================== 20 passed in 0.32s ====================================
> cd model-archiver && python3 -m pytest model_archiver/tests/unit_tests model_archiver/tests/integ_tests
=================================== 33 passed in 0.16s ====================================
  • Regression tests, CPU

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
================================================================ 163 passed, 40 skipped, 15 warnings in 2014.67s (0:33:34) ================================================================
  • Regression tests, CUDA

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
FAILED test_handler.py::test_huggingface_bert_model_parallel_inference - assert 'Bloomberg has decided to publish a new report on the global economy' in '{\n ...
====================================================== 1 failed, 162 passed, 40 skipped, 10 warnings in 2070.51s (0:34:30) ======================================================
  • Regression tests, ROCm

Logs:

> git submodule update --init --recursive
> python3 test/regression_tests.py
FAILED test_handler.py::test_huggingface_bert_model_parallel_inference - assert 'Bloomberg has decided to publish a new report on the global economy' in '{\n  ...
=========== 1 failed, 162 passed, 40 skipped, 11 warnings in 2085.45s (0:34:45) ===========

OBS! The test test_handler.py::test_huggingface_bert_model_parallel_inference fails due to:

ValueError: Input length of input_ids is 150, but max_length is set to 50. This can lead to unexpected behavior. You should consider increasing max_length or, better yet, setting max_new_tokens.

This indicates that preprocessing uses a different max_length than inference, which can be verified when looking at the handler when the test was originally implemented: model.generate() has max_length=50 by default, while tokenizer uses max_length from setup_config (max_length=150). It seems that the bert-based Textgeneration.mar needs an update.

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

smedegaard and others added 5 commits November 8, 2024 17:05
'update backend to be hardware agnostic'
Rony Leppänen <[email protected]>

'update frontend to be hardware agnostic'
Anders Smedegaard Pedersen <[email protected]>

'update Dockerfile.dev to also work for AMD'
'update requirements/ for AMD support'
Samu Tamminen <[email protected]>

Other contributions:

Bipradip Chowdhury <[email protected]>
Jarkko Lehtiranta <[email protected]>
Jarkko Vainio <[email protected]>
Tero Kemppi <[email protected]>
@smedegaard smedegaard linked an issue Nov 12, 2024 that may be closed by this pull request
@smedegaard smedegaard marked this pull request as ready for review November 12, 2024 13:37
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
docker/Dockerfile.dev Outdated Show resolved Hide resolved
docker/Dockerfile.dev Outdated Show resolved Hide resolved
docker/Dockerfile.dev Outdated Show resolved Hide resolved
requirements/torch_rocm61.txt Show resolved Hide resolved
amdsmi.amdsmi_init()

handle = amdsmi.amdsmi_get_processor_handles()[gpu_index]
mem_used = amdsmi.amdsmi_get_gpu_vram_usage(handle)["vram_used"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that torch.cuda.mem_get_info should work fine on our systems if we want to follow the same approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jataylo thank you for the comment! I am not sure if we can do that yet, see below an example when using torch2.4.1+rocm6.1:

(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# amd-smi monitor --vram
GPU  VRAM_USED  VRAM_TOTAL
  0   61496 MB    65501 MB
  1   61496 MB    65501 MB
  2   59062 MB    65501 MB
  3   61496 MB    65501 MB
  4      13 MB    65501 MB
  5      13 MB    65501 MB
  6      13 MB    65501 MB
  7      13 MB    65501 MB
(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# python -c "import amdsmi; import torch; print(*[(i, amdsmi.amdsmi_get_gpu_vram_usage(amdsmi.amdsmi_get_processor_handles()[i])) for i in range(torch.cuda.device_count())], sep='\n');"
(0, {'vram_total': 65501, 'vram_used': 61496})
(1, {'vram_total': 65501, 'vram_used': 61496})
(2, {'vram_total': 65501, 'vram_used': 59062})
(3, {'vram_total': 65501, 'vram_used': 61496})
(4, {'vram_total': 65501, 'vram_used': 13})
(5, {'vram_total': 65501, 'vram_used': 13})
(6, {'vram_total': 65501, 'vram_used': 13})
(7, {'vram_total': 65501, 'vram_used': 13})
(venv) root@6f6ab1e7f4fb:/workspaces/torch-serve-amd# python -c "import torch; import numpy as np; print(*[(i, np.array(torch.cuda.mem_get_info(i)) // 1024**2) for i in range(torch.cuda.device_count())], sep='\n');"
(0, array([ 6146, 65520])) # vram_used 59374
(1, array([ 4046, 65520])) # vram_used 61474
(2, array([ 4046, 65520])) # vram_used 61474 
(3, array([ 4046, 65520])) # vram_used 61474
(4, array([65414, 65520])) # vram_used 106
(5, array([65414, 65520])) # vram_used 106
(6, array([65414, 65520])) # vram_used 106
(7, array([65414, 65520])) # vram_used 106

Here the amdsmi and handle-based approach seems to provide correct numbers, but when using torch.cuda.mem_get_info() and accessing devices by index, the information does not seem to be correct (note that mem_get_info() returns (free, total) memory used). Seems that the device indices get somehow mixed up a bit.

@samutamm samutamm force-pushed the 2-hardware-agnostic-front-and-backend branch from 05211fa to ff4daa8 Compare November 14, 2024 08:08
@smedegaard smedegaard requested a review from jataylo November 20, 2024 13:05
docs/hardware_support/amd_support.md Outdated Show resolved Hide resolved
docs/hardware_support/amd_support.md Outdated Show resolved Hide resolved
docs/hardware_support/amd_support.md Outdated Show resolved Hide resolved
docs/hardware_support/amd_support.md Outdated Show resolved Hide resolved
requirements/common_rocm.txt Outdated Show resolved Hide resolved
ts/metrics/metric_collector.py Outdated Show resolved Hide resolved
Copy link

@jataylo jataylo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me now, I think we should consult upstream feedback once any remaining comments are addressed.

But there still does seem to be unnecessary formatting changes in the java code that may want to clean up.

@jakki-amd jakki-amd closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hardware-agnostic front- and backend
6 participants