Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add CUDA kernel for the ScatterElements operator in opset 18 #18381

Open
martinResearch opened this issue Nov 9, 2023 · 6 comments
Labels
ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement

Comments

@martinResearch
Copy link

martinResearch commented Nov 9, 2023

Describe the feature request

It seems that the operator ScatterElements in not implemented in CUDA when using opset 16,17 or 18 and "add" reduction.
We get a message CUDA kernel not found in registries for Op type: ScatterElements node name: /ScatterElements in the log when loading the onnx model and "CUDAExecutionProvider" in the profiling file.

Note that the operator ScatterElements is available when using opset 15 but provides wrong results (see onnx/onnx#3484)

Here is some minimal python code to reproduce the problem using torch==2.0.0+cu118 or torch==2.1.0+cu118 and onnxruntime-gpu==1.16.2 using a NVIDIA GeForce GTX 1050

import io
import json

import numpy as np
import onnxruntime
import torch
import torch.nn as nn


class ScatterAdd(nn.Module):
    """Point cloud renderer"""

    def __init__(self, length: int):
        super(ScatterAdd, self).__init__()
        self.length = length

    def forward(self, indices: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
        result = torch.zeros((self.length), device="cuda", dtype=torch.float)
        result = result.scatter_add(0, indices, weights)
        return result


def main():
    opset_version = 16
    onnx_provider = "CUDAExecutionProvider"

    indices = torch.Tensor([0, 0, 1, 2]).long().cuda()
    weights = torch.Tensor([1.0, 3.0, 5.0, 7.0]).cuda()
    scatter_add = ScatterAdd(length=3)
    result = scatter_add(indices=indices, weights=weights)
    assert np.allclose(result.cpu().numpy(), [4.0, 5.0, 7.0])

    bytes_io = io.BytesIO()
    torch.onnx.export(
        scatter_add, (indices, weights), bytes_io, opset_version=opset_version, input_names=["indices", "weights"]
    )
    onnxruntime.set_default_logger_severity(1)
    sess_options = onnxruntime.SessionOptions()
    sess_options.enable_profiling = True
    ort_session = onnxruntime.InferenceSession(
        bytes_io.getvalue(),
        providers=[
            onnx_provider,
        ],
        sess_options=sess_options,
    )
    # when using CUDAExecutionProvider with opset_version in 16, 17,=ir 18 getting in the log:
    # CUDA kernel not found in registries for Op type: ScatterElements node name: /ScatterElements

    numpy_inputs = {
        "indices": np.array([0, 0, 1, 2], dtype=np.int64),
        "weights": np.array([1.0, 3.0, 5.0, 7.0], dtype=np.float32),
    }
    result = ort_session.run(None, numpy_inputs)

    prof_file = ort_session.end_profiling()

    with open(prof_file) as f:
        sess_time = json.load(f)

    # fails when using opset_version=15 with both CUDAExecutionProvider
    assert np.allclose(result, [4.0, 5.0, 7.0])

    # fails when using opset_version=16 or 17 or 18 with both CUDAExecutionProvider
    assert sess_time[3]["args"]["provider"] == "CUDAExecutionProvider"
  

if __name__ == "__main__":
    main()

Describe scenario use case

This is used in an image processing pipeline.

@martinResearch martinResearch added the feature request request for unsupported feature or enhancement label Nov 9, 2023
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Nov 9, 2023
@martinResearch
Copy link
Author

When printing the onnx model using

    torch.onnx.export(
        scatter_add, (indices, weights), "model.onnx", opset_version=opset_version, input_names=["indices", "weights"]
    )
    with open("model.onnx", "rb") as f:
        model = onnx.load(f)

I get

ir_version: 8
opset_import {
  version: 16
}
producer_name: "pytorch"
producer_version: "2.1.0"
graph {
  node {
    output: "onnx::ScatterElements_2"
    name: "Constant_0"
    op_type: "Constant"
    attribute {
      name: "value"
      type: TENSOR
      t {
        dims: 3
        data_type: 1
        raw_data: "\000\000\000\000\000\000\000\000\000\000\000\000"
      }
    }
  }
  node {
    input: "onnx::ScatterElements_2"
    input: "indices"
    input: "weights"
    output: "3"
    name: "/ScatterElements"
    op_type: "ScatterElements"
    attribute {
      name: "axis"
      type: INT
      i: 0
    }
    attribute {
      name: "reduction"
      type: STRING
      s: "add"
    }
  }
  name: "main_graph"
  input {
    name: "indices"
    type {
      tensor_type {
        elem_type: 7
        shape {
          dim {
            dim_value: 4
          }
        }
      }
    }
  }
  input {
    name: "weights"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 4
          }
        }
      }
    }
  }
  output {
    name: "3"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 3
          }
        }
      }
    }
  }
}

The exported onnx model seems valid.

@martinResearch
Copy link
Author

The ScatterElement operator is listed here and here with lines

BuildKernelCreateInfo<ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 11, 12, ScatterElements)>,

and

BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 13, ScatterElements)>,

which seems to indicated it should be available for any opset greater of equal to 11, thus I it seems strange it seems in practice only available for opsets 11,12,13,14 and 15

@anjandeepsahni
Copy link

anjandeepsahni commented Mar 22, 2024

Seconding this request. Currently if we use scatter add with opset 16, onnxruntime runs it on CPUExecutionProvider which is very slow for large inputs and the runtime just seems to increase linearly with batched inputs. @martinResearch wondering if you already found a solution for this?

@pranavsharma
Copy link
Contributor

Feel free to contribute. We welcome external contributions.

@anjandeepsahni
Copy link

Thanks @pranavsharma . I am not an expert at the internal workings of ONNXRuntime. If I could get some guidance on how to fix this I am happy to create a PR. 😄

@anjandeepsahni
Copy link

#19198 This PR seems to have added support for ScatterElements in opset 13,15 and 18. But I am not sure why opsets 16 and 17 were skipped.

Unfortunately, PyTorch does not support opset 18 with torch.onnx.export and torch.onnx.dynamo_export is beta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants