-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal on enabling generic GPU vendor #2044
Proposal on enabling generic GPU vendor #2044
Conversation
I think I'd prefer to think of accelerator support not in terms of vendors, but interfaces or backends. You're bridging SYCL to either CUDA, or ROCm, or OpenCL, or Vulkan, or Level Zero, or... Each of those backends potentially supports N devices from M vendors. It happens that the OpenCL bit of oneDNN is positioned to support more than one vendor sooner than the other ones, but at the moment there is really just one vendor per backend, and that makes it tempting to use "vendor" when I think I wish we'd say "API". |
SYCL devices have to be created specifically for particular types. | ||
|
||
oneDNN supports two engine kinds: CPU and GPU. According to the UXFL strategy the | ||
generic kernels are expected to run on different GPUs and accelerators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there are kernel support differences between GPUs and Accelerators in SYCL (e.g. memory model, functionalities not mandated, ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic SYCL is meant to be a common denominator for all devices so I don't think there should be any differences in the SYCL programming model. However, different devices may have different capabilities and therefore some of the SYCL features may or may not be available for them. For example, USM memory model may not be supported by some devices so the users have to check the corresponding device aspect to query the capability prior to allocating USM memory.
We should also be careful with those cases inside the library, e.g. we always allocate a SYCL buffer for the scratchpad, which is always supported, but if we decide to use USM memory for it we will need to do that properly.
I'm not sure I understand what you mean by the "API" here. Maybe you can elaborate on that a bit more? Both of the GPU runtimes (OpenCL and SYCL) may support multiple vendors, and we want to provide the external contributors with a way to enable them. The generic kernels are meant to be a reference and to close the functional gap for the newly enabled vendors. The generic kernels are not meant to provide performance benefits. The reason we add the |
I think there are two things here: the interface oneDNN exposes to applications, and the interfaces oneDNN consumes in order to accelerate its own operation. For the former, yes, OpenCL and SYCL are the interop APIs and adding more would be an error, on this we agree. For the latter...
The issue I have is that what we are presently calling the "Intel" OpenCL engine is like 96% generic already. As far as I can tell the only thing that truly requires an Intel GPU is when ngen is set to emit assembly instead of clc. At heart the requirement is not that it's an Intel GPU, it's that the OpenCL device support As a potential external contributor, who is considering how to integrate oneDNN into my operating system, I like the idea of having an engine that generically supports many GPUs well. The existing OpenCL engine seems to me to be the best path to that breadth of support, as virtually all of it is written to APIs that already have broad support (or seem easy to add to rusticl); so much so that I would almost suggest renaming |
@nwnk, Just to add feedback as someone helping develop OpenCL kernels, I agree with you on many points, but I don't think we can expose the implementations en masse in quite the way you are proposing. The key issue is that performance is a requirement within oneDNN. In particular, the more targets an implementation supports, the harder it is to modify and validate performance. This induces a contention between code maintainability, where fewer implementations are better, and performance maintainability, where a smaller validation space is better. Splitting on device vendors provides a natural mechanism to narrow scope, as we expect significant divergence in performance characteristics between hardware from different vendors and, in some cases, we can rely on vendor libraries such as cuDNN. I expect shared OpenCL infrastructure will be helpful (and I am looking forward to your upcoming RFC), but we need to be careful in how we balance the code/performance contention across a broad suite of developers, hardware, goals, requirements, etc.. It is already a challenge to maintain this balance within only Intel's existing GPUs. In case it helps, here are some pointers for the current implementation:
You are right that a lot of the OpenCL code should be semi-portable. One of the key issues is that kernel implementations are better viewed as code generators. At a high level, we have an implementation dispatcher which takes as input a problem and outputs a configuration. The OpenCL code generators then JIT compile the configuration, before it is eventually executed. I expect there is a lot of OpenCL code generator infrastructure that can be shared. The dispatcher is a problem though. By necessity, it can be dependent on the target device, the target compiler, and the OpenCL code generator. Any design with shared OpenCL infrastructure needs to take this into account so shared infrastructure changes do not adversely interacting with dispatchers.
The Intel specific features we rely on should be listed in device_info.hpp (if you find any missing, please let us know). For kernel development, the |
@nwnk, I think Roy covered the part about managing the kernels and requirements to them pretty well. I just want to add a few things. I understand the desire to reuse the existing kernels to avoid investing additional time and effort in development and maintenance however, oneDNN has parts that are developed and maintained by external contributors more or less independently therefore we have to provide them with a certain level of autonomy. This is why we've decided to prioritize providing an independent space for the development over minimizing the duplication via the vendor mechanism. |
Maybe those challenges could inform how we progress here. What tensions are you finding, and how are you trying to address them? One thing I noted was that the list of potential kernels for a primitive is fixed at compilation time, based purely on the supported engines. This does allow you to put fast paths first, but it means the list itself has to be fast-path-sorted for every device it might encounter. I could easily imagine that iGPU and dGPU would want different sort orders entirely, and one way to address that might be to build the list at engine setup time instead, with different lists for UMA vs PCIE, or whatever other differentiator matters most. That kind of flexibility would almost be required (I feel) for supporting non-Intel GPUs in the OCL backend, and if it would help the Intel case as well, so much the better.
I admit I don't love depending on a particular CLC compiler. But, in the event, rusticl uses clang as well. So to the extent that those extensions are implemented through to clang's spirv output, it should just be a matter of handling them in rusticl. I've already seen a few we don't entirely handle, and it'll be a release or two before Mesa is really capable running oneDNN out of the box I suspect, but I'm not seeing clang tell me it doesn't know how to handle an attribute so I might be safe there. Once I get to enabling the SYCL runtime I'm pretty sure there will be more divergence to handle between Intel's LLVM and upstream, but I'm keeping my sights on just OpenCL for now. |
@nwnk, to try and keep discussions relevant to the build interface proposed in this RFC, I plan to answer most your questions about implementation details in #2019 and in this comment I want to check that we have addressed your concerns in regard to this RFC.
From my understanding of your comment, it seems to me like
To me, this sounds like we would need to enabled multi-vendor builds. This is different from the current model as most frameworks integrating with oneDNN create a new library per device vendor. There is definitely a discussion to be had here. To me, while a generic vendor does add a question on which implementation to use for a device fitting multiple vendors, it seems like a simple priority list suffices. Because of this, I do not see any issues with the proposed interface, but that future work will be required.
We currently do no have a concept for multiple runtime implementations within oneDNN. This sounds like a new concept for a runtime vendor that I am not sure how best to integrate and likely requires another RFC to discuss. Overall, my understanding is that there is definitely more design/enablement work to support your use case, but that the specific proposal here is not problematic. Does this seem right to you? |
Yes, I have no objection to this proposal, it seems like good progress towards a goal we all share. Thank you (all) for indulging my discussion here, it's been very helpful. |
Thanks everyone for the discussion. The changes proposed in this RFC have been implemented and are available in main and targeting v3.6. |
Link to rendered document