-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: dnncompat compatibility layer #2101
base: rfcs
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sgeor255 and thanks for the proposal.
The main question I have is about support of these compatibility APIs. Those symbols are currently being deprecated/removed from cuDNN for example, so it might push oneDNN to actually support features that are actually no more supported by libraries we want to help conversion from.
Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?
Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?
I guess it is fine to limit the scope to inference, but in that case, tools like SYCLomatic will inherit these limitations if they want to rely on dnncompat APIs.
direction. The supported data types initially will be `f32` and `f16`. In the | ||
future, the support for operations, data types & directions can be expanded | ||
based on the need of SYCLomatic and community. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a fundamental question we should deal with is: how many version of legacy APIs should we consider? And for how many years should we maintain compatiblity?
In particular,
- most of the abovementioned functionalities are actually deprecated in cuDNN (see legacy API documentation and reference).
- users might be using functionalities that are already no more supported (e.g. see the list for cuDNN v9)
The classes in `dnncompat` will be presented to users as opaque pointers, hiding | ||
implementation details and providing API as close as possible to vendor-specific | ||
libraries. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(random spot) A few comments here:
- what is the plan for features that are not supported by oneDNN? Good examples are Nanpropagation flag or convolution algorithms. You mention below we should print warnings, but I failed to find if we return a non successful status, or if we move forward with execution? And if we move forward with execution, what should be the values used?
- what about scratchpad management? IIUC, temporary buffer management is exposed in cuDNN and might not map to how it works for oneDNN. e.g. what about when oneDNN underlying primitive require scratchpad, but cuDNN symbol does not? Shall we transparently allocate that scratchpad? And if so, shall we notify the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For cases where this will affect the accuracy of the output we could return non-success status. For cases where it does not affect the accuracy we could print the warning and continue execution. It could be decide on a case-by-case basis.
- That's a good question. There are 2 possible options, let me think about it and we can discuss which approach is better.
CONVOLUTION_BWD_FILTER_ALGO_AUTO, | ||
CONVOLUTION_BWD_FILTER_ALGO_DIRECT, | ||
CONVOLUTION_BWD_FILTER_ALGO_COUNT, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should dnncompat API do when facing those algorithms?
In particular, what would SYCLomatic do if the user has some dispatching logic based on those enum values?
For example, what should we do with something like FindConvolutionForwardAlgorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhimingwang36, can you please explain how you handle it in SYCLomatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible suggestion would be for the first version of dnncompat if an unsupported algorithm is used to report a warning that it is not supported in oneDNN and fallback to the auto algorithm.
disable `NaN` propagation. However, in oneDNN, `NaN` values are always | ||
propagated by default, which could lead to different outcomes in the execution | ||
of similar operations. In the initial implementation a warning will be printed | ||
when `NaN` propagation is explicitly disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but would we allow the code to run (so return a success status)?
applications. | ||
|
||
- cuDNN includes a specialized FFT convolution implementation, while this | ||
algorithm is not supported in oneDNN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should dnncompat behavior be here? print a warning and move forward with direct convolution algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we can fallback to auto or direct.
convert their applications to `dnncompat` & oneDNN more easily, while also | ||
reducing potential integration issues. | ||
|
||
## Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any plan on cudnn_frontend API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any plan on cudnn fronted API there will be a separate RFC to keep the discussions separate.
@zhimingwang36, can you please provide your input on how you handle oneDNN/cuDNN scratchpad and workspace? Below are related questions from @mgouicem:
|
5ab968a
to
3a9baec
Compare
3a9baec
to
2e65cfd
Compare
I have added suggestions about handling workspace and scratchpad memory to the RFC. @zhimingwang36 / others please take a look. |
@sgeor255 In SYCLomaitc, If some workspace and scratchpad memory needs in cuDNN, while not in oneDNN. Then SYCLomatic will replace the cuDNN query API call with 0. If some workspace and scratchpad memory needs in oneDNN, while not in cuDNN. These memory will be allocated in wrapper function. |
@intwanghao as described in the RFC, scratchpad memory can be handled internally by dnncompat as an implementation detail. dnncompat is a replacement for the helper level functions and is at the same level as these helper function. The approach you mentioned can be seen as one way to handle this. |
Description
This is a proposal for dnncompat - oneDNN compatibility layer.
Rendered document can be seen here.