New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

rfc: dnncompat compatibility layer #2101

Open

sgeor255 wants to merge 1 commit into oneapi-src:rfcs from sgeor255:svet/dnncompat-rfc

+1,007 −0

Contributor

sgeor255 commented Sep 19, 2024

Description

This is a proposal for dnncompat - oneDNN compatibility layer.

Rendered document can be seen here.

sgeor255 requested a review from a team as a code owner

September 19, 2024 16:48

github-actions bot added the RFC label

mgouicem reviewed

View reviewed changes

Contributor

mgouicem left a comment

Hi @sgeor255 and thanks for the proposal.

The main question I have is about support of these compatibility APIs. Those symbols are currently being deprecated/removed from cuDNN for example, so it might push oneDNN to actually support features that are actually no more supported by libraries we want to help conversion from.

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

I guess it is fine to limit the scope to inference, but in that case, tools like SYCLomatic will inherit these limitations if they want to rely on dnncompat APIs.

rfcs/20240919-dnncompat/README.md

+              direction. The supported data types initially will be `f32` and `f16`. In the
+              future, the support for operations, data types & directions can be expanded
+              based on the need of SYCLomatic and community.

Contributor

mgouicem Sep 20, 2024

I guess a fundamental question we should deal with is: how many version of legacy APIs should we consider? And for how many years should we maintain compatiblity?
In particular,

most of the abovementioned functionalities are actually deprecated in cuDNN (see legacy API documentation and reference).
users might be using functionalities that are already no more supported (e.g. see the list for cuDNN v9)

rfcs/20240919-dnncompat/README.md

+              The classes in `dnncompat` will be presented to users as opaque pointers, hiding
+              implementation details and providing API as close as possible to vendor-specific
+              libraries.

Contributor

mgouicem Sep 20, 2024

(random spot) A few comments here:

what is the plan for features that are not supported by oneDNN? Good examples are Nanpropagation flag or convolution algorithms. You mention below we should print warnings, but I failed to find if we return a non successful status, or if we move forward with execution? And if we move forward with execution, what should be the values used?
what about scratchpad management? IIUC, temporary buffer management is exposed in cuDNN and might not map to how it works for oneDNN. e.g. what about when oneDNN underlying primitive require scratchpad, but cuDNN symbol does not? Shall we transparently allocate that scratchpad? And if so, shall we notify the user?

Contributor Author

sgeor255 Sep 25, 2024

For cases where this will affect the accuracy of the output we could return non-success status. For cases where it does not affect the accuracy we could print the warning and continue execution. It could be decide on a case-by-case basis.
That's a good question. There are 2 possible options, let me think about it and we can discuss which approach is better.

rfcs/20240919-dnncompat/README.md

+                  CONVOLUTION_BWD_FILTER_ALGO_AUTO,
+                  CONVOLUTION_BWD_FILTER_ALGO_DIRECT,
+                  CONVOLUTION_BWD_FILTER_ALGO_COUNT,
+              };

Contributor

mgouicem Sep 20, 2024

what should dnncompat API do when facing those algorithms?
In particular, what would SYCLomatic do if the user has some dispatching logic based on those enum values?

For example, what should we do with something like FindConvolutionForwardAlgorithm?

Contributor

densamoilov Sep 24, 2024

@zhimingwang36, can you please explain how you handle it in SYCLomatic?

Contributor Author

sgeor255 Sep 25, 2024

One possible suggestion would be for the first version of dnncompat if an unsupported algorithm is used to report a warning that it is not supported in oneDNN and fallback to the auto algorithm.

rfcs/20240919-dnncompat/README.md

+                disable `NaN` propagation. However, in oneDNN, `NaN` values are always
+                propagated by default, which could lead to different outcomes in the execution
+                of similar operations. In the initial implementation a warning will be printed
+                when `NaN` propagation is explicitly disabled.

Contributor

mgouicem Sep 20, 2024

but would we allow the code to run (so return a success status)?

rfcs/20240919-dnncompat/README.md

+                applications.
+              - cuDNN includes a specialized FFT convolution implementation, while this
+                algorithm is not supported in oneDNN.

Contributor

mgouicem Sep 20, 2024

what should dnncompat behavior be here? print a warning and move forward with direct convolution algorithm?

Contributor Author

sgeor255 Sep 25, 2024

Yes, and we can fallback to auto or direct.

vpirogov added RFC and removed RFC labels

zhimingwang36 reviewed

View reviewed changes

rfcs/20240919-dnncompat/README.md Outdated Show resolved Hide resolved

zhimingwang36 reviewed

View reviewed changes

rfcs/20240919-dnncompat/README.md

+              convert their applications to `dnncompat` & oneDNN more easily, while also
+              reducing potential integration issues.
+              ## Scope

zhimingwang36 Sep 23, 2024 •

edited

Loading

any plan on cudnn_frontend API?

Contributor Author

sgeor255 Sep 25, 2024

For any plan on cudnn fronted API there will be a separate RFC to keep the discussions separate.

Contributor

densamoilov commented Sep 24, 2024

@zhimingwang36, can you please provide your input on how you handle oneDNN/cuDNN scratchpad and workspace?

Below are related questions from @mgouicem:

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

sgeor255 force-pushed the svet/dnncompat-rfc branch from 5ab968a to 3a9baec Compare

October 14, 2024 10:21


          rfc: dnncompat compatibility layer

2e65cfd

sgeor255 force-pushed the svet/dnncompat-rfc branch from 3a9baec to 2e65cfd Compare

October 14, 2024 10:23

Contributor Author

sgeor255 commented Oct 14, 2024

@zhimingwang36, can you please provide your input on how you handle oneDNN/cuDNN scratchpad and workspace?

Below are related questions from @mgouicem:

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

I have added suggestions about handling workspace and scratchpad memory to the RFC. @zhimingwang36 / others please take a look.

intwanghao commented Nov 18, 2024

@sgeor255 In SYCLomaitc, If some workspace and scratchpad memory needs in cuDNN, while not in oneDNN. Then SYCLomatic will replace the cuDNN query API call with 0. If some workspace and scratchpad memory needs in oneDNN, while not in cuDNN. These memory will be allocated in wrapper function.

Contributor Author

sgeor255 commented Nov 19, 2024 •

edited

Loading

@sgeor255 In SYCLomaitc, If some workspace and scratchpad memory needs in cuDNN, while not in oneDNN. Then SYCLomatic will replace the cuDNN query API call with 0. If some workspace and scratchpad memory needs in oneDNN, while not in cuDNN. These memory will be allocated in wrapper function.

@intwanghao as described in the RFC, scratchpad memory can be handled internally by dnncompat as an implementation detail. dnncompat is a replacement for the helper level functions and is at the same level as these helper function. The approach you mentioned can be seen as one way to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

RFC