Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: dnncompat compatibility layer #2101

Open
wants to merge 1 commit into
base: rfcs
Choose a base branch
from

Conversation

sgeor255
Copy link
Contributor

Description

This is a proposal for dnncompat - oneDNN compatibility layer.

Rendered document can be seen here.

@sgeor255 sgeor255 requested a review from a team as a code owner September 19, 2024 16:48
@github-actions github-actions bot added the RFC A design document label Sep 19, 2024
Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sgeor255 and thanks for the proposal.

The main question I have is about support of these compatibility APIs. Those symbols are currently being deprecated/removed from cuDNN for example, so it might push oneDNN to actually support features that are actually no more supported by libraries we want to help conversion from.

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

I guess it is fine to limit the scope to inference, but in that case, tools like SYCLomatic will inherit these limitations if they want to rely on dnncompat APIs.

direction. The supported data types initially will be `f32` and `f16`. In the
future, the support for operations, data types & directions can be expanded
based on the need of SYCLomatic and community.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a fundamental question we should deal with is: how many version of legacy APIs should we consider? And for how many years should we maintain compatiblity?
In particular,

  • most of the abovementioned functionalities are actually deprecated in cuDNN (see legacy API documentation and reference).
  • users might be using functionalities that are already no more supported (e.g. see the list for cuDNN v9)

The classes in `dnncompat` will be presented to users as opaque pointers, hiding
implementation details and providing API as close as possible to vendor-specific
libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(random spot) A few comments here:

  • what is the plan for features that are not supported by oneDNN? Good examples are Nanpropagation flag or convolution algorithms. You mention below we should print warnings, but I failed to find if we return a non successful status, or if we move forward with execution? And if we move forward with execution, what should be the values used?
  • what about scratchpad management? IIUC, temporary buffer management is exposed in cuDNN and might not map to how it works for oneDNN. e.g. what about when oneDNN underlying primitive require scratchpad, but cuDNN symbol does not? Shall we transparently allocate that scratchpad? And if so, shall we notify the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For cases where this will affect the accuracy of the output we could return non-success status. For cases where it does not affect the accuracy we could print the warning and continue execution. It could be decide on a case-by-case basis.
  • That's a good question. There are 2 possible options, let me think about it and we can discuss which approach is better.

CONVOLUTION_BWD_FILTER_ALGO_AUTO,
CONVOLUTION_BWD_FILTER_ALGO_DIRECT,
CONVOLUTION_BWD_FILTER_ALGO_COUNT,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should dnncompat API do when facing those algorithms?
In particular, what would SYCLomatic do if the user has some dispatching logic based on those enum values?

For example, what should we do with something like FindConvolutionForwardAlgorithm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhimingwang36, can you please explain how you handle it in SYCLomatic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible suggestion would be for the first version of dnncompat if an unsupported algorithm is used to report a warning that it is not supported in oneDNN and fallback to the auto algorithm.

disable `NaN` propagation. However, in oneDNN, `NaN` values are always
propagated by default, which could lead to different outcomes in the execution
of similar operations. In the initial implementation a warning will be printed
when `NaN` propagation is explicitly disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would we allow the code to run (so return a success status)?

applications.

- cuDNN includes a specialized FFT convolution implementation, while this
algorithm is not supported in oneDNN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should dnncompat behavior be here? print a warning and move forward with direct convolution algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we can fallback to auto or direct.

@vpirogov vpirogov added RFC A design document and removed RFC A design document labels Sep 20, 2024
convert their applications to `dnncompat` & oneDNN more easily, while also
reducing potential integration issues.

## Scope
Copy link

@zhimingwang36 zhimingwang36 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any plan on cudnn_frontend API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any plan on cudnn fronted API there will be a separate RFC to keep the discussions separate.

@densamoilov
Copy link
Contributor

@zhimingwang36, can you please provide your input on how you handle oneDNN/cuDNN scratchpad and workspace?

Below are related questions from @mgouicem:

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

@sgeor255
Copy link
Contributor Author

@zhimingwang36, can you please provide your input on how you handle oneDNN/cuDNN scratchpad and workspace?

Below are related questions from @mgouicem:

Other thing is about memory management. In cuDNN scratchpad is handled by user and can be queried for specific primitives. However oneDNN could require scratchpad for some functionalities/platforms where cuDNN does not. Will we allow allocating scratchpad memory under dnncompat?

Other thing is that we might hit fundamental issues with training support. In particular, with respect to workspace management between forward and backward passes, as oneDNN requires workspace for some functionalities where cuDNN does not and vice-versa. If the compat API does not expose that to user, how would we handle it?

I have added suggestions about handling workspace and scratchpad memory to the RFC. @zhimingwang36 / others please take a look.

@intwanghao
Copy link

@sgeor255 In SYCLomaitc, If some workspace and scratchpad memory needs in cuDNN, while not in oneDNN. Then SYCLomatic will replace the cuDNN query API call with 0. If some workspace and scratchpad memory needs in oneDNN, while not in cuDNN. These memory will be allocated in wrapper function.

@sgeor255
Copy link
Contributor Author

sgeor255 commented Nov 19, 2024

@sgeor255 In SYCLomaitc, If some workspace and scratchpad memory needs in cuDNN, while not in oneDNN. Then SYCLomatic will replace the cuDNN query API call with 0. If some workspace and scratchpad memory needs in oneDNN, while not in cuDNN. These memory will be allocated in wrapper function.

@intwanghao as described in the RFC, scratchpad memory can be handled internally by dnncompat as an implementation detail. dnncompat is a replacement for the helper level functions and is at the same level as these helper function. The approach you mentioned can be seen as one way to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants