Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix derivative initialization of void functions in reverse mode #823

Closed

Conversation

kchristin22
Copy link
Collaborator

Fixes #822.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link

codecov bot commented Mar 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.98%. Comparing base (d2df900) to head (0479442).
Report is 258 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #823   +/-   ##
=======================================
  Coverage   94.97%   94.98%           
=======================================
  Files          49       49           
  Lines        7543     7553   +10     
=======================================
+ Hits         7164     7174   +10     
  Misses        379      379           
Files with missing lines Coverage Δ
lib/Differentiator/ReverseModeVisitor.cpp 96.61% <100.00%> (+0.01%) ⬆️
Files with missing lines Coverage Δ
lib/Differentiator/ReverseModeVisitor.cpp 96.61% <100.00%> (+0.01%) ⬆️

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@kchristin22
Copy link
Collaborator Author

kchristin22 commented Mar 14, 2024

Please do not merge yet, I'd like to see if I can add coverage of more cases (f.i. void foo(double *in, double *out); ). I will let you know by tomorrow.

Edit: Since I'm thinking of changing the gradient function to include another argument for this specific case, it could be a separate PR. Let me know what you think.

@PetroZarytskyi
Copy link
Collaborator

Hi, @kchristin22. Thank you for your work. However, I'd like to express my doubts on whether the behavior of gradients in this PR is intuitive. When the original function has only one reference/pointer type parameter, this makes sense to some extent. Like in your example:

void pointerArgOut(double* p) {
  *p *= *p;
}

void pointerArgOut_grad(double *p, clad::array_ref<double> _d_p) {
    * _d_p = 1;
    double _t0;
    _t0 = *p;
    *p *= *p;
    {
        *p = _t0;
        double _r_d0 = * _d_p;
        * _d_p -= _r_d0;
        * _d_p += _r_d0 * *p;
        * _d_p += *p * _r_d0;
    }
}

You basically consider *p an output parameter (i.e. *p represents the function value).
Even though I'm not sure such behavior should be default, let's consider the case when the number of parameters is bigger than 1:

void do_nothing(double x, double y, double z) {}

void do_nothing_grad(double x, double y, double z, clad::array_ref<double> _d_x, clad::array_ref<double> _d_y, clad::array_ref<double> _d_z) {
    * _d_x = 1;
    * _d_y = 1;
    * _d_z = 1;
}

What is the meaning of the gradient in this case? Why would a user expect the gradient to be {1, 1, 1}?
Mathematically, in this PR we consider "the value of a void function" to be the sum of its parameters. But why?
Could you give a bigger explanation of why the gradient of void functions should be computed this way?

@kchristin22
Copy link
Collaborator Author

Hi @PetroZarytskyi!

So I was thinking that when we want to derive a function based on an argument, f.i. a, then da/da = 1 which is defined in the code as _d_a. For the rest of the parameters, f.i. db/da, it would be 0, which is already performed as a step in the code.

Now for the more technical part, this addition should be well used. Meaning that if the user wants to differentiate the function based on all the parameters, then their assignments in the code should be independent, otherwise the user should have used sth like the jacobian version in two different functions I suppose. But I'm not sure if "protecting" the user this way should be part of clad as it would be just bad usage of this API (you can't expect a memset when you call malloc). So, regarding your example, I think it falls into this category, of what the user wants to achieve. Without the initialization, the code produces a Segmentation Fault. Hence, the user must set the derivative initialization themselves before executing the gradient function. This PR aims to protect the user in case such an omission occurs, which is also guaranteed for functions with other return types. It is also an extra capability for the user as they can perform two otherwise independent function derivatives in a single function.

Some may wonder that you can always use a function with a return statement instead so no need to improve anything here. However, CUDA kernels are void functions so if we want to support kernel differentiation, this PR is worth it because it protects the user and overly I believe it does more good than harm. Although, since I want to dive into supporting the case of void foo(double *in, double *out) for CUDA kernels, there may be more safe and sophisticated ways to add this initialization in the future. So I would really like your opinion on this.

@PetroZarytskyi
Copy link
Collaborator

Hi, @kchristin22. I agree that it often makes sense to consider some parameters as output (like in your example with CUDA kernels). This does mean initializing the adjoint of the output parameter to 1. My biggest concern is what happens when the function has multiple parameters. In that case, we need to choose a parameter to consider output (we can't just initialize all parameter adjoints to 1). This decision should be given to the user and it's not obvious how to do this. Considering you can already achieve this by setting the adjoint to 1 by hand before passing it to the gradient, I'm not sure we should introduce new interfaces. At least, this is worth a bigger discussion. And yes, I think this concern is relevant to most of the use cases, even your void foo example has two parameters only one of which is output.

@kchristin22
Copy link
Collaborator Author

Hello!

Yes, I can see how it can be problematic. The example I gave was to underline that more work needs to be done as there are loop holes. @parth-07 also gave a very nice example of why initializing in the derived function is not always efficient.

I had started working on a way to potentially support the case of void foo(double *in, double *out) (branch). So far, I have included an extra argument in the API (clad::gradient) and store its name in the ReverseModeVisitor.h in order to manipulate the derivation accordingly, though, due to our conversation and the one in the issue, I'm not sure if continuing working on this is worth it. What do you think?

@parth-07
Copy link
Collaborator

Hi @kchristin22

Thank you for being so proactive in fixing issues.

I had started working on a way to potentially support the case of void foo(double *in, double *out) (branch). So far, I have included an extra argument in the API (clad::gradient) and store its name in the ReverseModeVisitor.h in order to manipulate the derivation accordingly, though, due to our conversation and the one in the issue, I'm not sure if continuing working on this is worth it. What do you think?

How are you passing the argument name to clad::gradient? This would be very difficult to achieve because the adjoint types and the function signature would differ for different argument name, and we need to compute the function signature at compile-time without any help from the Clang plugin infrastructure. Can you please briefly show how the user would use this extra argument of clad::gradient?

@kchristin22
Copy link
Collaborator Author

No need to thank me, I really enjoy it.

This view of the changes really helps in pinpointing the additions. I basically add another argument in gradient that is initialized to nullptr and can only be assigned when the function to be derived is of void type. This argument is assigned to an expr stored as a class member of the request. When this request gets processed and the return argument has been specified, the function's parameters are scanned to ensure that the user gave a valid parameter name and the name is stored in ReverseMode.

I have some ideas on how the correct derivation will be achieved that I included in my proposal.

@vgvassilev
Copy link
Owner

@kchristin22, @PetroZarytskyi, what is the fate of this PR?

@PetroZarytskyi
Copy link
Collaborator

@vgvassilev Even though the PR itself looks good, I'm not convinced we need this. I think changing the interface this way will only make it less consistent and more confusing.

@vgvassilev
Copy link
Owner

@kchristin22, what is the fate of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix derivative initialization in void functions in reverse mode
4 participants