-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix derivative initialization of void functions in reverse mode #823
Conversation
clang-tidy review says "All clean, LGTM! 👍" |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #823 +/- ##
=======================================
Coverage 94.97% 94.98%
=======================================
Files 49 49
Lines 7543 7553 +10
=======================================
+ Hits 7164 7174 +10
Misses 379 379
|
clang-tidy review says "All clean, LGTM! 👍" |
Please do not merge yet, I'd like to see if I can add coverage of more cases (f.i. void foo(double *in, double *out); ). I will let you know by tomorrow. Edit: Since I'm thinking of changing the gradient function to include another argument for this specific case, it could be a separate PR. Let me know what you think. |
Hi, @kchristin22. Thank you for your work. However, I'd like to express my doubts on whether the behavior of gradients in this PR is intuitive. When the original function has only one reference/pointer type parameter, this makes sense to some extent. Like in your example: void pointerArgOut(double* p) {
*p *= *p;
}
void pointerArgOut_grad(double *p, clad::array_ref<double> _d_p) {
* _d_p = 1;
double _t0;
_t0 = *p;
*p *= *p;
{
*p = _t0;
double _r_d0 = * _d_p;
* _d_p -= _r_d0;
* _d_p += _r_d0 * *p;
* _d_p += *p * _r_d0;
}
} You basically consider
What is the meaning of the gradient in this case? Why would a user expect the gradient to be |
Hi @PetroZarytskyi! So I was thinking that when we want to derive a function based on an argument, f.i. a, then da/da = 1 which is defined in the code as _d_a. For the rest of the parameters, f.i. db/da, it would be 0, which is already performed as a step in the code. Now for the more technical part, this addition should be well used. Meaning that if the user wants to differentiate the function based on all the parameters, then their assignments in the code should be independent, otherwise the user should have used sth like the jacobian version in two different functions I suppose. But I'm not sure if "protecting" the user this way should be part of clad as it would be just bad usage of this API (you can't expect a memset when you call malloc). So, regarding your example, I think it falls into this category, of what the user wants to achieve. Without the initialization, the code produces a Segmentation Fault. Hence, the user must set the derivative initialization themselves before executing the gradient function. This PR aims to protect the user in case such an omission occurs, which is also guaranteed for functions with other return types. It is also an extra capability for the user as they can perform two otherwise independent function derivatives in a single function. Some may wonder that you can always use a function with a return statement instead so no need to improve anything here. However, CUDA kernels are void functions so if we want to support kernel differentiation, this PR is worth it because it protects the user and overly I believe it does more good than harm. Although, since I want to dive into supporting the case of |
Hi, @kchristin22. I agree that it often makes sense to consider some parameters as output (like in your example with CUDA kernels). This does mean initializing the adjoint of the output parameter to 1. My biggest concern is what happens when the function has multiple parameters. In that case, we need to choose a parameter to consider output (we can't just initialize all parameter adjoints to 1). This decision should be given to the user and it's not obvious how to do this. Considering you can already achieve this by setting the adjoint to 1 by hand before passing it to the gradient, I'm not sure we should introduce new interfaces. At least, this is worth a bigger discussion. And yes, I think this concern is relevant to most of the use cases, even your |
Hello! Yes, I can see how it can be problematic. The example I gave was to underline that more work needs to be done as there are loop holes. @parth-07 also gave a very nice example of why initializing in the derived function is not always efficient. I had started working on a way to potentially support the case of |
Hi @kchristin22 Thank you for being so proactive in fixing issues.
How are you passing the argument name to |
No need to thank me, I really enjoy it. This view of the changes really helps in pinpointing the additions. I basically add another argument in gradient that is initialized to I have some ideas on how the correct derivation will be achieved that I included in my proposal. |
@kchristin22, @PetroZarytskyi, what is the fate of this PR? |
@vgvassilev Even though the PR itself looks good, I'm not convinced we need this. I think changing the interface this way will only make it less consistent and more confusing. |
@kchristin22, what is the fate of this PR? |
Fixes #822.