-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient AD for the simulation of gradients #504
Comments
EDIT: I replaced this comment completely since I found a better way to do this (this might make some of the following comments superflous) I think this might be the way to code this. By encoding the partial derivatives in the points passed to the kernel, it is possible easily mark the index and place to take the partial derivative in. And this way this information is stored in the location. So if you store all previously evaluated locations, you know in which location a derivative was also taken. using KernelFunctions: Kernel, KernelFunctions as KF
using OneHotArrays: OneHotVector
import ForwardDiff as FD
import LinearAlgebra as LA
function partial(fun, dim, partials=())
"""
Take the partial derivative of a function `fun` with input dimesion `dim`.
If partials=(i,j), then (∂ᵢ∂ⱼ fun) is returned.
"""
if !isnothing(local next = iterate(partials))
ii, state = next
return partial(
x -> FD.derivative(0) do dx
fun(x .+ dx * OneHotVector(ii, dim))
end,
dim,
Base.rest(partials, state),
)
end
return fun
end
function partial(k, dim; partials_x=(), partials_y=())
"""
Take the partial derivative of a function with two dim-dimensional inputs,
i.e. 2dim dimensional input
"""
local f(x,y) = partial(t -> k(t,y), dim, partials_x)(x)
return (x,y) -> partial(t -> f(x,t), dim, partials_y)(y)
end
struct Pt{Dim}
pos::AbstractArray # the actual position
partial
# if empty and passed to function f return f(pos), if partial = (i,j) then return ∂ᵢ∂ⱼf(pos)
# this allows existing functions to be subtyped for Pt
end
Pt(x;partial=()) = Pt{length(x)}(x, partial) # convenience constructor
for T in subtypes(Kernel)
# this allows the passing of Pt to kernels (which makes them differentiable by default)
(k::T)(x::Pt{Dim}, y::Pt{Dim}) where {Dim} = evaluate_(k, x, y)
(k::T)(x::Pt{Dim}, y) where {Dim} = evaluate_(k, x, Pt(y))
(k::T)(x, y::Pt{Dim}) where {Dim} = evaluate_(k, Pt(x), y)
end
function evaluate_(k::T, x::Pt{Dim}, y::Pt{Dim}) where {Dim, T<:Kernel}
""" unbox the partial instructions from Pt and apply them to k, and evaluat them at the positions of Pt"""
return partial(
k, Dim,
partials_x=x.partial, partials_y=y.partial
)(x.pos, y.pos)
end
k = KF.MaternKernel()
k([1],[1]) # Cov(Z(x), Z(y)) where x=[1], y=[1]
k([1], Pt([1], partial=(1,1))) # Cov(Z(x), ∂₁∂₁Z(y)) where x=[1], y=[1]
k(Pt([1], partial=1), [2]) # Cov(∂₁Z(x), Z(y)) where x=[1], y=[2]
k(Pt([1,2], partial=(1)), Pt([1,2], partial=2))# Cov(∂₁Z(x), ∂₂Z(y)) where x=[1,2], y=[1,2]
k = KF.SEKernel()
kappa(x) = exp(- LA.dot(x,x)/2)
g = partial(k,1, partials_x=2, partials_y=0)
g(1,1)
|
I don't follow completely - what are you after in the end? Derivatives of the kernel functions with respect to one or both arguments? IIRC many kernels in KernelFunctions already support AD - and importantly not only ForwardDiff but also reverse-mode backends such as ReverseDiff, Zygote, and (AFAIK to a lesser extent) Enzyme. Depending on the input- and output dimensions of the functions you try to differentiate, you might want to use forward-mode or reverse-mode (a rule of thumb when working with differential equations is e.g. to use ForwardDiff when working with < 100 equations and reverse-mode otherwise: https://docs.sciml.ai/SciMLSensitivity/stable/getting_started/#When-Should-You-Use-Forward-or-Reverse-Mode?). Regardless of your exact intentions, I recommend you have a look at https://docs.julialang.org/en/v1/manual/performance-tips if you want to improve the performance of your code. For instance, probably you would want to avoid structs with abstract fields (https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-fields-with-abstract-type). |
Let me try a different way to phrase this: Let So lets say I want to simulate how would that work? Well first of all this is not just sampling from So how do you implement multivariate kernels? Well this package made the design decision to encode this in the input and without loss of generality do not have multivariate kernels. So the position Does that clear things up? Now that I motivated So here we see that the entries of |
I don't particularly like (k::Kernel)(x::Pt, y::Pt) then it can not tell whether to use this, or (k::SimpleKernel)(x,y) so if I want all kernel variants to first go through the "turning it into (k::T)(x::Pt, y::Pt) where {T<: Kernel} and the above wasn't allowed so I tried the next best thing (which I am not happy with either) |
Regarding the forward diff vs. backward diff part: Backward diff would be more efficient since kernels have many input dimensions and only one output. But since we want to obtain This is the disadvantage of the "no multivariate kernels" policy. You prevent possible performance improvement from calculating |
Someone pointed out the CovarianceFunctions.jl package to me (https://github.com/SebastianAment/CovarianceFunctions.jl#gradient-kernels) which apparently implements something similar as well. There is also an issue about interoperability with KernelFunctions.jl (SebastianAment/CovarianceFunctions.jl#2) so I will try to understand this package and see if it is all I want |
Introduction: How would you simulate gradients
If you want to simulate the gradient of a random function$Z$ , it turns out that you simply need to take derivatives of the covariance funcion, as (w.l.o.g. $Z$ is centered)
And similarly
For stationary covariance functions$C(x,y) = C(x-y)$ this simplifies to
So if we consider the multivariate random function$T_1(x) = (Z(x), \nabla Z(x))$ , then its covariance kernel is given by
where we use$d=x-y$ and $C(d) = f\bigl(\frac{|d|^2}2\bigr)$ in the last equation.
Performance Considerations
In principle you could just directly apply Autodiff (AD) to any kernel$C$ to obtain $C_{T_1}$ , but since I heard that the expense of autodiff scales with the number of input arguments, this would be really wasteful when we only really need to differentiate a one dimensional function in the isotropic case.
Unfortunately the way that KernelFunctions implement length scales results in general kernel functions, so I am not completely sure how to tell the compiler, that "these derivatives are much simpler than they look".
One possibility might be, to add the Abstract types
IsotropicKernel
andStationaryKernel
and carry these types over when transformations do not violate them. Scaling would not, more general affine transformations would violate isotropy but not stationarity, etc. This could probably be done with type parameters.But even once that is implemented, how do you tell autodiff what to differentiate? I have seen the file
chainrules.jl
in this repository, so I thought I would ask if someone already knows how to implement this.Kernels for multiple outputs considerations
Since you implemented kernels for multiple outputs as an extension of the input space, the reuse of the derivative$f'(\frac{|d|}{2})$ also becomes more complicated.
Maybe this is all premature optimization, as the evaluation of the kernel is complexity wise in the shadow of the cholesky decomposition.
Extension: Simulate$n$ -order derivatives
In principle you could similarly simulate$T_n(x) = (Z(x), Z'(x), \dots, Z^{(n)}(x))$ . But AD is already underdeveloped for hessians, so I don't know how to get this to work. It might be possible to write custom logic for certain isotropic random functions, like with the squared exponential one.
What do you think? I handcrafted something for first order derivatives in a personal project, but for
KernelFunctions.jl
a more general approach is probably needed.The text was updated successfully, but these errors were encountered: