How to handle a log prob for reverse mode? #903

SteveBronder · 2021-05-18T18:39:35Z

SteveBronder
May 18, 2021
Maintainer

I'd like to make a separate log prob for reverse mode autodiff as I think several optimizations we've discussed for the future (the new var matrix type, reverse mode in the compiler, etc.) would be a lot easier to work with if we had one.

Working on the new matrix type for reverse mode autodiff, I find that I'm doing a lot of weird things to be able to handle that log_prob can take in types with scalars of double, var, fvar<double>, fvar<var> etc. it would be nice to have a separate log_prob in the program, or somewhere, where we modify the UnsizedType.autodifftype definition to

and autodifftype = DataOnly | AutoDiffable | RevAutoDiff

and then for the new reverse mode log prob we just read the old one and promote any Autodiffable to a RevAutoDiff.

But I'm not sure where we should put this and when we should make it? My first thought is just putting it after the optimization pass of the transformed mir. I want to make the new reverse mode log prob after the mir optims / transforms so that in the future if we ever do something where we move statements across blocks (like moving data from log_prob or generated_quantities to prepare_data) we won't have to deal with any weirds of having two log_probs. Then if we want to do any additional optimization passes after generating the new log prob with var types we can do them there.

Any thoughts on this? If this seems like an okay idea with folks I might go for it today

SteveBronder · 2021-06-08T23:43:35Z

SteveBronder
Jun 8, 2021
Maintainer Author

Thinking about this more, for an example say we were going to write matrix multiply with two autodiffable matrices.

parameters {
  matrix[N, N] A;
  matrix[N, N] B;
}
transformed parameters {
  matrix[N, N] C = A * B;
}

The C++ we want to generate is something of the form

  // place the matrices on the autodiff stack
  auto arena_A = stan::math::to_arena(A);
  auto arena_B = stan::math::to_arena(B);
  // Do the forward pass, promoting the value matrix multiply to var
  auto C = to_arena(promote_to_var(multiply(value_of(arena_A), value_of(arena_B))));
  // place a reverse pass callback on the callback stack
  // passing in the objects we need for adjoint calcs
  reverse_pass_callback(
    [arena_A, arena_B, C]() mutable {
      // Do the adjoint accumulation for A and B 
      adjoint_of(arena_A) += multiply(adjoint_of(C), transpose(value_of(arena_B)));
      adjoint_of(arena_B) += multiply(transpose(value_of(arena_A)), adjoint_of(C));
    }
  });

In a much more abstract form this can be written like

  // place the matrices on the autodiff stack
  A_Arena = StackAllocStmt A;
  B_Arena = StackAllocStmt B;
  // Do the forward pass, promoting the value matrix multiply to var
  Return = ForwardPass (A_Arena, B_Arena);
  // place a reverse pass callback on the callback stack
  // passing in the objects we need for adjoint calcs
  ReverseCallBack(
    [A_Arena, B_Arena, Return],
    Input1Adjoint,
    Input2Adjoint
  )

I think for the functions themselves we would have a type RevFun that looks like

(**
 * Expresses base operations we will need later
 * We could probably do something clever to generate these from the 
 * the stan math signatures hash table
 *)
type ('lhs , 'rhs) Multiply = UFun (StanLib ("multiply", ...), ['lhs, 'rhs])
type ('t) Transpose = UFun (StanLib ("transpose", ...), ['t'])


(**
 * For tagging whether we need the value or adjoint of an input
 *)
type ReversePair = Value | Adjoint

(**
 * Tags we can use later to deduce which input is needed in the adjoint calculation 
 *)
type AdjointArgs = 
   ReturnAdj of ReversePair 
   | FirstArg of ReversePair
   | SecondArg of ReversePair
   | ThirdArg  of ReversePair
 
(* The actual adjoint function, I think it can just be like this where it's a tag essentially*)
type ('expr) adjoint_fun = 'expr


(**
 * Function comprising the forward and reverse pass
 * Type is comprised of 
 * Name of the function
 * Return type
 * List of input argument types
 * List of functions for each adjoint calculation
 *)
type reverse_mode_function =
  string * UnsizedType.returntype * UnsizedType.t list * adjoint_fun list

The reverse_mode_function's UnsizedType.t list and adjoint_fun list need to be the same length.

One thing to note in the above, if the focus is just on functions implemented
in Stan math, the forward pass is simply the StanLib function with the same
name, but operating on the values of the vars.

With something like multiply we would have

type ('lhs_expr, 'rhs_expr) ReverseMultiply = UFun (reverse_mode_function ("multiply",
  UnsizedType.UMatrix,
  [UMatrix, UMatrix],
  [adjoint_fun (Multiply (ReturnArg Adjoint, Transpose (SecondArg Value))),
   adjoint_fun (Multiply (Transpose (FirstArg Value), ReturnArg Adjoint))]), ['lhs_expr, 'rhs_expr])

And with that I think we can do each of the steps for generating the C++ we want.

Validate the lhs and rhs expressions types match up with the function inputs
Evaluate the lhs and rhs into StackAllocStmts stmts (in pseudocode like)

auto lhs_arena = to_arena(lhs);
auto rhs_arena = to_arena(rhs);

Call the forward pass and make the return type

auto C = to_arena(promote_to_var(multiply(value_of(lhs_arena), value_of(rhs_arena))));

Make the callback. We can get the arguments needed by taking the union of the AdjointArg in the list of adjoint_funs. Since the Multiply and Transpose types are defined using Stan math library functions I think we can parse those pretty easily. Then for the inner types for each of the adjoint functions inputs we would do a lookup of the name of the first or second argument and use the ReversePair inside of the AdjointArgs to deduce whether the argument needs to be wrapped in value_of() or adjoint_of(). And I think with all that we can do

reverse_pass_callback(
  [arena_A, arena_B, C]() mutable {
    // Do the adjoint accumulation for A and B 
    adjoint_of(arena_A) += multiply(adjoint_of(C), transpose(value_of(arena_B)));
    adjoint_of(arena_B) += multiply(transpose(value_of(arena_A)), adjoint_of(C));
  }
});

We can also look at the AutoDiff type of the input expression and cut out adjoint calculations for values that are DataOnly.

One thing that's nice about this is that it can be done in pieces. So for instance if we had cholesky_decompose() not implemented in the compiler, but we do have multiply() we can do

Eigen::Matrix<var, -1, -1> A = cholesky_decompose(other_obj);
  // place the matrices on the autodiff stack
  auto arena_A = stan::math::to_arena(A);
  auto arena_B = stan::math::to_arena(B);
  // Do the forward pass, promoting the value matrix multiply to var
  auto C = to_arena(promote_to_var(multiply(value_of(arena_A), value_of(arena_B))));
  // place a reverse pass callback on the callback stack
  // passing in the objects we need for adjoint calcs
  reverse_pass_callback(
    [arena_A, arena_B, C]() mutable {
      // Do the adjoint accumulation for A and B 
      adjoint_of(arena_A) += multiply(adjoint_of(C), transpose(value_of(arena_B)));
      adjoint_of(arena_B) += multiply(transpose(value_of(arena_A)), adjoint_of(C));
    }
  });

and if we have multiple reverse mode functions next to each other we can put their forward and reverse passes together so we only call one reverse pass callback and calculate the adjoints for multiple functions at once.

One glaring hole I've thought of so far is how to handle temporaries. Aka for cases with temporaries in the function what to name the thing assigned to the arena for code like the below.

matrix[N, M] SomeObj = multiply(add(X, Y), Z)

I haven't totally thought of that yet. We could just make some hashes up to name temporaries and pull out add(X, Y) as if the user wrote

matrix[N, M] hash_tmp = add(X, Y);
matrix[N, M] SomeObj = multiply(hash_tmp, Z);

Then do the stuff above to make the reverse mode passes for each

0 replies

bob-carpenter · 2021-06-09T14:35:26Z

bob-carpenter
Jun 9, 2021
Maintainer

Is there a way to make something like this efficient?

reverse_pass_callback(
    [arena_A, arena_B, C]() mutable {
      // Do the adjoint accumulation for A and B 
      adjoint_of(arena_A) += multiply(adjoint_of(C), transpose(value_of(arena_B)));
      adjoint_of(arena_B) += multiply(transpose(value_of(arena_A)), adjoint_of(C));
    }
  });

It's similar to what I did in my C++ example code on the forums and in the AD Handbook. I never figured out a way to get a type for the callback that made it efficient to store or access.

3 replies

SteveBronder Jun 9, 2021
Maintainer Author

The callback only goes onto the callback stack, is there anything in particular that's not optimal given that we have to put these callbacks somewhere? I'm trying to brew a scheme with @t4c1 right now where we actually remove the virtual function from vari and just have the ChainableStack hold a vector of function pointers. I have the scheme working with set_zero_adjoint() but still working on chain(). That would help a bit since it will slim down vari, but I don't have other ideas on how to make this more efficient :-/.

One thing that can offset this cost is how we can compose multiple reverse passes together. Like as a toy example if we have

matrix[M, M] A;
matrix[M, M] B;
matrix[M, M] C = multiply(A, B)
matrix[M, M] D = multiply(C, B);

we can put both of those passes together into one reverse pass callback.

reverse_pass_callback(
    [arena_A, arena_B, C, D]() mutable {
      // Do the adjoint accumulation for A and B 
      adjoint_of(arena_A) += multiply(adjoint_of(C), transpose(value_of(arena_B)));
      adjoint_of(arena_B) += multiply(transpose(value_of(arena_A)), adjoint_of(C)) + multiply(transpose(value_of(arena_C)), adjoint_of(D));
      adjoint_of(C) += multiply(adjoint_of(D), transpose(value_of(arena_B)));
    }
  });

^that might not be exactly the right form but that is the scheme we can do to help offset the costs

bob-carpenter Jun 10, 2021
Maintainer

What's the type of the structure storing the reverse-pass callbacks? That's what I couldn't figure out how to do efficiently. The function types in C++11 that can hold arbitrary functions were super heavy when used with vector.

SteveBronder Jun 10, 2021
Maintainer Author

I think I see what you mean. Yeah std::function is just massive we can't use that. So the callbacks are stored in a class called callback_vari. The lambda is stored there and then called in the chain() method. @t4c1 might be able to clarify more, but I think as long as the lambdas members are trivially destructible then the lamba should be pretty much erased / inlined by the compiler.

There's a sort of toy example here that I think describes the idea I'm talking about. In the example in the godbolt below the compiler is actually smart enough to know it doesn't have to pay for actually storing the lambda and it's cost doesn't even show up in blah's size

https://godbolt.org/z/cxvf3T8z4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle a log prob for reverse mode? #903

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to handle a log prob for reverse mode? #903

SteveBronder May 18, 2021 Maintainer

Replies: 2 comments · 3 replies

SteveBronder Jun 8, 2021 Maintainer Author

bob-carpenter Jun 9, 2021 Maintainer

SteveBronder Jun 9, 2021 Maintainer Author

bob-carpenter Jun 10, 2021 Maintainer

SteveBronder Jun 10, 2021 Maintainer Author

SteveBronder
May 18, 2021
Maintainer

Replies: 2 comments 3 replies

SteveBronder
Jun 8, 2021
Maintainer Author

bob-carpenter
Jun 9, 2021
Maintainer

SteveBronder Jun 9, 2021
Maintainer Author

bob-carpenter Jun 10, 2021
Maintainer

SteveBronder Jun 10, 2021
Maintainer Author