Expression Template

This page explains how mshadow works. The main trick behind mshadow is called Expression Template, we will explain how this technique is used in mshadow, and how it will affect the performance of compiled code. Expression template is the major trick behind the C++ matrix libraries such as Eigen, GSL.

How to write efficient C++/C update rule for machine learning

Before we start, let us think of the question above. Assume we want to write down the update rule

weight =  - eta * ( grad + lambda * weight )

Where weight and grad are a vector of length n. When you choose C++ as your programming language, I guess the major concern is efficiency. There is one principle that is important and used in most C/C++ programs:

Pre-allocate necessary memory, no temporal memory allocation during running.

An example code is like

void UpdateWeight (const float *grad, float eta, float lambda, int n, float *weight){
    for( int i = 0; i < n; ++i ){
        weight[i] =  - eta * ( grad[i] + lambda * weight[i] );
    }
}

The function takes the pre-allocated space grad, and weight, and run the calculation. Writing these functions are simple, however, it can be annoying when we write them repeatedly. So the question is, can we write as follows, and get same performance as previous code?

void UpdateWeight (const Vec& grad, float eta, float lambda, Vec& weight){
    weight =  - eta * ( grad + lambda * weight );
}

The answer is yes, but not by the most obvious solution.

A Naive Bad Solution

Let us first take a look at a most straight forward solution: operator overloading.

// Naive solution for vector operation overloading 
struct Vec {
    int len;
    float* dptr;
    Vec (int len):len(len){ 
        dptr = new float[ len ];
    }
    Vec (const Vec& src):len(src.len){
        dptr = new float[len];
        memcpy( dptr, src.dptr, sizeof(float)*len ); 
    }
    ~Vec (void){ delete dptr; }
};

inline Vec operator+ (const Vec& lhs, const Vec& rhs){
    Vec res(lhs.len);
    for( int i = 0; i < lhs.len; ++i ){
        res.dptr[i] = lhs.dptr[i] + rhs.dptr[i]; 
    } 
    return res;
}

If we add more operator overloading in the same style, we can get the effect we want to write equations instead of loop. However, this kind of approach is inefficient, because temporal memory is allocated and de-allocated during each operation, while we could have do better.

An alternative, more effective way is only overload operator+=, operator-=, which can be implemented without temporal memory allocation. But this limits the equations we can write.

Lazy Evaluation

Let us think why we need temporal memory allocation when doing operator+. This because we do not know the target that will be assigned to in operator+, otherwise we could have directly storing into target memory instead of temporal memory.

What if we can know the target? The following code (example/exp-template/exp_lazy.cpp) achieves this.

// Example Lazy evaluation code
// for simplicity, we use struct and make all members public
#include <cstdio>
struct Vec;
// expression structure holds the expression
struct BinaryAddExp{
    const Vec& lhs;
    const Vec& rhs;
    BinaryAddExp(const Vec& lhs, const Vec& rhs):lhs(lhs),rhs(rhs){}
};
// no constructor and destructor to allocate and de-allocate memory, allocation done by user
struct Vec {
    int len;
    float* dptr;
    Vec (void){}
    Vec (float *dptr, int len):len(len),dptr(dptr){}
    // here is where evaluation happens
    inline Vec& operator= (const BinaryAddExp& src){
        for( int i = 0; i < len; ++i ){
            dptr[i] = src.lhs.dptr[i] + src.rhs.dptr[i];
        }
        return *this;
    }
};
// no evaluation happens here
inline BinaryAddExp operator+ (const Vec& lhs, const Vec& rhs){
    return BinaryAddExp(lhs, rhs);
}

const int n = 3; 
int main( void ){
    float sa[n]={1,2,3},sb[n]={2,3,4},sc[n]={3,4,5};
    Vec A(sa,n), B(sb,n), C(sc,n);
    // run expression
    A = B + C;
    for( int i = 0; i < n; ++ i ){
        printf("%d:%f==%f+%f\n", i, A.dptr[i], B.dptr[i], C.dptr[i] );
    }
    return 0;
}

The idea is that we do not actually do computation in operator+, but only return a expression structure (like abstract syntax tree), and when we overload operator=, we see the target, as well as all the operands, and we can run computation without introducing extra memory! Similarly, we can define a DotExp and lazily evaluate at operator=, and redirect matrix(vector) multiplications to BLAS.

More Lengthy Expressions and Expression Template

By using lazy evaluation, we are cool by avoiding temporal memory allocations. But the function of the code is limited:

We can only write A=B+C, how about more length expressions.
When we add more expression, we need to write more operator= to evaluate each equations.

Here is where the magic of template programming comes to rescue. The following code(example/exp-template/exp_template.cpp), which is a bit more lengthy, also allows you to write lengthy equations.

// Example code, expression template, and more length equations
// for simplicity, we use struct and make all members public

#include <cstdio>

// this is expression, all expressions must inheritate it, and put their type in subtype
template<typename SubType>
struct Exp{
    // returns const reference of the actual type of this expression
    inline const SubType& self(void) const{
        return *static_cast<const SubType*>(this);
    }
};

// binary add expression
// note how it is inheritates from Exp
// and put its own type into the template argument
template<typename TLhs, typename TRhs>
struct BinaryAddExp: public Exp< BinaryAddExp<TLhs,TRhs> >{
    const TLhs& lhs;
    const TRhs& rhs;
    BinaryAddExp(const TLhs& lhs, const TRhs& rhs):lhs(lhs),rhs(rhs){}
    // evaluation function, evaluate this expression at position i
    inline float Eval( int i ) const{
        return lhs.Eval(i) + rhs.Eval(i);
    }
};
// no constructor and destructor to allocate and de-allocate memory, allocation done by user
struct Vec: public Exp<Vec>{
    int len;
    float* dptr;
    Vec (void){}
    Vec (float *dptr, int len):len(len),dptr(dptr){}
    // here is where evaluation happens
    template<typename EType>
    inline Vec& operator= (const Exp<EType>& src_){
        const EType &src = src_.self();
        for( int i=0; i < len; ++i ){
            dptr[i] = src.Eval(i);
        }
        return *this;
    }    
    // evaluation function, evaluate this expression at position i
    inline float Eval( int i ) const{
        return dptr[i];
    }
};
// template add, works for any expressions
template<typename TLhs, typename TRhs>
inline BinaryAddExp<TLhs,TRhs> operator+ (const Exp<TLhs>& lhs, const Exp<TRhs>& rhs){
    return BinaryAddExp<TLhs,TRhs>(lhs.self(), rhs.self());
}

const int n = 3; 
int main( void ){
    float sa[n]={1,2,3},sb[n]={2,3,4},sc[n]={3,4,5};
    Vec A(sa,n), B(sb,n), C(sc,n);
    // run expression, this expression is longer:)
    A = B + C + C;
    for( int i = 0; i < n; ++ i ){
        printf("%d:%f==%f+%f+%f\n", i, A.dptr[i], B.dptr[i], C.dptr[i], C.dptr[i] );
    }
    return 0;
}

The key idea of the code is the template Exp<SubType> takes type of its derived class as template argument, so it can convert itself to the SubType via self(). BinaryAddExp now is a template class that can composite expressions together, like a template version of Composite pattern. The evaluation is done through function Eval, which is done in a recursive way in BinaryAddExp.

Due to inlining, the function calls of src.Eval(i) in operator= will be compiled into B.dptr[i]+C.dptr[i]+C.dptr[i] in compile time.
We can write equations for element-wise operations with same efficiency as if we write a loop

Summary

=

Provide feedback

Saved searches

Use saved searches to filter your results more quickly