From 2a598de11a3cf18e2f9e43d3bbb849e30003d9fa Mon Sep 17 00:00:00 2001
From: James Barbetti <James.Barbetti@anu.edu.au>
Date: Tue, 21 Jul 2020 09:09:38 +1000
Subject: [PATCH] Extensive changes aimed at speeding up tree loading &
 parameter optimisation (~2x) & tree search (~2x) by "serialising" their
 memory access patterns to improve spatial and temporal locality of reference.

1. PhyloTree alignment summaries may now be "borrowed" from another tree
   that has the same alignment (the relevant member is isSummaryBorrowed;
   if it is true, this instance doesn't own the summary, it is only
   "Borrowing" a reference to a summary owned by another tree).
2. PhyloTree member functions copyPhyloTree and copyPhyloTreeMixlen take an
   extra parameter indicating whether the copy is to "borrow" a copy of the
   alignment summary of the original (if it has one).  This matters a lot
   for ratefree.cpp and +R free rate models, and modelmixture.cpp!
   The temporary copies of the phylo tree that are used during parameter
   Optimization can now re-use the AlignmentSummary of the original; which
   means they can "linearise" their memory access to sites, when they are
   Optimising branch lengths (see changes listed below, e.g. #4, #5, #6, #7).
3. PhyloTree::setAlignment does its "name check" a different way (rather
   than finding each sequence by name by scanning the tree, if asks
   MTree::getMapOfTaxonNameToNode for a name to leaf node map, and checks
   the array of sequence names against the map (updating the id on the
   node for each hit).  The new approach is equivalent but is much faster,
   O(n.ln(n)) rather than O(n^2).  This speeds up tree loads markedly
   (particularly for large trees), but it matters most for free rate
   parameter optimization (on middling inputs this was a significant
   factor: about ~10% of parameter optimization time).  This can be turned
   off by changing the FAST_NAME_CHECK symbol.
4. IQTree::optimizeModelParameters now calls prepareToComputeDistances()
   (So that AlignmentSummary's matrix of converted sequences) will be
   available (to be borrowed, via calls to PhyloTree::copyPhyloTree (see
   change # 2 above, and changes #5 through #7 below).
   Likewise IQTree::doNNISearch (so changes #5, #8 help tree searches too).
5. AlignmentPairwise::computeFunction and  AlignmentPairwise::computeFuncDerv(
   can now make use of AlignmentSummary's "Matrix of converted sequences"
   (if it is available) via PhyloTree's accessor methods,
   e.g. PhyloTree::getConvertedSequenceByNumber().
   For this to work as expected, it's necessary for callers to ask
   AlignmentSummary to construct that matrix *including* even sites where
   there is no variety at all (added the keepBoringSites parameter on the
   AlignmentSummary constructor for this).
6. RateMeyerDiscrete::computeFunction and RateMeyerDiscrete::computeFuncDerv
   likewise. And RateMeyerDiscrete::normalizeRates can make use of the "flat"
   frequency array exposed by PhyloTree::getConvertedSequenceFrequencies() too.
7. PhyloTree::computePartialLikelihoodGenericSIMD (in phylokernelnew.h)
   makes use of the matrix of converted sequences (if one is available), in
   about six (!) different places.  In terms of actual effect, this is the
   most important change in this commit, but it needs changes #1, #2, and #4
   committed too, if it is to have any effect.  This change speeds up both
   parameter optimisation and tree searching significantly.
8. As well as inv_eigenvectors, there is now an iv_eigenvectors_transposed
   (Using the transpose makes for some faster multiplications; see change #9
   listed below). ModelMarkov::calculateSquareMatrixTranspose is used to
   calculate the transpose of the inverse eigen vectors.  Unpleasant
   consequence: ModelMarkov::update_eigen_pointers has to take an extra
   parameter.  Keeping this additional member set correctly is the only
   Thing that forced changes to modelpomomixture.cpp (and .h), modelset.cpp,
   and modelsubst.h.
9. ModelMarkov::computeTransMatrix and ModelMarkov::computeTransDerv
   now use (a) calculateExponentOfScalarMultiply and
   (b) aTimesDiagonalBTimesTransposeOfC to calculate transition matrices
   (This is quite a bit faster than the Eigen code, since it doesn't
   bother to construct the diagonal matrix B.asDiagonal()...).
   (a) and (b) and the supporting functions, calculateHadamardProduct
   And dotProduct, are (for now) members of ModelMarkov.
10.Minor tweaks to vector processing code in phylokernelnew.h:
   (a) dotProductVec hand-unrolled treatment of the V array;
   (b) dotProductPairAdd treated the last item (in A and B) as the
       special case, when handling an odd number of items.
       Possibly the treatment of the AD and BD arrays should be
       hand-unrolled here, too, but I haven't tried that yet.
   (c) dotProductTriple (checking for odd uses & rather than %) (faster!)
11.The aligned_free free function (from phylotree.h ?!) does the "pointer
   Null?" check itself, and (because it takes a T*& rather than a T*),
   can itself set the pointer to nullptr.  This means that client code that
   used to go... if (x) { aligned_free(x); x=NULL; } ... can now be
   simplified to just... aligned_free(x);
12.Next to it (in phylotree.h), there is now an ensure_aligned_allocated
   method.  That lets you replace code like ... this:
        if (!eigenvalues)
            eigenvalues = aligned_alloc<double>(num_states);
   With:
        ensure_aligned_allocated(eigenvalues,  num_states);
   which is,
   I reckon, more readable.
13.In many places where there was code of the form... if (x) { delete x; }
   I have replaced it with delete x (likewise delete [] x).  delete always
   checks for null (it's required to, that's in the C++ standards), and
   "Rolling your own check" merely devalues the check that delete will
   later do! I've made similar "don't bother to check for null" changes
   in some other files, that I haven't included in this commit (since there
   aren't any *material* changes to anything in those files).
---
 alignment/alignmentpairwise.cpp | 128 +++++++--
 alignment/alignmentsummary.cpp  |   8 +-
 alignment/alignmentsummary.h    |   2 +-
 main/phyloanalysis.cpp          |   4 +-
 model/modelfactorymixlen.cpp    |   5 +-
 model/modelmarkov.cpp           | 449 ++++++++++++++++++++++----------
 model/modelmarkov.h             |  31 ++-
 model/modelmixture.cpp          | 137 +++++-----
 model/modelpomomixture.cpp      |  34 ++-
 model/modelpomomixture.h        |   3 +-
 model/modelset.cpp              | 102 ++++----
 model/modelsubst.h              |   7 +-
 model/ratefree.cpp              |   6 +-
 model/ratemeyerdiscrete.cpp     | 156 +++++++----
 tree/iqtree.cpp                 |   7 +-
 tree/mtree.cpp                  |  23 +-
 tree/mtree.h                    |  11 +
 tree/phylokernelnew.h           | 332 +++++++++++++++--------
 tree/phylotree.cpp              | 248 +++++++++---------
 tree/phylotree.h                |  26 +-
 tree/phylotreesse.cpp           |  25 +-
 21 files changed, 1130 insertions(+), 614 deletions(-)

diff --git a/alignment/alignmentpairwise.cpp b/alignment/alignmentpairwise.cpp
index f3ea23bcb..bfa692475 100644
--- a/alignment/alignmentpairwise.cpp
+++ b/alignment/alignmentpairwise.cpp
@@ -97,19 +97,22 @@ void AlignmentPairwise::setSequenceNumbers(int seq1, int seq2) {
     memset(pair_freq, 0, sizeof(double)*total_size);
     if (tree->hasMatrixOfConvertedSequences()
          && rate->getPtnCat(0) < 0 ) {
-        auto sequence1      = tree->getConvertedSequenceByNumber(seq1);
-        auto sequence2      = tree->getConvertedSequenceByNumber(seq2);
-        auto frequencies    = tree->getConvertedSequenceFrequencies();
+        auto sequence1        = tree->getConvertedSequenceByNumber(seq1);
+        auto sequence2        = tree->getConvertedSequenceByNumber(seq2);
+        auto frequencies      = tree->getConvertedSequenceFrequencies();
         size_t sequenceLength = tree->getConvertedSequenceLength();
         for (size_t i=0; i<sequenceLength; ++i) {
-            auto state1 = sequence1[i];
-            auto state2 = sequence2[i];
-            if ( state1 < num_states && state2 < num_states ) {
-                if ( state1 != STATE_UNKNOWN && state2 != STATE_UNKNOWN ) {
-                    pair_freq[state1*num_states + state2] += frequencies[i];
-                    //Todo: Would it be worth storing a multiplication table?!
-                    //      and using that instead of multiplying by num_states?
-                }
+            int state1 = sequence1[i];
+            if (num_states<=state1) {
+                continue;
+            }
+            auto pairRow = pair_freq + state1*num_states;
+            int  state2  = sequence2[i];
+            if (num_states<=state2) {
+                continue;
+            }
+            if ( state1 != STATE_UNKNOWN && state2 != STATE_UNKNOWN ) {
+                pairRow[state2] += frequencies[i];
             }
         }
         //Add back the cumulative frequencies for any sites
@@ -194,6 +197,36 @@ double AlignmentPairwise::computeFunction(double value) {
     int nptn = tree->aln->getNPattern();
     double lh = 0.0;
 
+    if (tree->hasMatrixOfConvertedSequences()) {
+        auto sequence1        = tree->getConvertedSequenceByNumber(seq_id1);
+        auto sequence2        = tree->getConvertedSequenceByNumber(seq_id2);
+        auto frequencies      = tree->getConvertedSequenceFrequencies();
+        size_t sequenceLength = tree->getConvertedSequenceLength();
+        
+        if (site_rate->isSiteSpecificRate()) {
+            for (int i = 0; i < sequenceLength; i++) {
+                int state1 = sequence1[i];
+                int state2 = sequence2[i];
+                if (state1 >= num_states || state2 >= num_states) {
+                    continue;
+                }
+                double trans = tree->getModelFactory()->computeTrans(value * site_rate->getPtnRate(i), state1, state2);
+                lh -= log(trans) * frequencies[i];
+            }
+            return lh;
+        } else if (tree->getModel()->isSiteSpecificModel()) {
+            for (int i = 0; i < nptn; i++) {
+                int state1 = sequence1[i];
+                int state2 = sequence2[i];
+                if (state1 >= num_states || state2 >= num_states) {
+                    continue;
+                }
+                double trans = tree->getModelFactory()->computeTrans(value * site_rate->getPtnRate(i), state1, state2);
+                lh -= log(trans) * frequencies[i];
+            }
+            return lh;
+        }
+    }
     // site-specific rates
     if (site_rate->isSiteSpecificRate()) {
         for (int i = 0; i < nptn; i++) {
@@ -205,7 +238,6 @@ double AlignmentPairwise::computeFunction(double value) {
         }
         return lh;
     }
-
     if (tree->getModel()->isSiteSpecificModel()) {
         for (int i = 0; i < nptn; i++) {
             int state1 = tree->aln->at(i)[seq_id1];
@@ -259,35 +291,86 @@ void AlignmentPairwise::computeFuncDerv(double value, double &df, double &ddf) {
     int nptn = tree->aln->getNPattern();
     df = 0.0;
     ddf = 0.0;
+    
+    auto sequence1        = tree->getConvertedSequenceByNumber(seq_id1);
+    auto sequence2        = tree->getConvertedSequenceByNumber(seq_id2);
+    auto frequencies      = tree->getConvertedSequenceFrequencies();
+    size_t sequenceLength = tree->getConvertedSequenceLength();
+    if (sequenceLength!=nptn) {
+        sequence1 = sequence2 = nullptr;
+        frequencies = nullptr;
+    }
 
     if (site_rate->isSiteSpecificRate()) {
         for (int i = 0; i < nptn; i++) {
-            int state1 = tree->aln->at(i)[seq_id1];
-            int state2 = tree->aln->at(i)[seq_id2];
-            if (state1 >= num_states || state2 >= num_states) continue;
+            int state1;
+            if (sequence1!=nullptr) {
+                state1 = sequence1[i];
+            } else {
+                state1 = tree->aln->at(i)[seq_id1];
+            }
+            if (num_states<=state1) {
+                continue;
+            }
+            int state2;
+            if (sequence2!=nullptr) {
+                state2 = sequence2[i];
+            } else {
+                state2 = tree->aln->at(i)[seq_id2];
+            }
+            if (num_states<=state2) {
+                continue;
+            }
             double rate_val = site_rate->getPtnRate(i);
             double rate_sqr = rate_val * rate_val;
             double derv1, derv2;
             double trans = tree->getModelFactory()->computeTrans(value * rate_val, state1, state2, derv1, derv2);
             double d1 = derv1 / trans;
-            df -= rate_val * d1 * tree->aln->at(i).frequency;
-            ddf -= rate_sqr * (derv2/trans - d1*d1) * tree->aln->at(i).frequency;
+            double freq;
+            if (frequencies!=nullptr) {
+                freq = frequencies[i];
+            }  else {
+                freq = tree->aln->at(i).frequency;
+            }
+            df -= rate_val * d1 * freq;
+            ddf -= rate_sqr * (derv2/trans - d1*d1) * freq;
         }
         return;
     }
 
     if (tree->getModel()->isSiteSpecificModel()) {
         for (int i = 0; i < nptn; i++) {
-            int state1 = tree->aln->at(i)[seq_id1];
-            int state2 = tree->aln->at(i)[seq_id2];
-            if (state1 >= num_states || state2 >= num_states) continue;
+            int state1;
+            if (sequence1!=nullptr) {
+                state1 = sequence1[i]; }
+            else {
+                state1 = tree->aln->at(i)[seq_id1];
+            }
+            if (num_states<=state1) {
+                continue;
+            }
+            int state2;
+            if (sequence2!=nullptr) {
+                state2 = sequence2[i];
+            } else {
+                state2 = tree->aln->at(i)[seq_id2];
+            }
+            if (num_states<=state2) {
+                continue;
+            }
             double rate_val = site_rate->getPtnRate(i);
             double rate_sqr = rate_val * rate_val;
             double derv1, derv2;
             double trans = tree->getModel()->computeTrans(value * rate_val,model->getPtnModelID(i), state1, state2, derv1, derv2);
             double d1 = derv1 / trans;
-            df -= rate_val * d1 * tree->aln->at(i).frequency;
-            ddf -= rate_sqr * (derv2/trans - d1*d1) * tree->aln->at(i).frequency;
+            double freq;
+            if (frequencies!=nullptr) {
+                freq = frequencies[i];
+            }  else {
+                freq = tree->aln->at(i).frequency;
+            }
+            df -= rate_val * d1 * freq;
+            ddf -= rate_sqr * (derv2/trans - d1*d1) * freq;
         }
         return;
     }
@@ -386,7 +469,6 @@ double AlignmentPairwise::recomputeDist
     ( int seq1, int seq2, double initial_dist, double &d2l ) {
     //Only called when -experimental has been passed
     if (initial_dist == 0.0) {
-        //ZORK
         if (tree->hasMatrixOfConvertedSequences()) {
             int distance    = 0;
             int denominator = 0;
diff --git a/alignment/alignmentsummary.cpp b/alignment/alignmentsummary.cpp
index ae7b0684a..8136e60c4 100644
--- a/alignment/alignmentsummary.cpp
+++ b/alignment/alignmentsummary.cpp
@@ -8,7 +8,9 @@
 #include "alignment.h"
 #include "alignmentsummary.h"
 
-AlignmentSummary::AlignmentSummary(const Alignment* a, bool keepConstSites) {
+AlignmentSummary::AlignmentSummary(const Alignment* a
+                                   , bool keepConstSites
+                                   , bool keepBoringSites) {
     alignment      = a;
     sequenceMatrix = nullptr;
     sequenceCount  = a->getNSeq();
@@ -56,14 +58,14 @@ AlignmentSummary::AlignmentSummary(const Alignment* a, bool keepConstSites) {
         s.maxState  = maxStateForSite;
     }
     sequenceLength = 0; //Number sites where there's some variability
-    std::map<int, int> map = stateToSumOfConstantSiteFrequencies;
+    std::map<int, int> & map = stateToSumOfConstantSiteFrequencies;
     for (size_t site=0; site<siteCount; ++site) {
         SiteSummary &s      = sites[site];
         bool alreadyCounted = false;
         totalFrequency     += s.frequency;
         totalFrequencyOfNonConstSites += s.isConst ? 0 : s.frequency;
         if ( keepConstSites || !s.isConst) {
-            if ( 0 < s.frequency && s.minState < s.maxState) {
+            if ( keepBoringSites || ( 0 < s.frequency && s.minState < s.maxState) ) {
                 alreadyCounted = true;
                 if (sequenceLength==0) {
                     minState = s.minState;
diff --git a/alignment/alignmentsummary.h b/alignment/alignmentsummary.h
index 4dfc620c4..8ec937e01 100644
--- a/alignment/alignmentsummary.h
+++ b/alignment/alignmentsummary.h
@@ -21,7 +21,7 @@ class Alignment;
 struct AlignmentSummary
 {
 public:
-    AlignmentSummary(const Alignment* a, bool keepConstSites);
+    AlignmentSummary(const Alignment* a, bool keepConstSites, bool keepBoringSites);
     ~AlignmentSummary();
     const Alignment*   alignment;
     std::vector<int>   siteNumbers;      //of sites with variation
diff --git a/main/phyloanalysis.cpp b/main/phyloanalysis.cpp
index c8cfe3612..4aa67bdfd 100644
--- a/main/phyloanalysis.cpp
+++ b/main/phyloanalysis.cpp
@@ -1637,7 +1637,7 @@ void computeMLDist ( Params& params, IQTree& iqtree
     longest_dist = iqtree.computeDist(params, iqtree.aln, ml_dist, ml_var);
     cout << "Computing ML distances took "
         << (getRealTime() - begin_wallclock_time) << " sec (of wall-clock time) "
-        << (getCPUTime() - begin_cpu_time) << " sec(of CPU time)" << endl;
+        << (getCPUTime() - begin_cpu_time) << " sec (of CPU time)" << endl;
     size_t n = iqtree.aln->getNSeq();
     size_t nSquared = n*n;
     if ( iqtree.dist_matrix == nullptr ) {
@@ -2226,7 +2226,7 @@ void runTreeReconstruction(Params &params, IQTree* &iqtree) {
             iqtree_new->initializePLL(params);
         }
         iqtree_new->setParams(&params);
-        iqtree_new->copyPhyloTree(iqtree);
+        iqtree_new->copyPhyloTree(iqtree, false);
 
         // replace iqtree object
         delete iqtree;
diff --git a/model/modelfactorymixlen.cpp b/model/modelfactorymixlen.cpp
index 2350c12eb..799394a19 100644
--- a/model/modelfactorymixlen.cpp
+++ b/model/modelfactorymixlen.cpp
@@ -125,14 +125,11 @@ string ModelFactoryMixlen::sortClassesByTreeLength() {
                 ((ModelMarkov*)model->getMixtureClass(m))->setEigenvalues(&model->getEigenvalues()[m*num_states]);
                 ((ModelMarkov*)model->getMixtureClass(m))->setEigenvectors(&model->getEigenvectors()[m*num_states*num_states]);
                 ((ModelMarkov*)model->getMixtureClass(m))->setInverseEigenvectors(&model->getInverseEigenvectors()[m*num_states*num_states]);
+                ((ModelMarkov*)model->getMixtureClass(m))->setInverseEigenvectorsTransposed(&model->getInverseEigenvectorsTransposed()[m*num_states*num_states]);
             }
             model->decomposeRateMatrix();
-
-//            model->writeInfo(cout);
             site_rate->writeInfo(cout);
-
         }
-
         tree->clearAllPartialLH();
         ASSERT(fabs(score - tree->computeLikelihood()) < 0.1);
     }
diff --git a/model/modelmarkov.cpp b/model/modelmarkov.cpp
index b3e0b0d42..8e262a2d2 100644
--- a/model/modelmarkov.cpp
+++ b/model/modelmarkov.cpp
@@ -22,10 +22,13 @@
 #include <string.h>
 #include "modelliemarkov.h"
 #include "modelunrest.h"
+
 #include <Eigen/Eigenvalues>
 #include <unsupported/Eigen/MatrixFunctions>
 using namespace Eigen;
 
+#include <vectorclass/vectormath_exp.h>
+#include <vectorclass/vectorclass.h>
 
 /** number of squaring for scaling-squaring technique */
 //const int TimeSquare = 10;
@@ -44,14 +47,16 @@ ModelMarkov::ModelMarkov(PhyloTree *tree, bool reversible, bool adapt_tree)
     rates = NULL;
 
     // variables for reversible model
-    eigenvalues = eigenvectors = inv_eigenvectors = NULL;
+    eigenvalues = nullptr;
+    eigenvectors = nullptr;
+    inv_eigenvectors = nullptr;
+    inv_eigenvectors_transposed = nullptr;
     highest_freq_state = num_states-1;
     freq_type = FREQ_UNKNOWN;
     half_matrix = true;
     highest_freq_state = num_states-1;
 
     // variables for non-reversible model
-//    model_parameters = NULL;
     rate_matrix = NULL;
     eigenvalues_imag = NULL;
     ceval = cevec = cinv_evec = NULL;
@@ -68,9 +73,7 @@ ModelMarkov::ModelMarkov(PhyloTree *tree, bool reversible, bool adapt_tree)
 }
 
 void ModelMarkov::setReversible(bool reversible, bool adapt_tree) {
-    
     bool old_reversible = is_reversible;
-    
     is_reversible = reversible;
 
     if (reversible) {
@@ -78,24 +81,23 @@ void ModelMarkov::setReversible(bool reversible, bool adapt_tree) {
         int i;
         int nrate = getNumRateEntries();
 
-        if (rates)
-            delete [] rates;
+        delete [] rates;
         rates = new double[nrate];
-        for (i=0; i < nrate; i++)
+        for (i=0; i < nrate; i++) {
             rates[i] = 1.0;
-
-        if (!eigenvalues)
-            eigenvalues = aligned_alloc<double>(num_states);
-        if (!eigenvectors)
-            eigenvectors = aligned_alloc<double>(num_states*num_states);
-        if (!inv_eigenvectors)
-            inv_eigenvectors = aligned_alloc<double>(num_states*num_states);
-
+        }
+        size_t num_states_squared = num_states * num_states;
+        ensure_aligned_allocated(eigenvalues,  num_states);
+        ensure_aligned_allocated(eigenvectors, num_states_squared);
+        ensure_aligned_allocated(inv_eigenvectors, num_states_squared);
+        ensure_aligned_allocated(inv_eigenvectors_transposed, num_states_squared);
+        
         num_params = nrate - 1;
 
         if (adapt_tree && phylo_tree && phylo_tree->rooted) {
-            if (verbose_mode >= VB_MED)
+            if (verbose_mode >= VB_MED) {
                 cout << "Converting rooted to unrooted tree..." << endl;
+            }
             phylo_tree->convertToUnrooted();
         }
     } else {
@@ -109,24 +111,27 @@ void ModelMarkov::setReversible(bool reversible, bool adapt_tree) {
         // reallocate the mem spaces
         if (rates && old_reversible) {
             // copy old reversible rates into new non-reversible
-            int i, j, k;
-            for (i = 0, k = 0; i < num_states; i++)
-                for (j = i+1; j < num_states; j++, k++) {
+            for (int i = 0, k = 0; i < num_states; i++) {
+                for (int j = i+1; j < num_states; j++, k++) {
                     rate_matrix[i*num_states+j] = rates[k] * state_freq[j];
                     rate_matrix[j*num_states+i] = rates[k] * state_freq[i];
                 }
+            }
             delete [] rates;
             rates = new double[num_rates];
-            for (i = 0, k = 0; i < num_states; i++)
-                for (j = 0; j < num_states; j++)
+            int k = 0;
+            int pos = 0;
+            for (int i = 0; i < num_states; i++) {
+                for (int j = 0; j < num_states; j++, pos++) {
                     if (i!=j) {
-                        rates[k] = rate_matrix[i*num_states+j];
-                        k++;
+                        rates[k] = rate_matrix[pos];
+                        ++k;
                     }
+                }
+            }
             ASSERT(k == num_rates);
         } else {
-            if (rates)
-                delete [] rates;
+            delete [] rates;
             rates = new double [num_rates];
             memset(rates, 0, sizeof(double) * (num_rates));
         }
@@ -477,16 +482,23 @@ void ModelMarkov::computeTransMatrix(double time, double *trans_matrix, int mixt
 	/* compute P(t) */
 	double evol_time = time / total_num_subst;
 
-    VectorXd eval_exp(num_states);
-    ArrayXd eval = Map<ArrayXd,Aligned>(eigenvalues, num_states);
-    eval_exp = (eval*evol_time).exp().matrix();
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> evectors(eigenvectors, num_states, num_states);
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> inv_evectors(inv_eigenvectors, num_states, num_states);
-    MatrixXd res = evectors * eval_exp.asDiagonal() * inv_evectors;
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_trans(trans_matrix,num_states,num_states);
-    map_trans = res;
-    return;
-    
+    if (Params::getInstance().experimental) {
+        double eval_exp[num_states];
+        calculateExponentOfScalarMultiply(eigenvalues, num_states, evol_time, eval_exp);
+        aTimesDiagonalBTimesTransposeOfC( eigenvectors, eval_exp
+                              , inv_eigenvectors_transposed, num_states, trans_matrix);
+        return;
+    } else {
+        VectorXd eval_exp(num_states);
+        ArrayXd eval = Map<ArrayXd,Aligned>(eigenvalues, num_states);
+        eval_exp = (eval*evol_time).exp().matrix();
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> evectors(eigenvectors, num_states, num_states);
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> inv_evectors(inv_eigenvectors, num_states, num_states);
+        MatrixXd res = evectors * eval_exp.asDiagonal() * inv_evectors;
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_trans(trans_matrix,num_states,num_states);
+        map_trans = res;
+        return;
+    }
     /*
     double exptime[num_states];
 	int i, j, k;
@@ -543,23 +555,167 @@ double ModelMarkov::computeTrans(double time, int state1, int state2) {
 }
 
 double ModelMarkov::computeTrans(double time, int state1, int state2, double &derv1, double &derv2) {
-	double evol_time = time / total_num_subst;
-	int i;
+    double evol_time = time / total_num_subst;
+    double trans_prob = 0.0;
+    derv1 = derv2 = 0.0;
+    for (int i = 0; i < num_states; i++) {
+        double trans = eigenvectors[state1*num_states+i]
+        * inv_eigenvectors[i*num_states+state2]
+        * exp(evol_time * eigenvalues[i]);
+        double trans2 = trans * eigenvalues[i];
+        trans_prob += trans;
+        derv1 += trans2;
+        derv2 += trans2 * eigenvalues[i];
+    }
+    return trans_prob;
+}
 
-//	double *coeff_entry = eigen_coeff + ((state1*num_states+state2)*num_states);
-	double trans_prob = 0.0;
-	derv1 = derv2 = 0.0;
-	for (i = 0; i < num_states; i++) {
-		double trans = eigenvectors[state1*num_states+i] * inv_eigenvectors[i*num_states+state2] * exp(evol_time * eigenvalues[i]);
-		double trans2 = trans * eigenvalues[i];
-		trans_prob += trans;
-		derv1 += trans2;
-		derv2 += trans2 * eigenvalues[i];
-	}
-	return trans_prob;
+void ModelMarkov::calculateExponentOfScalarMultiply(const double* source, int size
+                                                    , double scalar, double* dest) {
+    if (size==4) {
+        Vec4d v;
+        v.load(source);
+        exp(v*scalar).store(dest);
+        return;
+    }
+    int remainder = size;
+    if (4<size) {
+        //Todo: Investigate. Worth unrolling?
+        Vec4d v;
+        int step  = Vec4d::size();
+        remainder = size & (step - 1);
+        const double* sourceStop = source + (size-remainder);
+        for (; source<sourceStop; source+=step, dest+=step) {
+            v.load(source);
+            exp(v*scalar).store(dest);
+        }
+    }
+    //Do the last few operations one at a time
+    for (; 0<remainder; --remainder, ++source, ++dest) {
+        *dest = exp(*source * scalar);
+    }
+}
+
+void ModelMarkov::calculateHadamardProduct(const double* first
+                                           , const double* second, int size
+                                           , double *dest) {
+    if (size==4) {
+        Vec4d a;
+        Vec4d b;
+        a.load(first);
+        b.load(second);
+        (a*b).store(dest);
+        return;
+    }
+    int offset = 0;
+    if (4<size) {
+        //Todo: Investigate. Worth unrolling?
+        Vec4d a, b;
+        int step  = Vec4d::size();
+        int remainder = size & (step - 1);
+        int integralSize = size - remainder;
+        for (; offset<integralSize; offset += step) {
+            a.load(first + offset);
+            b.load(second + offset);
+            (a*b).store(dest + offset);
+        }
+    }
+    for (; offset<size; ++offset) {
+        dest[offset] = first[offset] * second[offset];
+    }
+}
+
+double ModelMarkov::dotProduct(const double* first
+                            , const double* second, int size) {
+    if (size==4) {
+        Vec4d a;
+        Vec4d b;
+        a.load(first);
+        b.load(second);
+        return horizontal_add(a*b);
+    }
+    int    offset    = 0;
+    double product   = 0;
+    if (4<size) {
+        //Todo: Investigate. Worth unrolling?
+        Vec4d a, b, dot = 0;
+        int step  = Vec4d::size();
+        int remainder = size & (step - 1);
+        int integralSize = size - remainder;
+        for (; offset<integralSize; offset += step) {
+            a.load(first + offset);
+            b.load(second + offset);
+            dot = mul_add(a, b, dot);
+        }
+        product = horizontal_add(dot);
+    }
+    for (; offset<size; ++offset) {
+        product += first[offset] * second[offset];
+    }
+    return product;
 }
 
 
+void ModelMarkov::calculateSquareMatrixTranspose(const double* original, int rank, double* transpose) {
+    const double* originalEntry = original;
+    for (int i=0; i<rank; ++i) {
+        double* destEntry = transpose + i;
+        for (int j=0; j<rank; ++j ) {
+            *destEntry = *originalEntry;
+            ++originalEntry;   //next in column (or: first in next row)
+            destEntry += rank; //next in column
+        }
+    }
+}
+
+void ModelMarkov::aTimesDiagonalBTimesTransposeOfC(const double* matrixA, const double* rowB
+                                      , const double* matrixCTranspose, int rank,double* dest) {
+    if (rank==4) {
+        Vec4d B;
+        B.load(rowB);
+        
+        Vec4d row1, row2, row3, row4; //rows of matrixA * corresponding entries in rowB
+        row1.load_a (matrixA)    *= B;
+        row2.load_a (matrixA+4)  *= B;
+        row3.load_a (matrixA+8)  *= B;
+        row4.load_a (matrixA+12) *= B;
+        
+        Vec4d col1, col2, col3, col4; //rows of matrixCTranspose: columns of matrixC
+        col1.load_a ( matrixCTranspose );
+        col2.load_a ( matrixCTranspose + 4  );
+        col3.load_a ( matrixCTranspose + 8  );
+        col4.load_a ( matrixCTranspose + 12 );
+
+        dest[0]  = horizontal_add(row1*col1);
+        dest[1]  = horizontal_add(row1*col2);
+        dest[2]  = horizontal_add(row1*col3);
+        dest[3]  = horizontal_add(row1*col4);
+        dest[4]  = horizontal_add(row2*col1);
+        dest[5]  = horizontal_add(row2*col2);
+        dest[6]  = horizontal_add(row2*col3);
+        dest[7]  = horizontal_add(row2*col4);
+        dest[8]  = horizontal_add(row3*col1);
+        dest[9]  = horizontal_add(row3*col2);
+        dest[10] = horizontal_add(row3*col3);
+        dest[11] = horizontal_add(row3*col4);
+        dest[12] = horizontal_add(row4*col1);
+        dest[13] = horizontal_add(row4*col2);
+        dest[14] = horizontal_add(row4*col3);
+        dest[15] = horizontal_add(row4*col4);
+        return;
+    } else {
+        double scratch[rank];
+        for (int i=0; i<rank; ++i, matrixA+=rank) {
+            calculateHadamardProduct(matrixA, rowB, rank, scratch);
+            auto rowC = matrixCTranspose;
+            for (int j=0; j<rank; ++j, rowC+=rank) {
+                *dest = dotProduct(scratch, rowC, rank );
+                ++dest;
+            }
+        }
+    }
+}
+
 void ModelMarkov::computeTransDerv(double time, double *trans_matrix, 
 	double *trans_derv1, double *trans_derv2, int mixture)
 {
@@ -600,26 +756,47 @@ void ModelMarkov::computeTransDerv(double time, double *trans_matrix,
 
 	double evol_time = time / total_num_subst;
     
-    ArrayXd eval = Map<ArrayXd,Aligned>(eigenvalues, num_states);
-    ArrayXd eval_exp = (eval*evol_time).exp();
-    ArrayXd eval_exp_derv1 = eval_exp*eval;
-    ArrayXd eval_exp_derv2 = eval_exp_derv1*eval;
-
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> evectors(eigenvectors, num_states, num_states);
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> inv_evectors(inv_eigenvectors, num_states, num_states);
-    MatrixXd res = evectors * eval_exp.matrix().asDiagonal() * inv_evectors;
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_trans(trans_matrix,num_states,num_states);
-    map_trans = res;
-
-    res = evectors * eval_exp_derv1.matrix().asDiagonal() * inv_evectors;
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_derv1(trans_derv1,num_states,num_states);
-    map_derv1 = res;
-
-    res = evectors * eval_exp_derv2.matrix().asDiagonal() * inv_evectors;
-    Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_derv2(trans_derv2,num_states,num_states);
-    map_derv2 = res;
-
+    if (Params::getInstance().experimental) {
+        //James' version
+        double eval_exp[num_states];
+        calculateExponentOfScalarMultiply(eigenvalues, num_states, evol_time, eval_exp);
+        aTimesDiagonalBTimesTransposeOfC( eigenvectors, eval_exp
+                              , inv_eigenvectors_transposed, num_states, trans_matrix);
+
+        double eval_exp_derv1[num_states];
+        calculateHadamardProduct(eigenvalues, eval_exp, num_states, eval_exp_derv1);
+        aTimesDiagonalBTimesTransposeOfC( eigenvectors, eval_exp_derv1
+                              , inv_eigenvectors_transposed, num_states, trans_derv1);
+
+        double eval_exp_derv2[num_states];
+        calculateHadamardProduct(eigenvalues, eval_exp_derv1, num_states, eval_exp_derv2);
+        aTimesDiagonalBTimesTransposeOfC( eigenvectors, eval_exp_derv2
+                              , inv_eigenvectors_transposed, num_states, trans_derv2);
+    }
+    else
+    {
+        //EIGEN version
+        ArrayXd eval = Map<ArrayXd,Aligned>(eigenvalues, num_states);
+        ArrayXd eval_exp = (eval*evol_time).exp();
+        ArrayXd eval_exp_derv1 = eval_exp*eval;
+        ArrayXd eval_exp_derv2 = eval_exp_derv1*eval;
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> evectors(eigenvectors, num_states, num_states);
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor>,Aligned> inv_evectors(inv_eigenvectors, num_states, num_states);
+        MatrixXd res = evectors * eval_exp.matrix().asDiagonal() * inv_evectors;
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_trans(trans_matrix,num_states,num_states);
+        map_trans = res;
+        
+        res = evectors * eval_exp_derv1.matrix().asDiagonal() * inv_evectors;
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_derv1(trans_derv1,num_states,num_states);
+        map_derv1 = res;
+        
+        res = evectors * eval_exp_derv2.matrix().asDiagonal() * inv_evectors;
+        Map<Matrix<double,Dynamic,Dynamic,RowMajor> >map_derv2(trans_derv2,num_states,num_states);
+        map_derv2 = res;
+    }
+    
     /*
+     //Flat version
 	double exptime[num_states];
 
 	for (i = 0; i < num_states; i++)
@@ -1081,6 +1258,8 @@ void ModelMarkov::decomposeRateMatrixNonrev() {
     
     if (phylo_tree->params->matrix_exp_technique == MET_EIGEN_DECOMPOSITION) {
         eigensystem_nonrev(rate_matrix, state_freq, eigenvalues, eigenvalues_imag, eigenvectors, inv_eigenvectors, num_states);
+        calculateSquareMatrixTranspose(inv_eigenvectors, num_states
+                                       , inv_eigenvectors_transposed);
         return;
     }
     
@@ -1168,11 +1347,13 @@ void ModelMarkov::decomposeRateMatrixNonrev() {
             }
         }
     }
-    
+    calculateSquareMatrixTranspose(inv_eigenvectors, num_states
+                                   , inv_eigenvectors_transposed);
+
     // sanity check
-//    MatrixXcd eval_diag = eval.asDiagonal();
-//    MatrixXd check = (inv_evec * mat * evec - eval_diag).cwiseAbs();
-//    ASSERT(check.maxCoeff() < 1e-4);
+    //    MatrixXcd eval_diag = eval.asDiagonal();
+    //    MatrixXd check = (inv_evec * mat * evec - eval_diag).cwiseAbs();
+    //    ASSERT(check.maxCoeff() < 1e-4);
 }
 
 void ModelMarkov::decomposeRateMatrix(){
@@ -1210,9 +1391,14 @@ void ModelMarkov::decomposeRateMatrix(){
 			eigenvectors[i*num_states+i] = state_freq[0]/state_freq[i];
 		}
 
-		for (i = 0; i < num_states; i++)
-			for (j = 0; j < num_states; j++)
-				inv_eigenvectors[i*num_states+j] = state_freq[j]*eigenvectors[j*num_states+i];
+        for (i = 0; i < num_states; i++) {
+            for (j = 0; j < num_states; j++) {
+				inv_eigenvectors[i*num_states+j]
+                    = state_freq[j]*eigenvectors[j*num_states+i];
+            }
+        }
+        calculateSquareMatrixTranspose(inv_eigenvectors, num_states
+                                       , inv_eigenvectors_transposed);
 		writeInfo(cout);
 		// sanity check
 		double *q = new double[num_states*num_states];
@@ -1231,9 +1417,7 @@ void ModelMarkov::decomposeRateMatrix(){
 		delete [] q;
         return;
 	}
-    
     auto technique = phylo_tree->params->matrix_exp_technique;
-    
     if (technique == MET_EIGEN3LIB_DECOMPOSITION) {
         // Use Eigen3 library for eigen decomposition of symmetric matrix
         int n = 0; // the number of states where freq is non-zero
@@ -1371,40 +1555,42 @@ void ModelMarkov::decomposeRateMatrix(){
                 }
             }
         }
+        calculateSquareMatrixTranspose(inv_eigenvectors, num_states
+                                       , inv_eigenvectors_transposed);
         return;
     }
     decomposeRateMatrixRev();
 }
 
 void ModelMarkov::decomposeRateMatrixRev() {
-
-    int i, j, k;
     // general reversible model
     double **rate_matrix = new double*[num_states];
 
-    for (i = 0; i < num_states; i++)
+    for (int i = 0; i < num_states; i++) {
         rate_matrix[i] = new double[num_states];
-
+    }
     if (half_matrix) {
-        for (i = 0, k = 0; i < num_states; i++) {
+        for (int i = 0, k = 0; i < num_states; i++) {
             rate_matrix[i][i] = 0.0;
-            for (j = i+1; j < num_states; j++, k++) {
+            for (int j = i+1; j < num_states; j++, k++) {
                 rate_matrix[i][j] = (state_freq[i] <= ZERO_FREQ || state_freq[j] <= ZERO_FREQ) ? 0 : rates[k];
                 rate_matrix[j][i] = rate_matrix[i][j];
             }
         }
     } else {
         // full matrix
-        for (i = 0; i < num_states; i++) {
+        for (int i = 0; i < num_states; i++) {
             memcpy(rate_matrix[i], &rates[i*num_states], num_states*sizeof(double));
             rate_matrix[i][i] = 0.0;
         }
     }
     /* eigensystem of 1 PAM rate matrix */
     eigensystem_sym(rate_matrix, state_freq, eigenvalues, eigenvectors, inv_eigenvectors, num_states);
-    //eigensystem(rate_matrix, state_freq, eigenvalues, eigenvectors, inv_eigenvectors, num_states);
-    for (i = num_states-1; i >= 0; i--)
+    calculateSquareMatrixTranspose(inv_eigenvectors, num_states
+                                   , inv_eigenvectors_transposed);
+    for (int i = num_states-1; i >= 0; i--) {
         delete [] rate_matrix[i];
+    }
     delete [] rate_matrix;
 }
 
@@ -1414,7 +1600,9 @@ void ModelMarkov::readRates(istream &in) throw(const char*, string) {
 	in >> str;
 	if (str == "equalrate") {
 		for (int i = 0; i < nrates; i++)
+        {
 			rates[i] = 1.0;
+        }
 	} else if (is_reversible ){
         // reversible model
 		try {
@@ -1624,27 +1812,18 @@ ModelMarkov::~ModelMarkov() {
 
 void ModelMarkov::freeMem()
 {
-    if (inv_eigenvectors)
-        aligned_free(inv_eigenvectors);
-    if (eigenvectors)
-        aligned_free(eigenvectors);
-    if (eigenvalues)
-        aligned_free(eigenvalues);
-
-	if (rates) delete [] rates;
-
-    if (cinv_evec)
-        aligned_free(cinv_evec);
-    if (cevec)
-        aligned_free(cevec);
-    if (ceval)
-        aligned_free(ceval);
-    if (eigenvalues_imag)
-        aligned_free(eigenvalues_imag);
-    if (rate_matrix)
-        aligned_free(rate_matrix);
-//    if (model_parameters)
-//        delete [] model_parameters;
+    aligned_free(inv_eigenvectors);
+    aligned_free(inv_eigenvectors_transposed);
+    aligned_free(eigenvectors);
+    aligned_free(eigenvalues);
+
+	delete [] rates;
+
+    aligned_free(cinv_evec);
+    aligned_free(cevec);
+    aligned_free(ceval);
+    aligned_free(eigenvalues_imag);
+    aligned_free(rate_matrix);
 }
 
 double *ModelMarkov::getEigenvalues() const
@@ -1661,31 +1840,34 @@ double* ModelMarkov::getInverseEigenvectors() const {
 	return inv_eigenvectors;
 }
 
-//void ModelGTR::setEigenCoeff(double *eigenCoeff)
-//{
-//    eigen_coeff = eigenCoeff;
-//}
+double* ModelMarkov::getInverseEigenvectorsTransposed() const {
+    return inv_eigenvectors_transposed;
+}
+
+void ModelMarkov::setEigenvalues(double *eigenValues)
+{
+    this->eigenvalues = eigenValues;
+}
 
-void ModelMarkov::setEigenvalues(double *eigenvalues)
+void ModelMarkov::setEigenvectors(double *eigenVectors)
 {
-    this->eigenvalues = eigenvalues;
+    this->eigenvectors = eigenVectors;
 }
 
-void ModelMarkov::setEigenvectors(double *eigenvectors)
+void ModelMarkov::setInverseEigenvectors(double *eigenV)
 {
-    this->eigenvectors = eigenvectors;
+    this->inv_eigenvectors = eigenV;
 }
 
-void ModelMarkov::setInverseEigenvectors(double *inv_eigenvectors)
+void ModelMarkov::setInverseEigenvectorsTransposed(double *eigenVTranspose)
 {
-    this->inv_eigenvectors = inv_eigenvectors;
+    this->inv_eigenvectors_transposed = eigenVTranspose;
 }
 
 /****************************************************/
 /*      NON-REVERSIBLE STUFFS                       */
 /****************************************************/
 
-
 void ModelMarkov::setRates() {
 	// I don't know the proper C++ way to handle this: got error if I didn't define something here.
 	ASSERT(0 && "setRates should only be called on subclass of ModelMarkov");
@@ -1711,11 +1893,13 @@ int ModelMarkov::get_num_states_total() {
   return num_states;
 }
 
-void ModelMarkov::update_eigen_pointers(double *eval, double *evec, double *inv_evec) {
-  eigenvalues = eval;
-  eigenvectors = evec;
-  inv_eigenvectors = inv_evec;
-  return;
+void ModelMarkov::update_eigen_pointers(double *eval, double *evec
+                                        , double *inv_evec, double *inv_evec_transposed) {
+    eigenvalues      = eval;
+    eigenvectors     = evec;
+    inv_eigenvectors = inv_evec;
+    inv_eigenvectors_transposed = inv_evec_transposed;
+    return;
 }
 
 void ModelMarkov::computeTransMatrixEigen(double time, double *trans_matrix) {
@@ -1723,10 +1907,9 @@ void ModelMarkov::computeTransMatrixEigen(double time, double *trans_matrix) {
 	double evol_time = time / total_num_subst;
     int nstates_2 = num_states*num_states;
 	double *exptime = new double[nstates_2];
-	int i, j, k;
 
     memset(exptime, 0, sizeof(double)*nstates_2);
-	for (i = 0; i < num_states; i++)
+    for (int i = 0; i < num_states; i++) {
         if (eigenvalues_imag[i] == 0.0) {
             exptime[i*num_states+i] = exp(evol_time * eigenvalues[i]);
         } else {
@@ -1739,25 +1922,24 @@ void ModelMarkov::computeTransMatrixEigen(double time, double *trans_matrix) {
             exptime[i*num_states+i] = exp_eval.real();
             exptime[i*num_states+i-1] = -exp_eval.imag();
         }
-
-
+    }
     // compute V * exp(L t)
-    for (i = 0; i < num_states; i++)
-        for (j = 0; j < num_states; j++) {
+    for (int i = 0; i < num_states; i++) {
+        for (int j = 0; j < num_states; j++) {
             double val = 0;
-            for (k = 0; k < num_states; k++)
+            for (int k = 0; k < num_states; k++)
                 val += eigenvectors[i*num_states+k] * exptime[k*num_states+j];
             trans_matrix[i*num_states+j] = val;
         }
-
+    }
     memcpy(exptime, trans_matrix, sizeof(double)*nstates_2);
 
     // then compute V * exp(L t) * V^{-1}
-    for (i = 0; i < num_states; i++) {
+    for (int i = 0; i < num_states; i++) {
         double row_sum = 0.0;
-        for (j = 0; j < num_states; j++) {
+        for (int j = 0; j < num_states; j++) {
             double val = 0;
-            for (k = 0; k < num_states; k++)
+            for (int k = 0; k < num_states; k++)
                 val += exptime[i*num_states+k] * inv_eigenvectors[k*num_states+j];
             // make sure that trans_matrix are non-negative
             ASSERT(val >= -0.001);
@@ -1767,7 +1949,6 @@ void ModelMarkov::computeTransMatrixEigen(double time, double *trans_matrix) {
         }
         ASSERT(fabs(row_sum-1.0) < 1e-4);
     }
-
     delete [] exptime;
 }
 
diff --git a/model/modelmarkov.h b/model/modelmarkov.h
index f27570d94..937105d65 100644
--- a/model/modelmarkov.h
+++ b/model/modelmarkov.h
@@ -357,6 +357,7 @@ class ModelMarkov : public ModelSubst, public EigenDecomposition
 
 	virtual double *getEigenvectors() const;
 	virtual double *getInverseEigenvectors() const;
+    virtual double *getInverseEigenvectorsTransposed() const;
 
 //	void setEigenCoeff(double *eigenCoeff);
 
@@ -366,6 +367,23 @@ class ModelMarkov : public ModelSubst, public EigenDecomposition
 
 	void setInverseEigenvectors(double *inv_eigenvectors);
 
+    void setInverseEigenvectorsTransposed(double *inv_eigenvectors);
+
+    static void calculateExponentOfScalarMultiply(const double* source, int size
+                                                  , double scalar, double* dest);
+    
+    static void calculateHadamardProduct(const double* first, const double* second
+                                         , int size, double *dest);
+    
+    static double dotProduct(const double* first, const double* second, int size);
+    
+    static void calculateSquareMatrixTranspose(const double* original, int rank
+                                               , double* transpose);
+    
+    static void aTimesDiagonalBTimesTransposeOfC
+        ( const double* matrixA, const double* rowB
+         , const double* matrixCTranspose, int rank
+         , double* dest);
     /**
      * compute the memory size for the model, can be large for site-specific models
      * @return memory size required in bytes
@@ -407,7 +425,8 @@ class ModelMarkov : public ModelSubst, public EigenDecomposition
   // need to be updated recursively, if the model is a mixture model. For a
   // normal Markov model, only the standard pointers are set. This was done in
   // `ModelMixture::initMem()` before.
-  virtual void update_eigen_pointers(double *eval, double *evec, double *inv_evec);
+  virtual void update_eigen_pointers(double *eval, double *evec
+                                     , double *inv_evec, double *inv_evec_transposed);
 
 
     /**
@@ -487,11 +506,11 @@ class ModelMarkov : public ModelSubst, public EigenDecomposition
 	*/
 	double *inv_eigenvectors;
 
-	/**
-		coefficient cache, served for fast computation of the P(t) matrix
-	*/
-//	double *eigen_coeff;
-
+    /**
+        transpose of the matrix of the inverse eigenvectors of the rate matrix Q
+    */
+    double *inv_eigenvectors_transposed;
+    
 	/** state with highest frequency, used when optimizing state frequencies +FO */
 	int highest_freq_state;
 
diff --git a/model/modelmixture.cpp b/model/modelmixture.cpp
index 9dcdf29fa..7aa10b567 100644
--- a/model/modelmixture.cpp
+++ b/model/modelmixture.cpp
@@ -1282,8 +1282,7 @@ void ModelMixture::initMixture(string orig_model_name, string model_name, string
 	full_name += CLOSE_BRACKET;
 
 	int nmixtures = size();
-	if (prop)
-		aligned_free(prop);
+    aligned_free(prop);
 	prop = aligned_alloc<double>(nmixtures);
 
 	double sum = 0.0;
@@ -1345,78 +1344,76 @@ void ModelMixture::initMixture(string orig_model_name, string model_name, string
 
     delete nxs_freq_optimize;
     delete nxs_freq_empirical;
-
 }
 
 void ModelMixture::initMem() {
-
-	int nmixtures = size();
-
-  // Calculate the total number of states and take into account that each of the
-  // models may be a mixture model itself (PoMo rate heterogeneity).
-
-  int num_states_total = 0;
-  for (iterator it = begin(); it != end(); it++)
-    num_states_total += (*it)->get_num_states_total();
-
-	if (eigenvalues) aligned_free(eigenvalues);
-	if (eigenvectors) aligned_free(eigenvectors);
-	if (inv_eigenvectors) aligned_free(inv_eigenvectors);
-  // if (eigen_coeff) aligned_free(eigen_coeff);
-
-	eigenvalues = aligned_alloc<double>(num_states_total*nmixtures);
-	eigenvectors = aligned_alloc<double>(num_states_total*num_states_total*nmixtures);
-	inv_eigenvectors = aligned_alloc<double>(num_states_total*num_states_total*nmixtures);
-  // int ncoeff = num_states_total*num_states_total*num_states_total;
-  // eigen_coeff = aligned_alloc<double>(ncoeff*nmixtures);
-
-	// assigning memory for individual models
-	int m = 0;
-  int count_num_states = 0;
-  int count_num_states_2 = 0;
-	for (iterator it = begin(); it != end(); it++, m++) {
-    int num_states_this_model = (*it)->get_num_states_total();
-    int num_states_this_model_2 = num_states_this_model * num_states_this_model;
-    // first copy memory for eigen stuffs
-    memcpy(&eigenvalues[count_num_states], (*it)->eigenvalues,
-           num_states_this_model*sizeof(double));
-    memcpy(&eigenvectors[count_num_states_2], (*it)->eigenvectors,
-           num_states_this_model_2*sizeof(double));
-    memcpy(&inv_eigenvectors[count_num_states_2], (*it)->inv_eigenvectors,
-           num_states_this_model_2*sizeof(double));
-    // memcpy(&eigen_coeff[m*ncoeff], (*it)->eigen_coeff, ncoeff*sizeof(double));
-
-    // then delete
-		if ((*it)->eigenvalues) aligned_free((*it)->eigenvalues);
-		if ((*it)->eigenvectors) aligned_free((*it)->eigenvectors);
-		if ((*it)->inv_eigenvectors) aligned_free((*it)->inv_eigenvectors);
-    // if ((*it)->eigen_coeff) aligned_free((*it)->eigen_coeff);
-
-    // And assign new memory. Also, recursively, update respective pointers for
-    // the mixture components of the current model. This is relevant if the
-    // current model is a mixture model itself.
-    (*it)->update_eigen_pointers(&eigenvalues[count_num_states],
-                                 &eigenvectors[count_num_states_2],
-                                 &inv_eigenvectors[count_num_states_2]);
-    // (*it)->eigen_coeff = &eigen_coeff[m*ncoeff];
-
-    // Update the state counters, so that the pointers are assigned correctly
-    // for the next mixture component.
-    count_num_states += num_states_this_model;
-    count_num_states_2 += num_states_this_model_2;
-	}
+    
+    int nmixtures = size();
+    
+    // Calculate the total number of states and take into account that each of the
+    // models may be a mixture model itself (PoMo rate heterogeneity).
+    
+    int num_states_total = 0;
+    for (iterator it = begin(); it != end(); it++) {
+        num_states_total += (*it)->get_num_states_total();
+    }
+    
+    aligned_free(eigenvalues);
+    aligned_free(eigenvectors);
+    aligned_free(inv_eigenvectors);
+    aligned_free(inv_eigenvectors_transposed);
+    ensure_aligned_allocated(eigenvalues, num_states_total*nmixtures);
+    ensure_aligned_allocated(eigenvectors, num_states_total*num_states_total*nmixtures);
+    ensure_aligned_allocated(inv_eigenvectors, num_states_total*num_states_total*nmixtures);
+    ensure_aligned_allocated(inv_eigenvectors_transposed, num_states_total*num_states_total*nmixtures);
+    
+    // assigning memory for individual models
+    int m = 0;
+    int count_num_states = 0;
+    int count_num_states_2 = 0;
+    for (iterator it = begin(); it != end(); it++, m++) {
+        int num_states_this_model = (*it)->get_num_states_total();
+        int num_states_this_model_2 = num_states_this_model * num_states_this_model;
+        // first copy memory for eigen stuffs
+        memcpy(&eigenvalues[count_num_states], (*it)->eigenvalues,
+               num_states_this_model*sizeof(double));
+        memcpy(&eigenvectors[count_num_states_2], (*it)->eigenvectors,
+               num_states_this_model_2*sizeof(double));
+        memcpy(&inv_eigenvectors[count_num_states_2], (*it)->inv_eigenvectors,
+               num_states_this_model_2*sizeof(double));
+        memcpy(&inv_eigenvectors_transposed[count_num_states_2], (*it)->inv_eigenvectors_transposed,
+               num_states_this_model_2*sizeof(double));
+        
+        // then delete
+        aligned_free((*it)->eigenvalues);
+        aligned_free((*it)->eigenvectors);
+        aligned_free((*it)->inv_eigenvectors);
+        aligned_free((*it)->inv_eigenvectors_transposed);
+        
+        // And assign new memory. Also, recursively, update respective pointers for
+        // the mixture components of the current model. This is relevant if the
+        // current model is a mixture model itself.
+        (*it)->update_eigen_pointers(&eigenvalues[count_num_states],
+                                     &eigenvectors[count_num_states_2],
+                                     &inv_eigenvectors[count_num_states_2],
+                                     &inv_eigenvectors_transposed[count_num_states_2]);
+        
+        // Update the state counters, so that the pointers are assigned correctly
+        // for the next mixture component.
+        count_num_states += num_states_this_model;
+        count_num_states_2 += num_states_this_model_2;
+    }
 }
 
 ModelMixture::~ModelMixture() {
-	if (prop)
-		aligned_free(prop);
-	for (reverse_iterator rit = rbegin(); rit != rend(); rit++) {
-//		(*rit)->eigen_coeff = NULL;
-		(*rit)->eigenvalues = NULL;
-		(*rit)->eigenvectors = NULL;
-		(*rit)->inv_eigenvectors = NULL;
-		delete (*rit);
-	}
+    aligned_free(prop);
+    for (reverse_iterator rit = rbegin(); rit != rend(); rit++) {
+        (*rit)->eigenvalues = nullptr;
+        (*rit)->eigenvectors = nullptr;
+        (*rit)->inv_eigenvectors = nullptr;
+        (*rit)->inv_eigenvectors_transposed = nullptr;
+        delete (*rit);
+    }
 }
 
 void ModelMixture::setCheckpoint(Checkpoint *checkpoint) {
@@ -1652,7 +1649,7 @@ double ModelMixture::optimizeWithEM(double gradient_epsilon) {
     tree->central_scale_num = phylo_tree->central_scale_num;
     tree->central_partial_pars = phylo_tree->central_partial_pars;
 
-    tree->copyPhyloTree(phylo_tree);
+    tree->copyPhyloTree(phylo_tree, true);
     tree->optimize_by_newton = phylo_tree->optimize_by_newton;
     tree->setParams(phylo_tree->params);
     tree->setLikelihoodKernel(phylo_tree->sse);
@@ -1752,7 +1749,7 @@ double ModelMixture::optimizeWithEM(double gradient_epsilon) {
 
         // now optimize model one by one
         for (c = 0; c < nmix; c++) if (at(c)->getNDim() > 0) {
-            tree->copyPhyloTreeMixlen(phylo_tree, c);
+            tree->copyPhyloTreeMixlen(phylo_tree, c, true);
             ModelMarkov *subst_model;
             subst_model = at(c);
             tree->setModel(subst_model);
diff --git a/model/modelpomomixture.cpp b/model/modelpomomixture.cpp
index 0618baf52..0286f5863 100644
--- a/model/modelpomomixture.cpp
+++ b/model/modelpomomixture.cpp
@@ -156,6 +156,8 @@ void ModelPoMoMixture::decomposeRateMatrix() {
             memcpy(eigenvalues+m*num_states, eigenvalues, sizeof(double)*num_states);
             memcpy(eigenvectors+m*num_states_2, eigenvectors, sizeof(double)*num_states_2);
             memcpy(inv_eigenvectors+m*num_states_2, inv_eigenvectors, sizeof(double)*num_states_2);
+            memcpy(inv_eigenvectors_transposed+m*num_states_2
+                   , inv_eigenvectors_transposed, sizeof(double)*num_states_2);
         }
         // restore mutation_rate matrix
         memcpy(mutation_rate_matrix, saved_mutation_rate_matrix, sizeof(double)*n_alleles*n_alleles);
@@ -227,18 +229,26 @@ int ModelPoMoMixture::get_num_states_total() {
   return num_states * getNMixtures();
 }
 
-void ModelPoMoMixture::update_eigen_pointers(double *eval, double *evec, double *inv_evec) {
-  eigenvalues = eval;
-  eigenvectors = evec;
-  inv_eigenvectors = inv_evec;
-  // We assume that all mixture model components have the same number of states.
-  int m = 0;
-  for (iterator it = begin(); it != end(); it++, m++) {
-    (*it)->update_eigen_pointers(eval + m*num_states,
-                                 evec + m*num_states*num_states,
-                                 inv_evec + m*num_states*num_states);
-  }
-  return;
+void ModelPoMoMixture::update_eigen_pointers(double *eval, double *evec
+                                             , double *inv_evec, double* inv_evec_transposed) {
+    eigenvalues = eval;
+    eigenvectors = evec;
+    inv_eigenvectors = inv_evec;
+    inv_eigenvectors_transposed = inv_evec_transposed;
+    
+    // We assume that all mixture model components have the same number of states.
+    size_t rowOffset = 0;
+    size_t matrixOffset = 0; //into matrices
+    size_t num_states_squared = num_states * num_states;
+    
+    for (iterator it = begin(); it != end();
+         it++, rowOffset+=num_states, matrixOffset+=num_states_squared) {
+        (*it)->update_eigen_pointers(eval + rowOffset,
+                                     evec + matrixOffset,
+                                     inv_evec + matrixOffset,
+                                     inv_evec_transposed + matrixOffset);
+    }
+    return;
 }
 
 bool ModelPoMoMixture::isUnstableParameters() {
diff --git a/model/modelpomomixture.h b/model/modelpomomixture.h
index 51cf5754c..2ec5253fd 100644
--- a/model/modelpomomixture.h
+++ b/model/modelpomomixture.h
@@ -146,7 +146,8 @@ class ModelPoMoMixture : public ModelPoMo, public ModelMixture {
   // need to be updated recursively, if the model is a mixture model. For a
   // normal Markov model, only the standard pointers are set. This was done in
   // `ModelMixture::initMem()` before.
-  virtual void update_eigen_pointers(double *eval, double *evec, double *inv_evec);
+  virtual void update_eigen_pointers(double *eval, double *evec
+                                     , double *inv_evec, double *inv_evec_transposed);
 
 
 	/**
diff --git a/model/modelset.cpp b/model/modelset.cpp
index 27e3d7915..8909a6c57 100644
--- a/model/modelset.cpp
+++ b/model/modelset.cpp
@@ -135,7 +135,6 @@ void ModelSet::decomposeRateMatrix()
 	// rearrange eigen to obey vector_size
 	size_t vsize = phylo_tree->vector_size;
 	size_t states2 = num_states*num_states;
-	size_t ptn, i, x;
 
     size_t max_size = get_safe_upper_limit(size());
 
@@ -144,35 +143,37 @@ void ModelSet::decomposeRateMatrix()
         memcpy(&eigenvalues[m*num_states], &eigenvalues[(m-1)*num_states], sizeof(double)*num_states);
         memcpy(&eigenvectors[m*states2], &eigenvectors[(m-1)*states2], sizeof(double)*states2);
         memcpy(&inv_eigenvectors[m*states2], &inv_eigenvectors[(m-1)*states2], sizeof(double)*states2);
+        memcpy(&inv_eigenvectors_transposed[m*states2], &inv_eigenvectors_transposed[(m-1)*states2], sizeof(double)*states2);
     }
 
     double new_eval[num_states*vsize];
     double new_evec[states2*vsize];
     double new_inv_evec[states2*vsize];
 
-	for (ptn = 0; ptn < size(); ptn += vsize) {
-		double *eval_ptr = &eigenvalues[ptn*num_states];
-		double *evec_ptr = &eigenvectors[ptn*states2];
-		double *inv_evec_ptr = &inv_eigenvectors[ptn*states2];
-		for (i = 0; i < vsize; i++) {
-			for (x = 0; x < num_states; x++)
-				new_eval[x*vsize+i] = eval_ptr[x];
-			for (x = 0; x < states2; x++) {
-				new_evec[x*vsize+i] = evec_ptr[x];
-				new_inv_evec[x*vsize+i] = inv_evec_ptr[x];
-			}
-			eval_ptr += num_states;
-			evec_ptr += states2;
-			inv_evec_ptr += states2;
-		}
-		// copy new values
+    for (size_t ptn = 0; ptn < size(); ptn += vsize) {
+        double *eval_ptr = &eigenvalues[ptn*num_states];
+        double *evec_ptr = &eigenvectors[ptn*states2];
+        double *inv_evec_ptr = &inv_eigenvectors[ptn*states2];
+        for (size_t i = 0; i < vsize; i++) {
+            for (size_t x = 0; x < num_states; x++)
+                new_eval[x*vsize+i] = eval_ptr[x];
+            for (size_t x = 0; x < states2; x++) {
+                new_evec[x*vsize+i] = evec_ptr[x];
+                new_inv_evec[x*vsize+i] = inv_evec_ptr[x];
+            }
+            eval_ptr += num_states;
+            evec_ptr += states2;
+            inv_evec_ptr += states2;
+        }
+        // copy new values
         memcpy(&eigenvalues[ptn*num_states], new_eval, sizeof(double)*num_states*vsize);
         memcpy(&eigenvectors[ptn*states2], new_evec, sizeof(double)*states2*vsize);
         memcpy(&inv_eigenvectors[ptn*states2], new_inv_evec, sizeof(double)*states2*vsize);
-	}
+        calculateSquareMatrixTranspose(new_inv_evec, num_states
+                                       , &inv_eigenvectors_transposed[ptn*states2]);
+    }
 }
 
-
 bool ModelSet::getVariables(double* variables)
 {
 	ASSERT(size());
@@ -188,52 +189,57 @@ void ModelSet::setVariables(double* variables)
 	front()->setVariables(variables);
 }
 
-
 ModelSet::~ModelSet()
 {
-	for (reverse_iterator rit = rbegin(); rit != rend(); rit++) {
-		(*rit)->eigenvalues = NULL;
-		(*rit)->eigenvectors = NULL;
-		(*rit)->inv_eigenvectors = NULL;
-		delete (*rit);
-	}
+    for (reverse_iterator rit = rbegin(); rit != rend(); rit++) {
+        (*rit)->eigenvalues = nullptr;
+        (*rit)->eigenvectors = nullptr;
+        (*rit)->inv_eigenvectors = nullptr;
+        (*rit)->inv_eigenvectors_transposed = nullptr;
+        delete (*rit);
+    }
 }
 
 void ModelSet::joinEigenMemory() {
     size_t nmixtures = get_safe_upper_limit(size());
-	if (eigenvalues) aligned_free(eigenvalues);
-	if (eigenvectors) aligned_free(eigenvectors);
-	if (inv_eigenvectors) aligned_free(inv_eigenvectors);
-
+    aligned_free(eigenvalues);
+    aligned_free(eigenvectors);
+    aligned_free(inv_eigenvectors);
+    aligned_free(inv_eigenvectors_transposed);
+    
     size_t states2 = num_states*num_states;
-
-	eigenvalues = aligned_alloc<double>(num_states*nmixtures);
-	eigenvectors = aligned_alloc<double>(states2*nmixtures);
-	inv_eigenvectors = aligned_alloc<double>(states2*nmixtures);
-
-	// assigning memory for individual models
-	size_t m = 0;
-	for (iterator it = begin(); it != end(); it++, m++) {
+    
+    eigenvalues = aligned_alloc<double>(num_states*nmixtures);
+    eigenvectors = aligned_alloc<double>(states2*nmixtures);
+    inv_eigenvectors = aligned_alloc<double>(states2*nmixtures);
+    inv_eigenvectors_transposed = aligned_alloc<double>(states2*nmixtures);
+    
+    // assigning memory for individual models
+    size_t m = 0;
+    for (iterator it = begin(); it != end(); it++, m++) {
         // first copy memory for eigen stuffs
         memcpy(&eigenvalues[m*num_states], (*it)->eigenvalues, num_states*sizeof(double));
         memcpy(&eigenvectors[m*states2], (*it)->eigenvectors, states2*sizeof(double));
         memcpy(&inv_eigenvectors[m*states2], (*it)->inv_eigenvectors, states2*sizeof(double));
+        memcpy(&inv_eigenvectors_transposed[m*states2], (*it)->inv_eigenvectors_transposed, states2*sizeof(double));
         // then delete
-		if ((*it)->eigenvalues) aligned_free((*it)->eigenvalues);
-		if ((*it)->eigenvectors) aligned_free((*it)->eigenvectors);
-		if ((*it)->inv_eigenvectors) aligned_free((*it)->inv_eigenvectors);
-//		if ((*it)->eigen_coeff) aligned_free((*it)->eigen_coeff);
-
+        aligned_free((*it)->eigenvalues);
+        aligned_free((*it)->eigenvectors);
+        aligned_free((*it)->inv_eigenvectors);
+        aligned_free((*it)->inv_eigenvectors_transposed);
+        
         // and assign new memory
-		(*it)->eigenvalues = &eigenvalues[m*num_states];
-		(*it)->eigenvectors = &eigenvectors[m*states2];
-		(*it)->inv_eigenvectors = &inv_eigenvectors[m*states2];
-	}
-
+        (*it)->eigenvalues = &eigenvalues[m*num_states];
+        (*it)->eigenvectors = &eigenvectors[m*states2];
+        (*it)->inv_eigenvectors = &inv_eigenvectors[m*states2];
+        (*it)->inv_eigenvectors_transposed = &inv_eigenvectors_transposed[m*states2];
+    }
+    
     // copy dummy values
     for (m = size(); m < nmixtures; m++) {
         memcpy(&eigenvalues[m*num_states], &eigenvalues[(m-1)*num_states], sizeof(double)*num_states);
         memcpy(&eigenvectors[m*states2], &eigenvectors[(m-1)*states2], sizeof(double)*states2);
         memcpy(&inv_eigenvectors[m*states2], &inv_eigenvectors[(m-1)*states2], sizeof(double)*states2);
+        memcpy(&inv_eigenvectors_transposed[m*states2], &inv_eigenvectors_transposed[(m-1)*states2], sizeof(double)*states2);
     }
 }
diff --git a/model/modelsubst.h b/model/modelsubst.h
index c4e84a329..f4c79eea9 100644
--- a/model/modelsubst.h
+++ b/model/modelsubst.h
@@ -336,9 +336,14 @@ class ModelSubst: public Optimization, public CheckpointFactory
 	}
 
 	virtual double *getInverseEigenvectors() const {
-		return NULL;
+		return nullptr;
 	}
 
+    virtual double *getInverseEigenvectorsTransposed() const {
+        return nullptr;
+    }
+
+    
     /**
      * compute the memory size for the model, can be large for site-specific models
      * @return memory size required in bytes
diff --git a/model/ratefree.cpp b/model/ratefree.cpp
index 67ae2b4a6..4b539fe1d 100644
--- a/model/ratefree.cpp
+++ b/model/ratefree.cpp
@@ -224,8 +224,6 @@ double RateFree::targetFunk(double x[]) {
 	return -phylo_tree->computeLikelihood();
 }
 
-
-
 /**
 	optimize parameters. Default is to optimize gamma shape
 	@return the best likelihood
@@ -499,7 +497,7 @@ double RateFree::optimizeWithEM() {
 //    tree->central_scale_num = phylo_tree->central_scale_num;
 //    tree->central_partial_pars = phylo_tree->central_partial_pars;
 
-    tree->copyPhyloTree(phylo_tree);
+    tree->copyPhyloTree(phylo_tree, true);
     tree->optimize_by_newton = phylo_tree->optimize_by_newton;
     tree->setParams(phylo_tree->params);
     tree->setLikelihoodKernel(phylo_tree->sse);
@@ -609,7 +607,7 @@ double RateFree::optimizeWithEM() {
         // now optimize rates one by one
         double sum = 0.0;
         for (c = 0; c < nmix; c++) {
-            tree->copyPhyloTree(phylo_tree);
+            tree->copyPhyloTree(phylo_tree, true);
             ModelMarkov *subst_model;
             if (phylo_tree->getModel()->isMixture() && phylo_tree->getModelFactory()->fused_mix_rate)
                 subst_model = (ModelMarkov*)phylo_tree->getModel()->getMixtureClass(c);
diff --git a/model/ratemeyerdiscrete.cpp b/model/ratemeyerdiscrete.cpp
index 320aff350..d93841c27 100644
--- a/model/ratemeyerdiscrete.cpp
+++ b/model/ratemeyerdiscrete.cpp
@@ -209,7 +209,7 @@ RateMeyerDiscrete::RateMeyerDiscrete() {
 
 RateMeyerDiscrete::~RateMeyerDiscrete()
 {
-	if (rates) delete [] rates;
+	delete [] rates;
 }
 
 bool RateMeyerDiscrete::isSiteSpecificRate() { 
@@ -282,28 +282,55 @@ double RateMeyerDiscrete::computeFunction(double value) {
 	double lh = 0.0;
 	int nseq = phylo_tree->leafNum;
 	int nstate = phylo_tree->getModel()->num_states;
-	int i, j, k, state1, state2;
 	ModelSubst *model = phylo_tree->getModel();
     int trans_size = nstate * nstate;
 	double *trans_mat = new double[trans_size];
 	int *pair_freq = new int[trans_size];
-
-	for (i = 0; i < nseq-1; i++) 
-		for (j = i+1; j < nseq; j++) {
-			memset(pair_freq, 0, trans_size * sizeof(int));
-			for (k = 0; k < size(); k++) {
-				if (ptn_cat[k] != optimizing_cat) continue;
-				Pattern *pat = & phylo_tree->aln->at(k);
-				if ((state1 = pat->at(i)) < nstate && (state2 = pat->at(j)) < nstate)
-					pair_freq[state1*nstate + state2] += pat->frequency;
-			}
-			model->computeTransMatrix(value * dist_mat[i*nseq + j], trans_mat);
-			for (k = 0; k < trans_size; k++) if (pair_freq[k])
-				lh -= pair_freq[k] * log(trans_mat[k]);
-		}
-	delete [] pair_freq;
-	delete [] trans_mat;
-	return lh;
+    
+    auto frequencies = phylo_tree->getConvertedSequenceFrequencies();
+    for (size_t i = 0; i < nseq-1; i++) {
+        auto eyeSequence = phylo_tree->getConvertedSequenceByNumber(i);
+        for (size_t j = i + 1; j < nseq; j++) {
+            auto jaySequence = phylo_tree->getConvertedSequenceByNumber(j);
+            memset(pair_freq, 0, trans_size * sizeof(int));
+            if (0 && jaySequence!=nullptr) { //DISABLED
+                for (size_t k = 0; k < size(); k++) {
+                    if (ptn_cat[k] != optimizing_cat) {
+                        continue;
+                    }
+                    int state1 = eyeSequence[k];
+                    auto pairRow = pair_freq + state1*nstate;
+                    if (nstate<=state1) {
+                        continue;
+                    }
+                    int state2 = jaySequence[k];
+                    if ( state2 < nstate) {
+                        pairRow[state2] += frequencies[k];
+                    }
+                }
+            } else {
+                for (size_t k = 0; k < size(); k++) {
+                    if (ptn_cat[k] != optimizing_cat) {
+                        continue;
+                    }
+                    Pattern *pat = & phylo_tree->aln->at(k);
+                    int state1 = pat->at(i);
+                    int state2 = pat->at(j);
+                    if ( state1 < nstate && state2 < nstate) {
+                        pair_freq[state1*nstate + state2] += pat->frequency;
+                    }
+                }
+            }
+            model->computeTransMatrix(value * dist_mat[i*nseq + j], trans_mat);
+            for (size_t k = 0; k < trans_size; k++)
+            {
+                lh -= pair_freq[k] * log(trans_mat[k]);
+            }
+        }
+    }
+delete [] pair_freq;
+delete [] trans_mat;
+return lh;
 }
 
 void RateMeyerDiscrete::computeFuncDerv(double value, double &df, double &ddf) {
@@ -323,18 +350,36 @@ void RateMeyerDiscrete::computeFuncDerv(double value, double &df, double &ddf) {
 	df = ddf = 0.0;
 
 	int *pair_freq = new int[trans_size];
-
-	for (size_t i = 0; i + 1 < nseq; ++i)
+    auto frequencies = phylo_tree->getConvertedSequenceFrequencies();
+    for (size_t i = 0; i + 1 < nseq; ++i) {
+        auto eyeSequence = phylo_tree->getConvertedSequenceByNumber(i);
 		for (size_t j = i+1; j < nseq; ++j) {
+            auto jaySequence = phylo_tree->getConvertedSequenceByNumber(j);
 			memset(pair_freq, 0, trans_size * sizeof(int));
-			for (size_t k = 0; k < size(); ++k) {
-				if (ptn_cat[k] != optimizing_cat) continue;
-				Pattern *pat = & phylo_tree->aln->at(k);
-                int state1 = pat->at(i);
-                int state2 = pat->at(j);
-				if (state1 < nstate && state2 < nstate)
-					pair_freq[state1*nstate + state2] += pat->frequency;
-			}
+            if (eyeSequence!=nullptr && jaySequence!=nullptr) {
+                for (size_t k = 0; k < size(); ++k) {
+                    if (ptn_cat[k] != optimizing_cat) continue;
+                    int state1 = eyeSequence[k];
+                    if (nstate<=state1) {
+                        continue;
+                    }
+                    auto pairRow = pair_freq + state1*nstate;
+                    int state2 = jaySequence[k];
+                    if (nstate<=state2) {
+                        continue;
+                    }
+                    pairRow[state2] += frequencies[k];
+                }
+            } else {
+                for (size_t k = 0; k < size(); ++k) {
+                    if (ptn_cat[k] != optimizing_cat) continue;
+                    Pattern *pat = & phylo_tree->aln->at(k);
+                    int state1 = pat->at(i);
+                    int state2 = pat->at(j);
+                    if (state1 < nstate && state2 < nstate)
+                        pair_freq[state1*nstate + state2] += pat->frequency;
+                }
+            }
 			double dist = dist_mat[i*nseq + j];
 			double derv1 = 0.0, derv2 = 0.0;
 			model->computeTransDerv(value * dist, trans_mat, trans_derv1, trans_derv2);
@@ -352,6 +397,7 @@ void RateMeyerDiscrete::computeFuncDerv(double value, double &df, double &ddf) {
 			df -= derv1 * dist;
 			ddf -= derv2 * dist * dist;
 		}
+    }
 	delete [] pair_freq;
 	delete [] trans_derv2;
 	delete [] trans_derv1;
@@ -414,25 +460,39 @@ double RateMeyerDiscrete::optimizeCatRate(int cat) {
 }
 
 void RateMeyerDiscrete::normalizeRates() {
-	double sum = 0.0, ok = 0.0;
-	int nptn = size();
-	int i;
-
-	for (i = 0; i < nptn; i++) {
-		//at(i) = rates[ptn_cat[i]];
-		if (getPtnRate(i) < MAX_SITE_RATE) { 
-			sum += getPtnRate(i) * phylo_tree->aln->at(i).frequency; 
-			ok += phylo_tree->aln->at(i).frequency; 
-		}
-	}
-
-	if (fabs(sum - ok) > 1e-3) {
-		//cout << "Normalizing rates " << sum << " / " << ok << endl;
-		double scale_f = ok / sum;
-		for (i = 0; i < ncategory; i++)
-			if (rates[i] > 2*MIN_SITE_RATE && rates[i] < MAX_SITE_RATE)
-				rates[i] *= scale_f;
-	}
+    double sum = 0.0;
+    double ok  = 0.0;
+    size_t nptn = size();
+    
+    auto frequencies = phylo_tree->getConvertedSequenceFrequencies();
+    if (frequencies!=nullptr) {
+        for (size_t i = 0; i < nptn; i++) {
+            //at(i) = rates[ptn_cat[i]];
+            if (getPtnRate(i) < MAX_SITE_RATE) {
+                double freq = frequencies[i];
+                sum += getPtnRate(i) * freq;
+                ok  += freq;
+            }
+        }
+    } else {
+        for (size_t i = 0; i < nptn; i++) {
+            //at(i) = rates[ptn_cat[i]];
+            if (getPtnRate(i) < MAX_SITE_RATE) {
+                double freq = phylo_tree->aln->at(i).frequency;
+                sum += getPtnRate(i) * freq;
+                ok  += freq;
+            }
+        }
+    }
+    if (fabs(sum - ok) > 1e-3) {
+        //cout << "Normalizing rates " << sum << " / " << ok << endl;
+        double scale_f = ok / sum;
+        for (int i = 0; i < ncategory; i++) {
+            if (rates[i] > 2*MIN_SITE_RATE && rates[i] < MAX_SITE_RATE) {
+                rates[i] *= scale_f;
+            }
+        }
+    }
 }
 
 double RateMeyerDiscrete::classifyRatesKMeans() {
diff --git a/tree/iqtree.cpp b/tree/iqtree.cpp
index 70945d8e1..2e9271ae5 100644
--- a/tree/iqtree.cpp
+++ b/tree/iqtree.cpp
@@ -2042,6 +2042,7 @@ double IQTree::perturb(int times) {
 extern pllUFBootData * pllUFBootDataPtr;
 
 string IQTree::optimizeModelParameters(bool printInfo, double logl_epsilon) {
+    prepareToComputeDistances();
     if (logl_epsilon == -1)
         logl_epsilon = params->modelEps;
     cout << "Estimate model parameters (epsilon = " << logl_epsilon << ")" << endl;
@@ -2074,7 +2075,6 @@ string IQTree::optimizeModelParameters(bool printInfo, double logl_epsilon) {
         } else {
             modOptScore = getModelFactory()->optimizeParameters(params->fixed_branch_length, printInfo, logl_epsilon);
         }
-
         if (isSuperTree()) {
             ((PhyloSuperTree*) this)->computeBranchLengths();
         }
@@ -2083,7 +2083,6 @@ string IQTree::optimizeModelParameters(bool printInfo, double logl_epsilon) {
             outWarning("Estimated model parameters are at boundary that can cause numerical instability!");
             cout << endl;
         }
-
         if (modOptScore < curScore - 1.0 && params->partition_type != TOPO_UNLINKED) {
             cout << "  BUG: Tree logl gets worse after model optimization!" << endl;
             cout << "  Old logl: " << curScore << " / " << "new logl: " << modOptScore << endl;
@@ -2096,7 +2095,7 @@ string IQTree::optimizeModelParameters(bool printInfo, double logl_epsilon) {
         if (params->print_trees_site_posterior)
             computePatternCategories();
     }
-
+    doneComputingDistances(); //DISABLED
     return newTree;
 }
 
@@ -2912,7 +2911,9 @@ pair<int, int> IQTree::doNNISearch(bool write_info) {
                 PLL_TRUE, 0, 0, 0, PLL_SUMMARIZE_LH, 0, 0);
         readTreeString(string(pllInst->tree_string));
     } else {
+        prepareToComputeDistances();
         nniInfos = optimizeNNI(Params::getInstance().speednni);
+        doneComputingDistances();
         if (isSuperTree()) {
             ((PhyloSuperTree*) this)->computeBranchLengths();
         }
diff --git a/tree/mtree.cpp b/tree/mtree.cpp
index 05d3f391d..fe396b294 100644
--- a/tree/mtree.cpp
+++ b/tree/mtree.cpp
@@ -1278,18 +1278,33 @@ void MTree::getOrderedTaxa(NodeVector &taxa, Node *node, Node *dad) {
 }
 
 void MTree::getTaxaName(vector<string> &taxname, Node *node, Node *dad) {
-    if (!node) node = root;
+    if (!node) {
+        node = root;
+    }
     if (node->isLeaf()) {
-        if (taxname.empty()) taxname.resize(leafNum);
+        if (taxname.empty()) {
+            taxname.resize(leafNum);
+        }
         taxname[node->id] = node->name;
     }
-    //for (NeighborVec::iterator it = node->neighbors.begin(); it != node->neighbors.end(); it++)
-    //if ((*it)->node != dad)	{
     FOR_NEIGHBOR_IT(node, dad, it) {
         getTaxaName(taxname, (*it)->node, node);
     }
 }
 
+void MTree::getMapOfTaxonNameToNode(Node* node, Node* dad
+                                    , map<string, Node*> &map) {
+    if (!node) {
+        node = root;
+    }
+    if (node->isLeaf()) {
+        map[node->name] = node;
+    }
+    FOR_NEIGHBOR_IT(node, dad, it) {
+        getMapOfTaxonNameToNode((*it)->node, node, map);
+    }
+}
+
 void MTree::getNodeName(set<string> &nodename, Node *node, Node *dad) {
     if (!node) node = root;
     if (!node->name.empty())
diff --git a/tree/mtree.h b/tree/mtree.h
index cd0733861..2345de02e 100644
--- a/tree/mtree.h
+++ b/tree/mtree.h
@@ -440,6 +440,17 @@ class MTree {
      */
     void getTaxaName(vector<string> &taxname, Node *node = NULL, Node *dad = NULL);
 
+    /**
+            get the descending taxa names below the node, and map each
+            name to the corresponding leaf node
+            @param node the starting node, nullptr to start from the root
+            @param dad dad of the node, used to direct the search
+            @param[out] map, from taxon name to corresponding node
+     */
+
+    void getMapOfTaxonNameToNode(Node* node, Node* dad
+                                 , map<string, Node*> &map);
+    
     /**
             get the descending node names below the node
             @param node the starting node, NULL to start from the root
diff --git a/tree/phylokernelnew.h b/tree/phylokernelnew.h
index cd34a14bb..82bbf1e9d 100644
--- a/tree/phylokernelnew.h
+++ b/tree/phylokernelnew.h
@@ -9,6 +9,8 @@
 
 #if !defined(PHYLOKERNELNEW_H_) || !defined(PHYLOKERNELNEW_STATE_H_)
 
+#define JAMES_VERSION 1
+
 #ifdef KERNEL_FIX_STATES
 #   define PHYLOKERNELNEW_STATE_H_
 #else
@@ -21,8 +23,6 @@
 #include <omp.h>
 #endif
 
-//#include <thread>
-
 using namespace std;
 
 /*******************************************************
@@ -86,6 +86,45 @@ template <class VectorClass, class Numeric, const bool FMA>
 inline void dotProductVec(Numeric *A, VectorClass *B, VectorClass &X, size_t N)
 #endif
 {
+#if JAMES_VERSION
+    if (N==4) {
+        X = (A[0]*B[0] + A[1]*B[1]) + (A[2]*B[2] + A[3]*B[3]);
+        return;
+    }
+    size_t cruft = (N & 3);
+    if (N==cruft) {
+        X = 0;
+        //IF cruft==N, N must be 0, 1, 2, or 3
+        for (int i=0; i<cruft; ++i) {
+            X = mul_add(A[i], B[i], X);
+        }
+        return;
+    }
+    X = A[0]*B[0];
+    VectorClass v1 = A[1]*B[1];
+    VectorClass v2 = A[2]*B[2];
+    VectorClass v3 = A[3]*B[3];
+    auto Astop = A + (N-cruft);
+    A+=4;
+    B+=4;
+    for (; A<Astop; A+=4, B+=4) {
+        X  = mul_add( A[0], B[0], X);
+        v1 = mul_add( A[1], B[1], v1);
+        v2 = mul_add( A[2], B[2], v2);
+        v3 = mul_add( A[3], B[3], v3);
+    }
+    if (0<cruft) {
+        X  = mul_add( A[0], B[0], X);
+        if (1<cruft) {
+            v1 = mul_add( A[1], B[1], v1);
+            if (2<cruft) {
+                v2 = mul_add( A[2], B[2], v2);
+            }
+        }
+    }
+    X  = ( X + v1 ) + ( v2 + v3 );
+    return;
+#else
     // quick treatment for small N <= 4
     switch (N) {
     case 1:
@@ -156,6 +195,7 @@ inline void dotProductVec(Numeric *A, VectorClass *B, VectorClass &X, size_t N)
         break;
     }
     }
+#endif
 }
 
 /**
@@ -290,6 +330,9 @@ template <class VectorClass, class Numeric, const bool FMA>
 inline void productVecMat(VectorClass *A, Numeric *M, VectorClass *X, size_t N)
 #endif
 {
+    if (N==4) {
+        
+    }
     // quick treatment for small N <= 4
     size_t i;
     switch (N) {
@@ -307,7 +350,6 @@ inline void productVecMat(VectorClass *A, Numeric *M, VectorClass *X, size_t N)
             M += 3;
         }
         return;
-    case 4:
         for (i = 0; i < 4; i++) {
             // manual unrolling
             X[i] = mul_add(A[1],M[1],A[0]*M[0]) + mul_add(A[3],M[3],A[2]*M[2]);
@@ -542,7 +584,27 @@ inline void dotProductPairAdd(Numeric *A, Numeric *B, VectorClass *D,
         Y = mul_add(B[0], D[0], Y);
         return;
     }
-
+#if JAMES_VERSION
+    VectorClass AD[2], BD[2];
+    for (size_t j = 0; j < 2; j++) {
+        AD[j] = A[j] * D[j];
+        BD[j] = B[j] * D[j];
+    }
+    size_t iStop = N - (N&1);
+    for (size_t i = 2; i < iStop; i+=2) {
+        for (size_t j = 0; j < 2; j++) {
+            AD[j] = mul_add(A[i+j], D[i+j], AD[j]);
+            BD[j] = mul_add(B[i+j], D[i+j], BD[j]);
+        }
+    }
+    if ((N & 1) == 0) {
+        X += AD[0] + AD[1];
+        Y += BD[0] + BD[1];
+    } else {
+        X += mul_add(A[N-1], D[N-1], AD[0] + AD[1]);
+        Y += mul_add(B[N-1], D[N-1], BD[0] + BD[1]);
+    }
+#else
     size_t i, j;
     if (N % 2 == 0) {
         VectorClass AD[2], BD[2];
@@ -550,12 +612,12 @@ inline void dotProductPairAdd(Numeric *A, Numeric *B, VectorClass *D,
             AD[j] = A[j] * D[j];
             BD[j] = B[j] * D[j];
         }
-		for (i = 2; i < N; i+=2) {
+        for (i = 2; i < N; i+=2) {
             for (j = 0; j < 2; j++) {
                 AD[j] = mul_add(A[i+j], D[i+j], AD[j]);
                 BD[j] = mul_add(B[i+j], D[i+j], BD[j]);
             }
-		}
+        }
         X += AD[0] + AD[1];
         Y += BD[0] + BD[1];
     } else {
@@ -565,15 +627,16 @@ inline void dotProductPairAdd(Numeric *A, Numeric *B, VectorClass *D,
             AD[j] = A[j] * D[j];
             BD[j] = B[j] * D[j];
         }
-		for (i = 2; i < N-1; i+=2) {
+        for (i = 2; i < N-1; i+=2) {
             for (j = 0; j < 2; j++) {
                 AD[j] = mul_add(A[i+j], D[i+j], AD[j]);
                 BD[j] = mul_add(B[i+j], D[i+j], BD[j]);
             }
-		}
+        }
         X += mul_add(A[N-1], D[N-1], AD[0] + AD[1]);
         Y += mul_add(B[N-1], D[N-1], BD[0] + BD[1]);
     }
+#endif
 }
 
 /**
@@ -602,7 +665,7 @@ inline void dotProductTriple(Numeric *A, Numeric *B, Numeric *C, VectorClass *D,
 #endif
 {
     size_t i, j;
-    if (nstates % 2 == 0) {
+    if ((nstates & 1) == 0) {
         VectorClass AD[2], BD[2], CD[2];
         for (j = 0; j < 2; j++) {
             AD[j] = A[j] * D[j];
@@ -860,7 +923,7 @@ void PhyloTree::computePartialInfo(TraversalInfo &info, VectorClass* buffer) {
             PhyloNeighbor *child = (PhyloNeighbor*)*it;
             // precompute information buffer
             if (child->direction == TOWARD_ROOT) {
-                // tranpose probability matrix
+                // transpose probability matrix
                 double mat[nstatesqr];
                 for (c = 0; c < ncat_mix; c++) {
                     double len_child = site_rate->getRate(c%ncat) * child->length;
@@ -1341,8 +1404,9 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
         double *vec_tip = (double*)&partial_lh_all[block];
 
         for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
-            for (i = 0; i < block; i++)
+            for (i = 0; i < block; i++){
                 partial_lh_all[i] = 1.0;
+            }
             UBYTE *scale_dad = NULL;
             if (SAFE_NUMERIC) {
                 scale_dad = dad_branch->scale_num + ptn*ncat_mix;
@@ -1417,20 +1481,28 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
                 } else {
                     // non site specific model
                     PhyloNeighbor *child = (PhyloNeighbor*)*it;
+                    auto stateRow = this->getConvertedSequenceByNumber(child->node->id);
+                    auto unknown  = aln->STATE_UNKNOWN;
                     UBYTE *scale_child = SAFE_NUMERIC ? child->scale_num + ptn*ncat_mix : NULL;
                     if (child->node->isLeaf()) {
                         // external node
                         // load data for tip
                         for (i = 0; i < VectorClass::size(); i++) {
-                            double *child_lh;
-                            if (ptn+i < orig_nptn)
-                                child_lh = partial_lh_leaf + block*(aln->at(ptn+i))[child->node->id];
-                            else if (ptn+i < max_orig_nptn)
-                                child_lh = partial_lh_leaf + block*aln->STATE_UNKNOWN;
-                            else if (ptn+i < nptn)
-                                child_lh = partial_lh_leaf + block*model_factory->unobserved_ptns[ptn+i-max_orig_nptn][child->node->id];
-                            else
-                                child_lh = partial_lh_leaf + block*aln->STATE_UNKNOWN;
+                            int state;
+                            if (ptn+i < orig_nptn) {
+                                if (stateRow!=nullptr) {
+                                    state = stateRow[ptn+i];
+                                } else {
+                                    state = (aln->at(ptn+i))[child->node->id];
+                                }
+                            } else if (ptn+i < max_orig_nptn) {
+                                state = unknown;
+                            } else if (ptn+i < nptn) {
+                                state = model_factory->unobserved_ptns[ptn+i-max_orig_nptn][child->node->id];
+                            } else {
+                                state = unknown;
+                            }
+                            double *child_lh = partial_lh_leaf + block*state;
                             double *this_vec_tip = vec_tip+i;
                             for (c = 0; c < block; c++) {
                                 *this_vec_tip = child_lh[c];
@@ -1562,6 +1634,10 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
         double *vec_right =  SITE_MODEL ? &vec_left[nstates*VectorClass::size()] : &vec_left[block*VectorClass::size()];
         VectorClass *partial_lh_tmp = SITE_MODEL ? (VectorClass*)vec_right+nstates : (VectorClass*)vec_right+block;
 
+        auto leftStateRow  = this->getConvertedSequenceByNumber(left->node->id);
+        auto rightStateRow = this->getConvertedSequenceByNumber(right->node->id);
+        auto unknown = aln->STATE_UNKNOWN;
+
 		for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
 			VectorClass *partial_lh = (VectorClass*)(dad_branch->partial_lh + ptn*block);
 
@@ -1597,26 +1673,37 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
                     partial_lh += nstates;
                 } // FOR category
             } else {
-                VectorClass *vleft = (VectorClass*)vec_left;
+                VectorClass *vleft  = (VectorClass*)vec_left;
                 VectorClass *vright = (VectorClass*)vec_right;
                 // load data for tip
                 for (x = 0; x < VectorClass::size(); x++) {
-                    double *tip_left, *tip_right;
+                    int leftState;
+                    int rightState;
                     if (ptn+x < orig_nptn) {
-                        tip_left  = partial_lh_left  + block * (aln->at(ptn+x))[left->node->id];
-                        tip_right = partial_lh_right + block * (aln->at(ptn+x))[right->node->id];
+                        if (leftStateRow!=nullptr) {
+                            leftState = leftStateRow[ptn+x];
+                        } else {
+                            leftState = (aln->at(ptn+x))[left->node->id];
+                        }
+                        if (rightStateRow!=nullptr) {
+                            rightState =  rightStateRow[ptn+x];
+                        } else {
+                            rightState = (aln->at(ptn+x))[right->node->id];
+                        }
                     } else if (ptn+x < max_orig_nptn) {
-                        tip_left  = partial_lh_left  + block * aln->STATE_UNKNOWN;
-                        tip_right = partial_lh_right + block * aln->STATE_UNKNOWN;
+                        leftState = unknown;
+                        rightState = unknown;
                     } else if (ptn+x < nptn) {
-                        tip_left  = partial_lh_left  + block * model_factory->unobserved_ptns[ptn+x-max_orig_nptn][left->node->id];
-                        tip_right = partial_lh_right + block * model_factory->unobserved_ptns[ptn+x-max_orig_nptn][right->node->id];
+                        leftState  = model_factory->unobserved_ptns[ptn+x-max_orig_nptn][left->node->id];
+                        rightState = model_factory->unobserved_ptns[ptn+x-max_orig_nptn][right->node->id];
                     } else {
-                        tip_left  = partial_lh_left  + block * aln->STATE_UNKNOWN;
-                        tip_right = partial_lh_right + block * aln->STATE_UNKNOWN;
+                        leftState  = unknown;
+                        rightState = unknown;
                     }
-                    double *this_vec_left = vec_left+x;
-                    double *this_vec_right = vec_right+x;
+                    double* tip_left  = partial_lh_left  + block*leftState;
+                    double* tip_right = partial_lh_right + block*rightState;
+                    double* this_vec_left = vec_left+x;
+                    double* this_vec_right = vec_right+x;
                     for (i = 0; i < block; i++) {
                         *this_vec_left = tip_left[i];
                         *this_vec_right = tip_right[i];
@@ -1665,10 +1752,12 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
         double *vec_left = buffer_partial_lh_ptr + thread_buf_size*thread_id;
         VectorClass *partial_lh_tmp = SITE_MODEL ? (VectorClass*)vec_left+2*nstates : (VectorClass*)vec_left+block;
 
+        auto leftStateRow = this->getConvertedSequenceByNumber(left->node->id);
+        auto unknown = aln->STATE_UNKNOWN;
+        
 		for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
 			VectorClass *partial_lh = (VectorClass*)(dad_branch->partial_lh + ptn*block);
 			VectorClass *partial_lh_right = (VectorClass*)(right->partial_lh + ptn*block);
-//            memset(partial_lh, 0, sizeof(VectorClass)*block);
             VectorClass lh_max = 0.0;
 
             if (SITE_MODEL) {
@@ -1721,16 +1810,21 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
                 VectorClass *vleft = (VectorClass*)vec_left;
                 // load data for tip
                 for (x = 0; x < VectorClass::size(); x++) {
-                    double *tip;
+                    int state;
                     if (ptn+x < orig_nptn) {
-                        tip = partial_lh_left + block*(aln->at(ptn+x))[left->node->id];
+                        if (leftStateRow!=nullptr) {
+                            state =  leftStateRow[ptn+x];
+                        } else {
+                            state = (aln->at(ptn+x))[left->node->id];
+                        }
                     } else if (ptn+x < max_orig_nptn) {
-                        tip = partial_lh_left + block*aln->STATE_UNKNOWN;
+                        state = unknown;
                     } else if (ptn+x < nptn) {
-                        tip = partial_lh_left + block*model_factory->unobserved_ptns[ptn+x-max_orig_nptn][left->node->id];
+                        state = model_factory->unobserved_ptns[ptn+x-max_orig_nptn][left->node->id];
                     } else {
-                        tip = partial_lh_left + block*aln->STATE_UNKNOWN;
+                        state = unknown;
                     }
+                    double *tip = partial_lh_left + block*state;
                     double *this_vec_left = vec_left+x;
                     for (i = 0; i < block; i++) {
                         *this_vec_left = tip[i];
@@ -1928,8 +2022,7 @@ void PhyloTree::computePartialLikelihoodGenericSIMD(TraversalInfo &info, size_t
 	}
 
     if (Params::getInstance().buffer_mem_save) {
-        if (partial_lh_leaves)
-            aligned_free(partial_lh_leaves);
+        aligned_free(partial_lh_leaves);
         aligned_free(echildren);
         info.echildren = info.partial_lh_leaves = NULL;
     }
@@ -1986,31 +2079,39 @@ void PhyloTree::computeLikelihoodBufferGenericSIMD(PhyloNeighbor *dad_branch, Ph
 
     if (dad->isLeaf()) {
         // special treatment for TIP-INTERNAL NODE case
-        double *tip_partial_lh_node = &tip_partial_lh[dad->id * max_orig_nptn*nstates];
-
-        double *vec_tip = buffer_partial_lh_ptr + tip_block*VectorClass::size()*thread_id;
-
-        for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
-            VectorClass *partial_lh_dad = (VectorClass*)(dad_branch->partial_lh + ptn*block);
-            VectorClass *theta = (VectorClass*)(theta_all + ptn*block);
+        double *tip_partial_lh_node = &tip_partial_lh[dad->id * max_orig_nptn * nstates];
+        double *vec_tip = buffer_partial_lh_ptr + tip_block * VectorClass::size() * thread_id;
+        auto stateRow = this->getConvertedSequenceByNumber(dad->id);
+        auto unknown  = aln->STATE_UNKNOWN;
+
+        size_t offset     = ptn_lower*block;
+        size_t offsetStep = block*VectorClass::size();
+        for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size(), offset+=offsetStep) {
+            VectorClass *partial_lh_dad = (VectorClass*)(dad_branch->partial_lh + offset);
+            VectorClass *theta = (VectorClass*)(theta_all + offset);
             //load tip vector
             if (!SITE_MODEL)
             for (i = 0; i < VectorClass::size(); i++) {
-                double *this_tip_partial_lh;
-                if (ptn+i < orig_nptn)
-                    this_tip_partial_lh = tip_partial_lh + tip_block*(aln->at(ptn+i))[dad->id];
-                else if (ptn+i < max_orig_nptn)
-                    this_tip_partial_lh = tip_partial_lh + tip_block*aln->STATE_UNKNOWN;
-                else if (ptn+i < nptn)
-                    this_tip_partial_lh = tip_partial_lh + tip_block*model_factory->unobserved_ptns[ptn+i-max_orig_nptn][dad->id];
-                else
-                    this_tip_partial_lh = tip_partial_lh + tip_block*aln->STATE_UNKNOWN;
+                int state;
+                if (ptn+i < orig_nptn) {
+                    if (stateRow!=nullptr) {
+                        state =  stateRow[ptn+i];
+                    } else {
+                        state = (aln->at(ptn+i))[dad->id];
+                    }
+                } else if (ptn+i < max_orig_nptn) {
+                    state = unknown;
+                } else if (ptn+i < nptn) {
+                    state = model_factory->unobserved_ptns[ptn+i-max_orig_nptn][dad->id];
+                } else {
+                    state = unknown;
+                }
+                double *this_tip_partial_lh = tip_partial_lh + tip_block*state;
                 double *this_vec_tip = vec_tip+i;
                 for (c = 0; c < tip_block; c++) {
                     *this_vec_tip = this_tip_partial_lh[c];
                     this_vec_tip += VectorClass::size();
                 }
-
             }
             VectorClass *lh_tip;
             if (SITE_MODEL)
@@ -2595,8 +2696,6 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
 
     size_t block = ncat_mix * nstates;
     size_t tip_block = nstates * model->getNMixtures();
-    size_t ptn; // for big data size > 4GB memory required
-    size_t c, i;
     size_t orig_nptn = aln->size();
     size_t max_orig_nptn = ((orig_nptn+VectorClass::size()-1)/VectorClass::size())*VectorClass::size();
     size_t nptn = max_orig_nptn+model_factory->unobserved_ptns.size();
@@ -2611,15 +2710,13 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
     double *eval = model->getEigenvalues();
     ASSERT(eval);
 
-//    double *val = aligned_alloc<double>(block);
-    double *val = NULL;
+    double *val = nullptr;
     double *buffer_partial_lh_ptr = buffer_partial_lh;
 
-
     double cat_length[ncat];
     double cat_prop[ncat];
     if (SITE_MODEL) {
-        for (c = 0; c < ncat; c++) {
+        for (size_t c = 0; c < ncat; c++) {
             cat_length[c] = site_rate->getRate(c) * dad_branch->length;
             cat_prop[c] = site_rate->getProp(c);
         }
@@ -2628,7 +2725,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
         buffer_partial_lh_ptr += get_safe_upper_limit(block);
         if (nstates % VectorClass::size() == 0) {
             size_t loop_size = nstates / VectorClass::size();
-            for (c = 0; c < ncat_mix; c++) {
+            for (size_t c = 0; c < ncat_mix; c++) {
                 size_t mycat = c%ncat;
                 size_t m = c/denom;
                 mix_addr_nstates[c] = m*nstates;
@@ -2637,11 +2734,12 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                 double len = site_rate->getRate(mycat)*dad_branch->getLength(mycat);
                 double prop = site_rate->getProp(mycat) * model->getMixtureWeight(m);
                 VectorClass *this_val = (VectorClass*)(val + c*nstates);
-                for (i = 0; i < loop_size; i++)
+                for (size_t i = 0; i < loop_size; i++) {
                     this_val[i] = exp(eval_ptr[i]*len) * prop;
+                }
             }
         } else {
-            for (c = 0; c < ncat_mix; c++) {
+            for (size_t c = 0; c < ncat_mix; c++) {
                 size_t mycat = c%ncat;
                 size_t m = c/denom;
                 mix_addr_nstates[c] = m*nstates;
@@ -2650,7 +2748,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                 double len = site_rate->getRate(mycat)*dad_branch->getLength(mycat);
                 double prop = site_rate->getProp(mycat) * model->getMixtureWeight(m);
                 double *this_val = val + c*nstates;
-                for (i = 0; i < nstates; i++)
+                for (size_t i = 0; i < nstates; i++)
                     this_val[i] = exp(eval_ptr[i]*len) * prop;
             }
         }
@@ -2683,9 +2781,9 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     double *lh_node = partial_lh_node + state*block;
                     double *lh_tip = tip_partial_lh + state*tip_block;
                     double *vc_val_tmp = val;
-                    for (c = 0; c < ncat_mix; c++) {
+                    for (size_t c = 0; c < ncat_mix; c++) {
                         double *this_lh_tip = lh_tip + mix_addr_nstates[c];
-                        for (i = 0; i < nstates; i+=VectorClass::size()) {
+                        for (size_t i = 0; i < nstates; i+=VectorClass::size()) {
                             (VectorClass().load_a(&vc_val_tmp[i]) * VectorClass().load_a(&this_lh_tip[i])).store_a(&lh_node[i]);
                         }
                         lh_node += nstates;
@@ -2698,9 +2796,9 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     double *lh_node = partial_lh_node +state*block;
                     double *val_tmp = val;
                     double *this_tip_partial_lh = tip_partial_lh + state*tip_block;
-                    for (c = 0; c < ncat_mix; c++) {
+                    for (size_t c = 0; c < ncat_mix; c++) {
                         double *lh_tip = this_tip_partial_lh + mix_addr_nstates[c];
-                        for (i = 0; i < nstates; i++) {
+                        for (size_t i = 0; i < nstates; i++) {
                               lh_node[i] = val_tmp[i] * lh_tip[i];
                         }
                         lh_node += nstates;
@@ -2709,10 +2807,12 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                 }
             }
         }
-
+        
+        auto stateRow = this->getConvertedSequenceByNumber(dad->id);
+        auto unknown  = aln->STATE_UNKNOWN;
     	// now do the real computation
 #ifdef _OPENMP
-#pragma omp parallel for private(ptn, i, c) schedule(static, 1) num_threads(num_threads)
+#pragma omp parallel for  schedule(static, 1) num_threads(num_threads)
 #endif
         for (int thread_id = 0; thread_id < num_threads; thread_id++) {
 
@@ -2730,7 +2830,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
 
             double *vec_tip = buffer_partial_lh_ptr + block*VectorClass::size()*thread_id;
 
-            for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
+            for (size_t ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
                 VectorClass lh_ptn(0.0);
 //                lh_ptn.load_a(&ptn_invar[ptn]);
                 VectorClass *lh_cat = (VectorClass*)(_pattern_lh_cat + ptn*ncat_mix);
@@ -2740,7 +2840,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                 if (SITE_MODEL) {
                     // site-specific model
                     VectorClass* eval_ptr = (VectorClass*) &eval[ptn*nstates];
-                    for (c = 0; c < ncat; c++) {
+                    for (size_t c = 0; c < ncat; c++) {
     #ifdef KERNEL_FIX_STATES
                         dotProductExp<VectorClass, double, nstates, FMA>(eval_ptr, lh_node, partial_lh_dad, cat_length[c], lh_cat[c]);
     #else
@@ -2755,26 +2855,31 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     }
                 } else { // normal model
                     //load tip vector
-                    for (i = 0; i < VectorClass::size(); i++) {
-                        double *lh_tip;
-                        if (ptn+i < orig_nptn)
-                            lh_tip = partial_lh_node + block*(aln->at(ptn+i))[dad->id];
-                        else if (ptn+i < max_orig_nptn)
-                            lh_tip = partial_lh_node + block*aln->STATE_UNKNOWN;
+                    for (size_t i = 0; i < VectorClass::size(); i++) {
+                        int state;
+                        if (ptn+i < orig_nptn) {
+                            if (stateRow!=nullptr) {
+                                state =  stateRow[ptn+i];
+                            } else {
+                                state = (aln->at(ptn+i))[dad->id];
+                            }
+                        } else if (ptn+i < max_orig_nptn)
+                            state = unknown;
                         else if (ptn+i < nptn)
-                            lh_tip = partial_lh_node + block*model_factory->unobserved_ptns[ptn+i-max_orig_nptn][dad->id];
+                            state = model_factory->unobserved_ptns[ptn+i-max_orig_nptn][dad->id];
                         else
-                            lh_tip = partial_lh_node + block*aln->STATE_UNKNOWN;
+                            state = aln->STATE_UNKNOWN;
 
+                        double *lh_tip = partial_lh_node + block*state;
                         double *this_vec_tip = vec_tip+i;
-                        for (c = 0; c < block; c++) {
+                        for (size_t c = 0; c < block; c++) {
                             *this_vec_tip = lh_tip[c];
                             this_vec_tip += VectorClass::size();
                         }
 
                     }
                     // compute likelihood per category
-                    for (c = 0; c < ncat_mix; c++) {
+                    for (size_t c = 0; c < ncat_mix; c++) {
     #ifdef KERNEL_FIX_STATES
                         dotProductVec<VectorClass, VectorClass, nstates, FMA>(lh_node, partial_lh_dad, lh_cat[c]);
     #else
@@ -2794,16 +2899,16 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     // numerical scaling per category
                     UBYTE *scale_dad = dad_branch->scale_num + ptn*ncat_mix;
                     UBYTE min_scale;
-                    for (i = 0; i < VectorClass::size(); i++) {
+                    for (size_t i = 0; i < VectorClass::size(); i++) {
     //                    scale_dad = dad_branch->scale_num+(ptn+i)*ncat_mix;
                         min_scale = scale_dad[0];
-                        for (c = 1; c < ncat_mix; c++)
+                        for (size_t c = 1; c < ncat_mix; c++)
                             min_scale = min(min_scale, scale_dad[c]);
 
                         vc_min_scale_ptr[i] = min_scale;
 
                         double *this_lh_cat = &_pattern_lh_cat[ptn*ncat_mix + i];
-                        for (c = 0; c < ncat_mix; c++) {
+                        for (size_t c = 0; c < ncat_mix; c++) {
                             // rescale lh_cat if neccessary
                             if (scale_dad[c] == min_scale+1) {
                                 this_lh_cat[c*VectorClass::size()] *= SCALING_THRESHOLD;
@@ -2817,7 +2922,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     sumVec<VectorClass, true>(lh_cat, lh_ptn, ncat_mix);
 
                 } else {
-                    for (i = 0; i < VectorClass::size(); i++) {
+                    for (size_t i = 0; i < VectorClass::size(); i++) {
                         vc_min_scale_ptr[i] = dad_branch->scale_num[ptn+i];
                     }
                 }
@@ -2839,7 +2944,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     if (horizontal_or(vc_min_scale != 0.0)) {
                         // some entries are rescaled
                         double *lh_ptn_dbl = (double*)&lh_ptn;
-                        for (i = 0; i < VectorClass::size(); i++)
+                        for (size_t i = 0; i < VectorClass::size(); i++)
                             if (vc_min_scale_ptr[i] != 0.0)
                                 lh_ptn_dbl[i] *= SCALING_THRESHOLD;
                     }
@@ -2865,7 +2970,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
     	//-------- both dad and node are internal nodes -----------/
 
 #ifdef _OPENMP
-#pragma omp parallel for private(ptn, i, c) schedule(static, 1) num_threads(num_threads)
+#pragma omp parallel for schedule(static, 1) num_threads(num_threads)
 #endif
         for (int thread_id = 0; thread_id < num_threads; thread_id++) {
 
@@ -2881,7 +2986,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
             for (vector<TraversalInfo>::iterator it = traversal_info.begin(); it != traversal_info.end(); it++)
                 computePartialLikelihood(*it, ptn_lower, ptn_upper, thread_id);
 
-            for (ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
+            for (size_t ptn = ptn_lower; ptn < ptn_upper; ptn+=VectorClass::size()) {
                 VectorClass lh_ptn(0.0);
 //                lh_ptn.load_a(&ptn_invar[ptn]);
                 VectorClass *lh_cat = (VectorClass*)(_pattern_lh_cat + ptn*ncat_mix);
@@ -2891,7 +2996,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                 // compute likelihood per category
                 if (SITE_MODEL) {
                     VectorClass* eval_ptr = (VectorClass*) &eval[ptn*nstates];
-                    for (c = 0; c < ncat; c++) {
+                    for (size_t c = 0; c < ncat; c++) {
     #ifdef KERNEL_FIX_STATES
                         dotProductExp<VectorClass, double, nstates, FMA>(eval_ptr, partial_lh_node, partial_lh_dad, cat_length[c], lh_cat[c]);
     #else
@@ -2906,7 +3011,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     }
                 } else {
                     double *val_tmp = val;
-                    for (c = 0; c < ncat_mix; c++) {
+                    for (size_t c = 0; c < ncat_mix; c++) {
     #ifdef KERNEL_FIX_STATES
                         dotProduct3Vec<VectorClass, double, nstates, FMA>(val_tmp, partial_lh_node, partial_lh_dad, lh_cat[c]);
     #else
@@ -2930,15 +3035,15 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     UBYTE sum_scale[ncat_mix];
                     UBYTE min_scale;
 
-                    for (i = 0; i < VectorClass::size(); i++) {
+                    for (size_t i = 0; i < VectorClass::size(); i++) {
                         min_scale = sum_scale[0] = scale_dad[0] + scale_node[0];
-                        for (c = 1; c < ncat_mix; c++) {
+                        for (size_t c = 1; c < ncat_mix; c++) {
                             sum_scale[c] = scale_dad[c] + scale_node[c];
                             min_scale = min(min_scale, sum_scale[c]);
                         }
                         vc_min_scale_ptr[i] = min_scale;
                         double *this_lh_cat = &_pattern_lh_cat[ptn*ncat_mix + i];
-                        for (c = 0; c < ncat_mix; c++) {
+                        for (size_t c = 0; c < ncat_mix; c++) {
                             if (sum_scale[c] == min_scale+1) {
                                 this_lh_cat[c*VectorClass::size()] *= SCALING_THRESHOLD;
                             } else if (sum_scale[c] > min_scale+1) {
@@ -2951,7 +3056,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     }
                     sumVec<VectorClass, true>(lh_cat, lh_ptn, ncat_mix);
                 } else {
-                    for (i = 0; i < VectorClass::size(); i++) {
+                    for (size_t i = 0; i < VectorClass::size(); i++) {
                         vc_min_scale_ptr[i] = dad_branch->scale_num[ptn+i] + node_branch->scale_num[ptn+i];
                     }
                 } // if SAFE_NUMERIC
@@ -2974,7 +3079,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
                     if (horizontal_or(vc_min_scale != 0.0)) {
                         // some entries are rescaled
                         double *lh_ptn_dbl = (double*)&lh_ptn;
-                        for (i = 0; i < VectorClass::size(); i++)
+                        for (size_t i = 0; i < VectorClass::size(); i++)
                             if (vc_min_scale_ptr[i] != 0.0)
                                 lh_ptn_dbl[i] *= SCALING_THRESHOLD;
                     }
@@ -3010,7 +3115,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
     // arbitrarily fix tree_lh if underflown for some sites
     if (!std::isfinite(tree_lh)) {
         tree_lh = 0.0;
-        for (ptn = 0; ptn < orig_nptn; ptn++) {
+        for (size_t ptn = 0; ptn < orig_nptn; ptn++) {
           if (!std::isfinite(_pattern_lh[ptn])) {
                 _pattern_lh[ptn] = LOG_SCALING_THRESHOLD*4; // log(2^(-1024))
             }
@@ -3024,23 +3129,28 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
             outError("+ASC not supported with robust phylo");
         size_t sites = getAlnNSite();
         size_t sites_drop = (int)((1.0-params->robust_phy_keep)*sites);
-        size_t site;
-        for (ptn = 0, site = 0; ptn < orig_nptn; ptn++)
-            for (i = 0; i < ptn_freq[ptn]; i++, site++)
+        size_t site = 0;
+        for (size_t ptn = 0; ptn < orig_nptn; ptn++) {
+            for (size_t i = 0; i < ptn_freq[ptn]; i++, site++) {
                 _site_lh[site] = _pattern_lh[ptn];
+            }
+        }
         ASSERT(site == sites);
         nth_element(_site_lh, _site_lh + sites_drop, _site_lh + sites);
         tree_lh = 0.0;
-        for (i = sites_drop; i < sites; i++)
+        for (size_t i = sites_drop; i < sites; i++) {
             tree_lh += _site_lh[i];
+        }
     } else if (params->robust_median) {
         if (ASC_Holder || ASC_Lewis)
             outError("+ASC not supported with robust phylo");
         size_t sites = getAlnNSite();
-        size_t site;
-        for (ptn = 0, site = 0; ptn < orig_nptn; ptn++)
-            for (i = 0; i < ptn_freq[ptn]; i++, site++)
+        size_t site = 0;
+        for (size_t ptn = 0; ptn < orig_nptn; ptn++) {
+            for (int i = 0; i < ptn_freq[ptn]; i++, site++) {
                 _site_lh[site] = _pattern_lh[ptn];
+            }
+        }
         ASSERT(site == sites);
         nth_element(_site_lh, _site_lh + sites/2, _site_lh + sites);
         if ((sites & 1) == 1) {
@@ -3058,15 +3168,15 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
         size_t step_unobserved_ptns = model_factory->unobserved_ptns.size() / nstates;
         double *const_lh_next = const_lh + step_unobserved_ptns;
         for (int step = 1; step < nstates; step++, const_lh_next += step_unobserved_ptns) {
-            for (ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size())
+            for (size_t ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size())
                 (VectorClass().load_a(&const_lh[ptn]) + VectorClass().load_a(&const_lh_next[ptn])).store_a(&const_lh[ptn]);
         }
         // cutoff the last entries if going beyond
-        for (ptn = orig_nptn; ptn < max_orig_nptn; ptn++)
+        for (size_t ptn = orig_nptn; ptn < max_orig_nptn; ptn++)
             const_lh[ptn] = 0.0;
         
         VectorClass sum_corr = 0.0;
-        for (ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size()) {
+        for (size_t ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size()) {
             VectorClass prob_variant = log(1.0 - VectorClass().load_a(&const_lh[ptn]));
             (VectorClass().load_a(&_pattern_lh[ptn]) - prob_variant).store_a(&_pattern_lh[ptn]);
             sum_corr += prob_variant*VectorClass().load_a(&ptn_freq[ptn]);
@@ -3088,7 +3198,7 @@ double PhyloTree::computeLikelihoodBranchGenericSIMD(PhyloNeighbor *dad_branch,
 //            _pattern_lh_cat[ptn] *= inv_const;
         
     	prob_const = log(1.0 - prob_const);
-    	for (ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size())
+    	for (size_t ptn = 0; ptn < orig_nptn; ptn+=VectorClass::size())
             (VectorClass().load_a(&_pattern_lh[ptn])-prob_const).store_a(&_pattern_lh[ptn]);
 //    		_pattern_lh[ptn] -= prob_const;
     	tree_lh -= aln->getNSite()*prob_const;
diff --git a/tree/phylotree.cpp b/tree/phylotree.cpp
index 84a1d2908..61a0bb9ab 100644
--- a/tree/phylotree.cpp
+++ b/tree/phylotree.cpp
@@ -33,6 +33,7 @@
 #include "upperbounds.h"
 #include "utils/MPIHelper.h"
 #include "utils/hammingdistance.h"
+#include "utils/setcheck.h"
 #include "model/modelmixture.h"
 #include "phylonodemixlen.h"
 #include "phylotreemixlen.h"
@@ -133,6 +134,7 @@ void PhyloTree::init() {
     vector_size = 0;
     safe_numeric = false;
     summary = nullptr;
+    isSummaryBorrowed = false;
 }
 
 PhyloTree::PhyloTree(Alignment *aln) : MTree(), CheckpointFactory() {
@@ -204,73 +206,33 @@ void myPartitionsDestroy(partitionList *pl) {
 
 PhyloTree::~PhyloTree() {
     doneComputingDistances();
-    if (nni_scale_num)
-        aligned_free(nni_scale_num);
-    nni_scale_num = NULL;
-    if (nni_partial_lh)
-        aligned_free(nni_partial_lh);
-    nni_partial_lh = NULL;
-    if (central_partial_lh)
-        aligned_free(central_partial_lh);
-    central_partial_lh = NULL;
-    if (central_scale_num)
-        aligned_free(central_scale_num);
-    central_scale_num = NULL;
-
-    if (central_partial_pars)
-        aligned_free(central_partial_pars);
-    central_partial_pars = NULL;
-
-    if(cost_matrix){
-        aligned_free(cost_matrix);
-        cost_matrix = NULL;
-    }
-
-    if (model_factory)
-        delete model_factory;
+    aligned_free(nni_scale_num);
+    aligned_free(nni_partial_lh);
+    aligned_free(central_partial_lh);
+    aligned_free(central_scale_num);
+    aligned_free(central_partial_pars);
+    aligned_free(cost_matrix);
+
+    delete model_factory;
     model_factory = NULL;
-    if (model)
-        delete model;
+    delete model;
     model = NULL;
-    if (site_rate)
-        delete site_rate;
+    delete site_rate;
     site_rate = NULL;
-    if (_pattern_lh_cat)
-        aligned_free(_pattern_lh_cat);
-    _pattern_lh_cat = NULL;
-    if (_pattern_lh)
-        aligned_free(_pattern_lh);
-    _pattern_lh = NULL;
-    if (_site_lh)
-        aligned_free(_site_lh);
-    _site_lh = NULL;
-    //if (state_freqs)
-    //    delete [] state_freqs;
-    if (theta_all)
-        aligned_free(theta_all);
-    theta_all = NULL;
-    if (buffer_scale_all)
-        aligned_free(buffer_scale_all);
-    buffer_scale_all = NULL;
-    if (buffer_partial_lh)
-        aligned_free(buffer_partial_lh);
-    buffer_partial_lh = NULL;
-    if (ptn_freq)
-        aligned_free(ptn_freq);
-    ptn_freq = NULL;
-    if (ptn_freq_pars)
-        aligned_free(ptn_freq_pars);
-    ptn_freq_pars = NULL;
+    aligned_free(_pattern_lh_cat);
+    aligned_free(_pattern_lh);
+    aligned_free(_site_lh);
+    aligned_free(theta_all);
+    aligned_free(buffer_scale_all);
+    aligned_free(buffer_partial_lh);
+    aligned_free(ptn_freq);
+    aligned_free(ptn_freq_pars);
     ptn_freq_computed = false;
-    if (ptn_invar)
-        aligned_free(ptn_invar);
-    ptn_invar = NULL;
-    if (dist_matrix)
-        delete[] dist_matrix;
+    aligned_free(ptn_invar);
+    delete[] dist_matrix;
     dist_matrix = NULL;
 
-    if (var_matrix)
-        delete[] var_matrix;
+    delete[] var_matrix;
     var_matrix = NULL;
 
     if (pllPartitions)
@@ -283,6 +245,9 @@ PhyloTree::~PhyloTree() {
     pllPartitions = NULL;
     pllAlignment = NULL;
     pllInst = NULL;
+    if (!isSummaryBorrowed) {
+        delete summary;
+    }
     summary = nullptr;
 }
 
@@ -360,27 +325,57 @@ void PhyloTree::copyTree(MTree *tree, string &taxa_set) {
     setAlignment(aln);
 }
 
-void PhyloTree::copyPhyloTree(PhyloTree *tree) {
+void PhyloTree::copyPhyloTree(PhyloTree *tree, bool borrowSummary) {
     MTree::copyTree(tree);
     if (!tree->aln)
         return;
     setAlignment(tree->aln);
+    if (borrowSummary && summary!=tree->summary && tree->summary!=nullptr) {
+        if (!isSummaryBorrowed) {
+            delete summary;
+        }
+        summary           = tree->summary;
+        isSummaryBorrowed = (summary!=nullptr);
+    }
 }
 
-void PhyloTree::copyPhyloTreeMixlen(PhyloTree *tree, int mix) {
+void PhyloTree::copyPhyloTreeMixlen(PhyloTree *tree, int mix, bool borrowSummary) {
     if (tree->isMixlen()) {
         ((PhyloTreeMixlen*)tree)->cur_mixture = mix;
     }
-    copyPhyloTree(tree);
+    copyPhyloTree(tree, borrowSummary);
     if (tree->isMixlen()) {
         ((PhyloTreeMixlen*)tree)->cur_mixture = -1;
     }
 }
 
+#define FAST_NAME_CHECK 1
 void PhyloTree::setAlignment(Alignment *alignment) {
     aln = alignment;
-    bool err = false;
+    //double checkStart = getRealTime();
     size_t nseq = aln->getNSeq();
+    bool err = false;
+#if FAST_NAME_CHECK
+    map<string, Node*> mapNameToNode;
+    getMapOfTaxonNameToNode(nullptr, nullptr, mapNameToNode);
+    for (size_t seq = 0; seq < nseq; seq++) {
+        string seq_name = aln->getSeqName(seq);
+        auto it = mapNameToNode.find(seq_name);
+        if (it==mapNameToNode.end()) {
+            string str = "Alignment sequence ";
+            str += seq_name;
+            str += " does not appear in the tree";
+            err = true;
+            outError(str, false);
+        } else {
+            (*it).second->id = seq;
+            mapNameToNode.erase(it);
+        }
+    }
+#else
+    //For each sequence, find the correponding node, via
+    //the map (if there is one), tag it, and remove it from
+    //the map.
     for (size_t seq = 0; seq < nseq; seq++) {
         string seq_name = aln->getSeqName(seq);
         Node *node = findLeafName(seq_name);
@@ -395,14 +390,22 @@ void PhyloTree::setAlignment(Alignment *alignment) {
             node->id = seq;
         }
     }
+#endif
     if (err) {
         printTree(cout, WT_NEWLINE);
         outError("Tree taxa and alignment sequence do not match (see above)");
     }
-    if (rooted) {
-        ASSERT(root->name == ROOT_NAME);
-        root->id = nseq;
+#if FAST_NAME_CHECK
+    //Since every sequence in the alignment has had its
+    //corresponding node tagged with an id, and that node's
+    //entry has been removed from the map, it follows:
+    //any node still referenced from the map does NOT
+    //correspond to a sequence in the alignment.
+    for (auto it = mapNameToNode.begin(); it != mapNameToNode.end(); ++it) {
+        outError((string)"Tree taxon " + it->first + " does not appear in the alignment", false);
+        err = true;
     }
+#else
     StrVector taxname;
     getTaxaName(taxname);
     for (StrVector::iterator it = taxname.begin(); it != taxname.end(); it++)
@@ -410,7 +413,20 @@ void PhyloTree::setAlignment(Alignment *alignment) {
             outError((string)"Tree taxon " + (*it) + " does not appear in the alignment", false);
             err = true;
         }
-    if (err) outError("Tree taxa and alignment sequence do not match (see above)");
+#endif
+    if (err) {
+        outError("Tree taxa and alignment sequence do not match (see above)");
+    }
+    if (rooted) {
+        ASSERT(root->name == ROOT_NAME);
+        root->id = aln->getNSeq();
+    }
+    /*
+    if (verbose_mode >= VB_MED) {
+        cout << "Alignment namecheck took " << (getRealTime()-checkStart)
+        << " sec (of wall-clock time)" << endl;
+    }
+    */
 }
 
 void PhyloTree::setRootNode(const char *my_root, bool multi_taxa) {
@@ -919,41 +935,20 @@ void PhyloTree::initializeAllPartialLh() {
 }
 
 void PhyloTree::deleteAllPartialLh() {
-
-    if (central_partial_lh) {
-        aligned_free(central_partial_lh);
-    }
-    if (central_scale_num) {
-        aligned_free(central_scale_num);
-    }
-    if (central_partial_pars)
-        aligned_free(central_partial_pars);
-
-    if (nni_scale_num)
-        aligned_free(nni_scale_num);
-    nni_scale_num = NULL;
-    if (nni_partial_lh)
-        aligned_free(nni_partial_lh);
-    nni_partial_lh = NULL;
-
-    if (ptn_invar)
-        aligned_free(ptn_invar);
-    if (ptn_freq)
-        aligned_free(ptn_freq);
-    if (ptn_freq_pars)
-        aligned_free(ptn_freq_pars);
-    if (theta_all)
-        aligned_free(theta_all);
-    if (buffer_scale_all)
-        aligned_free(buffer_scale_all);
-    if (buffer_partial_lh)
-        aligned_free(buffer_partial_lh);
-    if (_pattern_lh_cat)
-        aligned_free(_pattern_lh_cat);
-    if (_pattern_lh)
-        aligned_free(_pattern_lh);
-    if (_site_lh)
-        aligned_free(_site_lh);
+    aligned_free(central_partial_lh);
+    aligned_free(central_scale_num);
+    aligned_free(central_partial_pars);
+    aligned_free(nni_scale_num);
+    aligned_free(nni_partial_lh);
+    aligned_free(ptn_invar);
+    aligned_free(ptn_freq);
+    aligned_free(ptn_freq_pars);
+    aligned_free(theta_all);
+    aligned_free(buffer_scale_all);
+    aligned_free(buffer_partial_lh);
+    aligned_free(_pattern_lh_cat);
+    aligned_free(_pattern_lh);
+    aligned_free(_site_lh);
     central_partial_lh = NULL;
     central_scale_num = NULL;
     central_partial_pars = NULL;
@@ -1722,7 +1717,6 @@ int PhyloTree::computePatternCategories(IntVector *pattern_ncat) {
     delete [] id_mixture;
     delete [] sorted_lh_mixture;
     delete [] lh_mixture;
-//    delete [] cat_prob;
     return num_best_mixture;
 }
 
@@ -2568,9 +2562,10 @@ void PhyloTree::optimizePatternRates(DoubleVector &pattern_rates) {
         ptn_id.push_back(ptn);
         paln->extractPatterns(aln, ptn_id);
         PhyloTree *tree = new PhyloTree;
-        tree->copyPhyloTree(this);
+        tree->copyPhyloTree(this, false); //Local alignment, so tree can't "borrow" the summary of this
         tree->setParams(params);
         tree->setAlignment(paln);
+        tree->prepareToComputeDistances();
         tree->sse = sse;
         tree->num_threads = 1;
         // initialize model
@@ -2580,7 +2575,7 @@ void PhyloTree::optimizePatternRates(DoubleVector &pattern_rates) {
         tree->optimizeTreeLengthScaling(MIN_SITE_RATE, pattern_rates[ptn], MAX_SITE_RATE, 0.0001);
         
         tree->setModelFactory(NULL);
-        
+        tree->doneComputingDistances();
         delete tree;
         delete paln;
     }
@@ -3107,8 +3102,12 @@ void PhyloTree::prepareToComputeDistances() {
     for (threads -= distanceProcessors.size(); 0 < threads; --threads) {
         distanceProcessors.emplace_back(new AlignmentPairwise(this));
     }
+    if (summary!=nullptr && !isSummaryBorrowed) {
+        delete summary;
+        summary = nullptr;
+    }
     if (params->experimental) {
-        summary = new AlignmentSummary(aln, true);
+        summary = new AlignmentSummary(aln, true, true);
         summary->constructSequenceMatrix(false);
     }
 }
@@ -3118,22 +3117,37 @@ bool PhyloTree::hasMatrixOfConvertedSequences() const {
 }
 
 size_t PhyloTree::getConvertedSequenceLength() const {
+    if (summary==nullptr) {
+        return 0;
+    }
     return summary->sequenceLength;
 }
 
 const char* PhyloTree::getConvertedSequenceByNumber(int seq1) const {
+    if (summary==nullptr || summary->sequenceMatrix==nullptr) {
+        return nullptr;
+    }
     return summary->sequenceMatrix + seq1 * summary->sequenceLength;
 }
 
 const int* PhyloTree::getConvertedSequenceFrequencies() const {
+    if (summary==nullptr) {
+        return nullptr;
+    }
     return summary->siteFrequencies.data();
 }
 
 const int* PhyloTree::getConvertedSequenceNonConstFrequencies() const {
+    if (summary==nullptr) {
+        return nullptr;
+    }
     return summary->nonConstSiteFrequencies.data();
 }
 
 int  PhyloTree::getSumOfFrequenciesForSitesWithConstantState(int state) const {
+    if (summary==nullptr) {
+        return 0;
+    }
     return summary->getSumOfConstantSiteFrequenciesForState(state);
 }
 
@@ -3154,7 +3168,9 @@ void PhyloTree::doneComputingDistances() {
         delete (*it);
     }
     distanceProcessors.clear();
-    delete summary;
+    if (!isSummaryBorrowed) {
+        delete summary;
+    }
     summary = nullptr;
 }
 
@@ -3369,7 +3385,7 @@ double PhyloTree::computeDist_Experimental(double *dist_mat, double *var_mat) {
         }
     }
     EX_TRACE("Summarizing...");
-    AlignmentSummary s(aln, false);
+    AlignmentSummary s(aln, false, false);
     int maxDistance = 0;
     
     EX_TRACE("Summarizing found " << s.sequenceLength
@@ -3511,7 +3527,7 @@ double PhyloTree::computeDist(Params &params, Alignment *alignment, double* &dis
         double begin_time = getRealTime();
         longest_dist = (params.experimental)
             ? computeDist_Experimental(dist_mat, var_mat)
-            : computeDist(dist_mat, var_mat);
+            : computeDist(dist_mat, var_mat); //DISABLED
         if (verbose_mode >= VB_MED) {
             cout << "Distance calculation time: "
             << getRealTime() - begin_time << " seconds" << endl;
@@ -3531,7 +3547,7 @@ double PhyloTree::computeObsDist(double *dist_mat) {
     size_t nseqs = aln->getNSeq();
     double longest_dist = 0.0;
     #ifdef _OPENMP
-    #pragma omp parallel for
+    #pragma omp parallel for schedule(dynamic)
     #endif
     for (size_t seq1 = 0; seq1 < nseqs; ++seq1) {
         size_t pos = seq1*nseqs;
@@ -3625,7 +3641,7 @@ void PhyloTree::computeBioNJ(Params &params) {
         treeBuilder->constructTree(dist_file, bionj_file);
         if (verbose_mode >= VB_MED) {
             cout << "Constructing " << treeBuilder->getName() << " tree"
-                << " (from distance file " << dist_file << ") took"
+                << " (from distance file " << dist_file << ") took "
                 << (getRealTime()-start_time) << " sec." << endl;
         }
     }
@@ -4165,8 +4181,6 @@ NNIMove PhyloTree::getBestNNIForBran(PhyloNode *node1, PhyloNode *node2, NNIMove
 
     mem_slots.eraseSpecialNei();
 
-//     aligned_free(new_partial_lh);
-
      // restore the length of 4 branches around node1, node2
      FOR_NEIGHBOR(node1, node2, it)
          (*it)->setLength((*it)->node->findNeighbor(node1));
@@ -4182,9 +4196,7 @@ NNIMove PhyloTree::getBestNNIForBran(PhyloNode *node1, PhyloNode *node2, NNIMove
      } else {
          res = nniMoves[1];
      }
-    if (newNNIMoves) {
-        delete [] nniMoves;
-    }
+    delete [] nniMoves;
     return res;
 }
 
diff --git a/tree/phylotree.h b/tree/phylotree.h
index 30a8b251f..b75d8769c 100644
--- a/tree/phylotree.h
+++ b/tree/phylotree.h
@@ -111,7 +111,17 @@ inline T *aligned_alloc(size_t size) {
     return (T*)mem;
 }
 
-inline void aligned_free(void* mem) {
+template <class T> T* ensure_aligned_allocated(T* &ptr, size_t size) {
+    if (ptr==nullptr) {
+        ptr = aligned_alloc<T>(size);
+    }
+    return ptr;
+}
+
+template <class T> void aligned_free(T* & mem) {
+    if (mem==nullptr) {
+        return;
+    }
 #if defined WIN32 || defined _WIN32 || defined __WIN32__
     #if (defined(__MINGW32__) || defined(__clang__)) && defined(BINARY32)
         __mingw_aligned_free(mem);
@@ -121,6 +131,7 @@ inline void aligned_free(void* mem) {
 #else
 	free(mem);
 #endif
+    mem = nullptr;
 }
 
 
@@ -416,16 +427,18 @@ class PhyloTree : public MTree, public Optimization, public CheckpointFactory {
             copy the phylogenetic tree structure into this tree, designed specifically for PhyloTree.
             So there is some distinction with copyTree.
             @param tree the tree to copy
+            @param borrowSummary true if the alignment summary of the other tree is to be "borrowed"
      */
-    void copyPhyloTree(PhyloTree *tree);
+    void copyPhyloTree(PhyloTree *tree, bool borrowSummary);
 
     /**
             copy the phylogenetic tree structure into this tree, designed specifically for PhyloTree.
             So there is some distinction with copyTree.
             @param tree the tree to copy
             @param mix mixture ID of branch lengths
+            @param borrowSummary true if the alignment summary of the other tree is to be "borrowed"
      */
-    virtual void copyPhyloTreeMixlen(PhyloTree *tree, int mix);
+    virtual void copyPhyloTreeMixlen(PhyloTree *tree, int mix, bool borrowSummary);
 
 
     /**
@@ -779,9 +792,9 @@ class PhyloTree : public MTree, public Optimization, public CheckpointFactory {
     UINT *tip_partial_pars;
 
     bool ptn_freq_computed;
-
-    /** site log-likelihood buffer for robust phylogeny idea */
-    double *_site_lh;
+
+    /** site log-likelihood buffer for robust phylogeny idea */
+    double *_site_lh;
     
     /** vector size used by SIMD kernel */
     size_t vector_size;
@@ -2348,6 +2361,7 @@ class PhyloTree : public MTree, public Optimization, public CheckpointFactory {
 
     std::vector<AlignmentPairwise*> distanceProcessors;
     AlignmentSummary* summary;
+    bool isSummaryBorrowed;
     
 };
 
diff --git a/tree/phylotreesse.cpp b/tree/phylotreesse.cpp
index 787c84c89..f1d1c1f0b 100644
--- a/tree/phylotreesse.cpp
+++ b/tree/phylotreesse.cpp
@@ -259,30 +259,27 @@ void PhyloTree::computeTipPartialLikelihood() {
         int nstates = aln->num_states;
         size_t nseq = aln->getNSeq();
         ASSERT(vector_size > 0);
+        
+        
 #ifdef _OPENMP
         #pragma omp parallel for schedule(static)
 #endif
         for (int nodeid = 0; nodeid < nseq; nodeid++) {
+            auto stateRow = getConvertedSequenceByNumber(nodeid);
             int i, x, v;
             double *partial_lh = tip_partial_lh + tip_block_size*nodeid;
             size_t ptn;
             for (ptn = 0; ptn < nptn; ptn+=vector_size, partial_lh += nstates*vector_size) {
-//                int state[vector_size];
-//                for (v = 0; v < vector_size; v++) {
-//                    if (ptn+v < nptn)
-//                        state[v] = aln->at(ptn+v)[nodeid];
-//                    else
-//                        state[v] = aln->STATE_UNKNOWN;
-//                }
-
                 double *inv_evec = &model->getInverseEigenvectors()[ptn*nstates*nstates];
                 for (v = 0; v < vector_size; v++) {
                     int state = 0;
-                    if (ptn+v < nptn)
-                        state = aln->at(ptn+v)[nodeid];
-    //                double *partial_lh = node_partial_lh + ptn*nstates;
-//                    double *inv_evec = models->at(ptn)->getInverseEigenvectors();
-
+                    if (ptn+v < nptn) {
+                        if (stateRow!=nullptr) {
+                            state = stateRow[ptn+v];
+                        } else {
+                            state = aln->at(ptn+v)[nodeid];
+                        }
+                    }
                     if (state < nstates) {
                         for (i = 0; i < nstates; i++)
                             partial_lh[i*vector_size+v] = inv_evec[(i*nstates+state)*vector_size+v];
@@ -290,7 +287,6 @@ void PhyloTree::computeTipPartialLikelihood() {
                         // special treatment for unknown char
                         for (i = 0; i < nstates; i++) {
                             double lh_unknown = 0.0;
-//                            double *this_inv_evec = inv_evec + i*nstates;
                             for (x = 0; x < nstates; x++)
                                 lh_unknown += inv_evec[(i*nstates+x)*vector_size+v];
                             partial_lh[i*vector_size+v] = lh_unknown;
@@ -450,7 +446,6 @@ void PhyloTree::computeTipPartialLikelihood() {
 	double *all_inv_evec = model->getInverseEigenvectors();
 	ASSERT(all_inv_evec);
 
-
 	for (state = 0; state < nstates; state++) {
 		double *this_tip_partial_lh = &tip_partial_lh[state*nstates*nmixtures];
 		for (m = 0; m < nmixtures; m++) {