TODO

TODO: make a CIFAR-bw whitened patches dataset to use in examples, can probably copy it from projects/autoenc

TODO: denoising score matching
    - current implementation is wrong. Updaters for the different parameters should share the same noise vectors.
      implementing this correctly will require a stats object.
    - denoising score matching: test it on this CIFAR patches dataset


TODO: implement gamma units with fixed scale

TODO: implement free energy for gaussian units with fixed variance (already computed, see lyxfile)

TODO: Fast persistent CD (FPCD) (Tieleman) - this will require another stats function that also keeps internal state (the fast weights)

TODO: implicit score matching: needs second derivatives (diagonal of the hessian). Kevin Swersky may have code for this so don't implement this yet.

TODO: hinton suggests to estimate the current mean activity of the units as an exponential average over past minibatches. Is it feasible to implement this? Would need a variable to store the previous value, but I guess that's doable with Theano updates.
  - this affects both the sparsity penalty for autoencoders and the SparsityUpdater.

TODO: test autoencoder for exotic RBMs
    - test the varautoencoder to see if the proxy units support works correctly
    - test a conditional autoencoder to see if the context units support works correctly

TODO: update code to use OrderedDict for update dictionaries.
    
TODO: truncated exponential sampler fails for large lambda (in magnitude) and for lambda = 0. These cases can occur... fix this.
    
TODO: add a general purpose 'autoencoder' updater which can use a custom error measure (which would be MSE or CE in the most common case), not just the 'natural' one associated with the RBM. This would be useful to build autoencoders with NReLUs for example.
  - objective = reconstruction + error measure


TODO: Parameters.energy_gradient gives the NEGATIVE of the energy gradient. Either rename this method, or fix this (and take the negative wherever it is used)


TODO: current minibatchtrainer isn't efficient when multiple functions are compiled - the shared variables for the data are recreated for each function. Find a better way of implementing this. Also, it would be good if the 'updates' could be made accessible somehow, which is useful to diagnose problems. I guess this whole class needs to be rethought a bit.

TODO: mapping a vmap onto an activation vmap is a common operation for Units, it would be good if there were a method for this, with a short name.
    it has to be overridable, for unit types that have proxies (learntprecisiongaussian for example)
    refactor units.py to use this method.
    maybe Units.amap(vmap) ?

TODO: test TransformedParameters, write an example (FCRBM)?

TODO: implement energy_gradients for ThirdOrderFactoredParameters and Convolutional2DParameters

TODO: rbm.mean_field and rbm.sample always complete the vmap - sometimes this is undesirable. Maybe add a 'complete=True' argument (default True)

TODO: test NRELUnits

TODO: test exponential units
    - with non-truncated exponential units, lambda has to be positive, requires constraint on the parameters.

TODO: binomial units (read rate-coded RBMs paper http://www.cs.toronto.edu/~hinton/absps/nips00-ywt.pdf)

TODO: poisson units

TODO: make rbm.energy_gradient_for use theano's T.grad functionality by default. if we rely on that, it's quite easy to get the derivative of the energy function with respect to a given set of units. The automatic differentation should only be offered as a 'default' or 'fallback' for when the Parameters don't provide custom implementations of the gradients of the energy term with respect to the units.

TODO: Exact partition function calculation
  - can this be formulated for a general RBM model easily?
  - it should also be possible to select the type of units that is summed over numerically, as this is determines
    the runtime of the computation.

TODO: reconstruction cost is worthless for PCD, so if we're going to have PCD we'll also need something like pseudo-likelihood. Also adapt test_mnist_persistent.py to use this new measure. This requires free energy to be implemented.

TODO: truncated exponential units
    - free energy term is: -sum_j(log((exp(d*a_j) - 1)/a_j)) for distribution TEXP(a_j, d) (a_j is the linear activation of hidden unit h_j).

TODO: two-way factored RBM from http://www.cs.utoronto.ca/~hinton/absps/netflixICML.pdf section 5
  - see if it can be built
  - maybe create an example for it

TODO: dat FCRBM model voor modeling motion style eens proberen nabouwen!

TODO: in a conditional RBM (f.e. with visibles v, hiddens h and context x), CD won't work if the context variable isn't supplied. This is because CD will try to update the parameters tying h/v and x regardless of whether x was supplied as context. It would be nice if this didn't happen, i.e. when x is not supplied as context, all parameters tying x are simply ignored. This would also avoid some hard-to-decipher error messages. This would require knowledge about the dependencies in the RBM, so these dependencies would have to be specified on the Parameters.

TODO: AIS
- This might also be interesting to have a look at, it has a PCD implementation (and also some other stuff like AIS) in Python + cuv:
https://github.com/deeplearningais/CUV/tree/master/examples/rbm

TODO: docs: explain how/why parameter tying works

TODO: add docstrings everywhere
 
TODO: docs: explain the need for supplying an 'initial vmap' (for stats, but also for training)
  + convention: minibatch contributions are SUMMED rather than averaged.

TODO: cleanup: remove some comments that are irrelevant

TODO: docs: what are the different components and their roles, what are vmaps, umaps, dmaps, ..., and what are the interfaces the different components adhere to (f.e. the 'cd' keyword argument for samplers)
+ some common use cases + example usage for each feature (f.e. how to use momentum, how to construct a CD updater)


====================

BONUS:

- mcRBM
  - the deeplearning.net tutorials now include a tutorial for building the HMC sampler that was used for the visibles in this paper, and it looks pretty damn complicated. Don't waste time on this for now. 
  - technically, an mcRBM isn't even an RBM!
  - perhaps the samplers from MonteTheano can be of use?
  - in 'Learning to represent visual input', Hinton gives a simple 'mean field' rule that also works, try to figure this out and implement it (it might require refactoring of cd_stats to allow 'plugging in' sampling strategies)

- spike and slab RBMs
  - these require a special sampling procedure (in 3 steps instead of 2) and are thus not compatible with cd_stats. It will probably be a challenge to fit these into the framework, if it's possible at all. Maybe abstracting out the sampling procedure from cd_stats is a good starting point? That way, maybe HMC could be used for the mcRBM as well.
  - in this code that was used for the paper: https://bitbucket.org/gdesjardins/ssrbm_utlc/src/d54c0e410964/pssrbm.py
    s and h (the latent variables) are sampled given only v. This means that the dependence between s and h can be ignored for practical purposes, which DOES make it compatible with cd_stats, in theory...


====================


autoencoder.mean_reconstruction returns a vmap which contains more than just the requested visible_units. It also includes proxy units.
This can lead to unexpected behaviour when the vmap is iterated over. It may be worth considering to remove the spurious unit sets from the vmap.
This issue may occur in other places of the code as well. If something is done about it, look for those too!

TODO: write some proper automated tests

TODO: combining softmaxunits with the convolutional parameters is impossible, because they both require a different tensor shape.
    - convolution requires: (minibatches, maps, width, height)
    - softmax requires: (minibatches, units, states)
    investigate if it is possible to write a 'ReshapedUnits' proxy that makes this work.
    Upon instantiation of the proxy, it will have to add itself to the Units.proxy_units variable, because the encapsulated Units instance is not 'aware' that it will be proxied when it is created.
    If this is implemented, it will become necessary to support proxies of proxies, which is currently not supported!

RBMGraph:
    - a directed acyclic graph (DAG) where the nodes are RBMs
    - the root of the graph corresponds to the 'top layer' of a DBN
    - each edge A -> B is a 'connection' from RBM A to RBM B (meaning data flows from A to B) and specifies:
        - how to get the data from A, given all its 'inputs' (typically: a list of units to sample)
        - how to transform it to input data for B (optional)
    - special leaf nodes that provide the input data will be necessary. It might make sense to also implement preprocessing techniques as 'nodes'...
      maybe make a general 'Node' class, and provide a subclass that encapsulates an RBM.
      Try not to make any assumptions about what the data passed over the edges looks like (although it will typically be a vmap)
    - Maybe I'm reinventing MDP here? But this is tree-based and makes less assumptions about what the input/output data looks like.
    
    - Training procedure would then be:
        * traverse the graph, recursively ask for the processed data. if it is unavailable, process, if untrained, train first.
    
Simpler version: RBMStack: a linear version of RBMGraph where each node can only have a single child, yielding a simple 'chain' or 'stack' of RBMs.

TODO: wat glue code om RBMs te stacken in DBNs.
    * is het mogelijk dat units niet dezelfde zijn in RBM1 en RBM2? vb. sigmoidal in de ene, truncated exponential in de volgende of zo? hoe kunnen we eenvoudig specifiëren op welke manier twee RBMs aaneen hangen (met welke units)? Want dat hoeven niet noodzakelijk alle hiddens te zijn.
    - training:
      - train laag 1
      - bereken hiddens voor laag 1 en sla op (sampling? mean field?)
      - train laag 2
      - bereken hiddens voor laag 2 en sla op
      - ...
      - train laag n
      (hiddens hoeven niet berekend te worden voor laag n)
      * mogelijke problemen: grote datasets vereisen waarschijnlijk dat de berekende hiddens ergens kunnen opgeslagen worden of zo. Het zou dus goed zijn om een 'infer'-generator te hebben. De standaard 'train' methode zou die dan volledig doorlopen en alles in't geheugen houden, maar als dat niet mogelijk is kan er dus eenvoudig een alternatief geschreven worden. 
      * mss is het idee van 'lagen' te beperkend, in feite kunnen DBNs ook bomen zijn (vb. Multimodal Deep Learning). De verschillende RBMs 'hangen' aaneen via bepaalde units instances. 
    - sampling/inference:
      - gegeven een 'startpunt' uit de data
      - gegeven een configuratie van top hiddens
      - ook 'gedeeltelijk' samplen, van laag m naar laag n of zo


TODO: find a way to clean up namespaces, (for example, theano.tensor as T is everywhere)
    
TODO: Introduce ProxyUnits with trainable parameters. This can be used for the Gaussian stuff (v = v'/sigma) and for the linear combination stuff (v = Rx).
* for gaussian units, the energy function is in terms of v/sigma, not v! this is not implemented. Look into this.
- currently, Parameters instances specify their energy gradient. This only works when the energy is linear in the Parameters, because then the energy gradients don't depend on other parameters in that case. 
  Unfortunately, this is not always true, for example with the variance of Gaussian units.
  Another problem with the variance of Gaussians is that the contributions of Parameters to the energy function are no longer additive. Can this be made more general so that Gaussian units with trainable variance become possible?
* for the linear R parameters in the FCRBM model, we could apply the chain rule as follows:
dE/dR = dE/dv*dv/dR. Since v = Rx, dv/dR is easy. dE/dv is also fairly easy. One could suspect that it is identical to act_v, the activation of v, but I'm not convinced this is the case. It's true when everything is linear, but what when there's gaussian units, for example?
- Calculate dE/dv and act_v for a gaussian-binary RBM and compare.
- Think about how this could be incorporated.

- the ICASSP 2010 paper "PHONE RECOGNITION USING RESTRICTED BOLTZMANN MACHINES" by Mohamed & Hinton trains using an unsupervised gradient+unsupervised gradient combination. Try to implement an example using this technique, for example digit classification on MNIST. The idea is to combine the standard CD gradient where the label unit is seen as context, with p(l|v) which can be computed from the p(v,l)'s (since #l is finite and small) which in turn are exp(FE(v,l)).
    * would be nice to write this as general as possible, so it can work for any RBM with a label unit or a set of label units. Note that summing over the labels can only be done numerically (by enumerating them), it's not doable to integrate them out analytically.
    
TODO: optimised 1D convolution: visible_maps naar filter height, data (time) naar filter width
    
TODO: test 1D convolution
 
TODO: symmetric beta units (will have two proxies log(u) and log(1-u), and require hiddens with proxies for 1-...) - will need a beta sampler. Maybe MonteTheano has one?

TODO: supervised finetuning - how does this fit in the framework (separate trainers and updaters?)

TODO: maybe Tempered MCMC http://www.iro.umontreal.ca/~lisa/pointeurs/tempered_tech_report2009.pdf

TODO: Look into hessian-free optimisation, maybe that fits in as well? (although I think this is pretty much always supervised?)

TODO: implement other algorithms from "Inductive Principles for RBM Learning" by Marlin et al. (2010)

TODO: deep boltzmann machines?

TODO: rather than implicitly assuming everywhere that a parameters instance with a number of units instances as input ties all those units (bidirectionally), maybe parameters should explicitly define the dependencies they introduce. This allows for dependency checking where necessary (free energy computation would be one use case).
    # problem: if the 'Parameters' class also encodes directions, we will need separate ProdParameters and DirectedProdParameters, for example. But maybe this isn't such a big deal? Or have only a single ProdParameters class, but allow to specify the dependency relations at instantiation.
    - each parameters object should be able to generate a 'dependency map', which maps Units instances to the other Units instances they depend on.
    - for example, for prodparameters: { v: [h], h: [v] }
    - the default would be to include all other units.
    - each Parameters type can have its own way of manipulating this dependency map. For example, for ProdParameters or third order parameters, it would suffice to specify which parameters are context. Then the dependency list for the context Units would be empty.
    - that way, specifying dependencies is optional, but it can still be used for stuff like model visualisation, or to perform dependency checks.
    * everywhere sums of contributions are computed, the deps should now be checked + also adapt rbm.dependent_units
    - this adds complexity, and the gains are limited

TODO: model visualisation plugin using pydot? rbm.draw() or something.
  It would plot the dependencies and the Parameters objects affecting those dependencies. (this way it will also work sensibly for third order etc.)

TODO: once some proper experiments have been set up, profile things (ProfileMode).

- a nice (binary) MNIST alternative: http://www.cs.ubc.ca/~bmarlin/data.shtml


*** PRELIMINARY DOCUMENTATION ***

TRAINING

split into multiple phases:

- collect statistics, given input data and the model
  * for CD-k, this is input visibles, sampled hiddens, and then visibles and hiddens after k steps
  * for PCD, the negative term is different
- use statistics to compute the weight updates
  * for all types of CD, this is just getting the 'gradient' from the Parameters object and filling in the blanks.
- update the weights according to some other hyperparameterised settings (this is where momentum, weight decay etc. are)

ParamUpdater: composite object (possibly consisting of multiple other ParamUpdaters that are somehow combined (typical usecase: updates are summed)) that updates parameters given some input data

DecayParamUpdater: provides a decay term

MomentumParamUpdater: encapsulates another ParamUpdater and applies momentum to it

CDParamUpdater: encapsulates CD-k or PCD weight update computation
  * takes the input data, computes statistics (Stats object)
  * gets the form of the update term from the Parameters object, fills in the blanks with the stats
  * returns the resulting update
  
SparsityParamUpdater: encapsulates sparsity target regularisation

SumParamUpdater: sums the updates obtained from its child paramupdaters. should check that the composing ParamUpdaters are for the same Parameters!

ScaleParamUpdater: takes a ParamUpdater and a scale factor, and scales the updates by this factor. This will be the result of writing '0.005 * ParamUpdater()' for example.


Stats: object that encapsulates statistics (typically for CD)

cd_stats: statistics for CD-k

pcd_stats: statistics for PCD

!! since only the negative term differs between cd_stats and pcd_stats, maybe some overlapping code can be factored out here.


MODULAR RBM

v = a(f(W, h) + g(vbias))
h = a'(f'(W, v) + g'(hbias))

ActivationFunction
Sampler
Units

(unit data u + activation functions a(x) + samplers) = Units

Parameters(list<Units> units)
  has
    - list<Units>: list of units that are related by the parameters
    - a set of actual parameters (weights)
  provides
    - contribution to activation of each of the Units
    - contribution to the energy function

RBM
  has
    - list<Units>: a set of different types of units
    - list<Parameters>: a set of different types of parameters that relate the Units
  provides
    - sampling a type of units (computing nonlinear activation and sampling)
    - computing the energy function (summing the contributions for each of the Parameters)


get unit values (this goes for all unit types):
  - compute linear activation
      x = sum_i(f(W_i, u_i))
  - apply activation function  
      a(x)
  - sample
      u ~ a(x)
      
Computing a linear activation consists of summing all the contributions of the different Parameters

a Sampler can just be the identity function (mean field) or a bernoulli sampler (most common), a softmax sampler, a max-pooling sampler, ...


Units, ActivationFunction and Sampler are AWARE of the RBM they are part of - most of the logic is concentrated in the subclasses. while this does bring some dependencies that could technically be avoided, it should lead to a cleaner architecture.


implement a system where different types of units, weights and sampling can be combined easily. If it's performant that's a plus, but it doesn't have to be (mainly for experimentation). This will make it easier to construct heterogeneous models.