Skip to content
Avinash Ranganath edited this page Jan 23, 2017 · 34 revisions

Design Decisions

Should we support Python 2?

Decision

No, Python 3 is available everywhere, and it allow us to use nice new features while keeping the codebase clean.

Should we have two base classes for op nodes and leaf nodes?

We could either have only one base class for all nodes, or have two base classes, one for leaf nodes and one for op nodes (both inheriting from Node).

Reasoning

Option 1: One base class

Pros:

  • Abstract methods will not have to be defined twice.
  • Algorithms would check for inputs being present rather than for the type of node when looking for a leaf. It is not really a strong pro, since checking for being a subclass is equally easy.

Option 2: Two base classes

Pros:

  • Leaf nodes do not have the interface for adding inputs. This will provide a cleaned end-user interface.
  • Abstract methods that must be re-implemented in nodes will not have unnecessary inputs argument.
  • We can add checks for opnodes not having inputs defined.
  • Algorithms will not mistake an opnode without inputs for a leaf.

Decision

Since we prefer a cleaner internal and end-user interface and additional checks, and the overhead of 2 classes is not that big, we choose Option 2.

How to process multiple data samples at the same time (minibatch)?

Decision

We can use the convention employed by most TF models and assume that tensors passed through the SPN graph have at least two dimensions:

  • D0: instances in a minibatch
  • D1-DN: instance data

How to specify node-specific aspects of learning/inference?

It is often the case that we want to specify node-specific aspects of learning/inference, e.g.:

  • specify that some nodes should be learned using soft GD and some using hard GD
  • specify that weights of only some nodes should be learned
  • etc.

Decision

We can use a similar idea that TF employs when specifying which variables should be included during training (trainable flag specified during variable creation). Nodes in SPN graph should have flags that specify such options. These flags will be read and used by the algorithm generating TF graph from SPN graph.

Should TF graph be created when the SPN graph is created?

Should we generate TF graph when the SPN graph is built or defer that process to methods invoked on the graph that was previously created?

Reasoning

TF graph cannot be modified, while being able to adjust the SPN graph or build it from the top down might be a nice feature. Moreover, many algorithms can be applied to an SPN graph and the structure of the model should be independent from the algorithms used to learn/infer in the model. At the same time, adding all possible algorithms in the form of a TF graph would unnecessarily populate the graph.

Decision

SPN graph should not generate any TF graph elements corresponding to algorithms operating on the graph (inference, learning, etc.) and that require presence of input nodes. Elements that correspond to computations shared between algorithms or a state of the SPN graph (e.g. placeholders, variables, and their early processing) can be added at the SPN graph creation time. Those operations should not require that inputs to the node are already added. We assume that the SPN graph is created first, and then certain methods are invoked on that graph to generate TF graph for specific algortihms.

How can we accept custom operations as part of an SPN graph?

If we assume that the SPN graph is created independently from the TF graph, how can we include custom operations which are not defined as SPN nodes provided by the library?

Reasoning

Option 1

We can make it easy to create custom SPN nodes by implementing a node interface. The implementation would contain the custom operations defined in terms of TF ops.

Pros:

  • Simple, easy to understand, solution
  • Encapsulates custom operations

Cons:

  • Additional code is needed to wrap the TF ops into a node

Option 2

We can allow using tensors directly as inputs to the SPN nodes.

Pros:

  • slightly less code

Cons:

  • Nodes cannot be used as input to the custom ops, since the TF ops of nodes are not yet generated at the SPN graph creation time

Decision

We use Option 1.

When should inference/learning operations be created to allow for computing using multiple devices?

As noted above, we do not want to generate TF operations at the same time when SPN nodes are created. How should we generate these operation so that we can:

  • compute multiple mini-batches in parallel on different devices
  • perform different types of inferences on the same minibatch on different devices
  • perform gradient computations on multiple GPU devices in parallel with weight updates at the end done on the CPU

Reasoning & Decision

TF can automatically place operations that support GPU computations on a GPU, but does not automatically employ multiple GPU devices. Therefore, the device on which the op will run has to be specified manually, at TF op creation time. In order to use the same mechanism with SPN computations, we need to know when exactly TF ops are created by SPN algorithms and be able to place them in contexts specifying devices.

As noted in Should TF graph be created when the SPN graph is created?, variables and placeholders (parameters) being part of the SPN node definition can be generated at the SPN node creation time. Therefore, in order to specify the device on which the variables exist, we can simply create SPN nodes inside a device TF context.

Inference operations should be generated every time a method is executed on an SPN graph. That method can be a part of the SPN node class or belong to a separate class implementing a specific algorithm. This will make it clear when particular operations are created and make it possible to place the call to such method inside a device context.

Additionally, it will make it possible to generate two sets of TF operations performing the same type of computation (e.g. inference on a minibatch) employing the same weights. These computations can then be used to process multiple minibatches on different devices with the same weights.

In order to process multiple mini-batches on different devices, we need to have them in different placeholders or be able to divide a placeholder into parts. Following the approach described in Should TF graph be created when the SPN graph is created? and assuming that placeholders are created when SPN graph is created, we can still divide a larger batch into smaller batches when a specific algortihm is applied to the SPN graph. The class applying that algorithm will simply create multiple copies of the same TF graph based on the SPN graph and as input to each of them, use only a part of the batch stored in the placeholder.

More complex algorithms can be split into sub-algorithms the same way the GradientDescentOptimizer splits gradient calculation and application. This will make it possible to place parts of the computational graph of the algorithm on different devices if necessary.

How can we specify name scopes for grouping multiple SPN nodes?

Nodes should use name scopes to group operations inside a node. However, we might want to also group multiple nodes into a name scope.

Decision

We can place nodes in name scopes when the nodes are created. This way variables will be created in a scope. We can then re-create that same scope within the scope of an inference/learning computation by looking up the scopes of the variables.

How can we place an SPN in a specific TF graph?

It is sometimes convenient to maintain multiple TF graphs. How can we specify the graph in which the SPN nodes should be created?

Reasoning

All operations and variables that are used together must be placed in a single TF graph.

Decision

When the SPN graph is created, we can place nodes in a context of a specific different graph. When SPN nodes are added as children to other SPN nodes, it should be verified that the TF graphs they exist in are the same. Then, when adding inference/learning operations, we automatically add them to the TF graph in which the SPN graph exists.

How can we specify that weights should be shared between nodes?

Reasoning

Option 1

There is a special node in the SPN graph containing weights that is connected to the node performing an operation.

Pros:

  • Better encapsulation, weights-related computations are stored in a separate class
  • Sharing weights is less constrained and can be done by connecting a single weights node to multiple operation nodes

Cons:

  • For operations using weights, always two nodes must be created, one for the operation, one for the weights.

Option 2

Weights are a part of the node implementing the operation using them.

Pros:

  • Weights are tightly coupled with an operation and conceptually can be seen as a part of it.
  • Less nodes to create.

Cons:

  • Weight sharing is less flexible and requires that operations sharing weights are implemented in a single node.
  • Weight-related computations are mixed with operation-related computations in the same class

Decision

Not sure.

How can we add explicit latent variable IVs to the sum nodes?

Often, particularly in case of discriminative models, the latent variables that are summed out by the sums of an SPN need to be made explicit. It should be possible to both set and infer their values.

Reasoning

Option 1

A separate IVs node, identical to the one used for IVs corresponding to other variables is used. It is connected to the sum node. If the sum node has such a node connected, it uses the values of the variables during inference (if evidence is provided) and the IVs node provides methods for generating TF ops performing inference of the latent variable values if no evidence or partial evidence is provided (e.g. MPE inference).

Pros:

  • Better encapsulation, IV stuff is kept in IV nodes
  • IV sharing is less constrained and can be done by connecting a single IV node to multiple sums

Cons:

  • None?

Option 2

The IVs for the hidden variables are a part of the sum node and the interface of the sum node is extended.

Pros:

  • None?

Cons:

  • More difficult sharing of IVs between sum nodes in the same SPN layer. Sharing would have to happen within a sum node implementing multiple sums. Such behavior can still be achieved even if we decide to place the IVs outside the sum node.

Decision

We decide to use Option 1, and if a node implementing multiple sums using the same IVs is created, it can still be connected to an outside IV node.

How to optimize processing of layers of SPN nodes?

It is often the case that an SPN consists of layers of identical operations (multiple IVs, layer of sums or products). How should we represent such layers?

Reasoning

Option 1

Divide the layers into separate SPN nodes in the SPN graph.

Pros:

  • TensorBoard will visualize each node in the graph independently.

Cons:

  • This will lead to a large graph and many TF operations.
  • This might impact the TF performance (although this is not confirmed).
  • In case of IVs it will make it cumbersome to feed that many separate placeholders.

Option 2

Provide separate SPN node classes and SPN layer classes

This leads to the same problems as Option 1.

Option 3

Make SPN node classes generate multiple outputs corresponding to multiple operations of the same type.

Pros:

  • SPN nodes can make optimizations while processing multiple operations of the same type at the same time (e.g. use matrix multiplication to perform multiple sums)
  • Input data can be modeled using a single placeholder containing values of multiple variables

Cons:

  • The structure of the SPN cannot be fully visualized in TensorBoard

Decision

We use option 3. However, it is important to note that the only reason to group multiple operations into a single node is performance. If there is no benefit of using separate TF ops vs one op on gathered input, there is no reason to group operations into a single node. Other reasons such as placing multiple nodes in a scope, sharing weights and sharing latent variables are not good reasons as these can be accomplished differently (see below).

Should multiple sum/product operations be implemented as a single node?

Reasoning

As noted in How to optimize processing of layers of SPN nodes?, the primary benefit of having a layer of nodes is computational efficiency. Multiple sum operations should be grouped if having a single TF operation implementing multiple sums is more efficient than implementation based on separate operations for each sum.

Currently, there is no segment_sum operation in TF that could be used to realize multiple sums on the 2nd dimension of a tensor (the first dimension is used to represent a minibatch). However, currently there is also no gather operation that can be used to gather inputs from the 2nd dimension (columns). The only way to realize the gather operation is to flatten the tensor and perform a 1D operation. Therefore, in principle, that same trick could be used and regular 1D segment_sum could be applied.

In that situation, it is not clear if there is any performance benefit of grouping multiple sums into a single TF operation. It is also not clear if such single operation would be as parallelizable as multiple operations in TF. This should be investigated.

Decision

Until the results of that investigation are known, we will assume that sums and products implement only a single operation. Moreover, we will implement algorithms this way to find out what kind of groups of sums and products are needed.

Can the number of outputs of a node be dependent on the number of inputs?

Nodes might have either a fixed number of outputs (specified by the type of the node or a parameter specified at creation time), or the number of outputs might be dependent on the inputs (e.g. a node that contains multiple sums summing all elements in the input vector that correspond to the same scope).

Reasoning

The number of outputs of a node is in many cases dependent on the number of inputs. Therefore, specifying the number of outputs as a parameter of the node (to make the number independent of the number of inputs), will only add redundancy. If that specified number is not compatible with the number of inputs of the node, the graph will not be valid anyways.

Decision

The number of outputs of a node can depend on the inputs it takes. A node should be defined in terms of the inputs it takes and only the parameters which cannot be deduced from the inputs.

Should we encode information about the scope of an output of a node?

Each output of a node is calculated based on some input variables which define its scope. Multiple outputs can have the same scope.

Reasoning

Several algorithms rely on the knowledge of the scope, such structure generation. Moreover, without the scope information, it is not possible to determine if a graph is a valid SPN.

Decision

Yes, it should be possible to obtain the scope of an output of a node.

How to implement computations on nodes of the graph?

How should we design the implementation of algorithms performing computations on nodes of the graph (traversing the graph)?

Reasoning

Option 1

Use recurrence.

Cons:

  • Not possible since this can be done only on trees and not general graphs.
  • Python quickly reaches max recursion depth

Option 2

A separate function implementing graph traversal algorithms which runs a given function on every node of a graph. Various types of traversal algorithms can be implemented.

Decision

We use Option 2.

How to structure the implementation of methods computing a value for a graph node?

How should we structure the implementation of the methods computing a certain value (or values) for a single node of a graph? These methods would be used by the graph traversal algorithms to obtain values for the whole graph.

Reasoning

Option 1

The abstract classes implement a method providing a default computation and subclasses override it to provide specific implementation. They use the default implementation to pre-process the input data.

Pros:

  • Default computation is only executed if the specific implementation needs it.

Cons:

  • It is not easily possible to share the result of the default computation between multiple computation routines that are executed on the same node since the default method is called from the inside of the overriding method. This problem occurred in compute_valid where we have to compute the output scope, but also scopes of gathered inputs (which are also computed in the function calculating the output scope). One way of sharing the result of default computation would be to add another argument with that result (with default value None) to the specific implementation, which would be used if that value was already pre-computed instead of calling the base class default computation internally.
  • The default implementation is not aware of how the inputs are interpreted by the node and therefore cannot provide a very useful default behavior.

Option 2

The abstract classes implement a part of the computation that is shared by all specific implementations and call another sub-method which should be overridden in subclasses to provide a specific implementation on top of the results of the shared computation.

Cons:

  • The specific implementation cannot choose whether to use the "default" computation or not.

Option 3

The functions performing specific computations for nodes are abstract and need to be re-implemented in the sub-classes. They do not provide any default implementation in the base classes. Instead, the computations that must be shared across all specific implementations are invoked from the function passed to the graph traversal algorithm before the results are even passed to the specific implementation method. Any useful computations that might be shared, but their use should be in the gesture of the specific implementation, are realized by methods of Node which can be called from within the specific implementations if needed.

Pros:

  • The code is more readable and the names of method clearly state their purpose.
  • Nodes can choose to use some pre-processing by simply calling a method of Node.
  • The graph traversal algoritm has access to intermediate results in cases where some shared computations are alwahys used.
  • The code runnning the graph traversal algorithm has control over name scopes and can group the default and specific computation in a single scope.

Decision

We decide to use Option 3. All computations that always need to be performed for every node, should be moved to the function passed to the graph traversal algorithm. Any optional default computation should be invoked by the specific implementation, and can be placed in helper methods of Node.

How to pass multiple values during graph computations?

Sometimes, multiple values must be computed for each node and passed as inputs to the next node. How to pass these multiple values?

Decision

We should use named tuples to pass multiple values per input.

How should weights, representing a set of sum nodes, be implemented?

For representing a set of sum nodes as a single op, how should weights of the collection of the sums be represented, and where and how should they be normalized?

Reasoning for weights normalization

Option 1

Normalize sets of weights inside the Sums node

Cons:

  • The Sums node would have to update its own weights during learning
  • If multiple sums connect to the same weights node (weight sharing), they would then have to coordinate with each other to come up with normalized updates of weights.

Option 2

Provide multiple weights inputs to Sums node.

Cons:

  • This would introduce many weight/variable TF nodes to the network.

Option 3

Include a Concat op to the network.

Cons:

  • This would also introduce many TF nodes to the network.

Option 4

Change the Weights node to handle multiple sets of weights, wherein each set is normalized independently.

Pros:

  • Serves the objective of reducing TF nodes in the graph.

Reasoning for weights reprsentation

Option 1

Represent weights as a long 1D vector

Cons:

  • Not straightforward to either normalize weights, nor for multiplying with the inputs.

Option 2

Represent weights as a 2D matrix, with each row contains weights of a single sum node.

Pros:

  • Easy to normalize by dividing the matrix by taking reduce_sum of columns (D1)
  • Since all the sums in a Sums node would have the same input, it is trivial to multiply weights matrix with inputs matrix, to produce output as a weighted sum of inputs.

Decision

We decide to use Option 4 (Normalization) and 2 (Representation) respectively.

Clone this wiki locally