Skip to content

How TensorFI works

Niranjhana Narayanan edited this page Sep 12, 2019 · 2 revisions

TensorFI works in two stages as follows. First, it instruments the entire TensorFI graph by inserting custom fault injection nodes in it, and overriding the run method of the main session. Second, at runtime, it checks if a fault is to be injected, and if so, calls the inserted fault injection nodes which then perform the actual fault injection in accordance with the configured options.

1. Instrumentation Phase:

In this phase, TensorFI walks the main graph of the session (see Known Limitations and Assumptions) and creates a custom copy of every node in the graph. The custom operations mimic the operations of the original nodes when no faults are injected, and actually perform the fault injections at runtime when faults are injected. Because the original nodes are not disturbed in any way, except perhaps to add more out-degree to constants, variables and placeholder nodes; TensorFI does not inhibit any optimizations or parallelization of the original graph (see Design Principle 3). Further, the extra nodes inserted neither have control dependencies on the original nodes, nor do they induce any new control dependencies. As a result, the execution of the original graph is not affected. Finally, the instrumentation phase also "monkey patches" the main session's run function, so that it can invoke the custom nodes at runtime for injection.

2. Execution and Injection Phase:

In this phase, the session's run method first checks if faults are to be injected, and if so, it invokes the run operation on the custom nodes that were inserted. These in turn mimic the execution of the original nodes, until a node that is chosen for fault injection is encountered. At this point, it checks if the other fault injection criteria are met (e.g., if it is executed after skipCount, and if its probability of injection matches the current execution). If so, then the fault is injected into the result scalar or tensor value of the node depending on the fault type chosen. For example, a RAND fault would replace the correct scalar/tensor value with a randomly chosen scalar/tensor value.

Once the fault is injected, the execution continues for the rest of the nodes. Finally, the result value is returned by the monkey patched run method just like the original session's run method, so the client does not see any difference. This satisfies the first design goal, namely compatibility.

Clone this wiki locally