Skip to content

Writing a Data Flow Analysis

Bhuvan Sharma edited this page Jun 24, 2020 · 1 revision

Before writing a static analysis

Why I cannot write a data-flow analysis on-the-fly? Writing a data-flow analysis is a challenging task and can be tough. Therefore, you should be familiar with the underlying theory in order to be able to develop a novel data-flow analysis. We compiled a (incomplete) list of literature for you which you may consult before writing an analysis. Please have a look at Useful Literature.

Writing a static analysis

PhASAR provides several sophisticated mechanisms that allow you to write your own data-flow analysis. In general, PhASAR is designed in such a way that the analysis writer has to choose from several possible interfaces which he can use to implement a concrete static analysis. Depending on a user's analysis problem, some interfaces might be more suitable than others. When having found the right interface for the problem, the analysis writer usually only has to provide a new class which implements the interface's missing functionality which serves as the problem description for the analysis. This concrete problem description is handed over to the corresponding solver, which solves the problem in a fully automated manner. In the following the possible data-flow solvers are ordered according to their expressiveness (and complexity).

Choosing a control-flow graph

In general, all data-flow analyses are performed on a control-flow graph. When writing an analysis, a user has to choose a control-flow graph for their analysis. For that reason, all of our solvers work either on the CFG (intra-procedural control-flow graph) or on the ICFG (inter-procedural control-flow graph) interface.

For instance, when writing a simple intra-procedural data-flow analysis using the monotone framework, the user has to use one of CFG's concrete implementations or provide his own implementation for that interface. The pre-implemented LLVMBasedCFG.h should usually do the job and can be used out-of-the-box.

The inter-procedural call-string approach and the IFDS/IDE/WPDS frameworks solve a concrete analysis based on an inter-procedural control-flow graph. Depending on the analysis's needs, a forward-, backward-, or bi-directional inter-procedural control-flow graph may be used. For instance, if a backward analysis is required, the LLVMBasedBackwardICFG has to be used. However, most of the time the 'LLVMBasedICFG' should work just fine.

It is also possible to provide a custom implementation of the ICFG interface if necessary. In that case, a user just needs to provide another class for which they provide a reasonable name. The implementation must at least implement the interface that is defined by the ICFG.h interface.

Useful shortcuts

In the following section some useful coding shortcuts are presented which may come in handy when writing a new analysis within PhASAR.

The std::algorithm header

When overriding the classes in order to describe the problems within the monotone framework, oftentimes set operations like set union, set intersect or set difference are required. Writing these functions yourself is tedious. Therefore it makes much sense to use the existing set operation functions which are defined in the std::algorithm header file such as:

    ...
    std::includes /* subset */
    std::set_difference
    std::set_intersection
    std::set_union
    ...

The neat thing about these functions is that they are completely generic as they operate on iterators and provide several useful overloads. Thus, they work on all container types that follow the concept's required by these functions. In the following a small code example is presented:

std::set<int> a = {1, 2, 3};
std::set<int> b = {6, 4, 3, 2, 8, 1};
std::set<int> c;
std::set_union(a.begin(), a.end(),
               b.begin(), b.end(),
               inserter(c, c.begin()));
bool isSubSet = std::includes(b.begin(), b.end(), a.begin(), a.end());
The pre-defined flow function classes

When defining flow functions in IFDS, IDE or WPDS, sometimes a certain flow function type occurs more than once. For instance, the Kill flow function that kills a specific data-flow fact is often needed many times. An analysis writer can find several useful pre-defined flow (and edge) functions that can be directly used. A very useful example of such a recurring flow function is the Identity flow function. Some of the flow functions are also defined as a singleton in order to keep the amount of overhead as small as possible. This is always possible if the flow function has no state. A user may use the pre-defined flow functions inside their flow function factories. We also provide some LLVM specific flow functions for parameter mapping at call and return sites. We also provide some general edge functions that may be used when implementing an IDE analysis.

Here is a small example that makes use of the pre-defined flow functions:

// in this example let domain D be const llvm::Value*

shared_ptr<FlowFunction<const llvm::Value*>> getNormalFlowFunction(....) {
    // check the type of the instruction
    if( ...) {
        // do some work
    } else if (...) {
        // kill some data-flow fact
        return make_shared<Kill<const llvm::Value*>>(/* some fact to be killed */);
    } else {
        // just treat everything else as Identity
        return Identity<const llvm::Value*>::getInstance();
    }
}

Important template parameters

The code for the data-flow solvers is written in a very generic way. For that reason we use a lot of template parameters. Here we describe the most important template parameters:

  • D
    • The type of the data-flow facts of your data-flow domain D. When analyzing LLVM IR this will oftentimes be const llvm::Value *.
  • N
    • The type of nodes in you intra-/inter-procedural control-flow graph (aka statements/instructions). When using analysis targeting LLVM IR it will always be const llvm::Instruction *.
  • M
    • The type of functions/methods used by the framework. When using LLVM it will be const llvm::Function *.
  • I
    • The type of the inter-procedural control-flow graph to be used. Usually it will be some reference to a type implementing the ICFG.h interface, e.g. LLVMBasedICFG&.
  • V/L
    • This is the type for the second---value domain---of IDE problem. This type really depends on the concrete client analysis. When using IFDS you one does not have to worry about this type, since it is automatically chosen for the analysis writer as:
enum class BinaryDomain {
    BOTTOM = 0,
    TOP = 1
};

Writing an intra-procedural monotone framework analysis

This is probably the easiest analysis one can write and if an analysis writer is a beginner, they should definitely start here. Using the (intra-procedural) monotone framework, a data-flow analysis problem can be solved within a single procedure/function (caution: function calls within the function under analysis are not followed, but the call-sites are still in the code and trigger a callback, of course). In order to formulate such an analysis, a user has to implement the IntraMonotoneProblem.h interface. The implementation is then handed over to the corresponding IntraMonotonSolver.h which solves the analysis problem.

Writing an inter-procedural monotone framework analysis (using call-strings)

The implementation will be finished soon.

This analysis can be used when inter-procedural data-flow problems have be solved. The solver uses the classical monotone framework combined with the call-string approach to achieve k-context sensitivity. The k can be specified by the analysis implementor. The interface the analysis writer has to implement (InterMonotoneProblem) contains a few more functions than the IntraMonotoneProblem.h and thus, it is slightly more complex. Please note that this solver has scaling problems for a large k on larger programs. If the analysis writer demands a scalable analysis with infinite context sensitivity, they may would like to formulate their data-flow problem using an IFDS or IDE. Caution: IFDS and IDE can only be used when the flow functions are distributive. However, the usual Gen/Kill problems are distributive. An example for an analysis that is not distributive is full-constant propagation. In simple words a flow function is not distributive if a data-flow fact's validity depends on two or more other data-flow facts (at the same time).

Writing an IFDS analaysis

In order to solve a data-flow analysis using IFDS one basically only has implement a single class. It is a good idea to create name the analysis using the naming conventions that can be found in the corresponding problem directories.

To make the class an analysis problem our solver is able to solve, you let your class inherit from 'DefaultIFDSTabulationProblem' or 'LLVMDefaultIFDSTabulationProblem' when analyzing LLVM IR. The concrete analysis is formulated by overwriting all abstract functions of the interface. The member functions a user has to override are:

  • getNormalFlowFunction()

    • Returns flow functions that describe the intra-procedural flows.
  • getCallFlowFunction()

    • Express what the solver should do when it hits a call-site. This flow function factory usually returns a flow function that maps the actual parameters to the formal parameters of the called function.
  • getRetFlowFunction()

    • Usually propagates the information at the end of the callee back to the caller.
  • getCallToRetFlowFunction()

    • Every information that is not involved in the function call is handled here. Usually just passes every information as identity, except for pointer parameters that may be changed in the callee.
  • Optional: getSummaryFlowFunction()

    • Default: Implemented to return nullptr. But can be used to model llvm.intrinsic functions or libc function that one does not wish to follow (e.g if no implementation is available).
  • initialSeeds()

    • The initial seeds are the starting points of an analysis. An analysis can start at one or more points in the target program. The function must return start points as well as a set of data-flow facts that hold at these program points. Unless the concrete analysis requires something special, one would usually treat the first instruction of the main function as a starting point and use the special zero fact to hold at the analysis start.
  • createZeroValue()

    • This function must define what value the analysis should use as the special zero value.
  • Constructor

    • The constructor of an analysis receives the ICFG that shall be used to solve the analysis. Furthermore, the constructor of the DefaultIFDSTabulationProblem that one inherits from must be called AND the special zero value must be initialized in a suitable way. Here is an example of how your constructor can looks like:
  IFDSSolverTest::IFDSSolverTest(LLVMBasedICFG &I, vector<string> EntryPoints)
    : DefaultIFDSTabulationProblem<const llvm::Instruction *,
                                   const llvm::Value *, const llvm::Function *,
                                   LLVMBasedICFG &>(I), EntryPoints(EntryPoints) {
    DefaultIFDSTabulationProblem::zerovalue = createZeroValue();
  }

Writing an IDE analysis

When writing an IDE analysis one only has to implement a single class as well. The general concept is very similar to writing an IFDS analysis. But this time the analysis class has to inherit from DefaultIDETabulationProblem or LLVMDefaultIDETabulationProblem. In addition to this documentation, please also refer to the built-in implementations of IDE data-flow analyses in PhASAR such as the IDELinearConstantAnalysis.

The member functions you a user has to provide implementations for are:

  • getNormalFlowFunction()

    • See Writing an IFDS analysis
  • getCallFlowFuntion()

    • See writing an IFDS analysis
  • getRetFlowFunction()

    • See writing an IFDS analysis
  • getCallToRetFlowFunction()

    • See writing an IFDS analysis
  • Optional: getSummaryFlowFunction()

    • see writing an IFDS analysis
  • initialSeeds()

    • See writing an IFDS analysis
  • topElement()

    • A function that returns the top element of the underlying lattice that the analysis uses -> meaning 'no information at all'
  • bottomElement()

    • A function that returns the bottom element of the underlying lattice that the analysis is using -> meaning 'all information' (the most imprecise element in the lattice)
  • join()

    • A function that defines how information is joined (the merge operator of the lattice) that gets one higher up in the lattice (making the result less precise).
  • allTopFunction()

    • Function that returns the a special edge function allTop that is the EdgeFunction representation of the top value.
  • getNormalEdgeFunction()

    • Returns edge functions for the intra-procedural edges that specify computations that are associated with the corresponding exploded super graph edges.
  • getCallEdgeFunction()

    • Expresses the computations for inter-procedural call edges. These edge functions are oftentimes just the identity function as the actual parameters are usually mapped to the formal parameters of the called function.
  • getReturnEdgeFunction()

    • Express the edge functions that are applied to map data-flow facts that hold at the end of a callee back to the caller. Oftentimes this will be implemented as edge identity.
  • getCallToReturnEdgeFunction()

    • Here the edge functions are defined that are applied for data-flow facts that are passed alongsite the call-site.
  • Constructor

    • See Constructor of Writing an IFDS analysis

Please also refer to the IDE analysis IDELinearConstantAnalysis.

Writing a WPDS analysis

WPDS is the latest solver for distributive data-flow problems that has been implemented in PhASAR. WPDS is very similar to IDE. Basically, our WPDSSolver implementation is a specialization of the IDESolver. But instead of building an exploded super-graph using the user's flow- and edge function implementations a (weighted) pushdown system is build whose rules are drawn from the user's analysis description. Mathematically, IDE and WPDS are equally powerful but WPDS has some additional neat properties which we will not be able to discuss is this documentation. Each rule in the rule set of the pushdown system corresponds to an edge in the exploded super-graph. The analysis problem---encoded in a pushdown system---is then solved using a stack automaton that is obtained by the post* or pre* algorithm. We use a modified version of the WALi library for that task. The following variants of pushdown systems can be created in our implementation:

  • WPDS
  • EWPDS
  • FWPDS
  • SWPDS

In order to write a WPDS analysis one basically needs to write an IDE analysis. The WPDS analysis has its own problem interface WPDSProblem that can be used to encode an WPDS problem. In addition, an existing concrete IDEProblem can be easily transformed into a concrete WPDSProblem as both are distributive problems. Please refer to the WPDSLinearConstantAnalysis for the IDEProblem to WPDSProblem transformation.