Skip to content

Writing a Data Flow Analysis

LinusJungemann edited this page Jul 16, 2020 · 15 revisions

Before writing a static analysis

Why I cannot write a data-flow analysis on-the-fly? Writing a data-flow analysis is a challenging task and can be tough. Therefore, you should be familiar with the underlying theory in order to be able to develop a novel data-flow analysis. We compiled a (incomplete) list of literature for you which you may consult before writing an analysis. Please have a look at Useful Literature.

Writing a static analysis

PhASAR provides several sophisticated mechanisms that allow you to write your own data-flow analysis. In general, PhASAR is designed in such a way that the analysis writer has to choose from several possible interfaces which he can use to implement a concrete static analysis. Depending on a user's analysis problem, some interfaces might be more suitable than others. When having found the right interface for the problem, the analysis writer usually only has to provide a new class which implements the interface's missing functionality which serves as the problem description for the analysis. This concrete problem description is handed over to the corresponding solver, which solves the problem in a fully automated manner. In the following the possible data-flow solvers are ordered according to their expressiveness (and complexity).

Choosing a control-flow graph

In general, all data-flow analyses are performed on a control-flow graph. When writing an analysis, a user has to choose a control-flow graph for their analysis. For that reason, all of our solvers work either on the CFG (intra-procedural control-flow graph) or on the ICFG (inter-procedural control-flow graph) interface.

For instance, when writing a simple intra-procedural data-flow analysis using the monotone framework, the user has to use one of CFG's concrete implementations or provide his own implementation for that interface. The pre-implemented LLVMBasedCFG.h should usually do the job and can be used out-of-the-box.

The inter-procedural call-string approach and the IFDS/IDE/WPDS frameworks solve a concrete analysis based on an inter-procedural control-flow graph. Depending on the analysis's needs, a forward-, backward-, or bi-directional inter-procedural control-flow graph may be used. For instance, if a backward analysis is required, the LLVMBasedBackwardICFG has to be used. However, most of the time the 'LLVMBasedICFG' should work just fine.

It is also possible to provide a custom implementation of the ICFG interface if necessary. In that case, a user just needs to provide another class for which they provide a reasonable name. The implementation must at least implement the interface that is defined by the ICFG.h interface.

Useful shortcuts

In the following section some useful coding shortcuts are presented which may come in handy when writing a new analysis within PhASAR.

The std::algorithm header

When overriding the classes in order to describe the problems within the monotone framework, oftentimes set operations like set union, set intersect or set difference are required. Writing these functions yourself is tedious. Therefore it makes much sense to use the existing set operation functions which are defined in the std::algorithm header file such as:

    ...
    std::includes /* subset */
    std::set_difference
    std::set_intersection
    std::set_union
    ...

The neat thing about these functions is that they are completely generic as they operate on iterators and provide several useful overloads. Thus, they work on all container types that follow the concept's required by these functions. In the following a small code example is presented:

std::set<int> a = {1, 2, 3};
std::set<int> b = {6, 4, 3, 2, 8, 1};
std::set<int> c;
std::set_union(a.begin(), a.end(),
               b.begin(), b.end(),
               inserter(c, c.begin()));
bool isSubSet = std::includes(b.begin(), b.end(), a.begin(), a.end());
The pre-defined flow function classes

When defining flow functions in IFDS, IDE or WPDS, sometimes a certain flow function type occurs more than once. For instance, the Kill flow function that kills a specific data-flow fact is often needed many times. An analysis writer can find several useful pre-defined flow (and edge) functions that can be directly used. A very useful example of such a recurring flow function is the Identity flow function. Some of the flow functions are also defined as a singleton in order to keep the amount of overhead as small as possible. This is always possible if the flow function has no state. A user may use the pre-defined flow functions inside their flow function factories. We also provide some LLVM specific flow functions for parameter mapping at call and return sites. We also provide some general edge functions that may be used when implementing an IDE analysis.

Here is a small example that makes use of the pre-defined flow functions:

// in this example let domain D be const llvm::Value*

shared_ptr<FlowFunction<const llvm::Value*>> getNormalFlowFunction(....) {
    // check the type of the instruction
    if( ...) {
        // do some work
    } else if (...) {
        // kill some data-flow fact
        return make_shared<Kill<const llvm::Value*>>(/* some fact to be killed */);
    } else {
        // just treat everything else as Identity
        return Identity<const llvm::Value*>::getInstance();
    }
}

Important template parameters

The code for the data-flow solvers is written in a very generic way. For that reason we use a lot of template parameters. Here we describe the most important template parameters:

  • D
    • The type of the data-flow facts of your data-flow domain D. When analyzing LLVM IR this will oftentimes be const llvm::Value *.
  • N
    • The type of nodes in you intra-/inter-procedural control-flow graph (aka statements/instructions). When using analysis targeting LLVM IR it will always be const llvm::Instruction *.
  • M
    • The type of functions/methods used by the framework. When using LLVM it will be const llvm::Function *.
  • I
    • The type of the inter-procedural control-flow graph to be used. Usually it will be some reference to a type implementing the ICFG.h interface, e.g. LLVMBasedICFG&.
  • V/L
    • This is the type for the second---value domain---of IDE problem. This type really depends on the concrete client analysis. When using IFDS you one does not have to worry about this type, since it is automatically chosen for the analysis writer as:
enum class BinaryDomain {
    BOTTOM = 0,
    TOP = 1
};
Clone this wiki locally