Skip to content

Add types to graph (cherry-picked commit) #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 16, 2022

Conversation

ChrisCummins
Copy link
Owner

This supersedes #94 as a standalone commit.

This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

node {
  type: VARIABLE
  text: "i8"
}

image

There are two issues with this:

  • Composite types end up with long textual representations,
    e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
    unbounded number of possible structs, this prevents 100% vocabulary
    coverage on any IR with structs (or other composite types).

  • In the future, we will want to encode different information on data
    nodes, such as embedding literal values. Moving the type information
    out of the data node "frees up" space for something else.

Overview

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

node {
  type: VARIABLE
  text: "var"
}
node {
  type: TYPE
  text: "i32"
}
edge {
  flow: TYPE
  source: 1
}

image

Composite types

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types

A pointer is a composite of two types:

[variable] <- [pointer] <- [pointed-type]

For example:

int32_t* instance;

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  text: TYPE
  target: 1
}
edge {
  text: TYPE
  source: 1
  target: 2
}

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int *A() {
  int a;
  return &a;
}
EOF

Struct types

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

struct s {
  int8_t a;
  int8_t b;
  struct s* c;
}

struct s instance;

Would be represented as:

node {
  type: TYPE
  text: "struct"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  target: 2
  position: 1
}
edge {
  flow: TYPE
  target: 3
  position: 2
}
edge {
  flow: TYPE
  source: 3
}
edge {
  flow: TYPE
  target: 4
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
struct S {
  char a;
  char b;
  struct S* c;
};

char A() {
  struct S s;
  return s.a;
}
EOF

Array Types

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

int a[10];

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "[]"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  source: 1
  target: 2
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int* A() {
  int a[10];
  return a;
}
EOF

Function Pointers

A function pointer is represented by a type node that uniquely identifies
the signature of a function, i.e. its return type and parameter types. The
caveat of this is that pointers to different functions which have the same
signature will resolve to the same type node. Additionally, there is no edge
connecting a function pointer type and the instructions which belong to this
function.

Example

This program contains two function signatures (int (void) and
float (void)), but function pointers to three different functions. This
highlights the caveat described above as the function pointers a and b
alias to the same type node:

Generated using:

br -c opt //:install && cat <<EOF | clang2graph -xc - | graph2dot
int A() {
  return 10;
}

int B() {
  return 5;
}

float C() {
  return 15;
}

int D() {
  int (*a)() = &A;
  int (*b)() = &B;
  float (*c)() = &C;
  return (*a)() + (*b)() + (*c)();
}
EOF

#82

ChrisCummins and others added 2 commits July 15, 2022 17:55
This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background
----------

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

    node {
      type: VARIABLE
      text: "i8"
    }

There are two issues with this:

 * Composite types end up with long textual representations,
   e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
   unbounded number of possible structs, this prevents 100% vocabulary
   coverage on any IR with structs (or other composite types).

 * In the future, we will want to encode different information on data
   nodes, such as embedding literal values. Moving the type information
   out of the data node "frees up" space for something else.

Overview
--------

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

    node {
      type: VARIABLE
      text: "var"
    }
    node {
      type: TYPE
      text: "i32"
    }
    edge {
      flow: TYPE
      source: 1
    }

Composite types
---------------

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types
-------------

A pointer is a composite of two types:

    [variable] <- [pointer] <- [pointed-type]

For example:

    int32_t* instance;

Would be represented as:

    node {
      type: TYPE
      text: "i32"
    }
    node {
      type: TYPE
      text: "*"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      text: TYPE
      target: 1
    }
    edge {
      text: TYPE
      source: 1
      target: 2
    }

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Struct types
------------

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

    struct s {
      int8_t a;
      int8_t b;
      struct s* c;
    }

    struct s instance;

Would be represented as:

    node {
      type: TYPE
      text: "struct"
    }
    node {
      type: TYPE
      text: "i8"
    }
    node {
      type: TYPE
      text: "i8"
    }
    node {
      type: TYPE
      text: "*"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      flow: TYPE
      target: 1
    }
    edge {
      flow: TYPE
      target: 2
      position: 1
    }
    edge {
      flow: TYPE
      target: 3
      position: 2
    }
    edge {
      flow: TYPE
      source: 3
    }
    edge {
      flow: TYPE
      target: 4
    }

Array Types
-----------

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

    int a[10];

Would be represented as:

    node {
      type: TYPE
      text: "i32"
    }
    node {
      type: TYPE
      text: "[]"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      flow: TYPE
      target: 1
    }
    edge {
      flow: TYPE
      source: 1
      target: 2
    }

Function Pointers
-----------------

A function pointer is represented by a type node that uniquely identifies the
*signature* of a function, i.e. its return type and parameter types. The caveat
of this is that pointers to different functions which have the same signature
will resolve to the same type node. Additionally, there is no edge connecting a
function pointer type and the instructions which belong to this function.

github.com//issues/82
@ChrisCummins ChrisCummins force-pushed the feature/82-cherry-pick branch from 8318ab9 to 7ba7806 Compare July 16, 2022 00:55
@ChrisCummins
Copy link
Owner Author

All tests pass locally:

$ make test
bazel  test  //...
INFO: Analyzed 148 targets (0 packages loaded, 0 targets configured).
INFO: Found 121 targets and 27 test targets...
INFO: From Executing genrule //Documentation/bin:clang2graph-10:
error: unable to handle compilation, expected exactly one compiler job in ''
INFO: From Executing genrule //Documentation/bin:inst2vec:
Using backend: pytorch
----------------
Note: The failure of target //programl/bin:inst2vec (with exit code 1) may have been caused by the fact that it is running under Python 3 instead of Python 2. Examine the error to determine if that appears to be the problem. Since this target is built in the host configuration, the only way to change its version is to set --host_force_python=PY2, which affects the entire build.

If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
INFO: Elapsed time: 254.878s, Critical Path: 254.53s
INFO: 90 processes: 90 linux-sandbox.
INFO: Build completed successfully, 54 total actions
//tests:from_llvm_ir_test                                       (cached) PASSED in 49.4s
  Stats over 4 runs: max = 49.4s, min = 44.2s, avg = 45.9s, dev = 2.0s
//tests/graph/analysis:liveness_test                            (cached) PASSED in 0.1s
//tests/graph/format:cdfg_test                                  (cached) PASSED in 0.1s
//tests/graph/format:graph_serializer_test                      (cached) PASSED in 0.1s
//tests/graph/format:graph_tuple_test                           (cached) PASSED in 0.2s
//tests/graph/format:graphviz_converter_test                    (cached) PASSED in 0.1s
//tests/graph/format:node_link_graph_test                       (cached) PASSED in 0.1s
//tests/ir/llvm:clang_test                                      (cached) PASSED in 0.3s
//tests/ir/xla:hlo_proto_reader_test                            (cached) PASSED in 0.1s
//tests/util/py:decorators_test                                 (cached) PASSED in 2.8s
//tests/util/py:pbutil_test                                     (cached) PASSED in 2.0s
//tests/util/py:progress_test                                   (cached) PASSED in 85.3s
  Stats over 4 runs: max = 85.3s, min = 6.5s, avg = 35.0s, dev = 30.0s
//benchmarks:benchmark_dataflow_analyses                                 PASSED in 83.9s
//benchmarks:benchmark_llvm2graph                                        PASSED in 40.1s
//tasks/devmap/dataset:create_test                                       PASSED in 254.5s
//tests:from_cpp_test                                                    PASSED in 5.1s
//tests:from_xla_hlo_proto_test                                          PASSED in 4.9s
//tests/cmd:llvm2graph_strict_mode                                       PASSED in 1.3s
//tests/graph/analysis:datadep_test                                      PASSED in 13.8s
//tests/graph/analysis:dominance_test                                    PASSED in 50.3s
//tests/graph/analysis:reachability_test                                 PASSED in 15.1s
//tests/graph/analysis:subexpressions_test                               PASSED in 7.8s
//tests:serialize_ops_test                                               PASSED in 64.9s
  Stats over 3 runs: max = 64.9s, min = 10.1s, avg = 36.3s, dev = 22.4s
//tests:from_clang_test                                                  PASSED in 28.2s
  Stats over 4 runs: max = 28.2s, min = 25.4s, avg = 26.2s, dev = 1.2s
//tests:to_dot_test                                                      PASSED in 35.4s
  Stats over 8 runs: max = 35.4s, min = 26.6s, avg = 31.1s, dev = 2.7s
//tests:to_json_test                                                     PASSED in 27.2s
  Stats over 8 runs: max = 27.2s, min = 22.3s, avg = 24.6s, dev = 1.5s
//tests:to_networkx_test                                                 PASSED in 27.0s
  Stats over 8 runs: max = 27.0s, min = 22.5s, avg = 25.3s, dev = 1.6s

Executed 15 out of 27 tests: 27 tests pass.
INFO: Build completed successfully, 54 total actions

The CI is on fire, but that is a problem for another time. Merging.

@ChrisCummins ChrisCummins merged commit ad83aeb into development Jul 16, 2022
@ChrisCummins ChrisCummins deleted the feature/82-cherry-pick branch July 16, 2022 01:05
ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this pull request Jul 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant