Add types to graph (cherry-picked commit) #199

ChrisCummins · 2022-07-15T23:28:03Z

This supersedes #94 as a standalone commit.

This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

node {
  type: VARIABLE
  text: "i8"
}

There are two issues with this:

Composite types end up with long textual representations,
e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
unbounded number of possible structs, this prevents 100% vocabulary
coverage on any IR with structs (or other composite types).
In the future, we will want to encode different information on data
nodes, such as embedding literal values. Moving the type information
out of the data node "frees up" space for something else.

Overview

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

node {
  type: VARIABLE
  text: "var"
}
node {
  type: TYPE
  text: "i32"
}
edge {
  flow: TYPE
  source: 1
}

Composite types

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types

A pointer is a composite of two types:

[variable] <- [pointer] <- [pointed-type]

For example:

int32_t* instance;

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  text: TYPE
  target: 1
}
edge {
  text: TYPE
  source: 1
  target: 2
}

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int *A() {
  int a;
  return &a;
}
EOF

Struct types

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

struct s {
  int8_t a;
  int8_t b;
  struct s* c;
}

struct s instance;

Would be represented as:

node {
  type: TYPE
  text: "struct"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  target: 2
  position: 1
}
edge {
  flow: TYPE
  target: 3
  position: 2
}
edge {
  flow: TYPE
  source: 3
}
edge {
  flow: TYPE
  target: 4
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
struct S {
  char a;
  char b;
  struct S* c;
};

char A() {
  struct S s;
  return s.a;
}
EOF

Array Types

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

int a[10];

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "[]"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  source: 1
  target: 2
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int* A() {
  int a[10];
  return a;
}
EOF

Function Pointers

A function pointer is represented by a type node that uniquely identifies
the signature of a function, i.e. its return type and parameter types. The
caveat of this is that pointers to different functions which have the same
signature will resolve to the same type node. Additionally, there is no edge
connecting a function pointer type and the instructions which belong to this
function.

Example

This program contains two function signatures (int (void) and
float (void)), but function pointers to three different functions. This
highlights the caveat described above as the function pointers a and b
alias to the same type node:

Generated using:

br -c opt //:install && cat <<EOF | clang2graph -xc - | graph2dot
int A() {
  return 10;
}

int B() {
  return 5;
}

float C() {
  return 15;
}

int D() {
  int (*a)() = &A;
  int (*b)() = &B;
  float (*c)() = &C;
  return (*a)() + (*b)() + (*c)();
}
EOF

#82

ChrisCummins/ProGraML#199

This adds a fourth node type, and a fourth edge flow, both called "type". The idea is to represent types as first-class elements in the graph representation. This allows greater compositionality by breaking up composite types into subcomponents, and decreases the required vocabulary size required to achieve a given coverage. Background ---------- Currently, type information is stored in the "text" field of nodes for constants and variables, e.g.: node { type: VARIABLE text: "i8" } There are two issues with this: * Composite types end up with long textual representations, e.g. "struct foo { i32 a; i32 b; ... }". Since there is an unbounded number of possible structs, this prevents 100% vocabulary coverage on any IR with structs (or other composite types). * In the future, we will want to encode different information on data nodes, such as embedding literal values. Moving the type information out of the data node "frees up" space for something else. Overview -------- This changes the representation to represent types as first-class elements in the graph. A "type" node represents a type using its "text" field, and a new "type" edge connects this type to variables or constants of that type, e.g. a variable "int x" could be represented as: node { type: VARIABLE text: "var" } node { type: TYPE text: "i32" } edge { flow: TYPE source: 1 } Composite types --------------- Types may be composed by connecting multiple type nodes using type edges. This allows you to break down complex types into a graph of primitive parts. The meaning of composite types will depend on the IR being targetted, the remainder describes the process for LLVM-IR. Pointer types ------------- A pointer is a composite of two types: [variable] <- [pointer] <- [pointed-type] For example: int32_t* instance; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { text: TYPE target: 1 } edge { text: TYPE source: 1 target: 2 } Where variables/constants of this type receive an incoming type edge from the [pointer] node, which in turn receives an incoming type edge from the [pointed-type] node. One [pointer] node is generated for each unique pointer type. If a graph contains multiple pointer types, there will be multiple [pointer] nodes, one for each pointed type. Struct types ------------ A struct is a compsite type where each member is a node type which points to the parent node. Variable/constant instances of a struct receive an incoming type edge from the root struct node. Note that the graph of type nodes representing a composite struct type may be cyclical, since a struct can contain a pointer of the same type (think of a binary tree implementation). For all other member types, a new type node is produced. For example, a struct with two integer members will produce two integer type nodes, they are not shared. The type edges from member nodes to the parent struct are positional. The position indicates the element number. E.g. for a struct with three elements, the incoming type edges to the struct node will have positions 0, 1, and 2. This example struct: struct s { int8_t a; int8_t b; struct s* c; } struct s instance; Would be represented as: node { type: TYPE text: "struct" } node { type: TYPE text: "i8" } node { type: TYPE text: "i8" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE target: 2 position: 1 } edge { flow: TYPE target: 3 position: 2 } edge { flow: TYPE source: 3 } edge { flow: TYPE target: 4 } Array Types ----------- An array is a composite type [variable] <- [array] <- [element-type]. For example, the array: int a[10]; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "[]" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE source: 1 target: 2 } Function Pointers ----------------- A function pointer is represented by a type node that uniquely identifies the *signature* of a function, i.e. its return type and parameter types. The caveat of this is that pointers to different functions which have the same signature will resolve to the same type node. Additionally, there is no edge connecting a function pointer type and the instructions which belong to this function. github.com//issues/82

ChrisCummins · 2022-07-16T01:00:35Z

All tests pass locally:

$ make test
bazel  test  //...
INFO: Analyzed 148 targets (0 packages loaded, 0 targets configured).
INFO: Found 121 targets and 27 test targets...
INFO: From Executing genrule //Documentation/bin:clang2graph-10:
error: unable to handle compilation, expected exactly one compiler job in ''
INFO: From Executing genrule //Documentation/bin:inst2vec:
Using backend: pytorch
----------------
Note: The failure of target //programl/bin:inst2vec (with exit code 1) may have been caused by the fact that it is running under Python 3 instead of Python 2. Examine the error to determine if that appears to be the problem. Since this target is built in the host configuration, the only way to change its version is to set --host_force_python=PY2, which affects the entire build.

If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
INFO: Elapsed time: 254.878s, Critical Path: 254.53s
INFO: 90 processes: 90 linux-sandbox.
INFO: Build completed successfully, 54 total actions
//tests:from_llvm_ir_test                                       (cached) PASSED in 49.4s
  Stats over 4 runs: max = 49.4s, min = 44.2s, avg = 45.9s, dev = 2.0s
//tests/graph/analysis:liveness_test                            (cached) PASSED in 0.1s
//tests/graph/format:cdfg_test                                  (cached) PASSED in 0.1s
//tests/graph/format:graph_serializer_test                      (cached) PASSED in 0.1s
//tests/graph/format:graph_tuple_test                           (cached) PASSED in 0.2s
//tests/graph/format:graphviz_converter_test                    (cached) PASSED in 0.1s
//tests/graph/format:node_link_graph_test                       (cached) PASSED in 0.1s
//tests/ir/llvm:clang_test                                      (cached) PASSED in 0.3s
//tests/ir/xla:hlo_proto_reader_test                            (cached) PASSED in 0.1s
//tests/util/py:decorators_test                                 (cached) PASSED in 2.8s
//tests/util/py:pbutil_test                                     (cached) PASSED in 2.0s
//tests/util/py:progress_test                                   (cached) PASSED in 85.3s
  Stats over 4 runs: max = 85.3s, min = 6.5s, avg = 35.0s, dev = 30.0s
//benchmarks:benchmark_dataflow_analyses                                 PASSED in 83.9s
//benchmarks:benchmark_llvm2graph                                        PASSED in 40.1s
//tasks/devmap/dataset:create_test                                       PASSED in 254.5s
//tests:from_cpp_test                                                    PASSED in 5.1s
//tests:from_xla_hlo_proto_test                                          PASSED in 4.9s
//tests/cmd:llvm2graph_strict_mode                                       PASSED in 1.3s
//tests/graph/analysis:datadep_test                                      PASSED in 13.8s
//tests/graph/analysis:dominance_test                                    PASSED in 50.3s
//tests/graph/analysis:reachability_test                                 PASSED in 15.1s
//tests/graph/analysis:subexpressions_test                               PASSED in 7.8s
//tests:serialize_ops_test                                               PASSED in 64.9s
  Stats over 3 runs: max = 64.9s, min = 10.1s, avg = 36.3s, dev = 22.4s
//tests:from_clang_test                                                  PASSED in 28.2s
  Stats over 4 runs: max = 28.2s, min = 25.4s, avg = 26.2s, dev = 1.2s
//tests:to_dot_test                                                      PASSED in 35.4s
  Stats over 8 runs: max = 35.4s, min = 26.6s, avg = 31.1s, dev = 2.7s
//tests:to_json_test                                                     PASSED in 27.2s
  Stats over 8 runs: max = 27.2s, min = 22.3s, avg = 24.6s, dev = 1.5s
//tests:to_networkx_test                                                 PASSED in 27.0s
  Stats over 8 runs: max = 27.0s, min = 22.5s, avg = 25.3s, dev = 1.6s

Executed 15 out of 27 tests: 27 tests pass.
INFO: Build completed successfully, 54 total actions

The CI is on fire, but that is a problem for another time. Merging.

ChrisCummins/ProGraML#199

ChrisCummins force-pushed the feature/82-cherry-pick branch 2 times, most recently from 115a81c to 8318ab9 Compare July 16, 2022 00:03

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this pull request Jul 16, 2022

Bump ProGraML to include the type graph patch.

760816c

ChrisCummins/ProGraML#199

ChrisCummins mentioned this pull request Jul 16, 2022

[WIP] Bump ProGraML to include the type graph patch. facebookresearch/CompilerGym#731

Draft

ChrisCummins and others added 2 commits July 15, 2022 17:55

[pre-commit] Bump black version

29c6f72

ChrisCummins force-pushed the feature/82-cherry-pick branch from 8318ab9 to 7ba7806 Compare July 16, 2022 00:55

ChrisCummins mentioned this pull request Jul 16, 2022

Add types to the graph #94

Closed

[pre-commit] Update go install steps.

ff4e787

Bump pre-commit python dependencies.

1912c53

ChrisCummins merged commit ad83aeb into development Jul 16, 2022

ChrisCummins deleted the feature/82-cherry-pick branch July 16, 2022 01:05

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this pull request Jul 16, 2022

Bump ProGraML to include the type graph patch.

1d1f13e

ChrisCummins/ProGraML#199

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add types to graph (cherry-picked commit) #199

Add types to graph (cherry-picked commit) #199

ChrisCummins commented Jul 15, 2022

ChrisCummins commented Jul 16, 2022

Add types to graph (cherry-picked commit) #199

Add types to graph (cherry-picked commit) #199

Conversation

ChrisCummins commented Jul 15, 2022

Background

Overview

Composite types

Pointer types

Example

Struct types

Example

Array Types

Example

Function Pointers

Example

ChrisCummins commented Jul 16, 2022