Add types to the graph.

ChrisCummins · ChrisCummins · commit 559f37f1c3b2 · 2020-08-21T23:49:47.000+01:00
This adds a fourth node type, and a fourth edge flow, both called "type". The idea is to represent types as first-class elements in the graph representation. This allows greater compositionality by breaking up composite types into subcomponents, and decreases the required vocabulary size required to achieve a given coverage. Background ---------- Currently, type information is stored in the "text" field of nodes for constants and variables, e.g.: node { type: VARIABLE text: "i8" } There are two issues with this: * Composite types end up with long textual representations, e.g. "struct foo { i32 a; i32 b; ... }". Since there is an unbounded number of possible structs, this prevents 100% vocabulary coverage on any IR with structs (or other composite types). * In the future, we will want to encode different information on data nodes, such as embedding literal values. Moving the type information out of the data node "frees up" space for something else. Overview -------- This changes the representation to represent types as first-class elements in the graph. A "type" node represents a type using its "text" field, and a new "type" edge connects this type to variables or constants of that type, e.g. a variable "int x" could be represented as: node { type: VARIABLE text: "var" } node { type: TYPE text: "i32" } edge { flow: TYPE source: 1 } Composite types --------------- Types may be composed by connecting multiple type nodes using type edges. This allows you to break down complex types into a graph of primitive parts. The meaning of composite types will depend on the IR being targetted, the remainder describes the process for LLVM-IR. Pointer types ------------- A pointer is a composite of two types: [variable] <- [pointer] <- [pointed-type] For example: int32_t* instance; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { text: TYPE target: 1 } edge { text: TYPE source: 1 target: 2 } Where variables/constants of this type receive an incoming type edge from the [pointer] node, which in turn receives an incoming type edge from the [pointed-type] node. One [pointer] node is generated for each unique pointer type. If a graph contains multiple pointer types, there will be multiple [pointer] nodes, one for each pointed type. Struct types ------------ A struct is a compsite type where each member is a node type which points to the parent node. Variable/constant instances of a struct receive an incoming type edge from the root struct node. Note that the graph of type nodes representing a composite struct type may be cyclical, since a struct can contain a pointer of the same type (think of a binary tree implementation). The type edges from member nodes to the parent struct are positional. The position indicates the element number. E.g. for a struct with three elements, the incoming type edges to the struct node will have positions 0, 1, and 2. This example struct: struct s { int8_t a; int8_t b; struct s* c; } struct s instance; Would be represented as: node { type: TYPE text: "struct" } node { type: TYPE text: "i8" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE target: 1 position: 1 } edge { flow: TYPE target: 2 position: 2 } edge { flow: TYPE source: 2 } edge { flow: TYPE target: 3 } Array Types ----------- An array is a composite type [variable] <- [array] <- [element-type]. For example, the array: int a[10]; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "[]" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE source: 1 target: 2 } github.com//issues/82
diff --git a/programl/graph/format/graphviz_converter.cc b/programl/graph/format/graphviz_converter.cc
@@ -134,7 +134,7 @@ labm8::Status SerializeGraphVizToString(const ProgramGraph& graph,
 
     // Determine the subgraph to add this node to.
     boost::subgraph<GraphvizGraph>* dst = &external;
-    if (i && node.type() != Node::CONSTANT) {
+    if (i && (node.type() == Node::INSTRUCTION || node.type() == Node::VARIABLE)) {
       dst = &functionGraphs[node.function()].get();
     }
 
@@ -192,29 +192,33 @@ labm8::Status SerializeGraphVizToString(const ProgramGraph& graph,
     }
     labm8::TruncateWithEllipsis(text, kMaximumLabelLen);
     attributes["label"] = text;
+    attributes["style"] = "filled";
 
     // Set the node shape.
     switch (node.type()) {
       case Node::INSTRUCTION:
         attributes["shape"] = "box";
-        attributes["style"] = "filled";
         attributes["fillcolor"] = "#3c78d8";
         attributes["fontcolor"] = "#ffffff";
         break;
       case Node::VARIABLE:
         attributes["shape"] = "ellipse";
-        attributes["style"] = "filled";
         attributes["fillcolor"] = "#f4cccc";
         attributes["color"] = "#990000";
         attributes["fontcolor"] = "#990000";
         break;
       case Node::CONSTANT:
-        attributes["shape"] = "diamond";
-        attributes["style"] = "filled";
+        attributes["shape"] = "octagon";
         attributes["fillcolor"] = "#e99c9c";
         attributes["color"] = "#990000";
         attributes["fontcolor"] = "#990000";
         break;
+      case Node::TYPE:
+        attributes["shape"] = "diamond";
+        attributes["fillcolor"] = "#cccccc";
+        attributes["color"] = "#cccccc";
+        attributes["fontcolor"] = "#222222";
+        break;
     }
   }
 
@@ -242,15 +246,21 @@ labm8::Status SerializeGraphVizToString(const ProgramGraph& graph,
         attributes["color"] = "#65ae4d";
         attributes["weight"] = "1";
         break;
+      case Edge::TYPE:
+        attributes["color"] = "#aaaaaa";
+        attributes["weight"] = "1";
+        attributes["penwidth"] = "1.5";
+        break;
     }
 
     // Set the edge label.
     if (edge.position()) {
       // Position labels for control edge are drawn close to the originating
-      // instruction. For data edges, they are drawn closer to the consuming
-      // instruction.
+      // instruction. For control edges, they are drawn close to the branching
+      // instruction. For data and type edges, they are drawn close to the
+      // consuming node.
       const string label =
-          edge.flow() == Edge::DATA ? "headlabel" : "taillabel";
+          edge.flow() == Edge::CONTROL ? "taillabel" : "headlabel";
       attributes[label] = std::to_string(edge.position());
       attributes["labelfontcolor"] = attributes["color"];
     }
diff --git a/programl/graph/program_graph_builder.cc b/programl/graph/program_graph_builder.cc
@@ -69,6 +69,10 @@ Node* ProgramGraphBuilder::AddConstant(const string& text) {
   return AddNode(Node::CONSTANT, text);
 }
 
+Node* ProgramGraphBuilder::AddType(const string& text) {
+  return AddNode(Node::TYPE, text);
+}
+
 labm8::StatusOr<Edge*> ProgramGraphBuilder::AddControlEdge(int32_t position,
                                                            const Node* source,
                                                            const Node* target) {
@@ -145,6 +149,26 @@ labm8::StatusOr<Edge*> ProgramGraphBuilder::AddCallEdge(const Node* source,
   return AddEdge(Edge::CALL, /*position=*/0, source, target);
 }
 
+labm8::StatusOr<Edge*> ProgramGraphBuilder::AddTypeEdge(int32_t position,
+                                                        const Node* source,
+                                                        const Node* target) {
+  DCHECK(source) << "nullptr argument";
+  DCHECK(target) << "nullptr argument";
+
+  if (source->type() != Node::TYPE) {
+    return Status(labm8::error::Code::INVALID_ARGUMENT,
+                  "Invalid source type ({}) for type edge. Expected type",
+                  Node::Type_Name(source->type()));
+  }
+  if (target->type() == Node::INSTRUCTION) {
+    return Status(labm8::error::Code::INVALID_ARGUMENT,
+                  "Invalid destination type (instruction) for type edge. "
+                  "Expected {variable,constant,type}");
+  }
+
+  return AddEdge(Edge::TYPE, position, source, target);
+}
+
 labm8::StatusOr<ProgramGraph> ProgramGraphBuilder::Build() {
   if (options().strict()) {
     RETURN_IF_ERROR(ValidateGraph());
diff --git a/programl/graph/program_graph_builder.h b/programl/graph/program_graph_builder.h
@@ -64,6 +64,8 @@ class ProgramGraphBuilder {
 
   Node* AddConstant(const string& text);
 
+  Node* AddType(const string& text);
+
   // Edge factories.
   [[nodiscard]] labm8::StatusOr<Edge*> AddControlEdge(int32_t position,
                                                       const Node* source,
@@ -76,6 +78,10 @@ class ProgramGraphBuilder {
   [[nodiscard]] labm8::StatusOr<Edge*> AddCallEdge(const Node* source,
                                                    const Node* target);
 
+  [[nodiscard]] labm8::StatusOr<Edge*> AddTypeEdge(int32_t position,
+                                                   const Node* source,
+                                                   const Node* target);
+
   const Node* GetRootNode() const { return &graph_.node(0); }
 
   // Return the graph protocol buffer.
@@ -123,7 +129,7 @@ class ProgramGraphBuilder {
   int32_t GetIndex(const Function* function);
   int32_t GetIndex(const Node* node);
 
-  // Maps which covert store the index of objects in repeated field lists.
+  // Maps that store the index of objects in repeated field lists.
   absl::flat_hash_map<Module*, int32_t> moduleIndices_;
   absl::flat_hash_map<Function*, int32_t> functionIndices_;
   absl::flat_hash_map<Node*, int32_t> nodeIndices_;
diff --git a/programl/ir/llvm/inst2vec_encoder.py b/programl/ir/llvm/inst2vec_encoder.py
@@ -122,6 +122,7 @@ def Encode(
     # Add the node features.
     var_embedding = self.dictionary["!IDENTIFIER"]
     const_embedding = self.dictionary["!IMMEDIATE"]
+    type_embedding = self.dictionary["!IMMEDIATE"]  # Types are immediates
 
     text_index = 0
     for node in proto.node:
@@ -143,6 +144,12 @@ def Encode(
         node.features.feature["inst2vec_embedding"].int64_list.value.append(
           const_embedding
         )
+      elif node.type == node_pb2.Node.TYPE:
+        node.features.feature["inst2vec_embedding"].int64_list.value.append(
+          type_embedding
+        )
+      else:
+        raise TypeError(f"Unknown node type {node}")
 
     proto.features.feature["inst2vec_annotated"].int64_list.value.append(1)
     return proto
diff --git a/programl/ir/llvm/internal/BUILD b/programl/ir/llvm/internal/BUILD
@@ -43,6 +43,7 @@ cc_library(
         "@com_google_absl//absl/container:flat_hash_set",
         "@labm8//labm8/cpp:status_macros",
         "@labm8//labm8/cpp:statusor",
+        "@labm8//labm8/cpp:logging",
         "@labm8//labm8/cpp:string",
         "@llvm//10.0.0",
     ],
diff --git a/programl/ir/llvm/internal/program_graph_builder.cc b/programl/ir/llvm/internal/program_graph_builder.cc
@@ -20,6 +20,7 @@
 
 #include "absl/container/flat_hash_map.h"
 #include "absl/container/flat_hash_set.h"
+#include "labm8/cpp/logging.h"
 #include "labm8/cpp/status_macros.h"
 #include "labm8/cpp/string.h"
 #include "llvm/IR/BasicBlock.h"
@@ -39,14 +40,18 @@ namespace ir {
 namespace llvm {
 namespace internal {
 
+namespace {
+
+BytesList* getStringsList(ProgramGraph* programGraph) {
+  return (*programGraph->mutable_features()->mutable_feature())["strings"].mutable_bytes_list();
+}
+
+}  // anonymous namespace
+
 ProgramGraphBuilder::ProgramGraphBuilder(const ProgramGraphOptions& options)
-    : programl::graph::ProgramGraphBuilder(),
-      options_(options),
+    : programl::graph::ProgramGraphBuilder(options),
       blockCount_(0),
-      stringsList_((*GetMutableProgramGraph()
-          ->mutable_features()
-          ->mutable_feature())["strings"]
-                       .mutable_bytes_list()) {
+      stringsList_(getStringsList(GetMutableProgramGraph())) {
   // Add an empty
   graph::AddScalarFeature(GetMutableRootNode(), "llvm_string", AddString(""));
 }
@@ -357,29 +362,125 @@ Node* ProgramGraphBuilder::AddLlvmInstruction(
 Node* ProgramGraphBuilder::AddLlvmVariable(const ::llvm::Instruction* operand,
                                            const programl::Function* function) {
   const LlvmTextComponents text = textEncoder_.Encode(operand);
-  Node* node = AddVariable(text.lhs_type, function);
+  Node* node = AddVariable("var", function);
   node->set_block(blockCount_);
   graph::AddScalarFeature(node, "llvm_string", AddString(text.lhs));
 
+  compositeTypeParts_.clear();  // Reset after previous call.
+  Node* type = GetOrCreateType(operand->getType());
+  CHECK(AddTypeEdge(/*position=*/0, type, node).ok());
+
   return node;
 }
 
 Node* ProgramGraphBuilder::AddLlvmVariable(const ::llvm::Argument* argument,
                                            const programl::Function* function) {
   const LlvmTextComponents text = textEncoder_.Encode(argument);
-  Node* node = AddVariable(text.lhs_type, function);
+  Node* node = AddVariable("var", function);
   node->set_block(blockCount_);
   graph::AddScalarFeature(node, "llvm_string", AddString(text.lhs));
 
+  compositeTypeParts_.clear();  // Reset after previous call.
+  Node* type = GetOrCreateType(argument->getType());
+  CHECK(AddTypeEdge(/*position=*/0, type, node).ok());
+
   return node;
 }
 
 Node* ProgramGraphBuilder::AddLlvmConstant(const ::llvm::Constant* constant) {
   const LlvmTextComponents text = textEncoder_.Encode(constant);
-  Node* node = AddConstant(text.lhs_type);
+  Node* node = AddConstant("val");
   node->set_block(blockCount_);
   graph::AddScalarFeature(node, "llvm_string", AddString(text.text));
 
+  compositeTypeParts_.clear();  // Reset after previous call.
+  Node* type = GetOrCreateType(constant->getType());
+  CHECK(AddTypeEdge(/*position=*/0, type, node).ok());
+
+  return node;
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::Type* type) {
+  // Dispatch to the type-specific handlers.
+  if (::llvm::dyn_cast<::llvm::StructType>(type)) {
+    return AddLlvmType(::llvm::dyn_cast<::llvm::StructType>(type));
+  } else if (::llvm::dyn_cast<::llvm::PointerType>(type)) {
+    return AddLlvmType(::llvm::dyn_cast<::llvm::PointerType>(type));
+  } else if (::llvm::dyn_cast<::llvm::FunctionType>(type)) {
+    return AddLlvmType(::llvm::dyn_cast<::llvm::FunctionType>(type));
+  } else if (::llvm::dyn_cast<::llvm::ArrayType>(type)) {
+    return AddLlvmType(::llvm::dyn_cast<::llvm::ArrayType>(type));
+  } else if (::llvm::dyn_cast<::llvm::VectorType>(type)) {
+    return AddLlvmType(::llvm::dyn_cast<::llvm::VectorType>(type));
+  } else {
+    const LlvmTextComponents text = textEncoder_.Encode(type);
+    Node *node = AddType(text.text);
+    graph::AddScalarFeature(node, "llvm_string", AddString(text.text));
+    return node;
+  }
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::StructType* type) {
+  Node* node = AddType("struct");
+  compositeTypeParts_[type] = node;
+  graph::AddScalarFeature(node, "llvm_string",
+                          AddString(textEncoder_.Encode(type).text));
+
+  // Add types for the struct elements, and add type edges.
+  for (int i = 0; i < type->getNumElements(); ++i) {
+    const auto& member = type->elements()[i];
+    // Re-use the type if it already exists to prevent duplication of member
+    // types.
+    auto memberNode = GetOrCreateType(member);
+    CHECK(AddTypeEdge(/*position=*/i, memberNode, node).ok());
+  }
+
+  return node;
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::PointerType* type) {
+  Node* node = AddType("*");
+  graph::AddScalarFeature(node, "llvm_string",
+                          AddString(textEncoder_.Encode(type).text));
+
+  auto elementType = type->getElementType();
+  auto parent = compositeTypeParts_.find(elementType);
+  if (parent == compositeTypeParts_.end()) {
+    // Re-use the type if it already exists to prevent duplication.
+    auto elementNode = GetOrCreateType(type->getElementType());
+    CHECK(AddTypeEdge(/*position=*/0, elementNode, node).ok());
+  } else {
+    // Bottom-out for self-referencing types.
+    CHECK(AddTypeEdge(/*position=*/0, node, parent->second).ok());
+  }
+
+  return node;
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::FunctionType* type) {
+  Node* node = AddType("fn");
+  graph::AddScalarFeature(node, "llvm_string",
+                          AddString(textEncoder_.Encode(type).text));
+  return node;
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::ArrayType* type) {
+  Node* node = AddType("[]");
+  graph::AddScalarFeature(node, "llvm_string",
+                          AddString(textEncoder_.Encode(type).text));
+  // Re-use the type if it already exists to prevent duplication.
+  auto elementType = GetOrCreateType(type->getElementType());
+  CHECK(AddTypeEdge(/*position=*/0, elementType, node).ok());
+  return node;
+}
+
+Node* ProgramGraphBuilder::AddLlvmType(const ::llvm::VectorType* type) {
+  Node* node = AddType("vector");
+  graph::AddScalarFeature(node, "llvm_string",
+                          AddString(textEncoder_.Encode(type).text));
+  // Re-use the type if it already exists to prevent duplication.
+  auto elementType = GetOrCreateType(type->getElementType());
+  CHECK(AddTypeEdge(/*position=*/0, elementType, node).ok());
   return node;
 }
 
diff --git a/programl/ir/llvm/internal/program_graph_builder.h b/programl/ir/llvm/internal/program_graph_builder.h
@@ -71,6 +71,12 @@ class ProgramGraphBuilder : public programl::graph::ProgramGraphBuilder {
 
   void Clear();
 
+  // Return the node representing a type. If no node already exists
+  // for this type, a new node is created and added to the graph. In
+  // the case of composite types, multiple new nodes may be added by
+  // this call, and the root type returned.
+  Node* GetOrCreateType(const ::llvm::Type* type);
+
  protected:
   [[nodiscard]] labm8::StatusOr<FunctionEntryExits> VisitFunction(
       const ::llvm::Function& function, const Function* functionMessage);
@@ -90,6 +96,12 @@ class ProgramGraphBuilder : public programl::graph::ProgramGraphBuilder {
   Node* AddLlvmVariable(const ::llvm::Argument* argument,
                         const Function* function);
   Node* AddLlvmConstant(const ::llvm::Constant* constant);
+  Node* AddLlvmType(const ::llvm::Type* type);
+  Node* AddLlvmType(const ::llvm::StructType* type);
+  Node* AddLlvmType(const ::llvm::PointerType* type);
+  Node* AddLlvmType(const ::llvm::FunctionType* type);
+  Node* AddLlvmType(const ::llvm::ArrayType* type);
+  Node* AddLlvmType(const ::llvm::VectorType* type);
 
   // Add a string to the strings list and return its position.
   //
@@ -118,6 +130,26 @@ class ProgramGraphBuilder : public programl::graph::ProgramGraphBuilder {
   absl::flat_hash_map<string, int32_t> stringsListPositions_;
   // The underlying storage for the strings table.
   BytesList* stringsList_;
+
+  // A map from an LLVM type to the node message that represents it.
+  absl::flat_hash_map<const ::llvm::Type*, Node*> types_;
+
+  // When adding a new type to the graph we need to know whether the type that
+  // we are adding is part of a composite type that references itself. For
+  // example:
+  //
+  //     struct BinaryTree {
+  //       int data;
+  //       struct BinaryTree* left;
+  //       struct BinaryTree* right;
+  //     }
+  //
+  // When the recursive GetOrCreateType() resolves the "left" member, it needs
+  // to know that the parent BinaryTree type has already been processed. This
+  // map stores the Nodes corresponding to any parent structs that have been
+  // already added in a call to GetOrCreateType(). It must be cleared between
+  // calls.
+  absl::flat_hash_map<const ::llvm::Type*, Node*> compositeTypeParts_;
 };
 
 }  // namespace internal
diff --git a/programl/ir/llvm/py/BUILD b/programl/ir/llvm/py/BUILD
@@ -39,6 +39,7 @@ py_test(
     srcs = ["llvm_test.py"],
     deps = [
         ":llvm",
+        "//programl/proto:edge_py",
         "//programl/proto:node_py",
         "//programl/proto:program_graph_options_py",
         "//programl/proto:program_graph_py",
diff --git a/programl/ir/llvm/py/llvm_test.py b/programl/ir/llvm/py/llvm_test.py
diff --git a/programl/proto/edge.proto b/programl/proto/edge.proto
diff --git a/programl/proto/node.proto b/programl/proto/node.proto