Skip to content

Performance improvement walk extraction #96

Open
@MartinBoeckling

Description

@MartinBoeckling

🚀 Feature/ Enhancement

Increase walk extraction speed

I have experimented with the pyRDF2Vec package for my thesis and therefore appreciated the provided python package. During the course of my thesis I came across the described performance limitations for large knowledge graphs. As I faced memory issues and long computation times also after optimizing the different parameter settings for the walk extractor I partially implemented the walk extraction with the python igraph package which gave a performance boost regarding the walk extraction. Since I had benefited from the reduced calculation time I thought it would be useful to share my implemented approach.

Additional context

The igraph package is partially implemented in C and implements multiple graph algorithms, like the Breadth-first search and Depth-first search algorithm. A performance comparison can be seen in the picture attached below where the Pyrdf2vec approach took around 53 minutes and the implemented approach with igraph took around 2 minutes. The performance comparison was measured with the following Google Colab document on the FB15k train dataset. As I did not transfer the Knowledge Graph object to an igraph object I directly read in the csv file.

If additional information regarding the implemented approach are necessary, feel free to reach out.

Solution

A possible solution approach can be found here:

# import packages
import pandas as pd
import numpy as np
from tqdm import tqdm
from igraph import Graph
from itertools import groupby
import multiprocessing as mp

def predicateGeneration(self, pathList):
  # assign class graph to graph variable
  graph = self.graph
  # extract description of edge given edge id stored in numpy
  predValues = np.array([e.attributes()['description'] for e in graph.es(pathList)])
  # extract node sequences that are part of the edge path and flatten numpy array
  nodeSequence = np.array([graph.vs().select(e.tuple).get_attribute_values('name') for e in graph.es(pathList)]).flatten()
  # delete consecutive character values in numpy array based from prior matrix
  nodeSequence = np.array([key for key, _group in groupby(nodeSequence)])
  # combine description values and node sequences to one single array
  pathSequence = np.insert(predValues, np.arange(len(nodeSequence)), nodeSequence)
  # convert numpy array to list 
  pathSequence = pathSequence.tolist()
  # return path sequence numpy array
  return pathSequence

def walkIteration(self, idNumber):
  # assign class graph variable to local graph variable
  graph = self.graph
  # extract index of graph node
  nodeIndex = graph.vs.find(idNumber).index
  # perform breadth-first search algorithm
  bfsList = graph.bfsiter(nodeIndex, 'out', advanced=True)
  # iterate over breadth-first search iterator object to filter those paths out
  # defined distance variable
  distanceList = [nodePath for nodePath in bfsList if nodePath[1] <= self.distance]
  # create vertex list from distance list extracting vertex element
  vertexList = [vertexElement[0] for vertexElement in distanceList]
  # compute shortest path from focused node index to extracted vertex list outputting edge ID
  shortestPathList = graph.get_shortest_paths(v=nodeIndex, to=vertexList, output='epath')
  # extract walk sequences with edge id to generate predicates
  walkSequence = list(map(self.predicateGeneration, shortestPathList))      
  if len(walkSequence) < maxWalks: maxWalks = len(walkSequence)
  random.seed(15)
  walkSequence = random.sample(walkSequence, maxWalks)
  return walkSequence

# Read a CSV file containing the entities we want to extract.
graphData = pd.read_csv(**filePath**)
# transform values of row values dataframe into list
graphDataValues = graphData.to_records(index=False)
# initialize Knowledge Graph
self.graph = Graph().TupleList(rowValues, directed=True, edge_attrs=['description'])
# initialize multiprocessing pool with cpu number
pool = mp.Pool(mp.cpu_count())
# extract walk predicates using the walkIteration method 
walkPredicateList = list(tqdm(pool.imap_unordered(self.walkIteration,
entities, chunksize=8), desc='Walk Extraction', total=len(entities)))

Performance Comparison

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions