Performance improvement walk extraction

## 🚀 Feature/ Enhancement
**Increase walk extraction speed**

I have experimented with the pyRDF2Vec package for my thesis and therefore appreciated the provided python package. During the course of my thesis I came across the described performance limitations for large knowledge graphs. As I faced memory issues and long computation times also after optimizing the different parameter settings for the walk extractor I partially implemented the walk extraction with the [python igraph package](https://igraph.org/python/) which gave a performance boost regarding the walk extraction. Since I had benefited from the reduced calculation time I thought it would be useful to share my implemented approach.


## Additional context
The igraph package is partially implemented in C and implements multiple graph algorithms, like the [Breadth-first search](https://igraph.org/python/doc/api/igraph._igraph.GraphBase.html#bfsiter) and [Depth-first search](https://igraph.org/python/doc/api/igraph._igraph.GraphBase.html#dfsiter) algorithm. A performance comparison can be seen in the picture attached below where the Pyrdf2vec approach took around 53 minutes and the implemented approach with igraph took around 2 minutes. The performance comparison was measured with the following [Google Colab document](https://colab.research.google.com/drive/12HafVfduzhVshOElsqvx2t4zVbeLFcja?usp=sharing) on the [FB15k train dataset](https://www.microsoft.com/en-us/download/details.aspx?id=52312). As I did not transfer the Knowledge Graph object to an igraph object I directly read in the csv file.

If additional information regarding the implemented approach are necessary, feel free to reach out.

## Solution
A possible solution approach can be found here:
```
# import packages
import pandas as pd
import numpy as np
from tqdm import tqdm
from igraph import Graph
from itertools import groupby
import multiprocessing as mp

def predicateGeneration(self, pathList):
  # assign class graph to graph variable
  graph = self.graph
  # extract description of edge given edge id stored in numpy
  predValues = np.array([e.attributes()['description'] for e in graph.es(pathList)])
  # extract node sequences that are part of the edge path and flatten numpy array
  nodeSequence = np.array([graph.vs().select(e.tuple).get_attribute_values('name') for e in graph.es(pathList)]).flatten()
  # delete consecutive character values in numpy array based from prior matrix
  nodeSequence = np.array([key for key, _group in groupby(nodeSequence)])
  # combine description values and node sequences to one single array
  pathSequence = np.insert(predValues, np.arange(len(nodeSequence)), nodeSequence)
  # convert numpy array to list 
  pathSequence = pathSequence.tolist()
  # return path sequence numpy array
  return pathSequence

def walkIteration(self, idNumber):
  # assign class graph variable to local graph variable
  graph = self.graph
  # extract index of graph node
  nodeIndex = graph.vs.find(idNumber).index
  # perform breadth-first search algorithm
  bfsList = graph.bfsiter(nodeIndex, 'out', advanced=True)
  # iterate over breadth-first search iterator object to filter those paths out
  # defined distance variable
  distanceList = [nodePath for nodePath in bfsList if nodePath[1] <= self.distance]
  # create vertex list from distance list extracting vertex element
  vertexList = [vertexElement[0] for vertexElement in distanceList]
  # compute shortest path from focused node index to extracted vertex list outputting edge ID
  shortestPathList = graph.get_shortest_paths(v=nodeIndex, to=vertexList, output='epath')
  # extract walk sequences with edge id to generate predicates
  walkSequence = list(map(self.predicateGeneration, shortestPathList))      
  if len(walkSequence) < maxWalks: maxWalks = len(walkSequence)
  random.seed(15)
  walkSequence = random.sample(walkSequence, maxWalks)
  return walkSequence

# Read a CSV file containing the entities we want to extract.
graphData = pd.read_csv(**filePath**)
# transform values of row values dataframe into list
graphDataValues = graphData.to_records(index=False)
# initialize Knowledge Graph
self.graph = Graph().TupleList(rowValues, directed=True, edge_attrs=['description'])
# initialize multiprocessing pool with cpu number
pool = mp.Pool(mp.cpu_count())
# extract walk predicates using the walkIteration method 
walkPredicateList = list(tqdm(pool.imap_unordered(self.walkIteration,
entities, chunksize=8), desc='Walk Extraction', total=len(entities)))
```
![Performance Comparison](https://user-images.githubusercontent.com/39691697/172625583-1e92416a-315a-4057-b6c1-99078c71eec5.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvement walk extraction #96

🚀 Feature/ Enhancement

Additional context

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance improvement walk extraction #96

Description

🚀 Feature/ Enhancement

Additional context

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions