Code Monkey home page Code Monkey logo

gll4graph's Introduction

example workflow

GLL4Graph

This is the implementation of the GLL-based context-free path querying (CFPQ) algorithm. It is based on a high-performance GLL parsing algorithm implemented in Iguana project. Then, it was modificated to support graph-structured input data. Proposed solution solves both reachability and all paths problems for all pairs and multiple sources cases.

Performance

The proposed solution has been evaluated on several real-world graphs for both the all pairs and the multiple sources scenarios. The evaluation shows that the proposed solution is more than 45 times faster than the previous solution for Neo4j and is comparable, in some scenarios and cases, with the linear algebra based solution.

Machine configuration: PC with Ubuntu 20.04, Intel Core i7-6700 3.40GHz CPU, DDR4 64Gb RAM.

Enviroment configuration:

  • Java HotSpot(TM) 64-Bit server virtual machine (build 15.0.2+7-27, mixed mode, sharing).
  • JVM heap configuration: 55Gb both xms and xmx.
  • Neo4j 4.0.3 is used. Almost all configurations are default except two:
    • total off-heap transaction memory (tx_state_max_off_heap_memory parameter) is set to 24Gb
    • pagecache_warmup_enabled is set to true.

Graphs

The graph data is selected from CFPQ_Data dataset. There are two types of graph:

  • Graphs related to RDF analysis problems
  • Graphs related to static code analysis problems

A detailed description of the graphs is listed bellow.

RDF analysis

Graph name |V| |E| #subClassOf #type #broaderTransitive
Core 1 323 2 752 178 0 0
Pathways 6 238 12 363 3 117 3 118 0
Go hierarchy 45 007 490 109 490 109 0 0
Enzyme 48 815 86 543 8 163 14 989 8 156
Eclass_514en 239 111 360 248 90 962 72 517 0
Geospecies 450 609 2 201 532 0 89 065 20 867
Go 582 929 1 437 437 94 514 226 481 0
Taxonomy 5 728 398 14 922 125 2 112 637 2 508 635 0

Static code analysis

Graph name |V| |E| #a #d
Apache 1 721 418 1 510 411 362 799 1 147 612
Block 3 423 234 2 951 393 669 238 2 282 155
Fs 4 177 416 3 609 373 824 430 2 784 943
Ipc 3 401 022 2 931 498 664 151 2 267 347
Lib 3 401 355 2 931 880 664 311 2 267 569
Mm 2 538 243 2 191 079 498 918 1 692 161
Net 4 039 470 3 500 141 807 162 2 692 979
Postgre 5 203 419 4 678 543 1 209 597 3 468 946
Security 3 479 982 3 003 326 683 339 2 319 987
Sound 3 528 861 3 049 732 697 159 2 352 573
Init 2 446 224 2 112 809 481 994 1 630 815
Arch 3 448 422 2 970 242 6 712 95 2 298 947
Crypto 3 464 970 2 988 387 678 408 2 309 979
Drivers 4 273 803 3 707 769 858 568 2 849 201
Kernel 11 254 434 9 484 213 1 981 258 7 502 955

Grammars

All queries used in evaluation are variants of same-generation query. The inverse of an x relation and the respective edge is denoted as x_r.


Grammars used for RDF graphs:

G1

S -> subClassOf_r S subClassOf | subClassOf_r subClassOf 
     | type_r S type | type_r type

G2

S -> subClassOf_r S subClassOf | subClassOf

Geo

S -> broaderTransitive S broaderTransitive_r
     | broaderTransitive broaderTransitive_r 

Grammar used for static code analysis graphs:

PointsTo

M -> d_r V d
V -> (M? a_r)* M? (a M?)* 

Results

The results of the all pairs reachability queries evaluation on graphs related to RDF analysis are listed below.

The sign ’–’ in cells means that the respective query is not applicable to the graph, so time is not measured.

Graph name G1 G2 Geo
time (sec) #answer time (sec) #answer time (sec) #answer
Core 0,02 204 0,01 214
Pathways 0,07 884 0,04 3117
Go hierarchy 3,68 588 976 5,42 738 937
Enzyme 0,22 396 0,17 8163 5,7 14 267 542
Eclass 1,5 90 994 0,97 96 163
Geospecies 2,89 85 2,65 0 145,8 226 669 749
Go 5,56 640 316 4,24 659 501
Taxonomy 45,47 151 706 36,06 2 112 637

The evaluation results for single source CFPQ for graphs related to RDF analysis and G1, G2, Geo grammars respectively in reachability and all paths scenarious:

time time time

The results for graphs related to static code analysis are compared to results of Azimov’s CFPQ algorithm based on matrix operations. The implementation from CFPQ_PyAlgo was taken as the implementation of the matrix CFPQ algorithm. This library contains the implementation for both scenarios, all pairs reachability and single source reachability. To perform matrix operations pygraphblas is used. Pygraphblas is a python wrapper over the SuiteSparse library, which based on the GraphBLAS framework.

The results of the all pairs reachability queries evaluation on graphs related to static code analysis are listed below.

The sign ’–’ in cells means that the respective query and graph require a considerable amount of memory during algorithm execution that leads to unpredictable time to get the result.

Graph name PointsTo
Neo4j time (sec) GraphBLAS time (sec) #answer
Apache 536,7 92 806 768
Block 113,01 123,88 5 351 409
Fs 167,73 105,72 9 646 475
Ipc 109,43 79,52 5 249 389
Lib 111,09 121,79 5 276 303
Mm 77,92 84,15 3 990 305
Net 160,64 206,29 8 833 403
Postgre 969,88 90 661 446
Security 115,75 181,7 5 593 387
Sound 120,14 133,64 6 085 269
Init 87,25 45,84 3 783 769
Arch 130,77 119,92 5 339 563
Crypto 128,8 122,09 5 428 237
Drivers 371,18 279,39 18 825 025
Kernel 614,047 378,05 16 747 731

The evaluation results for single source CFPQ for graphs related to static code analysis and pointsTo grammar in reachability and all paths scenarious:

time

Download and build

The project is build with Maven.

git clone https://github.com/JetBrains-Research/GLL4Graph
cd GLL4Graph
mvn compile

Usage

To replicate experiments:

  • make the directory to save results
mkdir results
  • run the following command with arguments
mvn exec:java -Dexec.mainClass="benchmark.Neo4jBenchmark" -Dexec.args="arg1 arg2 arg3 arg4 arg5 agr6 agr7 arg8"
Argument Description
arg1 Relationship types of the edges in graph to use. Currently this argument can take one of three possible values:
- st is used for G1 and G2 grammars
- bt is used for Geo grammar
- ad is used for PointsTo grammar
arg2 The number of vertices in graph
arg3 The number of warm up iterations
arg4 The total number of iterations
arg5 Path to dataset
arg6 Path to grammar
arg7 The name of a file with result
arg8 Grammar name (g1/g2/geo/pointsTo)

Example

Here is an example which can be run right after project is downloaded and results directory is created without any additional preparation.

To run experiments on Core graph on G1 grammar use the following command:

mvn exec:java -Dexec.mainClass="benchmark.Neo4jBenchmark" -Dexec.args="st 1323 2 5 <ABS_PATH_TO_PROJECT>/data/core/ test/resources/grammars/graph/g1/grammar.json core g1"

Data

To get more graph data examples use Python script:

graph_loader.py --graph {graph_name} --relationships {comma_separated_relationships_list}

For example:

graph_loader.py --graph core --relationships subClassOf,type

License

This project is licensed under OpenBSD License. License text can be found in the license file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.