Gradoop: Distributed Graph Analytics on Hadoop
Gradoop is an open source (GPLv3) research framework for scalable graph analytics. It offers a graph data model which extends the widespread property graph model by the concept of logical subgraphs and further provides operators that can be applied on single graphs and collections of graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows.
// load social network from hdfs
LogicalGraph db = EPGMDatabase.fromJsonFile("hdfs://...").getDatabaseGraph();
// detect communities
GraphCollection communities = db.callForCollection(new LabelPropagation(...));
// filter large communities
GraphCollection communities = communities.select((LogicalGraph g) -> g.vertexCount() > 100);
// combine them to a single graph
LogicalGraph relevantSubgraph = communities.reduce((LogicalGraph g1, LogicalGraph g2) -> g1.combine(g2));
// summarize the network based on the city users live in
LogicalGraph summarizedGraph = relevantSubgraph.summarize("city");
// write back to HDFS
summarizedGraph.writeAsJson("hdfs://...");
Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.
Data Model
In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs are application-specific subsets from shared sets of vertices and edges, i.e., may have common vertices and edges. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.
Data Model elements (logical graphs, vertices and edges) have a unique identifier inside their domain, a single label (e.g., User) and a number of key-value properties (e.g, name = Alice). There is no schema involved, meaning each element can have arbitrary number of properties even if they have the same label.
Graph operators
The EPGM provides operators for both single graphs as well as collections of graphs; operators may also return single graphs or graph collections.
Our operator implementations are based on Apache Flink. The following table contains an overview (GC = GraphCollection, G = Graph).
Operator | In | Out | Output description | Impl |
---|---|---|---|---|
Selection | GC | GC | Graphs that fulfil a predicate function | Yes |
Distinct | GC | GC | No duplicate graphs | No |
SortBy | GC | GC | Graphs sorted by graph property | No |
Top | GC | GC | The first n elements of the input collection | No |
Union | GC x GC | GC | All graphs from both collections | Yes |
Intersection | GC x GC | GC | Only graphs that exist in both collections | Yes |
Difference | GC x GC | GC | Only graphs that exist in one collection | Yes |
Combination | G x G | G | Vertices and edges from both graphs | Yes |
Overlap | G x G | G | Vertices and edges that exist in both graphs | Yes |
Exclusion | G x G | G | Vertices and edges that exist in only one graph | Yes |
Pattern Match | G | GC | Graphs that match a given graph pattern | No |
Aggregation | G | G | Graph with result of an aggregate function | Yes |
Projection | G | G | Graph with projected vertex and edge sets | Yes |
Summarization | G | G | Structural condense of the input graph | Yes |
Apply | GC | GC | Applies operator to each graph in collection | No |
Reduce | GC | G | Reduces collection to graph using operator | No |
Setup
Build gradoop from source
-
Clone Gradoop into your local file system
-
Build and execute tests
cd gradoop
mvn clean install
Load data into gradoop
JSON
Gradoop supports JSON as input format for vertices, edges and graphs. Besides the
unique id, each JSON document stores the properties of the specific entity in an
embedded document data
. Meta information, like the obligatory label, is stored
in a second embedded document meta
. The meta document of vertices and edges may
contain a mapping to the logical graphs they are contained in.
Two persons (Alice and Bob) that have three properties each and are contained in
two logical graphs ("graphs":[0,2]
).
// content of hdfs:///input/nodes.json
{"id":0,"data":{"gender":"f","city":"Leipzig","name":"Alice"},"meta":{"label":"Person","graphs":[0,2]}}
{"id":1,"data":{"gender":"m","city":"Leipzig","name":"Bob"},"meta":{"label":"Person","graphs":[0,2]}}
Edges are represented in a similar way. Alice and Bob are connected by an edge
(knows). Edges may have properties (e.g., "since":2014
) and may also be contained
in logical graphs. Additionally, edge JSON documents store the obligatory source and
target vertex identifier.
// content of hdfs:///input/edges.json
{"id":0,"source":0,"target":1,"data":{"since":2014},"meta":{"label":"knows","graphs":[0,2]}}
Graphs may also have properties and must have a label (e.g., Community).
// content of hdfs:///input/graphs.json
{"id":0,"data":{"interest":"Databases","vertexCount":3},"meta":{"label":"Community"}}
{"id":1,"data":{"interest":"Hadoop","vertexCount":3},"meta":{"label":"Community"}}
{"id":2,"data":{"interest":"Graphs","vertexCount":4},"meta":{"label":"Community"}}
HBase
Gradoop can read and write an EPGM database from HBase using an EPGM store. The current implementation is work in progress, at the moment one can read or write the whole database. We are working on reading only data that is needed for the analysis (e.g., a collection of specific communities).
The following example shows how to create an EPGM Store and how to write an EPGM database to it.
EPGMDataBase epgmDB = EPGMDatabase.fromJsonFile(...);
// do some fancy analysis ...
EPGMStore epgmStore = HBaseEPGMStore.createOrOpenEPGMStore(vertexTable, edgeTable, graphHeadTable);
epgmDB.writeToHBase(epgmStore);
epgmStore.close();
You can now read the database from HBase.
EPGMStore epgmStore = HBaseEPGMStore.createOrOpenEPGMStore(vertexTable, edgeTable, graphHeadTable);
EPGMDatabase epgmDB = EPGMDatabase.fromHBase(epgmStore);
// do some fancy analysis ...
epgmStore.close()
Example: Extract schema graph from possibly large-scale graph
In this example, we use the summarize
operator to create a condensed version
of our input graph. By summarizing on vertex and edge labels, we compute the schema
of our graph. Each vertex in the resulting graph represents all vertices with the
same label (e.g., Person or Group), each edge represents all edges with the same
label that connect vertices from the same vertex groups.
String vertexInputPath = "hdfs:///input/nodes.json";
String edgeInputPath = "hdfs:///input/edges.json";
String graphInputPath = "hdfs:///input/graphs.json";
EPGMDatabase db = EPGMDatabase.fromJsonFile(vertexInputPath, edgeInputPath, graphInputPath, env);
LogicalGraph schemaGraph = db.getDatabaseGraph().summarizeOnVertexAndEdgeLabels();
schemaGraph.writeAsJson(vertexOutputPath, edgeOutputPath, graphOutputPath);
Cluster deployment
If you want to execute Gradoop on a cluster, you need Hadoop 2.5.1 and Flink 0.9.0 for Hadoop 2.4.1 installed and running.
- start a flink yarn-session (e.g. 5 Task managers with 4GB RAM and 4 processing slots each)
./bin/yarn-session.sh -n 5 -tm 4096 -s 4
- run your program (e.g. the included Summarization example)
./bin/flink run -c org.gradoop.examples.SummarizationExample ~/gradoop-flink-0.0.2-jar-with-dependencies.jar --vertex-input-path hdfs:///nodes.json --edge-input-path hdfs://edges.json --use-vertex-labels --use-edge-labels
Gradoop modules
gradoop-core
The main contents of that module are the Extended Property Graph Data Model, the corresponding graph repository and its reference implementation for Apache HBase (wip).
Furthermore, the module contains the Bulk Load / Write drivers based on MapReduce and file readers / writers for user defined file and graph formats.
gradoop-flink
This module contains a reference implementation of the EPGM including the data model and its operators. The concepts of the EPGM are mapped to the property graph data model offered by Flink Gelly and additional Flink Datasets. The gradoop operator implementations leverage existing Flink and Gelly operators.
gradoop-examples
Contains example pipelines showing use cases for Gradoop.
- Graph summarization (build structural aggregates of property graphs)
gradoop-checkstyle
Used to maintain the codestyle for the whole project.
Developer notes
Code style for IntelliJ IDEA
-
copy codestyle from dev-support to your local IDEA config folder
cp dev-support/gradoop-idea-codestyle.xml ~/.IntelliJIdea14/config/codeStyles
-
restart IDEA
-
File -> Settings -> Code Style -> Java -> Scheme -> "Gradoop"
Troubleshooting
-
Exception while running test org.apache.giraph.io.hbase .TestHBaseRootMarkerVertexFormat (incorrect permissions, see http://stackoverflow.com/questions/17625938/hbase-minidfscluster-java-fails -in-certain-environments for details)
umask 022
-
Ubuntu + Giraph hostname problems. To avoid hostname issues comment the following line in /etc/hosts
127.0.1.1 <your-host-name>
-
And add your hostname to the localhost entry
127.0.0.1 localhost <your-host-name>
Version History
- 0.0.1 first prototype using Hadoop MapReduce and Apache Giraph for operator processing
- 0.0.2 support for HBase as distributed graph storage
- 0.0.3 Apache Flink replaces MapReduce and Giraph as operator implementation layer and distributed execution engine