Code Monkey home page Code Monkey logo

gradoop's Introduction

Build Status

Gradoop: Distributed Graph Analytics on Hadoop

Gradoop is an open source (GPLv3) research framework for scalable graph analytics. It offers a graph data model which extends the widespread property graph model by the concept of logical subgraphs and further provides operators that can be applied on single graphs and collections of graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows.

// load social network from hdfs
LogicalGraph db = EPGMDatabase.fromJsonFile("hdfs://...").getDatabaseGraph();
// detect communities
GraphCollection communities = db.callForCollection(new LabelPropagation(...));
// filter large communities
GraphCollection communities = communities.select((LogicalGraph g) -> g.vertexCount() > 100);
// combine them to a single graph
LogicalGraph relevantSubgraph = communities.reduce((LogicalGraph g1, LogicalGraph g2) -> g1.combine(g2));
// summarize the network based on the city users live in
LogicalGraph summarizedGraph = relevantSubgraph.summarize("city");
// write back to HDFS
summarizedGraph.writeAsJson("hdfs://...");

Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.

Data Model

In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs are application-specific subsets from shared sets of vertices and edges, i.e., may have common vertices and edges. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.

Data Model elements (logical graphs, vertices and edges) have a unique identifier inside their domain, a single label (e.g., User) and a number of key-value properties (e.g, name = Alice). There is no schema involved, meaning each element can have arbitrary number of properties even if they have the same label.

Graph operators

The EPGM provides operators for both single graphs as well as collections of graphs; operators may also return single graphs or graph collections.

Our operator implementations are based on Apache Flink. The following table contains an overview (GC = GraphCollection, G = Graph).

Operator In Out Output description Impl
Selection GC GC Graphs that fulfil a predicate function Yes
Distinct GC GC No duplicate graphs No
SortBy GC GC Graphs sorted by graph property No
Top GC GC The first n elements of the input collection No
Union GC x GC GC All graphs from both collections Yes
Intersection GC x GC GC Only graphs that exist in both collections Yes
Difference GC x GC GC Only graphs that exist in one collection Yes
Combination G x G G Vertices and edges from both graphs Yes
Overlap G x G G Vertices and edges that exist in both graphs Yes
Exclusion G x G G Vertices and edges that exist in only one graph Yes
Pattern Match G GC Graphs that match a given graph pattern No
Aggregation G G Graph with result of an aggregate function Yes
Projection G G Graph with projected vertex and edge sets Yes
Summarization G G Structural condense of the input graph Yes
Apply GC GC Applies operator to each graph in collection No
Reduce GC G Reduces collection to graph using operator No

Setup

Build gradoop from source

Load data into gradoop

JSON

Gradoop supports JSON as input format for vertices, edges and graphs. Besides the unique id, each JSON document stores the properties of the specific entity in an embedded document data. Meta information, like the obligatory label, is stored in a second embedded document meta. The meta document of vertices and edges may contain a mapping to the logical graphs they are contained in.

Two persons (Alice and Bob) that have three properties each and are contained in two logical graphs ("graphs":[0,2]).

// content of hdfs:///input/nodes.json
{"id":0,"data":{"gender":"f","city":"Leipzig","name":"Alice"},"meta":{"label":"Person","graphs":[0,2]}}
{"id":1,"data":{"gender":"m","city":"Leipzig","name":"Bob"},"meta":{"label":"Person","graphs":[0,2]}}

Edges are represented in a similar way. Alice and Bob are connected by an edge (knows). Edges may have properties (e.g., "since":2014) and may also be contained in logical graphs. Additionally, edge JSON documents store the obligatory source and target vertex identifier.

// content of hdfs:///input/edges.json
{"id":0,"source":0,"target":1,"data":{"since":2014},"meta":{"label":"knows","graphs":[0,2]}}

Graphs may also have properties and must have a label (e.g., Community).

// content of hdfs:///input/graphs.json
{"id":0,"data":{"interest":"Databases","vertexCount":3},"meta":{"label":"Community"}}
{"id":1,"data":{"interest":"Hadoop","vertexCount":3},"meta":{"label":"Community"}}
{"id":2,"data":{"interest":"Graphs","vertexCount":4},"meta":{"label":"Community"}}

HBase

Gradoop can read and write an EPGM database from HBase using an EPGM store. The current implementation is work in progress, at the moment one can read or write the whole database. We are working on reading only data that is needed for the analysis (e.g., a collection of specific communities).

The following example shows how to create an EPGM Store and how to write an EPGM database to it.

EPGMDataBase epgmDB = EPGMDatabase.fromJsonFile(...);

// do some fancy analysis ...

EPGMStore epgmStore = HBaseEPGMStore.createOrOpenEPGMStore(vertexTable, edgeTable, graphHeadTable);
epgmDB.writeToHBase(epgmStore);
epgmStore.close();

You can now read the database from HBase.

EPGMStore epgmStore = HBaseEPGMStore.createOrOpenEPGMStore(vertexTable, edgeTable, graphHeadTable);
EPGMDatabase epgmDB = EPGMDatabase.fromHBase(epgmStore);

// do some fancy analysis ...

epgmStore.close()

Example: Extract schema graph from possibly large-scale graph

In this example, we use the summarize operator to create a condensed version of our input graph. By summarizing on vertex and edge labels, we compute the schema of our graph. Each vertex in the resulting graph represents all vertices with the same label (e.g., Person or Group), each edge represents all edges with the same label that connect vertices from the same vertex groups.

String vertexInputPath = "hdfs:///input/nodes.json";
String edgeInputPath = "hdfs:///input/edges.json";
String graphInputPath = "hdfs:///input/graphs.json";
EPGMDatabase db = EPGMDatabase.fromJsonFile(vertexInputPath, edgeInputPath, graphInputPath, env);
LogicalGraph schemaGraph = db.getDatabaseGraph().summarizeOnVertexAndEdgeLabels();
schemaGraph.writeAsJson(vertexOutputPath, edgeOutputPath, graphOutputPath);

Cluster deployment

If you want to execute Gradoop on a cluster, you need Hadoop 2.5.1 and Flink 0.9.0 for Hadoop 2.4.1 installed and running.

  • start a flink yarn-session (e.g. 5 Task managers with 4GB RAM and 4 processing slots each)

./bin/yarn-session.sh -n 5 -tm 4096 -s 4

  • run your program (e.g. the included Summarization example)

./bin/flink run -c org.gradoop.examples.SummarizationExample ~/gradoop-flink-0.0.2-jar-with-dependencies.jar --vertex-input-path hdfs:///nodes.json --edge-input-path hdfs://edges.json --use-vertex-labels --use-edge-labels

Gradoop modules

gradoop-core

The main contents of that module are the Extended Property Graph Data Model, the corresponding graph repository and its reference implementation for Apache HBase (wip).

Furthermore, the module contains the Bulk Load / Write drivers based on MapReduce and file readers / writers for user defined file and graph formats.

gradoop-flink

This module contains a reference implementation of the EPGM including the data model and its operators. The concepts of the EPGM are mapped to the property graph data model offered by Flink Gelly and additional Flink Datasets. The gradoop operator implementations leverage existing Flink and Gelly operators.

gradoop-examples

Contains example pipelines showing use cases for Gradoop.

  • Graph summarization (build structural aggregates of property graphs)

gradoop-checkstyle

Used to maintain the codestyle for the whole project.

Developer notes

Code style for IntelliJ IDEA

  • copy codestyle from dev-support to your local IDEA config folder

    cp dev-support/gradoop-idea-codestyle.xml ~/.IntelliJIdea14/config/codeStyles

  • restart IDEA

  • File -> Settings -> Code Style -> Java -> Scheme -> "Gradoop"

Troubleshooting

  • Exception while running test org.apache.giraph.io.hbase .TestHBaseRootMarkerVertexFormat (incorrect permissions, see http://stackoverflow.com/questions/17625938/hbase-minidfscluster-java-fails -in-certain-environments for details)

    umask 022

  • Ubuntu + Giraph hostname problems. To avoid hostname issues comment the following line in /etc/hosts

    127.0.1.1 <your-host-name>

  • And add your hostname to the localhost entry

    127.0.0.1 localhost <your-host-name>

Version History

  • 0.0.1 first prototype using Hadoop MapReduce and Apache Giraph for operator processing
  • 0.0.2 support for HBase as distributed graph storage
  • 0.0.3 Apache Flink replaces MapReduce and Giraph as operator implementation layer and distributed execution engine

gradoop's People

Contributors

freeclimbing avatar p3et avatar s1ck avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.