The unipop from unipop-graph

"Customizing & Extending Unipop" guide

Customizing Schemas

index by time
multiple rows
multiple buckets

Extending Controllers

Elasticsearch
- RoutedDocumentController
- TemplateController
- AggregationController
- GeoIntersectController
Jdbc
- StoredProcedureController

Use JEST library instead of ES native java client

JEST is an elasticsearch java client that communicates through REST with ES. This (hopefully) means that we can use one client for all versions, thus killing the separation (and code duplication) between unipop-elastic and unipop-elastic2.

Theoratically it should also make unipop consume less memory.

How to deploy a gremlin server with Unipop

I am working on a project that require graph queries on ElasticSearch and I found Unipop which fits our use case perfectly. Well done!
However, I found few clues about how to set up a gremlin server with unipop. The unipop-elastic2 is half developed. I found that I could not even package unipop-elastic successfully.
It would be very helpful if you could give any advice. I am really looking forward to the release of a stable version.

ReduceQuery - JDBC

Implement ReduceController in jdbc module

sub-task of #46

Usage

I wondered how I can actually use this library to run a simple gremlin graph traversal. ElasticGraphProvider looks promising, but that doesn't actually get shipped in your jar, as it is in test code...

OrderGlobalStep - Strategy, Step, JDBC & Elastic

Pass the order step to the controller,
so that the behavior can be transferred partly to the DB's. and then merged locally

Performance Metrics

Metrics - utilize Tinkerpop's TraversalMetrics

Bulk query Repeat/Union/Coalesce/Where Steps

ElasticGraphController

A new elastic controller utilizing the new Elasticsearch Graph API

Text Predicates

PREFIX, SUFFIX, REGEX, FUZZY, etc.

NestedController (post-refactor)

elastic2 NestedController - refactor

Connection terminated

If the connection to the databases terminates for any reason, Unipop won't reconnect and thus won't work.

Parallelism

Add parallel execution when issuing UniQueries.

Possible Parallelism points:

Controller Parallelism
Bulk Parallelism
Schema Parallelism

Dependency Management - Collections Library

Currently we use multiple collections libraries, including:

Apache Collections
Google Collections
Java Stream API

all of them are lacking some features which exist in other languages such as Scala and the .Net family

a solution to this is using the Seq API from JOOQ/JOOL.

Much easier to use than StreamSupport and the stream API
has more features
very readable
encapsulates the required API from both the apache/google libraries and the stream API.

some resources:

from personal experience, the JOOL api fits our needs and gives us greater flexibility.

Bulk mutations

Solution suggestion

Optimize the Mutation steps to issue mutation UniQuerys with a bulk of Element, enabling the Controllers to issue bulk commands to the DB.

Questions

Should we enable bulk mutations directly from the UniGraph? That would entail adding on Tinkerpop's current Graph API.
Should changes be automatically committed to the DB after every query, or should we add a commit() method to UniGraph/Traversal?
Maybe using BulkLoaderVertexProgram, or a similar solution, would be a better choice?

jdbc - edges as columns in vertex rows

OLAP GraphComputer

Theoretically we should be able to run distributed Unipop queries on Spark or something. Some of Unipop's data sources even have Hadoop integration (e.g. Elasticsearch RDD, Jdbc RDD, etc).
Utilizing a Unigraph's schema configuration, this feature should provide Unipop's users a transparent, zero-configuration way to execute distributed queries over their data.

Questions

Should we implement Tinkerpop's GraphComputer? What does that entail?
Can we utilize Tinkerpop's HadoopGraph implementation? If so, how?
Can we utilize Tinkerpop's SparkGraphComputer (is that the name)? If so, how?

elastic DocumentController - refactor

UniBulkStep - Incremental Bulk

Implement bulk step to grow exponentially to
make streaming faster.

Json configuration-array support

Make sure that json fields that support arrays also support a single value without an array. E.g:
"foo": "bar" == "foo": ["bar"]

Exists Predicate

Add back ExistsP for g.V().has("fieldname")

Grouping TraversalStrategy

Enable Controllers to implement optimized "group by" and "group count" functions.
This needs to be re-implemented following the changes In #44.

elastic2 DocumentController refactor

ReduceQuery - Elastic

Implement ReduceController in elastic module,

sub-task of #46

jdbc - StoredProcedureController

Dynamic UniFeatures

UniFeatures are part of the graph object.
this does not allow to have controllers with different features.

a solution is needed that will allow a dynamic result. as it is based on what is being executed at the moment and where what action is undergoing considering the controller.

Suggestions are welcome

"inner JOIN" steps into single query

Most Databases have an Sql-like "join" feature. Utilizing this across Traversal Steps can bring a big boost to performance. e.g. g.V().hasLabel('foo').out('bar') could be queried as select ... from foo join bar on ...

This is how I thought to implement this:

Analyze Traversal

Use TraversalStrategys to analyze possible joins:

Adjacent steps
SelectStep
Inner traversal
Aggregations
MatchStep
...

SearchQuery will be added a property SearchVertexQuery[] getNextQueries(), returning the possible joins found. A controller can use this recursively to get more join possibilities.

Validate Join

The Controller should use the Schemas to validate the 'legitimacy' of joining the query with any of the "next queries". Things to check:

Is the "next query"' handled only by this Controller (i.e. is all the data on the same DB).
Else, is a full copy available? (We can duplicate often-joined data in our different databases, and mark them as such in the schema configuration).
Skip VirtualVertexs and join with the next query.

Query

Each controller implements the join query in a different way:

Elasticsearch
- Graph Api
- Siren Join Plugin (https://siren.solutions/searchplugins/join/)
Jdbc
- JOIN

The Controller should return each result and its corresponding "future steps" results as a set, with each result in the set associated with its relevant stepId.
The issuing Step will create a Traverser from its relevant result, and add the rest as "Traverser side effects", enabling the future steps to access the results when they are called.

jdbc TableController - refactor

Elasticsearch Scroll API

Use elasticsearch query Scroll API to iterate many results in the StartStep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

According to the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration". Should we use Scroll in Unipop? If so, should we always use it? maybe only in StartStep (Assuming most times it will iterate a large amount of results).

The scroll functionality is already implemented in QueryIterator, but its currently unused.

Dependency convergence error on build

This sounds cool, so I wanted to give unipop-elastic a try.
Is there a jar hosted on some public repository?

When I build it manually I get the following dependency convergence error, which I couldn't even fix when adding a direct dependency from unipop-elastic on snakeyaml:1.15

Failed while enforcing releasability the error(s) are [
Dependency convergence error for org.yaml:snakeyaml:1.15 paths to dependency are:
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.apache.tinkerpop:gremlin-core:3.0.2-incubating
      +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-unipop:unipop-core:0.1
    +-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
  +-org.elasticsearch:elasticsearch:1.7.3
    +-org.yaml:snakeyaml:1.12
]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Unipop ............................................. SUCCESS [  0.627 s]
[INFO] Unipop :: Core ..................................... SUCCESS [  0.762 s]
[INFO] Unipop :: Elasticsearch Controllers ................ FAILURE [  0.709 s]
[INFO] Unipop :: JDBC Controllers ......................... SKIPPED
[INFO] Unipop :: Integration tests ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE

Dependency Management - JSON

At the moment, we depend on 4 different json libraries (including inner dependencies).

json-simple
json.org
Gson
Jackson

for the sake of less dependencies, we should consider removing most.

the top candidates are Jackson and Gson, as Jackson is heavily used by tinkerpop, and Gson is heavily used by Jest & Hadoop. requires further analysis of which API is easier to use and what performs better.

Tinkerpop
Jest
Hadoop

either way, this required some refactor to remove the other libraries.

ElementSchema - internal vertex/edge

Add Integration Tests

Execute as a total both JDBC and Elastic, run tests that include both.

part of the unipop-test module

TemplateController (post-refactor)

Multi Field PropertySchema

Have properties that are composed of multiple fields in the origin data.

JDBC - smart table union filter.

In order to improve performance when querying SQL databases, we can filter out tables from the union by analysis of which tables conform to the PredicatesHolder and its 'must-have' fields, as any tables that do not have those fields can be cut out.

Schema configuration guide

Reducing TraversalStrategy

Enable Controllers to implement optimized reducing function - count, sum, average, min, max, etc
This needs to be re-implemented following the changes In #44.

Elastic Edges Example

Was digging through the code a bit trying to find an example of what I would do if my documents in elastic search represent nodes and edges. For instance I have documents that look like the following, how would I represent that using unipop? User A and User B are nodes, and a case could be made for the message also being a node and the edge contains the timestamp.

{
"userA": "000000001",
"timestamp": "2015-01-01T05:14:22",
"message": "Writing on your wall",
"userB": "000000002"
}

From what I can tell, my guess would be that I use an ElasticEdgeController, but its not entirely clear how to actually use that to build my graph and run gremlin queries across it. Can you provide an example of how to do that?

Major Refactor

Adding this in retrospect for documentation's sake:

A much needed refactor to split Unipop's code to different components:

Structure - implementation of default Tinkerpop model classes that issue UniQuerys to Controllers
Procces - implementation of Tinkerpop Strategies that issue UniQuerys to Controllers, and add Unipop-specific optimizations.
UniQuery - a set of APIs Controllers can implement.
Controller - a component responsible for executing the different UniQuerys.
Schema - a set of helper classes meant to ease schema management for Controllers, and standardize schema mappings.

Performance Benchmarks

https://github.com/ellitron/ldbc-snb-impls/tree/master/snb-interactive-neo4j

SLF4J - Practical Logging Implementation

While #19 considers Metrics and the profile step and traversal metrics, this issue considers only the implementation of simply logging with SLF4J.

NestedVertexSchema

Virtual Controller

TBD

SourceProvider should receive standard `Configuration`

Currently SourceProviders, ElementSchemas, and PropertySchemas receive a JSONObject for initialization. The ControllerManager should provide them a standard org.apache.commons.configuration so that other sources of configuration can be provided.

Cardinality Strategy

Most database optimizers use statistics-based cardinality estimates to to determine the optimal order in which to run a query's steps. Should we do something similar?

We could implement a TraversalStrategy that rearranges the Traversal's steps according to its estimated cardinality. Each Controller provide the necessary information by utilizing its database's capabilities. e.g. Oracle Statistics.

Tinkerpop has something similar in its MatchStep, except it does a run-time statistics calculation for every traversal.

Should we use MatchStep's MatchAlgorithm to implement this feature?
Why is this only implemented for MatchStep? Can't the same logic be applied for all steps in the traversal?

Getting started guide

Optimize property fetching

Current querying behavior

Vertex - fetch all properties + any "inner" edges and their properties.
Edge - fetch all properties + both vertices and their properties.
- If a vertex's schema is of type 'ref', its properties will only be fetched when it passes through a UniGraphVertexPropertiesSideEffectStep, a step that comes before any step that uses properties, and issues a DeferredVertexQuery. This ensures that the vertices will only be queried if and when its needed (i.e. lazy loading).

Problems

When an Element is queried, all its properties are fetched, whether or not they are needed by this traversal.
When a Vertex is queried, all its "inner" edges are fetched, whether or not they are needed by this traversal.

This issue tries to solve problem 1. We should probably create another ticket for solving problem 2 in the future.

Solution suggestion

SearchQuery/SearchVertexQuery/DeferredVertexQuery should pass a list of property keys needed from the queried element. UniGraphPropertiesStepStrategy should provide the property lists to the querying steps by analyzing the traversal. Scenarios:

No step in the traversal needs any property - empty list.
Step(s) in the traversal need specific property(s) - property list.
Step(s) in the traversal iterate over all properties - null list.
Unknown (the strategy couldn't identify which properties are needed) - null list. ???

Next, when a Controller receives these queries it should only fetch the relavent properties, or not issue a query at all when possible.

JDBC - MultiRow schema

Implement a schema that is able to consume multiple rows, and treat it as a single vertex, with each row representing an edge to that vertex.

unipop-graph / unipop Goto Github PK

unipop's People

Contributors

Stargazers

Watchers

Forkers

unipop's Issues

Customizing Schemas

Extending Controllers

Solution suggestion

Questions

Questions

Analyze Traversal

Validate Join

Query

Current querying behavior

Problems

Solution suggestion

Recommend Projects

Recommend Topics

Recommend Org