unipop-graph / unipop Goto Github PK
View Code? Open in Web Editor NEWData Integration Graph
License: Apache License 2.0
Data Integration Graph
License: Apache License 2.0
JEST is an elasticsearch java client that communicates through REST with ES. This (hopefully) means that we can use one client for all versions, thus killing the separation (and code duplication) between unipop-elastic and unipop-elastic2.
Theoratically it should also make unipop consume less memory.
I am working on a project that require graph queries on ElasticSearch and I found Unipop which fits our use case perfectly. Well done!
However, I found few clues about how to set up a gremlin server with unipop. The unipop-elastic2 is half developed. I found that I could not even package unipop-elastic successfully.
It would be very helpful if you could give any advice. I am really looking forward to the release of a stable version.
Implement ReduceController in jdbc module
sub-task of #46
I wondered how I can actually use this library to run a simple gremlin graph traversal. ElasticGraphProvider looks promising, but that doesn't actually get shipped in your jar, as it is in test code...
Pass the order step to the controller,
so that the behavior can be transferred partly to the DB's. and then merged locally
Metrics - utilize Tinkerpop's TraversalMetrics
A new elastic controller utilizing the new Elasticsearch Graph API
PREFIX, SUFFIX, REGEX, FUZZY, etc.
If the connection to the databases terminates for any reason, Unipop won't reconnect and thus won't work.
Add parallel execution when issuing UniQueries.
Possible Parallelism points:
Currently we use multiple collections libraries, including:
all of them are lacking some features which exist in other languages such as Scala and the .Net family
a solution to this is using the Seq
API from JOOQ/JOOL.
StreamSupport
and the stream APIsome resources:
from personal experience, the JOOL api fits our needs and gives us greater flexibility.
Optimize the Mutation steps to issue mutation UniQuery
s with a bulk of Element, enabling the Controller
s to issue bulk commands to the DB.
Graph
API.commit()
method to UniGraph/Traversal?Theoretically we should be able to run distributed Unipop queries on Spark or something. Some of Unipop's data sources even have Hadoop integration (e.g. Elasticsearch RDD, Jdbc RDD, etc).
Utilizing a Unigraph
's schema configuration, this feature should provide Unipop's users a transparent, zero-configuration way to execute distributed queries over their data.
GraphComputer
? What does that entail?Implement bulk step to grow exponentially to
make streaming faster.
Make sure that json fields that support arrays also support a single value without an array. E.g:
"foo": "bar" == "foo": ["bar"]
Add back ExistsP for g.V().has("fieldname")
Enable Controllers to implement optimized "group by" and "group count" functions.
This needs to be re-implemented following the changes In #44.
Implement ReduceController in elastic module,
sub-task of #46
UniFeatures are part of the graph object.
this does not allow to have controllers with different features.
a solution is needed that will allow a dynamic result. as it is based on what is being executed at the moment and where what action is undergoing considering the controller.
Suggestions are welcome
Most Databases have an Sql-like "join" feature. Utilizing this across Traversal
Step
s can bring a big boost to performance. e.g. g.V().hasLabel('foo').out('bar')
could be queried as select ... from foo join bar on ...
This is how I thought to implement this:
Use TraversalStrategy
s to analyze possible joins:
SelectStep
SearchQuery
will be added a property SearchVertexQuery[] getNextQueries()
, returning the possible joins found. A controller can use this recursively to get more join possibilities.
The Controller should use the Schema
s to validate the 'legitimacy' of joining the query with any of the "next queries". Things to check:
VirtualVertex
s and join with the next query.Each controller implements the join query in a different way:
The Controller should return each result and its corresponding "future steps" results as a set, with each result in the set associated with its relevant stepId.
The issuing Step will create a Traverser from its relevant result, and add the rest as "Traverser side effects", enabling the future steps to access the results when they are called.
Use elasticsearch query Scroll API to iterate many results in the StartStep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
According to the docs "Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration". Should we use Scroll in Unipop? If so, should we always use it? maybe only in StartStep (Assuming most times it will iterate a large amount of results).
The scroll functionality is already implemented in QueryIterator
, but its currently unused.
This sounds cool, so I wanted to give unipop-elastic a try.
Is there a jar hosted on some public repository?
When I build it manually I get the following dependency convergence error, which I couldn't even fix when adding a direct dependency from unipop-elastic on snakeyaml:1.15
Failed while enforcing releasability the error(s) are [
Dependency convergence error for org.yaml:snakeyaml:1.15 paths to dependency are:
+-unipop:unipop-elastic:0.1
+-unipop:unipop-core:0.1
+-org.apache.tinkerpop:gremlin-core:3.0.2-incubating
+-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
+-unipop:unipop-core:0.1
+-org.yaml:snakeyaml:1.15
and
+-unipop:unipop-elastic:0.1
+-org.elasticsearch:elasticsearch:1.7.3
+-org.yaml:snakeyaml:1.12
]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Unipop ............................................. SUCCESS [ 0.627 s]
[INFO] Unipop :: Core ..................................... SUCCESS [ 0.762 s]
[INFO] Unipop :: Elasticsearch Controllers ................ FAILURE [ 0.709 s]
[INFO] Unipop :: JDBC Controllers ......................... SKIPPED
[INFO] Unipop :: Integration tests ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
At the moment, we depend on 4 different json libraries (including inner dependencies).
for the sake of less dependencies, we should consider removing most.
the top candidates are Jackson and Gson, as Jackson is heavily used by tinkerpop, and Gson is heavily used by Jest & Hadoop. requires further analysis of which API is easier to use and what performs better.
either way, this required some refactor to remove the other libraries.
Execute as a total both JDBC and Elastic, run tests that include both.
part of the unipop-test module
Have properties that are composed of multiple fields in the origin data.
In order to improve performance when querying SQL databases, we can filter out tables from the union by analysis of which tables conform to the PredicatesHolder
and its 'must-have' fields, as any tables that do not have those fields can be cut out.
Enable Controllers to implement optimized reducing function - count, sum, average, min, max, etc
This needs to be re-implemented following the changes In #44.
Was digging through the code a bit trying to find an example of what I would do if my documents in elastic search represent nodes and edges. For instance I have documents that look like the following, how would I represent that using unipop? User A and User B are nodes, and a case could be made for the message also being a node and the edge contains the timestamp.
{
"userA": "000000001",
"timestamp": "2015-01-01T05:14:22",
"message": "Writing on your wall",
"userB": "000000002"
}
From what I can tell, my guess would be that I use an ElasticEdgeController, but its not entirely clear how to actually use that to build my graph and run gremlin queries across it. Can you provide an example of how to do that?
Adding this in retrospect for documentation's sake:
A much needed refactor to split Unipop's code to different components:
UniQuery
s to ControllersUniQuery
s to Controllers, and add Unipop-specific optimizations.Controller
s can implement.UniQuery
s.Controller
s, and standardize schema mappings.While #19 considers Metrics and the profile step and traversal metrics, this issue considers only the implementation of simply logging with SLF4J.
TBD
Currently SourceProvider
s, ElementSchema
s, and PropertySchema
s receive a JSONObject for initialization. The ControllerManager
should provide them a standard org.apache.commons.configuration so that other sources of configuration can be provided.
Most database optimizers use statistics-based cardinality estimates to to determine the optimal order in which to run a query's steps. Should we do something similar?
We could implement a TraversalStrategy
that rearranges the Traversal's steps according to its estimated cardinality. Each Controller provide the necessary information by utilizing its database's capabilities. e.g. Oracle Statistics.
Tinkerpop has something similar in its MatchStep
, except it does a run-time statistics calculation for every traversal.
MatchStep
's MatchAlgorithm
to implement this feature?MatchStep
? Can't the same logic be applied for all steps in the traversal?UniGraphVertexPropertiesSideEffectStep
, a step that comes before any step that uses properties, and issues a DeferredVertexQuery
. This ensures that the vertices will only be queried if and when its needed (i.e. lazy loading).This issue tries to solve problem 1. We should probably create another ticket for solving problem 2 in the future.
SearchQuery/SearchVertexQuery/DeferredVertexQuery should pass a list of property keys needed from the queried element. UniGraphPropertiesStepStrategy
should provide the property lists to the querying steps by analyzing the traversal. Scenarios:
null
list.null
list. ???Next, when a Controller receives these queries it should only fetch the relavent properties, or not issue a query at all when possible.
Implement a schema that is able to consume multiple rows, and treat it as a single vertex, with each row representing an edge to that vertex.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.