ldbc / gcore-spark Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 5.0 8.02 MB

Implementation of the G-CORE graph query language on Spark

License: Apache License 2.0

Scala 100.00%

gcore-spark's People

Contributors

Stargazers

Watchers

Forkers

neo-technology ramhgarcias pstutz rgarcias isabella232

gcore-spark's Issues

CONSTRUCT (v1 GROUP foo), (v1 GROUP bar IN v2)

Introduce a new optional IN subclause where "IN v1" is the new optional subclause of GROUP in CONSTRUCT. This means that v1 and v2 are going to belong to the same domain, so v1 and v2 will be the same now if expression foo and bar are the same, and therefore we can potentially also construct a self-referencing edge here (the use-case whereby Hannes Voigt championed this feature is graph summarization).

Change in spoofax parser
Change in CONSTRUCT translation

Add specific functions to make it easier to work with paths

This should start and be defined when the system stably works and we start testing it for trajectory storage & analysis applications.

For example, if we want to select paths from a repo of stored paths that pass through two nodes, cut them out and then feed them into GROUP BY paths we would at least need:

path_intersects(path,vertex) : bool
path_cut(path,vertex,vertex) : path

Support GROUP BY and HAVING in case of SELECT queries

Add in spoofax parser
Translation: should be just passing these down to Spark SQL runtime

Construct with empty nodes

Describe the bug
The following query does not work : Construct () match ()
The problem seems to be the empty node.

unsuported expression: property p.prop = ""

For a expression of the form: construct (n) match (n) where n.name = "";
The parser assumes that n.name = n.""

Replace CONSTRUCT .. WHEN with CONSTRUCT … WHERE

in parser (very simple renaming)

Terminal-based front-end

G-core must provide a command line tool to execute operation on a database.

Only permit reverse of one label (RPQ)

When use a RPQ, only allow use the reverse symbol for one per label, and not in a parenthesis

Error saving new graph

When a graph is saved, the application show an error and don't save the graph in hard disk.

Support creating stored paths

This means creating the path dataframes, which contain a src_id, dst_id and edge_list. Currently, these are not created yet in CONSTRUCT.

CONSTRUCT .. ()-[e HAVING …]-()

This feature applies a selection after grouping on the binding table for a construct pattern.

spoofax parser needs to be extended
Translation needs to be created (in CONSTRUCT)

Support the FROM table T clause

adds T as a binding table variable,
creates the new binding table as cartesian product between original binding table and T
support the columns of T as single-valued properties (this entails usage of properties in SELECT, MATCH and CONSTRUCT)

Implement missing semantic checks

Expression types match for binary expressions (if possible to check)
(?) An edge is between two distinct vertices
Variable bindings in SimpleMatchClause have different names. Note: Bindings can be re-used across multiple SimpleMatchClauses. HOWEVER, edges should not be reused.
Ambiguous labeling of entities. For example, in queries such as:(v1:L1)->(v2)<-(v3), (v1:L2)v1 is labeled differently in the two patterns.
.Eliminate similar queries? For example:(v1:L1)->(v2), (v3:L1)->(v4)v1 and v3 are the same Vertex, v2 and v4 will be the same vertex, their edges are the same too, it’s a repeated query, which we translate into a join over the two edges.
Validate that all keys in edgeRestriction (GraphSchema) are present in the graph. Also validate that all values in edgeRestrictions are present in the graph.
ALL PATHS can only be used with stored paths
(?) Throw error or warn the user if, after label inference, an entity can have more than one label. This translates into a UNION ALL of all labels for that entity.
Each match variable must be matched on only one graph. Validation should go in MapBindingToGraph.
All variables in an edge or path pattern should be part of the same graph. Validation should go in MapBindingToGraph.
(?) An EXISTS subquery must have at least one common variable with the main MATCH clause.
UnionAll operator is applied on two relations with the same header.
Property exists for given label, or exists for given entity type.
Property types match with Expression types.
(?) No two Table’s contain the same id.
A constructed entity must be of the same type as its matched counterpart, if they are the same variable (this can be checked in CreateGroupingSets).
A specific GROUP clause can only be used with unbound variables. This can be checked in CreateGroupingSets
Check that each variable in CONSTRUCT ends up in the end with at most one label - if the label is missing, then we can create a new one in VertexCreate. This check can be done in CreateGroupingSets.
Are aggregate expressions allowed in the MATCH’s WHERE clause?
An entity can only be GROUP-ed once, or else the GROUP-ings must be combined.

Access labels of objects

We need an option to access labels of objects, like where n.label = "Label"

WHERE for MATCH

The current grammar does not allow a WHERE clause for the entire MATCH clause, when OPTIONAL patterns are included.

Change the license of gcore-spark to Apache

https://stackoverflow.com/questions/31639059/how-to-add-license-to-an-existing-github-project

We should also add at the top of each file a copyright and license notice.

we need to wait with this until it is clear that we agree on Apache as a license (and possibly to a copyrights transfer)

Correct use of parentheses RPQ

For some RPQ expression using parentheses, the query crashes.

To Reproduce
Steps to reproduce the behavior:

Open the console
Run a query like this: CONSTRUCT (n)-/@p:reach/->(m) MATCH (n)-/p<(:HasInterest |:IsLocatedIn)! :Knows>/->(m)
Error: "Key not found"

Expected behavior
A successful query execution.

Check if left-to-right semantics in CONSTRUCT is supported

If a node variable x is created in a CONSTRUCT pattern, then a subsequent pattern (to the right) could use it, e.g. CONSTRUCT (x GROUP y.foo), (x)->(y)

MATCH … ON (<CONSTRUCT subquery>)

This can be implement as a rewrite:

GRAPH VIEW tmp as CONSTRUCT…
MATCH … ON tmp

Only permit negation of one label (RPQ)

When use a RPQ, only allow use the negation symbol for one per label, and not in a parenthesis

GCORE sparkSession support

Wrap the code in a gcore-spark module, that a single import statement initializes the gcore-spark subsystem, reads the default catalog, and then is ready to execute queries by adding some gcore(string) : Graph method to the sparkSession.

If the query is a SELECT query, we should just return a dataframe
If the query is a CONSTRUCT, we should return some SparkGraph object, and maybe offer some simple basic methods to look into graphs (like returning a list of V, E, P dataframes, or even couple with a graph visualization libraries).

Implement “full graph” operations

g1 OP g2 : g3, where OP in { UNION, INTERSECT, MINUS }
Essentially, we need to pair all (vertex,edge,path) dataframes (df1i,df2i) of both g1 and g2, with the same label. Use an empty dataframe if the dataframe does not exist in either graph. Then apply df3i = df1i OP df2i

Path operations

Define syntax and semantics for paths operations.

Operations between sets of paths: join, union, difference, intersection
Filter functions for paths: The WHERE clause could contains functions to filter paths, e.g. path.contains(node)
Path construction: the CONSTRUCT clause could contains operators to add labels and properties to the paths returned by the MATCH

CREATE and DROP GRAPH

Support CREATE GRAPH x, which indicates that graph x is persistent, and that the catalog has to be changed also in a persistent way.

Also support DROP GRAPH x, which indicates that a graph x that is persistent has to be deleted from Spark and the catalog.

CREATE GRAPH x should have a default semantics (for example, it should be default rule that indicates where graph x should be stored and in which format).

CREATE GRAPH x should also give the possibility to the user to specify some parameters such as directory where x is going to be stored, format for x, …

It could be something like CREATE GRAPH x (directory="/foo", format="parquet")

Graph output format

G-core must provide a textual format to represent the output of a query

Support multi-valued properties

Spark dataframes can support lists of literal values, so this is easily added on the storage level.
We should support lists of literal values (list, list, list, list, list, list) in the schema languages and catalog
This means we should then also support the functionality of binding a property value to a variable. This has the effect of “unrolling” the multi-valued value into individual rows of the new binding table. If the multi-valued values was in fact unbound (NULL), the binding is not lost, but will consist of a single row in the binding table (with the variable taking NULL in that row).

Support SELECT

Add in spoofax parser
Translation: basically implement x.prop expressions on the binding table

Add missing data types (date, datetime, timestamp)

Add to the translator typesystem and catalog

Review the use of CREATE

Describe the bug
The following query is allowed: "CREATE 'nuevo' CONSTRUCT (n) MATCH (n)"

To Reproduce

Execute: CREATE 'nuevo' CONSTRUCT (n) MATCH (n);

Expected behavior
Show a parser error, because it must be "CREATE GRAPH"

RPQ with KleeneBounds

Implement RPQ with KleeneBounds

Ex. MATCH (n:Person)-/ALL p<:knows*{2}>/->(m:Person)

Change syntax of unbound test

support use of NULL as a constant symbol indicating an unbound property
introduce“e IS NULL” (like SQL) and remove exist(e) notation (affects both parser and translation)

PATH expressions

basic PATH pat = (src)-...pattern..-(dst),... MATCH ... -/ pat* /- … ON g
This can be implemented as a rewrite:

MATCH … -/ pat*/- … ON (
CONSTRUCT g,(src)-[pat]-(dst)
MATCH (src)-..pattern…-(dst),.. )

Add weighted paths on PATH expressions by adding a ’weight’ property on the newly created edges. In the translation and specifically the generation of the graphX code, we need GraphX to sum the “weight” properties to calculate the path length (as opposed to taking hop-count). It is probably better not to expose this on the g-core syntax level, but put this as an annotation on the GraphX operator during the translation.

The rewrite for weighted PATH pat = (src)-...pattern..-(dst),...COST ..Y.. used in MATCH … -/ pat* COST x/- … is therefore:

MATCH … -/ pat* COST x/- … ON (
CONSTRUCT g,(src)-[pat {weight:=..Y..}]-(dst)
MATCH (src)-..pattern…-(dst),.. )

again, somehow we need to ensure that COST x now gets filled not with the hopcount, but with the SUM(weight). The GraphX implementation already has some support for this, but it needs to be triggered. This extra info is probably best a property attached to the algebra tree nodes, so the GraphX code generation can pick it up and generate the appropriate stuff

Implement a history of commands for command line

Describe the solution you'd like
When a user insert a command, the command is stored. Then when the user press the up or down arrow key, the system shows the latest commands.

GRAPH VIEW x AS CONSTRUCT ...

This is lightweight version of CREATE GRAPH
Keep the result dataframes of the CONSTRUCT query around in the Spark session(now they get de-allocated), but do not save them persistently as in CREATE GRAPH
Temporarily add the meta-data of the CONSTRUCT query to the catalog under graph name X (the new catalog should not be stored in disk)