Hello, it would be great if we could use unique keys in nodes and edges. in Neo4j it's

As <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id=

Hello, the commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-u

Here some links just for reference. List of <a href="https://www.ope

unique properties / indexing,about apache/age

Comments (25)

pdpotter commented on June 3, 2024 6

It would indeed be very useful to enable the creation of constraints / indexes. This could help to accelerate the creation of edges between nodes that were added earlier, which slows down a lot when there are a lot of nodes with the same label:

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

The CREATE UNIQUE, CREATE CONSTRAINT, and CREATE INDEX clauses / commands are part of cypher, but not of openCypher, so I don't know what the best approach would be here.

@ehsanonline

See comment below for an update. Click here to view the original comment content.

You can create a unique index, which will prevent the creation of duplicate values, but will not speed up matching, by creating an immutable function that casts to json over text

CREATE OR REPLACE FUNCTION get_name(properties agtype) 
  RETURNS text
AS
$BODY$
    select $1::text::json->>'name';
$BODY$
LANGUAGE sql
IMMUTABLE;

and using that immutable function in the create index command

CREATE UNIQUE INDEX person_name_idx ON mygraph.person(get_name(properties)) ;

Link to the mentioned comment below

from age.

JoshInnis commented on June 3, 2024 3

Hello,

A graph name is a schema and a label name is a table. Id and properties are columns in vertex table. Id, start_id, end_id, and properties are columns in the edge tables. Use the agtype_access_operator(properties, key) to get to get a property value.

Knowing all that you can use Postges' standard DDL language to implement constraints, indices and unique values.

ALTER TABLE graph_name.label_name 
ADD CONSTRAINT constraint_name
CHECK(agtype_access_operator(properties, "name_of_property") != '"Check against here"'::agtype);

from age.

pdpotter commented on June 3, 2024 3

As #228 has been fixed, it is now possible to use GIN indices for adding relations.
After adding GIN indices by executing

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties);
CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties);

Adding relations using a query like

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

is now over 30 times faster (~300/s on my local VM) than without the GIN indices.

I tried improving the performance by creating indices using

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties jsonb_path_ops);

but jsonb_path_ops is not supported for agtype.

Another attempt using a specialized index created by

CREATE INDEX d:LabelA__properties_id
ON graph_name.LabelA USING GIN ((properties->'id'));

didn't seem to work because the index wasn't used when executing MATCH queries (this was already mentioned in #212 (comment)).

As using GIN indixes is ~30 times slower than the method described in #45 (comment), I think I'm going to keep using this other method for initial data import.

from age.

JoshInnis commented on June 3, 2024 2

Hello, the commit 57e11a3 in master should have resolved this issue and will be available in the next release.

from age.

Dzordzu commented on June 3, 2024 1

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

I wouldn't say it's resolved. Fact - currently there are mechanisms that allow client to create indexes / constraints. On the other hand there is neither good documentation on this, nor agreement on the topic if constraints should also be included within cypher-like methods.

from age.

audiBookning commented on June 3, 2024

Here some links just for reference.

List of Cypher Improvement Proposals (CIP)

As for indexes and constraints in Opencypher, there are already some PR for their addition, but i think that they are "blocked" by another CIP

CIP:

PR:

from age.

pdpotter commented on June 3, 2024

I tried using id() instead of using properties for matching, but this was even slower (I noticed using an additional WHERE clause instead of defining properties in the MATCH clause is slower in another use case as well).

For this, I rewrote

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2})
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype);

where I obtained the ids (the ones above are made up) using

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

from age.

pdpotter commented on June 3, 2024

There is a (1000 times faster) workaround for the creation of edges between vertices that were added earlier. The workaround consists of inserting the data directly in the underlying tables that are used by Apache AGE.

See comment below for an update. Click here to view the original comment content.

First, the underlying ids of the vertices are retrieved:

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

Indices on start_id and end_id are then created in the underlying table for the edges of a certain type, to speed up the inserts in the _ag_label_edge table later on:

CREATE INDEX RelationA__start_id ON graph_name.RelationA(start_id);
CREATE INDEX RelationA__end_id ON graph_name.RelationA(end_id);

All other edges are created by direct insertion, first in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

Secondly, the edges are also directly inserted into the _ag_label_edge table (also using executemany). In my use case, each edge has an id property that can be used to select the correct edge when there are multiple edges between two vertices:

INSERT INTO graph_name._ag_label_edge (id, start_id, end_id, properties)
VALUES (
    (
        SELECT id from graph_name.RelationA
        WHERE start_id = $1
        AND end_id = $2
        AND properties::text::json->>'id' = $3
    ),
    $1,
    $2,
    $4
)

Where $1 are the domain_ids, $2 are the range_ids, $3 are the relation ids and $4 are the properties (as json dump).

Any thoughts on this workaround?
Are there any plans to add property indexes to Apache AGE later on?

from age.

pdpotter commented on June 3, 2024

When the edges are inserted in the specific edge table, they are automatically added to the _ag_label_edge table, reducing the workaround to the following and making the relation creation even faster:

First, the underlying ids of the vertices are retrieved:

SELECT * FROM cyper('graph_name',$$
MATCH (n:LabelA)
return id(n), n.id
$$) as (id agtype, prop agtype);

A single edge is then created using the cypher function to make sure the underlying tables are created correctly (the id values are made up)

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA), (r:LabelB)
WHERE id(d) = 11111 and id(r) = 11112
CREATE (d)-[:RelationA {prop: 'value'}]->(r)
$$) as (a agtype);

All other edges are created by direct insertion in the specific edge table (using executemany):

INSERT INTO graph_name.RelationA (start_id, end_id, properties)
VALUES ($1, $2, $3)

Where $1 are the domain_ids, $2 are the range_ids and $3 are the properties (as json dump).

from age.

pdpotter commented on June 3, 2024

This is great, thank you!

Since indices require an immutable function, an additional function will still need to be created for them. When I create a get_id function with

CREATE OR REPLACE FUNCTION get_id(properties agtype) 
  RETURNS agtype
AS
$BODY$
    select agtype_access_operator($1, '"id"');
$BODY$
LANGUAGE sql
IMMUTABLE;

and use it in an index with

CREATE UNIQUE INDEX person_id_idx ON mygraph.person(get_id(properties)) ;

the creation of vertices with the same id will be prevented

ERROR:  duplicate key value violates unique constraint "person_id_idx"
DETAIL:  Key (get_id(properties))=(2250) already exists.

but the index will still not be used when trying to match vertices with a specific id:

SELECT * FROM ag_catalog.cypher('mygraph', $$EXPLAIN ANALYZE MATCH (a:person {id:2250}) return a$$) as (a agtype);

                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on person a  (cost=0.00..2775.32 rows=16906 width=32) (actual time=12.688..85.521 rows=1 loops=1)
   Filter: _property_constraint_check(properties, agtype_build_map('id'::text, '2250'::agtype))
   Rows Removed by Filter: 50718
 Planning Time: 0.122 ms
 Execution Time: 85.550 ms
(5 rows)

Is there a way to use indices when matching?

from age.

JoshInnis commented on June 3, 2024

It might be safe to change agtype_access_operator to immutable.

Per Postgres' Documentation:

IMMUTABLE indicates that the function cannot modify the database and always returns the same result when given the same argument values; that is, it does not do database lookups or otherwise use information not directly present in its argument list. If this option is given, any call of the function with all-constant arguments can be immediately replaced with the function value.

STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements. This is the appropriate selection for functions whose results depend on database lookups, parameter variables (such as the current time zone), etc. (It is inappropriate for AFTER triggers that wish to query rows modified by the current command.) Also note that the current_timestamp family of functions qualify as stable, since their values do not change within a transaction.

The access operator works for Agtype Lists and Maps, it does not perform any database lookup, it just extracts a value from the first passed in parameter.

from age.

JoshInnis commented on June 3, 2024

Indices cannot currently be used while matching. There will need to be some re factoring done to allow the planner to realize opportunities where the indices can be used.

from age.

pdpotter commented on June 3, 2024

Thank you for the update! Is it possible to explain in more detail what issue was resolved?

I tried creating a unique index as described in #45 (comment), but a Sequential scan is still executed when I use a MATCH query to find a vertex based on the property on which a unique index has been created.

I pulled the latest changes, did a make and sudo make install, DROP EXTENSTION age CASCADE; and CREATE EXTENTION age;. Is this sufficient to use the changes from 57e11a3, or do I have to drop and recreate the database I'm working on?

from age.

JoshInnis commented on June 3, 2024

Hi @pdpotter, sorry for the confusion, AGE now supports constraints, the MATCH clause does not yet support using index scans. Constraints now work and the updating (SET, REMOVE, etc) clauses work with constraints and they no longer break indices. The patch 6279c10 supports GIN indices. So if you create an index on a label's properties and place the quals in the {} in the MATCH clause. Such as MATCH (:label {prop_name: "filter value"}) the gin index will be used. Index scans using filters in the WHERE clause in nearing completion and will be in the review process shortly.

from age.

pdpotter commented on June 3, 2024

See comment below for an update. Click here to view the original comment content.

Wow, this is fantastic. It is now possible to create relations quickly (~10 000/s on my local VM) using simple queries (with executemany) in the form of ``` SELECT * FROM cypher('graph_name', $$ MATCH (d:LabelA {id: 1}), (r:LabelB {id: 2}) CREATE (d)-[:Relation {prop: 'value'}]->(r) $$) as (a agtype) ```

After simply adding GIN indexes

CREATE INDEX d:LabelA__properties
ON graph_name.LabelA USING GIN (properties)

CREATE INDEX d:LabelB__properties
ON graph_name.LabelB USING GIN (properties)

For this specific use case, it would of course be more disk space efficient to only index the id property and not all of them, but having MATCH working so much faster is a huge leap forward. Great work!

from age.

pdpotter commented on June 3, 2024

Commit 379983b includes some improvements which are relevant here:

agtype_access_operator is now immutable. This means it can be used directly to create unique indices. It is no longer necessary to create a separate immutable function. E.g.,

CREATE UNIQUE INDEX person_id_idx ON mygraph.person(ag_catalog.agtype_access_operator(properties, '"id"'::agtype));

~~GIN indices are now used when filtering using WHERE clauses.~~
edit: The unique index created above is used when filtering using WHERE clauses.

Very nice work, thank you @JoshInnis!

from age.

pdpotter commented on June 3, 2024

It looks like I have been a bit too enthusiastic in my previous comments.

Property constraints

When using GIN indices on properties and creating relations using these indices, I didn't check if the relations were actually added. Unfortunately, they were not.

After adding a GIN index and adding enough vertices so the index is used, a match query doesn't return any results. E.g.,

SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 1}) return p$$) as (p agtype);
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 2}) return p$$) as (p agtype);
...
SELECT * FROM cypher('test_graph', $$CREATE (p:person {id: 500}) return p$$) as (p agtype);
CREATE INDEX person__properties ON test_graph.person USING GIN (properties);

A match query using property constraints returns 0 results.

SELECT * FROM cypher('test_graph', $$MATCH (p:person {id: 1}) return p$$) as (p agtype);
 p 
---
(0 rows)

Query plan:

                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on person p  (cost=12.01..16.03 rows=1 width=32) (actual time=0.026..0.027 rows=0 loops=1)
   Recheck Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
   ->  Bitmap Index Scan on person__properties  (cost=0.00..12.01 rows=1 width=0) (actual time=0.024..0.024 rows=0 loops=1)
         Index Cond: (properties @> agtype_build_map('id'::text, '1'::agtype))
 Planning Time: 0.158 ms
 Execution Time: 0.075 ms
(6 rows)

Where clause

When using the where clause, it is not the GIN index that is being used, but the unique index that was added to prevent duplicate entries by executing

CREATE UNIQUE INDEX person__id ON test_graph.person (properties);

The WHERE clause does give a correct result:

SELECT * FROM cypher('test_graph', $$MATCH (p:person) WHERE p.id = 1) return p$$) as (p agtype);
                                      p                                      
-----------------------------------------------------------------------------
 {"id": 844424930131969, "label": "person", "properties": {"id": 1}}::vertex
(1 row)

Query plan:

      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using person__id on person p  (cost=0.27..8.30 rows=1 width=32) (actual time=0.063..0.065 rows=1 loops=1)
   Index Cond: (agtype_access_operator(VARIADIC ARRAY[properties, '"id"'::agtype]) = '1'::agtype)
 Planning Time: 0.190 ms
 Execution Time: 0.103 ms
(4 rows)

When using the WHERE clause for creating relations, the performance decreases when adding a lot of relations (when adding relations in batches of 5000, the first batch achieves ~5000 it/s, while the ninth batch achieves ~500 it/s) using something similar to

SELECT * FROM cypher('graph_name', $$
MATCH (d:LabelA ), (r:LabelB)
WHERE d.id = 1 AND r.id = 2
CREATE (d)-[:Relation {prop: 'value'}]->(r)
$$) as (a agtype)

Test suite

When having a new look at the tests (https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L299-L336) to check if I was making mistakes, I found out some of the index related tests might have some issues:

Are the indices being used with a limited amount of vertices?
The query at https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L304 returns 0 results, which is expected, but it might be useful to check if results are returned for other queries (e.g., MATCH (c:City {country_code: "US"}). Queries https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L318 and https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L330 are similar cases.
There are errors on lines https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L314 and https://github.com/apache/incubator-age/blob/master/regress/expected/index.out#L335

Thank you so much for making Apache AGE better and better with each commit. I'm sorry if I caused any confusion with my previous comments.

from age.

JoshInnis commented on June 3, 2024

Hi @pdpotter, GIN Indices are for a subset of JsonB operators https://www.postgresql.org/docs/11/gin-builtin-opclasses.html These operators are not usabale in the cypher Where clause. Currently the only place they can be used is in the property constraint field in the match clause. The where clause is now compatible with the comparison operators that the cypher command currently has. These new operators need to be added to the cypher command's where clause

from age.

JoshInnis commented on June 3, 2024

For an article about this you can checkout this https://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/ Its a bit dated and about Agensgraph and not AGE, but if you replace insances of the '->>' with the '.' operator the information is still accurate

from age.

Dzordzu commented on June 3, 2024

For onlookers: within Rust driver I've implemented methods that provide unique indexes, and properties contraints. As of now it can be treated as a summary of this disscussion

from age.

jrgemignani commented on June 3, 2024

Just wondering if this issue is resolved? Or, is there more that needs to be done that this issue needs to stay open? It is a bit difficult to tell from the correspondence.

from age.

jrgemignani commented on June 3, 2024

I will try to see if I can get others engaged that can help.

from age.

ddrao commented on June 3, 2024

Any update on indexing side of things? Any pointers on enabling the cypher queries to use specific property indexes? Other than the GIN index, nothing seems to be working.

from age.

pdpotter commented on June 3, 2024

~~In #954 (comment), a way to create specific property indexes has been mentioned. I haven't been able to test it out, but the issue has been closed, so I'm hopeful it will work.~~ Edit: it doesn't seem to work, see #1009

In #1000, a patch that is still being worked on is mentioned that would allow GIN indices to be used in WHERE clauses as well.

from age.

vladiksun commented on June 3, 2024

@pdpotter I suppose the indexes mentioned in the #1000 to be used from the WHERE clauses should be of BTREE type instead of GIN because GIN index is good for pattern matches like in the MATCH clauses.

from age.

unique properties / indexing about age HOT 25 OPEN

Comments (25)

Property constraints

Where clause

Test suite

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent