Created by: <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Origin comment by: <a class="user-mention notranslate" data-hovercard-type="user" data

Origin comment by: <a class="user-mention notranslate" data-hovercard-typ

Support for API Bulk Loading about falkordb HOT 10 OPEN

gkorland commented on May 18, 2024

Support for API Bulk Loading

from falkordb.

Comments (10)

gkorland commented on May 18, 2024

Origin comment by: @jeffreylovitz
hi @mboldisc,

The major bottleneck in the speed of graph creation is reallocating the matrices to accommodate new vertices and edges. Using a clause like MERGE will perform these reallocations one vertex at a time, which is basically worst-case behavior!

The binary encoding we chose for the bulk loader saves some time in processing and reduces some of the pressure on the Redis input buffer (which has limits of roughly 1 million tokens and/or 1 gigabyte of data). That benefit is fairly trivial compared to the speed gained by reducing the frequency of matrix reallocations, however.

I think that the most straightforward approach, as you suggest, would be writing a script similar to the CSV loader that encodes data in the same binary format. What format is your input data in?

I like the idea of improving bulk load capabilities very much. As one word of caution, when using the GRAPH.BULK endpoint, the module is incapable of de-duping, which makes commands much more similar to CREATE than MERGE. Any script using that endpoint should be capable of sanitizing inputs.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @mboldisc
Ah. Thanks for the response! I need to think about this a bit. I really need a "MERGE" instead of a "CREATE". I'm taking various data sources in and dumping them in a graph, trying not to clobber existing data. Neo4j has a batch JDBC PreparedStatement approach. Maybe we need something similar to that.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @mboldisc
It sounds like the bottleneck is the matrix memory allocation and associated processing. Here are two other potential ideas. Let me know if either would help:

Have a configuration file parameter to pre-allocate a matrix of a given size. The elements would be reserved somehow for new vertex data. Say, for example I don't expect more than six million vertices in my graph. I could set a config specifying: redisgraph.vertex.allocation=6000000
Similar to the idea above, could adding new vertices beyond the current max size cause the matrix to expand beyond the requested size? Possibly double the size upon a needed expansion? Many list data structures have a growing approach that grows beyond the current requested element for future capacity.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @jeffreylovitz
Those are both good suggestions, and in the past we have in fact implemented variants of both. The trade-off between perfectly-sized matrices (which we maintain right now in all contexts except GRAPH.BULK usage) and matrices with empty space for growing is a tricky one to call.

One reason for this is that the memory cost associated with matrix size is pretty steep. Each label has one associated matrix, and each relationship type has two. For traversals to work properly, all labels and relations in a query have to be of the same size (since we perform a series of matrix multiplications). Per matrix, the cost of an allocated-but-unused vertex is 16 bytes, which is not in itself that bad, but you can see how it would quickly add up.

I think that one interesting approach would be to change matrix sizing policy either through user specification (like your idea 1) or through some simple heuristics that determine whether the graph is write-heavy (and would thus benefit from fewer reallocs, suggesting more generous sizing like in idea 2) or read-heavy (and would thus benefit more from reducing wasted RAM).

I'll discuss this with the rest of our team, but I'm not sure when we'll be able to place it in our roadmap!

If you'd like to experiment with some changes yourself, I can point you to the areas in the code where we set the resizing policies and the dimensions to which matrices should be resized (a la idea 2).

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @mboldisc
@jeffreylovitz - If you have a couple minutes to point out where to start in the code, I'd appreciate it.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @mboldisc
I should also mention that I was off by an order of magnitude in MERGE statement speed. It's more like 25 operations per second.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @jeffreylovitz
I'd be happy to!

One of the first things that happens when a query is received is that the matrix synchronization policy is set:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L220

As the comments indicate, these policies serve two roles at the moment - managing the dimensions of matrices as well as flushing all pending changes to those matrices. I think that we'll likely split those roles into two separate policies when we try to optimize this area more thoroughly.

The first role is fairly safe to change, and you can create a buffer of unused space in matrices by making changes in or around the matrix dimension setter:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L300

The second role is a bit trickier, but also important, as one can break atomicity guarantees if read queries are being issued to the database while matrices are out of sync. Pending changes are flushed with calls to GrB_Matrix_nvals(). The default synchronization function (used for all endpoints except GRAPH.DELETE and GRAPH.BULK) is here:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L172

In normal execution, synchronization occurs whenever a matrix is fetched - a MERGE query, for example, will fetch every matrix associated with the edges and vertices your pattern describes.

By adding some buffer space Graph_RequiredMatrixDim() and changing the associated calls to never shrink the retrieved matrix, some performance gains should be possible. If concurrent queries are not a risk for your use case, you should be able to get even greater gains by changing when pending changes are flushed to GraphBLAS.

The RedisGraph team discussed this issue today, and think we should be able to make significant improvements in allowing batch operations like this, but won't be able to make changes until we're sure that they're safe for all use cases. From what you've described, I think that a more bespoke approach would work fine for you in the meantime - we ought to be able to introduce a formal solution in the next few months.

I hope that this was helpful! Let me know if you have any questions or need help, and please tell us what your experience is like. This is an area we can make big improvements in, so additional sets of eyes are a huge help.

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @github-actions[bot]
Stale issue message

from falkordb.

gkorland commented on May 18, 2024

Origin comment by: @kkonevets
@mboldisc Have you achieved the state of ready to use bulk loader for your input format? I need to do the same #709

from falkordb.

kkonevets commented on May 18, 2024

Origin comment by: @kkonevets
@mboldisc Have you achieved the state of ready to use bulk loader for your input format? I need to do the same #709

I have ended up using LMDB to implement a graph database on top, turns out it's easy to do.

At the time I tried Redis didn't persist state on disk, it was pure in memory, so I switched to LMDB

from falkordb.

Support for API Bulk Loading about falkordb HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent