typedb-osi / typedb-loader Goto Github PK

View Code? Open in Web Editor NEW

58.0 7.0 17.0 1.41 MB

TypeDB Loader - Data Migration Tool for TypeDB

Home Page: https://github.com/typedb-osi/typedb-loader

License: Apache License 2.0

Java 100.00%

knowledge-graph knowledge-base migration data-migration-tool data-migrator typedb typedbloader loader grami

typedb-loader's Introduction

If your TypeDB project

has a lot of data
and you want/need to focus on schema design, inference, and querying

Use TypeDB Loader to take care of your data migration for you. TypeDB Loader streams data from files and migrates them into TypeDB at scale!

Features:

Data Input:
- data is streamed to reduce memory requirements
- supports any tabular data file with your separator of choice (i.e.: csv, tsv, whatever-sv...)
- supports gzipped files
- ignores unnecessary columns
Attribute, Entity, Relation Loading:
- load required/optional attributes of any TypeDB type (string, boolean, long, double, datetime)
- load required/optional role players (attribute / entity / relation)
- load list-like attribute columns as n attributes (recommended procedure until attribute lists are fully supported by TypeDB)
- load list-like player columns as n players for a relation
- load entity if not present - if present, either do not write or append attributes
Appending Attributes to existing things
Append-Attribute-Or-Insert-Entity for entities
Data Validation:
- validate input data rows and log issues for easy diagnosis input data-related issues (i.e. missing attributes/players, invalid characters...)
Configuration Validation:
- write your configuration with confidence: warnings will display useful information for fine tuning, errors will let you know what you forgot. All BEFORE the database is touched.
Performance:
- parallelized asynchronous writes to TypeDB to make the most of your hardware configuration, optimized with engineers @vaticle
Stop/Restart (in re-implementation, currently NOT available):
- tracking of your migration status to stop/restart, or restart after failure
Basic Column Preprocessing using RegEx's

Create a Loading Configuration (example) and use TypeDB Loader

as an executable CLI - no coding
in your own Java project - easy API

How it works:

To illustrate how to use TypeDB Loader, we will use a slightly extended version of the "phone-calls" example dataset and schema from the TypeDB developer documentation:

Configuration

The configuration file tells TypeDB Loader what things you want to insert for each of your data files and how to do it.

Here are some example:

For detailed documentation, please refer to the WIKI.

The config in the phone-calls test is a good starting example of a configuration.

Migrate Data

Once your configuration files are complete, you can use TypeDB Loader in one of two ways:

As an executable command line interface - no coding required:

./bin/typedbloader load \
                -tdb localhost:1729 \
                -c /path/to/your/config.json \
                -db databaseName \
                -cm

See details here

As a dependency in your own Java code:

import com.vaticle.typedb.osi.loader.cli.LoadOptions;
import com.vaticle.typedb.osi.loader.loader.TypeDBLoader;

public class LoadingData {

    public void loadData() {
        String uri = "localhost:1729";
        String config = "path/to/your/config.json";
        String database = "databaseName";

        String[] args = {
                "load",
                "-tdb", uri,
                "-c", config,
                "-db", database,
                "-cm"
        };

        LoadOptions options = LoadOptions.parse(args);
        TypeDBLoader loader = new TypeDBLoader(options);
        loader.load();
    }
}

See details here

Step-by-Step Tutorial

A complete tutorial for TypeDB version >= 2.5.0 is in work and will be published.

An example of configuration and usage of TypeDB Loader on real data can be found in the TypeDB Examples.

A complete tutorial for TypeDB (Grakn) version < 2.0 can be found on Medium.

There is an example repository for your convenience.

Connecting to TypeDB Cluster

To connect to TypeDB Cluster, a set of options is provided:

--typedb-cluster=<address:port>
--username=<username>
--password // can be asked for interactively
--tls-enabled
--tls-root-ca=<path/to/CA/cert>

Compatibility Table

Ranges are [inclusive, inclusive].

TypeDB Loader	TypeDB Driver (internal)	TypeDB	TypeDB Cloud
1.9.x	2.26.6	2.25.x -	2.25.x -
1.8.0	2.25.6	2.25.x -	2.25.x -
1.7.0	2.18.1	2.18.x 2.23.x	2.18.x 2.23.x
1.6.0	2.14.2	2.14.x - 2.17.x	2.14.x - 2.16.x
1.2.0 - 1.5.x	2.8.0 - 2.14.0	2.8.0 - 2.14.0	N/A
1.1.0 - 1.1.x	2.8.0	2.8.x	N/A
1.0.0	2.5.0 - 2.7.1	2.5.x - 2.7.x	N/A
0.1.1	2.0.0 - 2.4.x	2.0.x - 2.4.x	N/A
<0.1	1.8.0	1.8.x	N/A

Type DB

Find the Readme for GraMi for grakn < 2.0 here

Package hosting

Package repository hosting is graciously provided by Cloudsmith. Cloudsmith is the only fully hosted, cloud-native, universal package management solution, that enables your organization to create, store and share packages in any format, to any place, with total confidence.

Contributions

TypeDB Loader was built @Bayer AG in the Semantic and Knowledge Graph Technology Group with the support of the engineers @Vaticle.

Licensing

This repository includes software developed at Bayer AG. It is released under the Apache License, Version 2.0.

Credits

Icon in banner by Freepik from Flaticon

typedb-loader's People

Contributors

Stargazers

Watchers

Forkers

hkuich catocodex didier1969 kimwager ochibobo flyingsilverfin smartniz bayer-science-for-a-better-life stjordanis dmitrii-ubskii chandrareddy97 estkae jamesreprise sohaibanwaar devkrish23 dip707 irtazaraza

typedb-loader's Issues

Data Table / Column Pre-Processors

As of now, there are two pre-processor already implemented:

DEFAULT: clean all column values by removing illegal characters and leading/trailing whitespace
column-like lists given a seperator

Would like to be able to do more:

provide a general pre-processor that takes a regex
collect input from users as to what would be useful

Progress logging has invalid characters

There seems to be an encoding issue when progress is logged when running typedb-loader in Windows Terminal/PowerShell terminal. See the screenshot below:

This happens in typedb-loader-1.1.0 on Windows 11.

Optionally log every query

Occasionally, we get an error from the loader that needs to be traced back to the source query that caused the error. For this purpose, it would be good to be able to set TRACE log level and see each query run.

Leverage workflow description language syntax

Problem to Solve

Replace handwritten migrators with type-safe composable workflow syntax.

Current Workaround

Currently, we need to manually manage potentially batching/pooling and lots of string interpolation for mutations.

Proposed Solution

Write a command extension for OpenWDL - Spec 1.0 or adopt a similar sort of syntax

Given a schema.tql:

  define
  name sub attribute,
      value string;
  location sub entity,
      abstract,
      owns name @key,
      plays location-hierarchy:superior,
      plays location-hierarchy:subordinate;
  area sub location;
  city sub location;
  country sub location;
  location-hierarchy sub relation,
      relates superior,
      relates subordinate;

We create a workflow description:

import "schema.tql.wdl" # Codegen the AST traversal to spit out the typed struct mapping.

struct LocationBatch {
    Array[Location] locations
    Array[Area] areas
    Array[City] cities
    Array[Country] countries
    Array[LocationHierarchy] location_hierarchies
}

task cast_entity_relations {
  input {
    LocationBatch batch
  }
  
  command <<<
    typedb_loader <<CODE

    for {name, lat, long} <- %locations% {
      insert
      $location isa location,
        has? %name%,
        has %lat%,
        has %long;  
    }
  >>>  

  output {
    TypeDBBatch[Location] = read_batch_result(stdout())
  }
  
  meta {
    concurrency: 32,
    batch_size: 1000,
 }

This would be a currently valid program description (given a typedb_loader extension). This could also be improved with better syntax surrounding the projection of source -> subgraph for insertion.

Additional Information

The current loader approach certainly works, but it's anything but readable/convenient. This type of solution has the added benefit of compossibility, so you can have a task for csv->subgraph and a task for subgraph->batch_upsert, etc...

Imported nested attributes

Hi TypeDB-Loaders,
Thank you for open-sourcing such a great tool!

I cannot find any documentation on importing attributes of attributes. The appendAttribute functionality only is documented to work for entities and relations (that I can see). Please could you tell me if this is possible or if it is on the road map?

Thinking if something like the below is possible in the TypeDB-Loader config file, where entityA owns attributeA, and attributeA owns attributeB.

"entities": {
    "entityA": {
        "data": [
            "mydata.csv"
        ],
        "insert": {
            "entity": "entityA",
            "ownerships": [
                {
                    "attribute": "attributeA",
                    "column": "attributeA",
                    "ownerships": [
                        {
                            "attribute": "attributeB",
                            "column": "attributeB"
                        }
                    ]
                }
            ]
        }
    }

This is of course possible directly in TypeQL, with something like (if the entityA and attributeA are already loaded in):

match $a isa attributeA; $a "Foo";
insert $a has attributeB "Bar";

It would just be lovely to do it all using TypeDB Loader :)

Many thanks in advance!
Andy

Current Implementation of RelationInsertGenerator.playersMatch does not cast player tokens to corresponding conceptGenerator idValueType

Context

I am trying to load in a relation table that looks like the following:

# tag.csv

label_name,text_id
blue,28974
flower,83682
might,263684
...

Quick and dirty sample schema:

# schema.gql

uid sub attribute,
    value long;

name sub attribute,
    value string;

text sub entity,
    key uid,
    plays tagged;

label sub entity,
    key name,
    plays tagger;

tag sub relation,
    relates tagger,
    relates tagged;

# processorConfig.json

"tagged": {
            "playerType": "text",
            "uniquePlayerId": "uid",
            "idValueType": "long",
            "roleType": "tagged",
            "required": true
          }

The label and text entity loaders have worked fine however upon attempting to migrate this tag.csv file I am seeing the following error:

java.util.concurrent.ExecutionException: grakn.client.exception.GraknClientException: UNKNOWN: grakn.core.kb.graql.exception.GraqlSemanticException: 
Value 203088375 is not compatible with attribute value type: java.lang.Long. 
Please check server logs for the stack trace.

Suspected Error

I suspected that the match query was not correctly casting the text_id to long and found that currently the token is passed through the cleanToken method here https://github.com/bayer-science-for-a-better-life/grami/blob/f98b2b4e33094eb33275bd7fb6d1cc3676edc51a/src/main/java/generator/RelationInsertGenerator.java#L148
but that there is no further cast statement that I can find.

I was curious why this didn't break on the entity insertion but it seems that you do cast the value in the EntityInsertGenerator here https://github.com/bayer-science-for-a-better-life/grami/blob/f98b2b4e33094eb33275bd7fb6d1cc3676edc51a/src/main/java/generator/GeneratorUtil.java#L64

Rows limit while loading the data

Is there any paramter/attribute there to limit the number of rows while loading the data to typedb in config.json?

Problematic .tsv processing

When trying to ingest from .tsv files using Loader 1.4.1 on Ubuntu 20.04, I receive the following error:

[open_alex_0::5] ERROR com.vaticle.typedb.osi.loader.loader - async-writer-4: [THW07] Invalid Thing Write: Attempted to assign a key ',' of type 'id' that had been taken by another 'researcher'.

However, I've reviewed the .tsv and confirmed there are no comma values in this column; all values are open_alex identifiers, which are URLs starting with https.

In my typeDB config.json file, I have it set to expect tab separators, and it successfully ingests hundreds of thousands of rows.

"separator": "\t",

Below is a screenshot of confirming there are no commas in the id column using Python and Pandas.

I considered it being an issue with perhaps the header since it fails on the 2nd .tsv it's going through, as there is one record in the database with a comma for an id.

However, it doesn't fail until processing over 600,000 rows according to TypeDB processing updates.

import-task dependency management update

add ability to specify order of import. After adding attribute players in relations, one can now mess up imports by just taking the default ordering of:

independent attributes
entities
relations
nested-relations
appending attributes

because when having attributes as players, they can be added in steps 3, 4, and 5 - and it is no longer guaranteed that they are already present...

Solution:

adding independent attributes can and should always happen first
adding entities next can and should always happen second
next would be relations containing only entities
next would be relations containing entities and other relations
next would be appending attributes to existing entities or relations
finally all relations that have attributes as players
there should then be an option to have something be imported "afterwards" - for example, when one has a nested relation that contains attribute players, requiring that 6. is completed

Implement fault-tolerant loading and restarting

Goal

One feature that has been removed as of the 1.0.0 version is the ability to stop and/or continue loading the dataset from a particular point. This is particularly useful when loading large datasets that run for many hours/days.

This feature is quite tricky to implement, and it will require new features on TypeDB.

Design

The thread reading from the CSV should apply a deterministic counter/ID to each batch that it reads. This should be supplied to the consumer threads which are going to write each row batch in a transaction.

At the start of a transaction, we need to write a globally unique Transaction ID (not available at this time) and the batch ID to a durable log - essentially a write ahead log. We then proceed to write and commit the data to TypeDB. If the commit succeeds, the row batch ID is added to a durable set of committed batches and removed from the WAL. If it fails, we remove it from the WAL only. Otherwise, if the program crashes or the state is uncertain, on restarting the data load the server has to be queried to check if the transaction ID was successfully committed or not (this API is not yet available).

There are bits we can optimise such as compacting the set committed batch IDs, etc.

schema - processor-config - data config validation and reporting

Improved error message for faulty row tokenisation

Users encounter errors such as:

[...] skipped row 59213 b/c does not have a proper <isa> statement or is missing required attributes. Faulty tokenized row: [...]

However, it is unclear exactly which part of the row is faulty. If the issue is a required attribute is missing, we should be able to print out the column number/name that is required but missing.

Sporadic OOM when inserting a large number of relations

To reproduce:
Follow the steps in https://github.com/vaticle/typedb-examples/blob/catalogue-of-life/catalogue_of_life/README.md#quickstart

Typical output:
OOM in typedb server while finalizing the "naming" rule. Occasionally the loader OOMs as well.

Logging of insertion errors failing on Windows

When a record fails insertion, I believe a log message should be written to *_invalid.log.
On a windows machine, the formatting of the timestamp leads to an invalid file path being generated (: characters are not valid in Windows file-paths), and the logger being unable to create the file to record the insertion failure.

An example of the exception is as follows:

00:31:46.394 [main] INFO  com.bayer.dt.tdl.loader - buffered-read: total: 2,929, rate: 29,585.86/s
java.io.FileNotFoundException: 2022-06-12-00:31:46\files_invalid.log (The filename, directory name, or volume label syntax is incorrect)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:158)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:82)
        at io.FileLogger.logInvalid(FileLogger.java:64)
        at generator.EntityGenerator.write(EntityGenerator.java:70)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$0(AsyncLoaderWorker.java:415)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$1(AsyncLoaderWorker.java:413)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
java.io.FileNotFoundException: 2022-06-12-00:31:46\files_invalid.log (The filename, directory name, or volume label syntax is incorrect)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:158)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:82)
        at io.FileLogger.logInvalid(FileLogger.java:64)
        at generator.EntityGenerator.write(EntityGenerator.java:70)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$0(AsyncLoaderWorker.java:415)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$1(AsyncLoaderWorker.java:413)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
java.io.FileNotFoundException: 2022-06-12-00:31:46\files_invalid.log (The filename, directory name, or volume label syntax is incorrect)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:158)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:82)
        at io.FileLogger.logInvalid(FileLogger.java:64)
        at generator.EntityGenerator.write(EntityGenerator.java:70)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$0(AsyncLoaderWorker.java:415)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$1(AsyncLoaderWorker.java:413)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
java.io.FileNotFoundException: 2022-06-12-00:31:46\files_invalid.log (The filename, directory name, or volume label syntax is incorrect)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:158)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:82)
        at io.FileLogger.logInvalid(FileLogger.java:64)
        at generator.EntityGenerator.write(EntityGenerator.java:70)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$0(AsyncLoaderWorker.java:415)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at loader.AsyncLoaderWorker.lambda$asyncWrite$1(AsyncLoaderWorker.java:413)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

orderedBeforeGenerators and orderedAfterGenerators without ignoreGenerators cause loader to crash

In ConfigurationValidation.java:68-75 getIgnoreGenerators() is called when it may be null, which causes the loader to crash if the orderedBeforeGenerators or orderedAfterGenerators were defined, but ignoreGenerators weren't.

relation players in relation

currently, only entities can be players in a relation - this needs to be extended and tested to relations and attributes.

apply schema additions to existing schema if not breaking change

not breaking change:

addition of concepts
deleting/updating existing anything as long as no data is attached to it (no instantiation yet happened)
includes rules - only that rules can always be deleted/updated/added without causing breaking changes

clearer error messages for end-users

Error messages in Grami are a little vague.

Example: com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Expected name at line 135 column 10 path $.[4].conceptGenerators

Something is wrong with a .json file, but which one? There are 4 at play during ingest.

auto-generate processor config and data config skeletons from schema

given schema, create one processor and one data config entry (minimal set)

Add attributes to existing entities/relations

Given already existing entities/relations in your graph, do match/insert to add attributes:

match $entity isa entity, has identifying-existing-attribute "read from data table with new attributes";
insert $entity has new-attribute "read from data table with new attributes", has new-attribute-2 "...", ...;

Will need to adjust processor and data config and a new query builder.

rules management

have separate file for managing rules (define and undefine) to dynamically manage rules, their relations, etc in a already existing database

Wiki update for: 10 TypeDB Loader as Executable CLI

Hi Folks,
The current wiki on running TypeDB Loader as Executable CLI doesn't work for me. There appears to be a missing dash and an extra comma in the command. Any chance of updating the Wiki accordingly to help new users?

It works for me when I use the following (using macOS Ventura 13.1 on Apple M1 Max, however, also had the same issue on an Intel Mac last year):

./bin/typedb-loader load \
        -tdb localhost:1729 \
        -c /path/to/your/config.json \
        -db databaseName \
        -cm

Currently, it states:

./bin/typedbloader load \
        -tdb localhost:1729, \
        -c /path/to/your/config.json \
        -db databaseName \
        -cm

Cheers,
Andy

Support multiple files per-entity/relationship type

I have hundreds of files for a dataset with ~100 million records where in many cases a single entity type's records are spread across 20 different files. From the dataConfig.json example, it appears I can only associate a single .csv file to a given entity/relationship type at a time. It would be quite nice to be able to configure multiple dataPaths for all the different files I have for a single entity type rather than have to have 20 different config files that I have to run separately.

Typo in documentation for cli command to fire up loader

Hi There, I've encountered a typo while going through the documentation:
https://github.com/typedb-osi/typedb-loader/blob/master/README.md#migrate-data
the command fails if you copy it as it is, removing the comma "," after the port number (after -tbd flag), it works fine.

Generators can't validate query with updated TypeQL version

It appears that as the TypeDB Java client is updated, and the TypeQL version has changed, the string representation of a query changed. There are places where the generators look for a substring in the query (e.g. RelationGenerator.java:104) that can no longer appear. The loader ends up just ignoring the data with a generated query like

insert
$null isa null,
    has null "null";

More specific WARN logging

Request specific row number be captured in grami-log.log where there is an issue.

For example- in a dataset of 44 million rows:

10:49:06.448 [main] INFO com.bayer.dt.grami - processed 4950k rows

10:49:06.458 [main] WARN com.bayer.dt.grami.data - current row has column of type <datetime> with non-<ISO 8601 format> datetime value:

10:49:06.458 [main] WARN com.bayer.dt.grami.data - Text 'pub_date' could not be parsed at index 0

10:49:07.438 [main] INFO com.bayer.dt.grami - processed 5000k rows

While this is certainly helpful, I have to review 50,000 rows to find the specific problem.