Code Monkey home page Code Monkey logo

kgtk's Introduction

KGTK: Knowledge Graph Toolkit

doi travis ci Coverage Status

The Knowledge Graph Toolkit (KGTK) is a comprehensive framework for the creation and exploitation of large hyper-relational knowledge graphs (KGs), designed for ease of use, scalability, and speed. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. KGTK provides:

  • a suite of import commands to import Wikidata, RDF and popular graph representations into KGTK format;
  • a rich collection of transformation commands make it easy to clean, union, filter, and sort KGs;
  • graph combination commands support efficient intersection, subtraction, and joining of large KGs;
  • a query language using a variant of Cypher, optimized for querying KGs stored on disk supports efficient ad hoc queries;
  • graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components and shortest paths;
  • advanced commands support lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph;
  • a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing and graph-tool;
  • a development environment using Jupyter notebooks provides seamless integration with Pandas.

KGTK can process Wikidata-sized KGs with billions of edges on a laptop. We have used KGTK in multiple use cases, focusing primarily on construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a commonsense KG combining multiple existing sources, creation of Wikidata extensions for food security and the pharmaceutical industry.

KGTK is open source software, well documented, actively used and developed, and released using the MIT license. We invite the community to try KGTK. It is easy to get started with our tutorial notebooks available and executable online.

Installation

The following instructions install KGTK and the KGTK Jupyter Notebooks on Linux and MacOS systems.

If you want to install KGTK on a Microsoft Windows system, please
contact the KGTK team.

Our KGTK installations use a Conda virtual environment. If you don't have the Conda tools installed, follow this guide to install it. We recommend installing Miniconda installation rather than the full Anaconda installation.

Next, execute the following steps to install the latest stable release of KGTK:

conda create -n kgtk-env python=3.9
conda activate kgtk-env
conda install -c conda-forge graph-tool
conda install -c conda-forge jupyterlab
pip --no-cache install -U kgtk

Please see our installation document for more details. If you encounter problems with your installation, or are interested in a detailed explanation of these commands, read more about the installation procedure here.

Installation issues on Macbooks with M1 chip

Running pip install -e . (development mode) throws an error about 3 libraries,

  1. thinc
  2. blis
  3. tokenizers

Fixed the thinc issue by ,

a. commenting out [this line in requirements.txt](https://github.com/usc-isi-i2/kgtk/blob/dev/requirements.txt#L11)

b. running `pip install thinc-apple-ops`

Fixed the tokenizers issue by running the following commands in the conda environment

# download and install Rust. Follow the on screen instructions

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python/
pip install setuptools_rust
python setup.py install

continue installing kgtk, pip install -e .

Installing KGTK with Docker

Please refer to this document for installing KGTK with Docker

Getting started

Online Documentation

You can read our latest documentation online with:

https://kgtk.readthedocs.io/en/latest/

KGTK Notebooks

For examples of using KGTK, please see our Tutorial Notebooks.

Releases

KGTK Text Search API

The documentation for the KGTK Text Search API is here

KGTK Semantic Similarity API

The documentation for the KGTK Semantic Similarity API is here

How to cite

@inproceedings{ilievski2020kgtk,
  title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},
  author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},
  booktitle={International Semantic Web Conference},
  pages={278--293},
  year={2020},
  organization={Springer}
  url={https://arxiv.org/pdf/2006.00088.pdf}
}

kgtk's People

Contributors

aidankelley avatar bhatiadivij avatar bin-go2 avatar chalypso avatar ckxz105 avatar cmungall avatar craigmilorogers avatar dangiankit avatar dgarijo avatar filievski avatar g1eb avatar grantxie avatar greatyyx avatar kartik2112 avatar kyao avatar naren954 avatar nicklein avatar rijulvohra avatar rongpenl avatar saggu avatar shashank73744 avatar shreya027 avatar szeke avatar thadguidry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kgtk's Issues

Clarify documentation of --output-stats in graph_statistics

The documentation says:

--output-stats        do not output the graph but statistics only

It is unclear what this option means, should clarify.

I think we need the following options:

  • ability to put the global statistics of the graph in a file
  • ability to echo the original graph, adding the statistics edges at the end
  • ability to only output the statistics edges

Date Validation

The command kgtk generate_wikidata_triples is dropping dates found on Wikidata.

This is the content of ignored.log file:

Corrupted statement at line number: 7145 with id  with current corrupted id None
Corrupted statement at line number: 7146 with id  with current corrupted id None
Corrupted statement at line number: 29804 with id  with current corrupted id None
Corrupted statement at line number: 32151 with id  with current corrupted id None
Corrupted statement at line number: 63640 with id  with current corrupted id None

The corresponding lines are:

Q201000001      P580    ^-4000-01-01T00:00:00Z/11
Q201000001      P582    ^49500-01-01T00:00:00Z/11
Q201000028      P582    ^5675674-01-01T00:00:00Z/11
Q201000031      P582    ^20080-01-01T00:00:00Z/9
Q201000054      P580    ^-0499-01-01T00:00:00Z/11

`import-wikidata` fails on early json dump of wikidata.

Describe the bug
import-wikidata fails on early json dump of wikidata.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://archive.org/download/wikidata-json-20150330
  2. Download the json dump.
  3. Unzip it
  4. Run kgtk import-wikidata -i 20150330.json --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv --procs 1 --debug --lang en

Expected behavior
Three kgtk files will be generated. The file size should be more than 1GB in general.

Actual behavior

An error was reported but program didn't stop. The generated kgtk files are about 100K.

Screenshots
Screen Shot 2020-08-03 at 10 54 01 AM

Example toy KGTK file

It would be great to have an example toy graph to demonstrate the different features of KGTK without having to wait a long time when dealing with a KG.

This toy KGTK file could be using for quick testing of all commands on each release.

Error in remove_columns when removing multiple columns, separated by space

I am using the following command to remove the exploded columns from a Wikidata file:

kgtk remove_columns -c 'id,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type'

If I put spaces between the column names, eg, id, rank, node2;magnitude it doesn't work.

This command also has a non-standard -dt option.

Exception in clean_data while cleaning wikidata_qualifiers_20200504.tsv.gz

The file wikidata_qualifiers_20200504.tsv.gz seems to have a large number of errors (reported by Itay), I tried to run clean_data on it and it threw an exception:

(kgtk-env) D22ML-PSZEKELY:wikidata-20200504 pedroszekely$ gzcat wikidata_qualifiers_20200504.tsv.gz | pv -p -s 24077191262 | kgtk clean_data --repair-month-or-day-zero 2> ~/Downloads/kgtk-err.txt | gzip > ~/data/wikidata-20200504/wikidata_qualifiers_20200504-clean.tsv.gz
 0.21% | 14570 ETA | 48.75MB Transferred | 1.57MB/sevents.js:186
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:87:16)
Emitted 'error' event on Socket instance at:
    at Socket.onerror (/usr/local/lib/node_modules/pv/node_modules/readable-stream/lib/_stream_readable.js:640:52)
    at Socket.emit (events.js:209:13)
    at errorOrDestroy (internal/streams/destroy.js:107:12)
    at onwriteError (_stream_writable.js:449:5)
    at onwrite (_stream_writable.js:470:5)
    at internal/streams/destroy.js:49:7
    at Socket.dummyDestroy [as _destroy] (internal/process/stdio.js:7:3)
    at Socket.destroy (internal/streams/destroy.js:37:8)
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:88:12) {
  errno: 'EPIPE',
  code: 'EPIPE',
  syscall: 'write'
}

Update connected components command to understand node/edge file conventions

The connected components command uses non standard options to specify columns to operate on:

--noheader: Option to specify that the input file does not contain a header.
--subj {integer}: Column in which the subject is given. Default: 0.
--pred {integer}: Column in which the predicate is given. Default: 1.
--obj {integer}: Column in which the object is given. Default: 2.

The task is to update the command to follow the same conventions as the others.

Change format of property definitions for generate Wikidata triples

The currently supported file format looks like this:

node1 label node2 data_type
P2020008 label "attributed to software"@en string
P2020008 description software used to extract information from an article or text fragment  
P2020013 label "vertex in-degree"@en quantity
P2020013 description measure of vertex in degree in a graph

We want it it look like this (data_type is not another column, but instead a label. This makes it so that the properties file is an edge file that can be used in KGTK.

node1 label node2
P2020001 label "text Fragment"@en
P2020001 description "text fragment of a scholarly article"@en
P2020001 data_type item
P2020008 label "attributed to software"@en
P2020008 description "software used to extract information from an article or text fragment"@en
P2020008 data_type string

Implode not working for coordinates

Describe the bug
Node2 of coordinate property P625 is missing after implosion.

To Reproduce
Run kgtk implode command:

unzip files.zip
kgtk implode exploded.tsv --remove-prefixed-columns True --without si_units language_suffix > imploded.tsv

files.zip

Expected behavior
Node2 of P625 should contain coordinate value

Desktop (please complete the following information):
KGTK 0.3.2

Kgtk Timing

It helps to know how long a KGTK command executed. Rather than relying on the vagaries of command shells, or adding this feature to each individual command, provide kgtk --timing support in the top-level command dispatcher. Ideally, when executing a KGTK pipe, it will provide timing for each section of the pipe.

import triples does not import numbers

When importing the following triples all the triples with numbers are skipped in the output.

<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.tablenet.l3s.uni-hannover.de/TableNet#Table> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#hasTableID> 5020573 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfColumns> 2 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfRows> 1 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#document> <http://en.wikipedia.org/wiki/Metropolis_Gold%23> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://purl.org/dc/terms/source> <http://dbepdia.org/resource/Metropolis_Gold> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#resourceURL> <http://www.tablenet.l3s.uni-hannover.de/TableNet/json/5020573> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.tablenet.l3s.uni-hannover.de/TableNet#Column> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <https://www.tablenet.l3s.uni-hannover.de/TableNet#columnPosition> 0 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <https://www.tablenet.l3s.uni-hannover.de/TableNet#hasLevel> 0 .

The output:

node1	label	node2
tn-table:5020573	rdf:type	tn:Table
tn-table:5020573	tn:document	wiki:Metropolis_Gold%23
tn-table:5020573	dc:source	db:Metropolis_Gold
tn-table:5020573	tn:resourceURL	tn-json:5020573
tn-column:5020573_0_Professional_ratings	rdf:type	n1:Column
tn-column:5020573_0_Professional_ratings	tn:columnName	"Professional_ratings"

Make It Easier to Export KGTK Correctly

We need some tools and guidelines to assist with exporting KGTK to other formats. For example, a destringify routine that removes the escape () from the the vertical bar (pipe) (|) character would help.

[Dockerfiles] Do not run notebooks as root

It is not a good practice to run the notebooks as root in the container. The following lines address the problem:

ARG NB_USER=jovyan
ARG NB_UID=1000
ENV USER ${NB_USER}
ENV NB_UID ${NB_UID}
ENV HOME /home/${NB_USER}

RUN adduser --disabled-password \
    --gecos "Default user" \
    --uid ${NB_UID} \
    ${NB_USER}
    
COPY . ${HOME}
USER root
RUN chown -R ${NB_UID} ${HOME}
RUN chown -R ${NB_UID} kgtk
USER ${NB_USER}

This was taken from the mybinder documentation. The invocation command for the notebook should probably be updated too.

unit test not passed on text embedding

Describe the bug
When running on text embedding unit test, the unit raise an error like this:

======================================================================
ERROR: test_vector (test_embedding.TestEmbedding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/usc-isi-i2/kgtk/kgtk/tests/test_embedding.py", line 10, in test_vector
    assert cli_entry("kgtk", "text_embedding", test_input, "--use-cache", "false", "--embedding-projector-metadata-path", "none", "--property-value", "P1629", "P1466") == 0
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/cli_entry.py", line 73, in cli_entry
    mod = importlib.import_module('.{}'.format(h), 'kgtk.cli')
  File "/opt/python/3.7.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/cli/connected-components.py", line 7, in <module>
    from kgtk.gt.connected_components import ConnectedComponents
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/gt/connected_components.py", line 10, in <module>
    from graph_tool.topology import label_components
ModuleNotFoundError: No module named 'graph_tool'
----------------------------------------------------------------------
Ran 11 tests in 12.109s
FAILED (errors=1)
The command "coverage run -m unittest discover" exited with 1.

For details, please refer to https://travis-ci.org/github/usc-isi-i2/kgtk/builds/713131958
it seems because the unit test environment did not install graph-tools.
If installed (e.g. test locally), it will pass the unit test.

To Reproduce
Run the unit test with text embedding

Expected behavior
unit test success finished.

Desktop (please complete the following information):

  • OS: MacOS 10.14.6

Create notebooks for each command

Some command's documentation is not up to date. For example, lift documentation had examples that did not work anymore @szeke reports.

We should have notebooks that illustrate how to use KGTK for each command. These notebooks can inform tests for the toolkit.

Command Option Syntax Is Confusing

Users are confused about where to put positional arguments, typically file names. If they place them after an option that takes a varying number of arguments, such as a column list, the file name is absorbed into the list.

Add --left-prefix and --right-prefix to `kgtk join`

kgtk join currently supports --prefix for right-graph additional column name prefixing. When using kgtk join to create horizontally-joined records (kgtk join / compact), it would be useful to be able to apply a prefix to the left-graph additional column names.

Add left-graph column names prefixing, supporting --right-prefix and --left-prefix. Retain --prefix for compatability.

kgtk connected-components Cannot Process Stdin

kgtk/gt/connected-components passes the input path to both KgtkReader (to get the header line) and to gtaph_tool.load_from_csv(...). The input path is "-" when we want to read stdin.

  1. graph_tool.load_from_csv(...) isn't prepared to use "-" as an indicator to read stdin
  2. I suspect that the no_header logic is wrong, because KgtkReader will read the header line from stdin before load_from_csv can access the input stream.

Examples should be updated

The kgtk code has changed so rapidly in the last 3 weeks that some of the commands used in the notebooks are no longer valid.

The notebooks should be updated accordingly.

cli branch unable to be used now because of import error

When calling any kgtk cli function, it will raise an error because of import error

(gpt2) ➜  kgtk git:(feature/cli) ✗ kgtk dummy 
Traceback (most recent call last):
  File "/Users/minazuki/miniconda3/envs/gpt2/bin/kgtk", line 11, in <module>
    load_entry_point('kgtk', 'console_scripts', 'kgtk')()
  File "/Users/minazuki/Desktop/studies/master/2018Summer/DSBOX_2019/kgtk/kgtk/cli_entry.py", line 70, in cli_entry
    mod = importlib.import_module('.{}'.format(h), 'kgtk.cli')
  File "/Users/minazuki/miniconda3/envs/gpt2/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/minazuki/Desktop/studies/master/2018Summer/DSBOX_2019/kgtk/kgtk/cli/wikidata_nodes_import.py", line 46
    from __future__ import unicode_literals
    ^
SyntaxError: from __future__ imports must occur at the beginning of the file
(gpt2) ➜  kgtk git:(feature/cli) ✗ 

Improve wikidata triple generation.

  1. Handle external identifier and URI rather than group them into the StringValue class.
  2. Integrated the properties in wikidata as internal data. This way, the user only needs to specify his own properties in the prop_types.tsv with very large identifier number.

Numeric sorting

Describe the bug
Sort does not allow for numeric sorting, it always performs string sorting.

To Reproduce
Steps to reproduce the behavior:

  1. Jun tried to compute PageRank by using graph_statistics
  2. then he used kgtk sort on the resulting PageRank statistics

Expected behavior
The sort should accept a flag --numeric that makes it perform a numeric rather than lexical sort.

Desktop (please complete the following information):
Jun's laptop (don't know the specs)

Update instructions: How to update KGTK?

We have no instructions stating how to update an already installed version of KGTK.
In theory it should be as easy as:

git clone https://github.com/usc-isi-i2/kgtk/ 
cd /kgtk 
python setup.py install 

Sorting Order Disagreement

Python's sorting function sorts in the following order:

P560 P580-count 2 21
P5603 P580-count 2 845
P5607 P580-count 326 25788

The Gnu sort command sorts in the following order, unless the envar LC_ALL is set to C:

P5603 P580-count 2 845
P5607 P580-count 326 25788
P560 P580-count 2 21

This means tha the output of kgtk sort probably doesn't satisfy the sorting requirements for optimization in kgtk lift.

language_suffix Can't Be Matched with field_value

Language_suffix can't be matched with field_value because the language suffix values start with a dash, so the comparison values look like invalid numbers. Even if the dash were removed, there would be confusion because Wikidata allows all-numeric suffixes.

The best solution is probably the rule that if the comparison value is a string (not a symbol), compare the field value to the contents of the string, since field values can't themselves be strings.

Filter should use the header parser by Craig

Describe the bug
The header of the file right now has to be node1,label,node2 and can only be overwritten by setting the columns

To Reproduce
Given a sample file:

node1   property        node2   id
Q8      P31     Q331769 Q8-P31-0
Q8      P31     Q60539479       Q8-P31-1
Q8      P31     Q9415   Q8-P31-2
Q8      P1343   Q20743760       Q8-P1343-3
Q8      P1343   Q1970746        Q8-P1343-4
Q8      P1343   Q19180675       Q8-P1343-5
Q8      P461    Q169251 Q8-P461-6
Q8      P279    Q16748867       Q8-P279-7
Q8      P460    Q935526 Q8-P460-8

You can filter successfully as follows:
kgtk filter -p " ; vertex_pagerank ; " --pred property wikidata-pagerank-degrees_all.tsv > wikidata-pagerank-only.tsv

But if you run the following, you get nothing:
kgtk filter -p " ; vertex_pagerank ; " wikidata-pagerank-degrees_all.tsv > wikidata-pagerank-only.tsv

Expected behavior
The filter command should employ Craig's parser. It should at least accept all aliases for the three key columns. It should (probably) also use the first three columns otherwise.

Desktop (please complete the following information):

  • OS: iOS
  • Browser: chrome
  • Version 83.

Smartphone (please complete the following information):

  • Device: macbook
  • OS: Catalina
  • Browser: chrome
  • Version 10.15

Additional context
Jun Liu reported the issue, during the June 4th 2020 meeting.

Usage instructions

The readme file does not specify how to use the tool. There are examples on the examples folder, but they are not very informative.

Also, is it possible to use kgtk with graphtools within its docker image?

kgt connected-components Will Fail on Certain Strings

If the KGTK input file contains strings (or symbols) with internal double quotes, failure is likely because load_from_csv isn't passed the csv_options to use an escape character (") instead of doubling ("") to process the internal quotes.

Getting <class 'ValueError'> error when running the generate_wikidata_triples command

Describe the bug
When I try to run the command kgtk generate_wikidata_triples -pf wikibase_property_mapping_kgtk.tsv < kgtk_test.tsv > output_file.ttl, I am getting the error
"UserWarning: Please raise KGTKException instead of <class 'ValueError'>
warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found"

To Reproduce
Steps to reproduce the behavior:

  1. input file: kgtk_test.txt (Cannot attach tsv format)

  2. property file:
    wikibase_property_mapping_kgtk.txt

  3. kgtk generate_wikidata_triples -pf wikibase_property_mapping_kgtk.tsv < kgtk_test.tsv > output_file.ttl

  4. See error
    UserWarning: Please raise KGTKException instead of <class 'ValueError'>
    warnings.warn('Please raise KGTKException instead of {}'.format(type_))
    KGTKException found

Expected behavior
No error and generating the TTL file

Screenshots
image

Desktop (please complete the following information):

  • OS: MacOS

Additional context
There is an inconsistency in the kgtk documentation,
https://kgtk.readthedocs.io/en/latest/export/generate_wikidata_triples/
image
property_type here should be data_type

[Feature request] Extract common paths from KGTK edge file

Is your feature request related to a problem? Please describe.
Given a KGTK file, I would like to know how the information is connected, that is, what are the typical paths that can be found between instances of a class in the graph.

For example, given Wikidata and the class Human, I would like to know what are the common paths one could explore from a Human in the endpoint. Examples could be (the properties do not reflect reality):

  • Human --mariedTo-> Human
  • Human --artistOf--> Song --includedIn--> Music collection
  • Human --partOfTeam-->SoccerTeam --hasAwards--> Award
  • Human --presidentOf-->Country--partOf-->Continent

This helps me know which types of queries the endpoint supports most commonly, instead of my trying to figure out what is in there.

Describe the solution you'd like
Common pattern analysis (sub graph extraction, random walks, association rule mining) to obtain a set of candidate paths with a given support, where support is the frequency that path has. For example, if I state support=100, that path must have occurred > 100 times in the graph.

Discussion of the status path forward for this issue
https://docs.google.com/document/d/1CEsQw46c4MGX1l6C7_d4XDfBGhtP2YfiIv8nSKTWlog/edit#heading=h.fvmyc2aqdp5x

kgtk ifnotexists return existing triple

Describe the bug
Filter Not Exist does not return an expected behavior. In my case, the thing that exist in filter file still appear in the output. Please see the step to reproduce below.

To Reproduce
Steps to reproduce the behavior:

  1. Go to this folder and download both existing_triples.tsv and matched_triples_kgtk_format.tsv
  2. Run the following command
kgtk ifnotexists path/to/file/matched_triples_kgtk_format.tsv \
 --input-keys node1 label node2 \
 --filter-on path/to/file/exist_triples.tsv \
 --filter-keys node1 label node2 \
 > path/to/file/not_exist_triples.tsv

Expected behavior
3. Read in data to check the result using any tools. In this example I use pandas

import pandas as pd

existing_candidate = pd.read_csv('path/to/file/exist_triples.tsv', delimiter='\t')
existing_candidate[(existing_candidate['node1'] == 'Q544059') & (existing_candidate['label'] == 'P1469')]

You should see this output:
image

Try reading in the output of kgtk ifnotexists commands.

not_existing_candidate = pd.read_csv('path/to/file/not_exist_triples.tsv', delimiter='\t')
not_existing_candidate[(not_existing_candidate['node1'] == 'Q544059') & (not_existing_candidate['node2'] == '295149')]

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.