usc-isi-i2 / kgtk Goto Github PK

View Code? Open in Web Editor NEW

346.0 20.0 57.0 107.99 MB

Knowledge Graph Toolkit

Home Page: https://kgtk.readthedocs.io/en/latest/

License: MIT License

Python 41.47% Shell 1.66% Makefile 0.01% Dockerfile 0.02% Jupyter Notebook 46.79% HTML 10.04%

knowledge-graphs graphs rdf etl-framework embeddings wikidata toolkit kg

kgtk's Introduction

KGTK: Knowledge Graph Toolkit

The Knowledge Graph Toolkit (KGTK) is a comprehensive framework for the creation and exploitation of large hyper-relational knowledge graphs (KGs), designed for ease of use, scalability, and speed. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. KGTK provides:

a suite of import commands to import Wikidata, RDF and popular graph representations into KGTK format;
a rich collection of transformation commands make it easy to clean, union, filter, and sort KGs;
graph combination commands support efficient intersection, subtraction, and joining of large KGs;
a query language using a variant of Cypher, optimized for querying KGs stored on disk supports efficient ad hoc queries;
graph analytics commands support scalable computation of centrality metrics such as PageRank, degrees, connected components and shortest paths;
advanced commands support lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph;
a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing and graph-tool;
a development environment using Jupyter notebooks provides seamless integration with Pandas.

KGTK can process Wikidata-sized KGs with billions of edges on a laptop. We have used KGTK in multiple use cases, focusing primarily on construction of subgraphs of Wikidata, analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a commonsense KG combining multiple existing sources, creation of Wikidata extensions for food security and the pharmaceutical industry.

KGTK is open source software, well documented, actively used and developed, and released using the MIT license. We invite the community to try KGTK. It is easy to get started with our tutorial notebooks available and executable online.

Installation

The following instructions install KGTK and the KGTK Jupyter Notebooks on Linux and MacOS systems.

If you want to install KGTK on a Microsoft Windows system, please
contact the KGTK team.

Our KGTK installations use a Conda virtual environment. If you don't have the Conda tools installed, follow this guide to install it. We recommend installing Miniconda installation rather than the full Anaconda installation.

Next, execute the following steps to install the latest stable release of KGTK:

conda create -n kgtk-env python=3.9
conda activate kgtk-env
conda install -c conda-forge graph-tool
conda install -c conda-forge jupyterlab
pip --no-cache install -U kgtk

Please see our installation document for more details. If you encounter problems with your installation, or are interested in a detailed explanation of these commands, read more about the installation procedure here.

Installation issues on Macbooks with M1 chip

Running pip install -e . (development mode) throws an error about 3 libraries,

thinc
blis
tokenizers

Fixed the thinc issue by ,

a. commenting out [this line in requirements.txt](https://github.com/usc-isi-i2/kgtk/blob/dev/requirements.txt#L11)

b. running `pip install thinc-apple-ops`

Fixed the tokenizers issue by running the following commands in the conda environment

# download and install Rust. Follow the on screen instructions

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python/
pip install setuptools_rust
python setup.py install

continue installing kgtk, pip install -e .

Installing KGTK with Docker

Please refer to this document for installing KGTK with Docker

Getting started

Online Documentation

You can read our latest documentation online with:

https://kgtk.readthedocs.io/en/latest/

KGTK Notebooks

For examples of using KGTK, please see our Tutorial Notebooks.

Releases

See all source code releases

KGTK Text Search API

The documentation for the KGTK Text Search API is here

KGTK Semantic Similarity API

The documentation for the KGTK Semantic Similarity API is here

How to cite

@inproceedings{ilievski2020kgtk,
  title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},
  author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},
  booktitle={International Semantic Web Conference},
  pages={278--293},
  year={2020},
  organization={Springer}
  url={https://arxiv.org/pdf/2006.00088.pdf}
}

kgtk's People

Contributors

Stargazers

Watchers

Forkers

wyfunique bhatiadivij rpatil524 cmungall a-zhang1 dgarijo vishalbelsare aidankelley usbader shashank73744 nicklein rijulvohra bin-go2 ankitshah009 jingmouren trendingtechnology yyht thadguidry shreya027 mann-brinson otamio research-software-company hardirathod karanpraharaj hoelzl dangiankit arita37 jupyterchu sunatthegilddotcom kartik2112 soblinger hussien vincentwei2021 4n6strider digitalcompanion qanu-survey grantxie ceperezegma robuso avenchen aahmadai coderpql ii-research-yu rdhanurkar paradoxl navapatn alleria1809 tejasbharambe17 kshitijahuja17 5l1v3r1 cpatil 97varun g1eb wukunhuan hwl26 danyray420

kgtk's Issues

Have a single requirements file

Right now there are 3 requirement files:

requirements.txt, used for installation of kgtk.
requirements-dev
requirements-text_embeddings, which are now included in requirements.txt

Can we have just one? @saggu, @GreatYYX, @ckxz105 ?

Add Precision to Coordinates

import_wikidata would like to add a precision field to coordinates.

Clarify documentation of --output-stats in graph_statistics

The documentation says:

--output-stats        do not output the graph but statistics only

It is unclear what this option means, should clarify.

I think we need the following options:

ability to put the global statistics of the graph in a file
ability to echo the original graph, adding the statistics edges at the end
ability to only output the statistics edges

Date Validation

The command kgtk generate_wikidata_triples is dropping dates found on Wikidata.

This is the content of ignored.log file:

Corrupted statement at line number: 7145 with id  with current corrupted id None
Corrupted statement at line number: 7146 with id  with current corrupted id None
Corrupted statement at line number: 29804 with id  with current corrupted id None
Corrupted statement at line number: 32151 with id  with current corrupted id None
Corrupted statement at line number: 63640 with id  with current corrupted id None

The corresponding lines are:

Q201000001      P580    ^-4000-01-01T00:00:00Z/11
Q201000001      P582    ^49500-01-01T00:00:00Z/11
Q201000028      P582    ^5675674-01-01T00:00:00Z/11
Q201000031      P582    ^20080-01-01T00:00:00Z/9
Q201000054      P580    ^-0499-01-01T00:00:00Z/11

graph_statistics doesn't accept input in standard input

The input must be provided in a file argument. We should make it consistent with every other command. In addition, the output header should use label instead of property

`import-wikidata` fails on early json dump of wikidata.

Describe the bug
import-wikidata fails on early json dump of wikidata.

To Reproduce
Steps to reproduce the behavior:

Go to https://archive.org/download/wikidata-json-20150330
Download the json dump.
Unzip it
Run kgtk import-wikidata -i 20150330.json --node nodefile.tsv --edge edgefile.tsv --qual qualfile.tsv --procs 1 --debug --lang en

Expected behavior
Three kgtk files will be generated. The file size should be more than 1GB in general.

Actual behavior

An error was reported but program didn't stop. The generated kgtk files are about 100K.

Screenshots

Example toy KGTK file

It would be great to have an example toy graph to demonstrate the different features of KGTK without having to wait a long time when dealing with a KG.

This toy KGTK file could be using for quick testing of all commands on each release.

Create a docker image with KGTK

Usage instructions include Conda installation. It would be great to have a dockerized version of KGTK

Error in remove_columns when removing multiple columns, separated by space

I am using the following command to remove the exploded columns from a Wikidata file:

kgtk remove_columns -c 'id,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type'

If I put spaces between the column names, eg, id, rank, node2;magnitude it doesn't work.

This command also has a non-standard -dt option.

Exception in clean_data while cleaning wikidata_qualifiers_20200504.tsv.gz

The file wikidata_qualifiers_20200504.tsv.gz seems to have a large number of errors (reported by Itay), I tried to run clean_data on it and it threw an exception:

(kgtk-env) D22ML-PSZEKELY:wikidata-20200504 pedroszekely$ gzcat wikidata_qualifiers_20200504.tsv.gz | pv -p -s 24077191262 | kgtk clean_data --repair-month-or-day-zero 2> ~/Downloads/kgtk-err.txt | gzip > ~/data/wikidata-20200504/wikidata_qualifiers_20200504-clean.tsv.gz
 0.21% | 14570 ETA | 48.75MB Transferred | 1.57MB/sevents.js:186
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:87:16)
Emitted 'error' event on Socket instance at:
    at Socket.onerror (/usr/local/lib/node_modules/pv/node_modules/readable-stream/lib/_stream_readable.js:640:52)
    at Socket.emit (events.js:209:13)
    at errorOrDestroy (internal/streams/destroy.js:107:12)
    at onwriteError (_stream_writable.js:449:5)
    at onwrite (_stream_writable.js:470:5)
    at internal/streams/destroy.js:49:7
    at Socket.dummyDestroy [as _destroy] (internal/process/stdio.js:7:3)
    at Socket.destroy (internal/streams/destroy.js:37:8)
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:88:12) {
  errno: 'EPIPE',
  code: 'EPIPE',
  syscall: 'write'
}

Description of graph_statistics is incorrect

kgtk --help produces

graph_statistics    Import a CSV file in Graph-tool.
gt_loader           Import a CSV file in Graph-tool.

Add Revised Precision to Dates

Add support for the revised (named) precisions to dates.

Update connected components command to understand node/edge file conventions

The connected components command uses non standard options to specify columns to operate on:

--noheader: Option to specify that the input file does not contain a header.
--subj {integer}: Column in which the subject is given. Default: 0.
--pred {integer}: Column in which the predicate is given. Default: 1.
--obj {integer}: Column in which the object is given. Default: 2.

The task is to update the command to follow the same conventions as the others.

Fix Reliability and Performance Issues in Sort and Pipes

kgtk sort was unreliable in pipes. A workaround has been implemented, but a thorough examination and possible rewrite of the code is called for.

Change format of property definitions for generate Wikidata triples

The currently supported file format looks like this:

node1	label	node2	data_type
P2020008	label	"attributed to software"@en	string
P2020008	description	software used to extract information from an article or text fragment
P2020013	label	"vertex in-degree"@en	quantity
P2020013	description	measure of vertex in degree in a graph

We want it it look like this (data_type is not another column, but instead a label. This makes it so that the properties file is an edge file that can be used in KGTK.

node1	label	node2
P2020001	label	"text Fragment"@en
P2020001	description	"text fragment of a scholarly article"@en
P2020001	data_type	item
P2020008	label	"attributed to software"@en
P2020008	description	"software used to extract information from an article or text fragment"@en
P2020008	data_type	string

Implode not working for coordinates

Describe the bug
Node2 of coordinate property P625 is missing after implosion.

To Reproduce
Run kgtk implode command:

unzip files.zip
kgtk implode exploded.tsv --remove-prefixed-columns True --without si_units language_suffix > imploded.tsv

files.zip

Expected behavior
Node2 of P625 should contain coordinate value

Desktop (please complete the following information):
KGTK 0.3.2

Importing to Graph Tools using a CSV Reader Breaks KGTK Strings

kgtk/gt/connected_components.py was using a CVS-reader to import into Graph Tools. This broke the KGTK strings being imported. The other KGTK commands using Graph Tools are probably in a similar condition.

Type in md help string

It says

Additional options are shown in expert help.
kgtk --expert **cat** --help

generate_wikidata_triples ignores a valid ISO date if it does not start with `^`

2014-04-01T00:00:00Z/9 is a valid data in ISO format, but the command ignores it and throws an error because the string does not start with ^.

This should be handled in the command

Kgtk Timing

It helps to know how long a KGTK command executed. Rather than relying on the vagaries of command shells, or adding this feature to each individual command, provide kgtk --timing support in the top-level command dispatcher. Ideally, when executing a KGTK pipe, it will provide timing for each section of the pipe.

import triples does not import numbers

When importing the following triples all the triples with numbers are skipped in the output.

<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.tablenet.l3s.uni-hannover.de/TableNet#Table> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#hasTableID> 5020573 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfColumns> 2 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#numberOfRows> 1 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#document> <http://en.wikipedia.org/wiki/Metropolis_Gold%23> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <http://purl.org/dc/terms/source> <http://dbepdia.org/resource/Metropolis_Gold> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/table/5020573> <https://www.tablenet.l3s.uni-hannover.de/TableNet#resourceURL> <http://www.tablenet.l3s.uni-hannover.de/TableNet/json/5020573> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.tablenet.l3s.uni-hannover.de/TableNet#Column> .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <https://www.tablenet.l3s.uni-hannover.de/TableNet#columnPosition> 0 .
<https://www.tablenet.l3s.uni-hannover.de/TableNet/column/5020573_0_Professional_ratings> <https://www.tablenet.l3s.uni-hannover.de/TableNet#hasLevel> 0 .

The output:

node1	label	node2
tn-table:5020573	rdf:type	tn:Table
tn-table:5020573	tn:document	wiki:Metropolis_Gold%23
tn-table:5020573	dc:source	db:Metropolis_Gold
tn-table:5020573	tn:resourceURL	tn-json:5020573
tn-column:5020573_0_Professional_ratings	rdf:type	n1:Column
tn-column:5020573_0_Professional_ratings	tn:columnName	"Professional_ratings"

Make It Easier to Export KGTK Correctly

We need some tools and guidelines to assist with exporting KGTK to other formats. For example, a destringify routine that removes the escape () from the the vertical bar (pipe) (|) character would help.

text_embedding doesn't accept input in standard input

The input right now must be provided in a file argument. We should make it consistent with every other command.

Binder not working

kgtk fails to run because it wants graph tools

[Dockerfiles] Do not run notebooks as root

It is not a good practice to run the notebooks as root in the container. The following lines address the problem:

ARG NB_USER=jovyan
ARG NB_UID=1000
ENV USER ${NB_USER}
ENV NB_UID ${NB_UID}
ENV HOME /home/${NB_USER}

RUN adduser --disabled-password \
    --gecos "Default user" \
    --uid ${NB_UID} \
    ${NB_USER}
    
COPY . ${HOME}
USER root
RUN chown -R ${NB_UID} ${HOME}
RUN chown -R ${NB_UID} kgtk
USER ${NB_USER}

This was taken from the mybinder documentation. The invocation command for the notebook should probably be updated too.

Add calendar to dates

Add support for calendars to dates.

unit test not passed on text embedding

Describe the bug
When running on text embedding unit test, the unit raise an error like this:

======================================================================
ERROR: test_vector (test_embedding.TestEmbedding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/usc-isi-i2/kgtk/kgtk/tests/test_embedding.py", line 10, in test_vector
    assert cli_entry("kgtk", "text_embedding", test_input, "--use-cache", "false", "--embedding-projector-metadata-path", "none", "--property-value", "P1629", "P1466") == 0
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/cli_entry.py", line 73, in cli_entry
    mod = importlib.import_module('.{}'.format(h), 'kgtk.cli')
  File "/opt/python/3.7.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/cli/connected-components.py", line 7, in <module>
    from kgtk.gt.connected_components import ConnectedComponents
  File "/home/travis/virtualenv/python3.7.7/lib/python3.7/site-packages/kgtk/gt/connected_components.py", line 10, in <module>
    from graph_tool.topology import label_components
ModuleNotFoundError: No module named 'graph_tool'
----------------------------------------------------------------------
Ran 11 tests in 12.109s
FAILED (errors=1)
The command "coverage run -m unittest discover" exited with 1.

For details, please refer to https://travis-ci.org/github/usc-isi-i2/kgtk/builds/713131958
it seems because the unit test environment did not install graph-tools.
If installed (e.g. test locally), it will pass the unit test.

To Reproduce
Run the unit test with text embedding

Expected behavior
unit test success finished.

Desktop (please complete the following information):

OS: MacOS 10.14.6

Create notebooks for each command

Some command's documentation is not up to date. For example, lift documentation had examples that did not work anymore @szeke reports.

We should have notebooks that illustrate how to use KGTK for each command. These notebooks can inform tests for the toolkit.

Command Option Syntax Is Confusing

Users are confused about where to put positional arguments, typically file names. If they place them after an option that takes a varying number of arguments, such as a column list, the file name is absorbed into the list.

Modify output of gt_loader

Add header
Output in and out degrees
option to output only stats and not data rows

Add --left-prefix and --right-prefix to `kgtk join`

kgtk join currently supports --prefix for right-graph additional column name prefixing. When using kgtk join to create horizontally-joined records (kgtk join / compact), it would be useful to be able to apply a prefix to the left-graph additional column names.

Add left-graph column names prefixing, supporting --right-prefix and --left-prefix. Retain --prefix for compatability.

kgtk connected-components Cannot Process Stdin

kgtk/gt/connected-components passes the input path to both KgtkReader (to get the header line) and to gtaph_tool.load_from_csv(...). The input path is "-" when we want to read stdin.

graph_tool.load_from_csv(...) isn't prepared to use "-" as an indicator to read stdin
I suspect that the no_header logic is wrong, because KgtkReader will read the header line from stdin before load_from_csv can access the input stream.

Examples should be updated

The kgtk code has changed so rapidly in the last 3 weeks that some of the commands used in the notebooks are no longer valid.

The notebooks should be updated accordingly.

cli branch unable to be used now because of import error

When calling any kgtk cli function, it will raise an error because of import error

(gpt2) ➜  kgtk git:(feature/cli) ✗ kgtk dummy 
Traceback (most recent call last):
  File "/Users/minazuki/miniconda3/envs/gpt2/bin/kgtk", line 11, in <module>
    load_entry_point('kgtk', 'console_scripts', 'kgtk')()
  File "/Users/minazuki/Desktop/studies/master/2018Summer/DSBOX_2019/kgtk/kgtk/cli_entry.py", line 70, in cli_entry
    mod = importlib.import_module('.{}'.format(h), 'kgtk.cli')
  File "/Users/minazuki/miniconda3/envs/gpt2/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/minazuki/Desktop/studies/master/2018Summer/DSBOX_2019/kgtk/kgtk/cli/wikidata_nodes_import.py", line 46
    from __future__ import unicode_literals
    ^
SyntaxError: from __future__ imports must occur at the beginning of the file
(gpt2) ➜  kgtk git:(feature/cli) ✗

Improve wikidata triple generation.

Handle external identifier and URI rather than group them into the StringValue class.
Integrated the properties in wikidata as internal data. This way, the user only needs to specify his own properties in the prop_types.tsv with very large identifier number.

Import n-triples in help does not use ntriples format

In the KGTK help, import_ntriples is as follows:

gtk import_ntriples -i dbpedia_wikipedia_links.ttl -o DbpediaWikipediaLinks.tsv

However, can the command use turtle file? Shouldn't the command be .nt?

Numeric sorting

Describe the bug
Sort does not allow for numeric sorting, it always performs string sorting.

To Reproduce
Steps to reproduce the behavior:

Jun tried to compute PageRank by using graph_statistics
then he used kgtk sort on the resulting PageRank statistics

Expected behavior
The sort should accept a flag --numeric that makes it perform a numeric rather than lexical sort.

Desktop (please complete the following information):
Jun's laptop (don't know the specs)

Update instructions: How to update KGTK?

We have no instructions stating how to update an already installed version of KGTK.
In theory it should be as easy as:

git clone https://github.com/usc-isi-i2/kgtk/ 
cd /kgtk 
python setup.py install

Sorting Order Disagreement

Python's sorting function sorts in the following order:

P560 P580-count 2 21
P5603 P580-count 2 845
P5607 P580-count 326 25788

The Gnu sort command sorts in the following order, unless the envar LC_ALL is set to C:

P5603 P580-count 2 845
P5607 P580-count 326 25788
P560 P580-count 2 21

This means tha the output of kgtk sort probably doesn't satisfy the sorting requirements for optimization in kgtk lift.

language_suffix Can't Be Matched with field_value

Language_suffix can't be matched with field_value because the language suffix values start with a dash, so the comparison values look like invalid numbers. Even if the dash were removed, there would be confusion because Wikidata allows all-numeric suffixes.

The best solution is probably the rule that if the comparison value is a string (not a symbol), compare the field value to the contents of the string, since field values can't themselves be strings.

Filter should use the header parser by Craig

Describe the bug
The header of the file right now has to be node1,label,node2 and can only be overwritten by setting the columns

To Reproduce
Given a sample file:

node1   property        node2   id
Q8      P31     Q331769 Q8-P31-0
Q8      P31     Q60539479       Q8-P31-1
Q8      P31     Q9415   Q8-P31-2
Q8      P1343   Q20743760       Q8-P1343-3
Q8      P1343   Q1970746        Q8-P1343-4
Q8      P1343   Q19180675       Q8-P1343-5
Q8      P461    Q169251 Q8-P461-6
Q8      P279    Q16748867       Q8-P279-7
Q8      P460    Q935526 Q8-P460-8

You can filter successfully as follows:
kgtk filter -p " ; vertex_pagerank ; " --pred property wikidata-pagerank-degrees_all.tsv > wikidata-pagerank-only.tsv

But if you run the following, you get nothing:
kgtk filter -p " ; vertex_pagerank ; " wikidata-pagerank-degrees_all.tsv > wikidata-pagerank-only.tsv

Expected behavior
The filter command should employ Craig's parser. It should at least accept all aliases for the three key columns. It should (probably) also use the first three columns otherwise.

Desktop (please complete the following information):

OS: iOS
Browser: chrome
Version 83.

Smartphone (please complete the following information):

Device: macbook
OS: Catalina
Browser: chrome
Version 10.15

Additional context
Jun Liu reported the issue, during the June 4th 2020 meeting.

Usage instructions

The readme file does not specify how to use the tool. There are examples on the examples folder, but they are not very informative.

Also, is it possible to use kgtk with graphtools within its docker image?

`paths` does not consistently output edge ids

The output of kgtk paths sometimes shows the node1 or node2 instead of the edge id. There seems to be a problem of not reading the header correctly.

Add_id's documentation of id-styles needs to be updated/clarified

The documentation of id-style options in https://kgtk.readthedocs.io/en/dev/transform/add_id/ seems confusing and needs to be updated/clarified. Namely, the first table conveniently lists the possible styles but the actual options in the command below are different (or at least, do not correspond 1-on-1 to the styles in the table).

`kgtk import-wikidata` Creates Incorrect Sitelink URLs

kgtk import-wikidata drops many sitelink URLs. It always uses "wikipedia.org" in the URL, when it should use other strings (e.g., "wikinews.org"), depending upon the site.

Jupyter Lab Example 4 uses gzcat, which isn't portable

KGTK jupyter lab Example 4 uses gzcat. This command isn't portable; although it is available on MacOS, it is not provided in some flavors of Linux. The portable alternative is gunzip -c, or even better, gzip -dc.

kgt connected-components Will Fail on Certain Strings

If the KGTK input file contains strings (or symbols) with internal double quotes, failure is likely because load_from_csv isn't passed the csv_options to use an escape character (") instead of doubling ("") to process the internal quotes.

Getting <class 'ValueError'> error when running the generate_wikidata_triples command

Describe the bug
When I try to run the command kgtk generate_wikidata_triples -pf wikibase_property_mapping_kgtk.tsv < kgtk_test.tsv > output_file.ttl, I am getting the error
"UserWarning: Please raise KGTKException instead of <class 'ValueError'>
warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found"

To Reproduce
Steps to reproduce the behavior:

input file: kgtk_test.txt (Cannot attach tsv format)
property file:
wikibase_property_mapping_kgtk.txt
kgtk generate_wikidata_triples -pf wikibase_property_mapping_kgtk.tsv < kgtk_test.tsv > output_file.ttl
See error
UserWarning: Please raise KGTKException instead of <class 'ValueError'>
warnings.warn('Please raise KGTKException instead of {}'.format(type_))
KGTKException found

Expected behavior
No error and generating the TTL file

Screenshots

Desktop (please complete the following information):

OS: MacOS

Additional context
There is an inconsistency in the kgtk documentation,
https://kgtk.readthedocs.io/en/latest/export/generate_wikidata_triples/

property_type here should be data_type

[Feature request] Extract common paths from KGTK edge file

Is your feature request related to a problem? Please describe.
Given a KGTK file, I would like to know how the information is connected, that is, what are the typical paths that can be found between instances of a class in the graph.

For example, given Wikidata and the class Human, I would like to know what are the common paths one could explore from a Human in the endpoint. Examples could be (the properties do not reflect reality):

Human --mariedTo-> Human
Human --artistOf--> Song --includedIn--> Music collection
Human --partOfTeam-->SoccerTeam --hasAwards--> Award
Human --presidentOf-->Country--partOf-->Continent

This helps me know which types of queries the endpoint supports most commonly, instead of my trying to figure out what is in there.

Describe the solution you'd like
Common pattern analysis (sub graph extraction, random walks, association rule mining) to obtain a set of candidate paths with a given support, where support is the frequency that path has. For example, if I state support=100, that path must have occurred > 100 times in the graph.

Discussion of the status path forward for this issue
https://docs.google.com/document/d/1CEsQw46c4MGX1l6C7_d4XDfBGhtP2YfiIv8nSKTWlog/edit#heading=h.fvmyc2aqdp5x

kgtk ifnotexists return existing triple

Describe the bug
Filter Not Exist does not return an expected behavior. In my case, the thing that exist in filter file still appear in the output. Please see the step to reproduce below.

To Reproduce
Steps to reproduce the behavior:

Go to this folder and download both existing_triples.tsv and matched_triples_kgtk_format.tsv
Run the following command

kgtk ifnotexists path/to/file/matched_triples_kgtk_format.tsv \
 --input-keys node1 label node2 \
 --filter-on path/to/file/exist_triples.tsv \
 --filter-keys node1 label node2 \
 > path/to/file/not_exist_triples.tsv

Expected behavior
3. Read in data to check the result using any tools. In this example I use pandas

import pandas as pd

existing_candidate = pd.read_csv('path/to/file/exist_triples.tsv', delimiter='\t')
existing_candidate[(existing_candidate['node1'] == 'Q544059') & (existing_candidate['label'] == 'P1469')]

You should see this output:

Try reading in the output of kgtk ifnotexists commands.

not_existing_candidate = pd.read_csv('path/to/file/not_exist_triples.tsv', delimiter='\t')
not_existing_candidate[(not_existing_candidate['node1'] == 'Q544059') & (not_existing_candidate['node2'] == '295149')]

usc-isi-i2 / kgtk Goto Github PK

kgtk's Introduction

KGTK: Knowledge Graph Toolkit

Installation

Installation issues on Macbooks with M1 chip

Installing KGTK with Docker

Getting started

Online Documentation

KGTK Notebooks

Releases

KGTK Text Search API

KGTK Semantic Similarity API

How to cite

kgtk's People

Contributors

Stargazers

Watchers

Forkers

kgtk's Issues

Recommend Projects

Recommend Topics

Recommend Org