wikidata / soweego Goto Github PK

View Code? Open in Web Editor NEW

95.0 7.0 8.0 8.05 MB

Link Wikidata items to large catalogs

Home Page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2

License: GNU General Public License v3.0

Python 95.17% Shell 4.83%

wikimedia wikidata knowledge-graph record-linkage data-matching entity-resolution identifiers entity-linking

soweego's Introduction

soweego: link Wikidata to large catalogs

soweego is a pipeline that connects Wikidata to large-scale third-party catalogs.

soweego is the only system that makes statisticians, epidemiologists, historians, and computer scientists agree. Why? Because it performs record linkage, data matching, and entity resolution at the same time. Too easy, they all seem to be synonyms!

Oh, soweego also embeds Machine Learning and advocates for Linked Data.

Official Project Pages

soweego is made possible thanks to the Wikimedia Foundation:

Documentation

https://soweego.readthedocs.io/

Highlights

Run the whole pipeline, or
use the command line;
import large catalogs into a SQL database;
gather live Wikidata datasets;
connect them to target catalogs via rule-based and supervised linkers;
upload links to Wikidata and Mix'n'match;
synchronize Wikidata to imported catalogs;
enrich Wikidata items with relevant statements.

Get Ready

Install Docker and Compose, then enter soweego:

$ git clone -b v1.1 https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/run.sh
Building soweego
...

root@70c9b4894a30:/app/soweego#

Now it's too late to get out!

Run the Pipeline

Piece of cake:

:/app/soweego# python -m soweego run CATALOG

Pick CATALOG from discogs, imdb, or musicbrainz.

These steps are executed by default:

import the target catalog into a local database;
link Wikidata to the target with a supervised linker;
synchronize Wikidata to the target.

Results are in /app/shared/results.

Use the Command Line

You can launch every single soweego action with CLI commands:

:/app/soweego# python -m soweego
Usage: soweego [OPTIONS] COMMAND [ARGS]...

  Link Wikidata to large catalogs.

Options:
  -l, --log-level <TEXT CHOICE>...
                                  Module name followed by one of [DEBUG, INFO,
                                  WARNING, ERROR, CRITICAL]. Multiple pairs
                                  allowed.
  --help                          Show this message and exit.

Commands:
  importer  Import target catalog dumps into a SQL database.
  ingester  Take soweego output into Wikidata items.
  linker    Link Wikidata items to target catalog identifiers.
  run       Launch the whole pipeline.
  sync      Sync Wikidata to target catalogs.

Just two things to remember:

you can always get --help;
each command may have sub-commands.

Contribute

The best way is to import a new catalog. Please also have a look at the guidelines.

License

The source code is under the terms of the GNU General Public License, version 3.

soweego's People

Contributors

Stargazers

Watchers

Forkers

fossabot ettorerizza amit2014 todun wangchuan2008888 databill86 rpatil524 trellixvulnteam

soweego's Issues

Compute discogs coverage estimation

use the Wikidata musicians and bands sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: # unique matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Big_fishes

Match MusicBrainz releases

Currently, only 37 out of 256k albums in Wikidata have a MusicBrainz identifier:

SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q482994 ; wdt:P434 ?identifier . }

Investigate how we could leverage the album data in MusicBrainz to both populate album identifiers and performer statements, like:
Animals, performer, Pink Floyd

Docker configuration

The whole project should work in a virtual environment like Docker

Get the set of target identifiers from MariaDB

Currently, it is loaded from a file.
See check_existence checks.py

Logging facility

Setup a logging mechanism.

Bot ingestor

create a bot request, after #66;
implement the bot in soweego/ingestor;
import accurate links as per #62.

Assess the accuracy of baseline matches over the samples

Token similarity for urls

We cannot be sure that the url building from Wikidata is the same available in the target databases.

We could split for '/' and '.' the urls to get the tokens. Excluding grammar words like "https" and so on we can try to understand similarity between urls.

E.g. (trivial)
https://twitter.com/BarackObama -> "https" "twitter" "com" "BarackObama"
After removing grammar words we have "twitter" "BarackObama" which are actually the core informations we need to match against another form of twitter url like http://www.twitter.com/BarackObama -> "twitter" "BarackObama" (after cleaning)

Split MusicBrainz into musicians and bands

Import them into 2 different tables in MariaDB

Jaro-Winkler distance name matcher

- [ ] Port this Java implementation into Python: https://github.com/fbk/utils/blob/master/utils-core/src/main/java/eu/fbk/utils/core/strings/JaroWinklerDistance.java
based on Apache Commons lang: https://commons.apache.org/proper/commons-text/javadocs/api-release/index.html?org/apache/commons/text/similarity/JaroWinklerDistance.html
original source code should be contained here: http://it.apache.contactlab.it//commons/lang/source/commons-lang3-3.7-src.tar.gz;

use jellyfish;
for each full name of a target identifier, compute the Jaro-Winkler edit distance against all Wikidata item labels.

Fix the identifier used by Musicbrainz

Currently musicbrainz calculation are done with an internal identifier and not with the one reachable from the outside

Requirements for bot creation request

code pointer in User:Hjfocs
test edits:
35 Discogs
35 MusicBrainz
35 Twitter

Script to import BIBSYS dump

It should perform the following actions:

download the dump at http://data.bibsys.no/autreg/linked_data_auth_reg.nt;
slice the subset about people (if applicable), i.e., all the subjects that have rdf:type foaf:Person. For instance:

<http://data.bibsys.no/data/notrbib/authorityentry/x90054225> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Script to import BNE dump

It should perform the following actions:

download the dump at http://datos.bne.es/datadumps/autoridades.nt.bz2;
slice the subset about people;
import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Name cleaning procedure

Remove honorific words, e.g., Sir, Mlle, Jr., Sr., PhD, MD M.D. (prefixes), de, de la, of, von (infixes). See mix'n'match regexps below:

/[, ]+(Jr\.{0,1}|Sr\.{0,1}|PhD\.{0,1}|MD|M\.D\.)$
/^(Sir|Baron|Baronesse{0,1}|Graf|Gräfin)\s+/
/\b(Mmle|pseud\.|diverses)\b/
/ Bt$/' , ' Baronet'
'/^(.+)[ ,]+[sj]r\.{0,1}$/i'
/^(.+)[ ,]+I+\.{0,1}$/i

remove name initials with or without dot, e.g. M.;
remove commas, dashes and quotes;
normalization: convert to ASCII via a diacritics map, e.g., { 'à': 'a' };
lowercase;
split on white spaces.

Split Discogs into musicians and bands

they are all artist nodes in the dump.
Musicians have a groups tag.
Bands have a members tag.

Import them into 2 different tables in MariaDB.

Investigate phonetic algorithms

Soundex, metaphone, and other phonetic algorithms can be useful to normalize ideographic languages like Japanese and Chinese.

The jellyfish Python library we use implements those algorithms, but does not seem to support ideographic languages (exceptions raised or empty strings as output).

Perfect string matcher

This is the most straightforward one.

For each full name of a target identifier:

lowercase the string;
match against Wikidata item labels (also lowercased).

Integrate feedback on the long tail report

The final version will be published on the project page: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline

Investigate full-text indices solutions

The investigation should answer the question: is a full-text index column in MariaDB better than a Lucene-based solution?

Implement the BIBSYS link scraper

In bibsys_baseline_helper.link_scraper.

Match Discogs releases

Currently, only 12 out of 256k albums in Wikidata have a Discogs identifier.

Discogs has a huge amount of albums: we could either leverage the releases_url tag of each artist or the releases dump directly, e.g.:
https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180801_releases.xml.gz

Compute BIBSYS coverage estimation

build the Wikidata sample of authors and teachers with no links to BIBSYS;
implement the following baseline matchers:
#6;
#7;
#8;
~~#11;~~ No dates available!
compute the following ratio: #matches / sample length;
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

Birth and death dates matcher

For each (birth, death) couple (death is optional) of a target identifier, for each of [#6, #9, #10], match against Wikidata item birth and death dates, and labels.

Deprecate statements that do not comply with validation criteria 2 or 3

As per #19, if required criteria 2 or 3 are not met, then deprecate the statement.

Compute MusicBrainz coverage estimation

use the Wikidata musicians and bands sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: # unique matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Big_fishes

Cross-catalog link matcher

For each link to another catalog of a target identifier, match against Wikidata item identifiers.

Compute BNE coverage estimation

use the Wikidata sample;
implement the following baseline matchers:
#6;
#7;
#8;
#11;
compute the following ratio: #matches / (sample length - #already linked Wikidata items);
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#Target_selection

Index target database entries

standard analyzer (tokenization, bag of words);
~~- [ ] fuzzy analyzer;~~
optimize #9 #10 #33 runtime from O(nm) to O(n), where n is the length of the source term list, m is the length of the target term list. We can drop a "constant", coming from the result set length of a query against the index.

Implement a fuzzy analyzer

As per #35, MariaDB on Toolforge doesn't seem to support one.
Therefore, the fuzzy analyzer must be implemented before ingesting datasets into MariaDB.

Levenshtein distance name matcher

~~- [ ] Pick the best implementation that fits our use case here: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python;~~

use jellyfish;
for each full name of a target identifier, compute the Levenshtein edit distance against all Wikidata item labels.

Use the project as a CLI or as a module

The whole project should be structured to serve both as a command line interface and as a Python module.

Get inspired by the StrepHit structure: https://github.com/Wikidata/StrepHit/tree/master/strephit: the folder strephit is the parent module and all its subfolders are submodules;
for module usage, you can see the typical __init__.py and __main__.py files;
for CLI usage, click is the way to go.

Set up your account for Wikimedia Toolforge

Follow steps 1 to 3 of this link: https://tools.wmflabs.org/

Wikipedia links matcher

For each Wikipedia/DBpedia link of a target identifier, match against Wikidata item site links.

Refine item validation criteria

We have 3 criteria, 2 generic and 1 specific to item type.

existence: whether the target ID still exists in the target catalog;
links: URLs overlap;
metadata: statements overlap.

Community discussion initiated at:

Refactor target selection module

So far, we have developed several different things in the target_selection module.
It's time now to move logic to appropriate places:

discogs, ~~bne~~, musicbrainz, matching_strategies.py, -> linker module;
data extraction functions, e.g., discogs/baseline_matcher#extract_data_from_dump -> importer module, with appropriate sub-folders.

Get Wikidata aliases

Besides labels, aliases may be useful to augment matching.

Adding code style guideline

In particular:

For paths, no string concatenation but os.path.join
Loading of resources should be done using resources.get_data
Paths for Input should be set from click.argument
Paths for Output should be set from click.option

Band matcher

Get Wikidata bands;
implement #54;
implement #59;
run name-based matchers over album titles.

Build a whitelist of URL domains

It should be used at dump extraction time to feed the link table in MariaDB with "good" URLs.
A starting point would be to query Wikidata for URL domains of existing catalog identifiers.

json library breaks in Python 3.4 with pkgutil.get_data

Python 3.4 is the version deployed on Toolforge.

pkgutil.get_data returns a binary string, i.e., bytes.
json.loads docstring:

Python 3.4
Deserialize s (a str instance containing a JSON document) to a Python object.
Python 3.6.5
Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object.`

Temporary fix: don't use get_data
See discogs baseline matcher.

Album matcher

Get Wikidata musical albums;
implement #55;
implement #56;
run name-based matchers.

Import Musicbrainz links

Delete statements not complying with validation criterium 1

As per #19, if required criterium 1 is not met, then delete the statement.

Script to import MusicBrainz dump

It should perform the following actions:

download the dump;
slice the subset about artists;
import the subset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Damerau-Levenshtein distance name matcher

use jellyfish;
for each full name of a target identifier, compute the Damerau-Levenshtein edit distance against all Wikidata item labels.

Validate the existence of MusicBrainz ID in Wikidata

INPUT: full set of Wikidata musicians and bands QIDs with MusicBrainz ID; full set of MusicBrainz artists;
OUTPUT: set of Wikidata QIDs with invalid MusicBrainz ID
Please put the implementation under the soweego/validator folder.

Compute IMDB coverage estimation

While working on #62, we realized that BIBSYS suffers from data inconsistency:

the available dump is out of sync with the online resources, with identifiers yielding HTTP 404;
an identifier may have multiple cross-catalog links (VIAF, GND);
an identifier may have correct sitelinks, but wrong cross-catalog links.

IMDB is the next big fish in line.

build the Wikidata sample of actors, directors and producers with no links to IMDB;
implement the following baseline matchers:
#6;
~~#7;~~ No links available
~~#8;~~ No Wikilinks available
#11;
compute the following ratio: #matches / sample length;
report the result here: https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline#August_2018:_big_fishes_selection

Script to import Discogs dump

It should perform the following actions:

download the dump at https://data.discogs.com/;
slice the subset about artists;
import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Validate the existence of Discogs ID in Wikidata

INPUT: full set of Wikidata musicians and bands QIDs with Discogs ID; full set of Discogs artists;
OUTPUT: set of Wikidata QIDs with invalid Discogs ID
Please put the implementation under the soweego/validator folder.

wikidata / soweego Goto Github PK

soweego's Introduction

soweego: link Wikidata to large catalogs

Official Project Pages

Documentation

Highlights

Get Ready

Run the Pipeline

Use the Command Line

Contribute

License

soweego's People

Contributors

Stargazers

Watchers

Forkers

soweego's Issues

Recommend Projects

Recommend Topics

Recommend Org