Code Monkey home page Code Monkey logo

soweego's Introduction

soweego: link Wikidata to large catalogs

pre-commit.ci status Documentation Status Imports: isort License

soweego is a pipeline that connects Wikidata to large-scale third-party catalogs.

soweego is the only system that makes statisticians, epidemiologists, historians, and computer scientists agree. Why? Because it performs record linkage, data matching, and entity resolution at the same time. Too easy, they all seem to be synonyms!

Oh, soweego also embeds Machine Learning and advocates for Linked Data.

Is soweego similar to the Go game?

Official Project Pages

soweego is made possible thanks to the Wikimedia Foundation:

Documentation

https://soweego.readthedocs.io/

Highlights

Get Ready

Install Docker and Compose, then enter soweego:

$ git clone -b v1.1 https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/run.sh
Building soweego
...

root@70c9b4894a30:/app/soweego#

Now it's too late to get out!

Run the Pipeline

Piece of cake:

:/app/soweego# python -m soweego run CATALOG

Pick CATALOG from discogs, imdb, or musicbrainz.

These steps are executed by default:

  1. import the target catalog into a local database;
  2. link Wikidata to the target with a supervised linker;
  3. synchronize Wikidata to the target.

Results are in /app/shared/results.

Use the Command Line

You can launch every single soweego action with CLI commands:

:/app/soweego# python -m soweego
Usage: soweego [OPTIONS] COMMAND [ARGS]...

  Link Wikidata to large catalogs.

Options:
  -l, --log-level <TEXT CHOICE>...
                                  Module name followed by one of [DEBUG, INFO,
                                  WARNING, ERROR, CRITICAL]. Multiple pairs
                                  allowed.
  --help                          Show this message and exit.

Commands:
  importer  Import target catalog dumps into a SQL database.
  ingester  Take soweego output into Wikidata items.
  linker    Link Wikidata items to target catalog identifiers.
  run       Launch the whole pipeline.
  sync      Sync Wikidata to target catalogs.

Just two things to remember:

  1. you can always get --help;
  2. each command may have sub-commands.

Contribute

The best way is to import a new catalog. Please also have a look at the guidelines.

License

The source code is under the terms of the GNU General Public License, version 3.

soweego's People

Contributors

edoardolenzi9 avatar marfox avatar maxfrax avatar pre-commit-ci[bot] avatar tupini07 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

soweego's Issues

Match MusicBrainz releases

Currently, only 37 out of 256k albums in Wikidata have a MusicBrainz identifier:

SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q482994 ; wdt:P434 ?identifier . }

Investigate how we could leverage the album data in MusicBrainz to both populate album identifiers and performer statements, like:
Animals, performer, Pink Floyd

Bot ingestor

  • create a bot request, after #66;
  • implement the bot in soweego/ingestor;
  • import accurate links as per #62.

Token similarity for urls

We cannot be sure that the url building from Wikidata is the same available in the target databases.

We could split for '/' and '.' the urls to get the tokens. Excluding grammar words like "https" and so on we can try to understand similarity between urls.

E.g. (trivial)
https://twitter.com/BarackObama -> "https" "twitter" "com" "BarackObama"
After removing grammar words we have "twitter" "BarackObama" which are actually the core informations we need to match against another form of twitter url like http://www.twitter.com/BarackObama -> "twitter" "BarackObama" (after cleaning)

Jaro-Winkler distance name matcher

- [ ] Port this Java implementation into Python: https://github.com/fbk/utils/blob/master/utils-core/src/main/java/eu/fbk/utils/core/strings/JaroWinklerDistance.java
based on Apache Commons lang: https://commons.apache.org/proper/commons-text/javadocs/api-release/index.html?org/apache/commons/text/similarity/JaroWinklerDistance.html
original source code should be contained here: http://it.apache.contactlab.it//commons/lang/source/commons-lang3-3.7-src.tar.gz;

  • use jellyfish;
  • for each full name of a target identifier, compute the Jaro-Winkler edit distance against all Wikidata item labels.

Script to import BIBSYS dump

It should perform the following actions:

<http://data.bibsys.no/data/notrbib/authorityentry/x90054225> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
  • import the dataset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Name cleaning procedure

  1. Remove honorific words, e.g., Sir, Mlle, Jr., Sr., PhD, MD M.D. (prefixes), de, de la, of, von (infixes). See mix'n'match regexps below:
/[, ]+(Jr\.{0,1}|Sr\.{0,1}|PhD\.{0,1}|MD|M\.D\.)$
/^(Sir|Baron|Baronesse{0,1}|Graf|Gräfin)\s+/
/\b(Mmle|pseud\.|diverses)\b/
/ Bt$/' , ' Baronet'
'/^(.+)[ ,]+[sj]r\.{0,1}$/i'
/^(.+)[ ,]+I+\.{0,1}$/i
  1. remove name initials with or without dot, e.g. M.;
  2. remove commas, dashes and quotes;
  3. normalization: convert to ASCII via a diacritics map, e.g., { 'à': 'a' };
  4. lowercase;
  5. split on white spaces.

Investigate phonetic algorithms

Soundex, metaphone, and other phonetic algorithms can be useful to normalize ideographic languages like Japanese and Chinese.

The jellyfish Python library we use implements those algorithms, but does not seem to support ideographic languages (exceptions raised or empty strings as output).

Perfect string matcher

This is the most straightforward one.

For each full name of a target identifier:

  1. lowercase the string;
  2. match against Wikidata item labels (also lowercased).

Birth and death dates matcher

For each (birth, death) couple (death is optional) of a target identifier, for each of [#6, #9, #10], match against Wikidata item birth and death dates, and labels.

Index target database entries

  • standard analyzer (tokenization, bag of words);
    - [ ] fuzzy analyzer;
  • optimize #9 #10 #33 runtime from O(nm) to O(n), where n is the length of the source term list, m is the length of the target term list. We can drop a "constant", coming from the result set length of a query against the index.

Implement a fuzzy analyzer

As per #35, MariaDB on Toolforge doesn't seem to support one.
Therefore, the fuzzy analyzer must be implemented before ingesting datasets into MariaDB.

Wikipedia links matcher

For each Wikipedia/DBpedia link of a target identifier, match against Wikidata item site links.

Refine item validation criteria

We have 3 criteria, 2 generic and 1 specific to item type.

  1. existence: whether the target ID still exists in the target catalog;
  2. links: URLs overlap;
  3. metadata: statements overlap.

Community discussion initiated at:

Refactor target selection module

So far, we have developed several different things in the target_selection module.
It's time now to move logic to appropriate places:

  • discogs, bne, musicbrainz, matching_strategies.py, -> linker module;
  • data extraction functions, e.g., discogs/baseline_matcher#extract_data_from_dump -> importer module, with appropriate sub-folders.

Adding code style guideline

In particular:

  • For paths, no string concatenation but os.path.join
  • Loading of resources should be done using resources.get_data
  • Paths for Input should be set from click.argument
  • Paths for Output should be set from click.option

Band matcher

  • Get Wikidata bands;
  • implement #54;
  • implement #59;
  • run name-based matchers over album titles.

Build a whitelist of URL domains

It should be used at dump extraction time to feed the link table in MariaDB with "good" URLs.
A starting point would be to query Wikidata for URL domains of existing catalog identifiers.

json library breaks in Python 3.4 with pkgutil.get_data

Python 3.4 is the version deployed on Toolforge.

pkgutil.get_data returns a binary string, i.e., bytes.
json.loads docstring:

  • Python 3.4
    Deserialize s (a str instance containing a JSON document) to a Python object.
  • Python 3.6.5
    Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object.`

Temporary fix: don't use get_data
See discogs baseline matcher.

Album matcher

  • Get Wikidata musical albums;
  • implement #55;
  • implement #56;
  • run name-based matchers.

Script to import MusicBrainz dump

It should perform the following actions:

  • download the dump;
  • slice the subset about artists;
  • import the subset into the s51434__mixnmatch_large_catalogs_p database on Toolforge.

Validate the existence of MusicBrainz ID in Wikidata

INPUT: full set of Wikidata musicians and bands QIDs with MusicBrainz ID; full set of MusicBrainz artists;
OUTPUT: set of Wikidata QIDs with invalid MusicBrainz ID
Please put the implementation under the soweego/validator folder.

Compute IMDB coverage estimation

While working on #62, we realized that BIBSYS suffers from data inconsistency:

  1. the available dump is out of sync with the online resources, with identifiers yielding HTTP 404;
  2. an identifier may have multiple cross-catalog links (VIAF, GND);
  3. an identifier may have correct sitelinks, but wrong cross-catalog links.

IMDB is the next big fish in line.

Validate the existence of Discogs ID in Wikidata

INPUT: full set of Wikidata musicians and bands QIDs with Discogs ID; full set of Discogs artists;
OUTPUT: set of Wikidata QIDs with invalid Discogs ID
Please put the implementation under the soweego/validator folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.