Code Monkey home page Code Monkey logo

Comments (21)

ringgaard avatar ringgaard commented on September 28, 2024 1

The name table is used in the knowledge browser for name prefix lookups. For entity resolution, you would normally use the phrase table. It can be used for looking up all entities matching a (normalized) name/phrase.

from sling.

ringgaard avatar ringgaard commented on September 28, 2024 1

Do you want to look up the item for the QID? In that case you can load the KB into memory and do the look up directly:

import sling

# Load Wikidata knowledge base into memory.
kb = sling.Store()
kb.load("data/e/kb/kb.sling")

# Look up item in KB.
item = kb["Q1222"]
print(item.id, item.name)
print(item.data())

It takes some time (minutes) and resources (~20 GB RAM) to load the knowledge base, but after this, it is very fast to look up items.

You can fetch the knowledge base from the ringgaard.com site with this command:

sling fetch --dataset kb

from sling.

ringgaard avatar ringgaard commented on September 28, 2024 1

Right now I only store the name and description for the items in the knowledge base to save space. You can find all the labels and aliases in the items (--dataset items). You can look these up using a RecordDatabase:

import sling

items = sling.RecordDatabase("data/e/kb/[email protected]")

commons = sling.Store()
n_name = commons["name"]
n_alias = commons["alias"]
commons.freeze()

rec = items["Q1222"]
item = sling.Store(commons).parse(rec)

for name in item(n_name): print("name", name, name.qualifier())
for alias in item(n_alias): print("alias", alias, alias.qualifier())

from sling.

ringgaard avatar ringgaard commented on September 28, 2024 1

Each item record contains all the labels and aliases for all the languages (at least the languages supported by SLING). The qualifier tells you the language (if any) of the label/alias.

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

It does take a long time to import the Wikidata dump because decompressing the bz2 file is slow. You can monitor the progress of the job by looking at http://localhost:6767. It does not involve a "sort step", so it should not produce any large intermediate data sets.

You can fetch phrase and name tables for other languages by using the --language parameter:

sling fetch --dataset nametab,phrasetab --language fr

from sling.

Evraa avatar Evraa commented on September 28, 2024

thank you very much for your fast reply and help.

Follow up question, what are the languages available for this command:
sling fetch --dataset nametab,phrasetab --language fr

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

I process approximately 30 different wikipedias. You can see the language directories here:

https://ringgaard.com/data/kb/

from sling.

Evraa avatar Evraa commented on September 28, 2024

Thanks Michael, that is so great.

totally appreciate the work you and your team are doing.

If I wanted to produce phrase-table & name-table for other languages, what tips you would give me when doing so?

I got stuck multiple times due to errors regarding export TMPDIR=<path> .. etc.

Sorry for bothering you, and thanks

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

What command are you running?

from sling.

Evraa avatar Evraa commented on September 28, 2024

sling build_wiki

edit:
after buildings and installations are completed .. and it gets stuck as in the image in the first comment.

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

It takes a lot of resources to run the whole pipeline. You could try to break it into smaller runs and run each task by itself, e.g,

sling import_wikidata
sling import_wikipedia
sling map_wikipedia
sling parse_wikipedia
sling extract_wikilinks
sling merge_categories
sling invert_categories
sling compute_fanin
sling fuse_items
sling build_kb
sling extract_aliases
sling build_nametab
sling build_phrasetab

Then you can let me know which task that gets stuck, and we can take it from there on.

from sling.

Evraa avatar Evraa commented on September 28, 2024

Thank you for your help.
Actually, I broke it down and it gets stuck at first step .. importing wikipedia .. or even wikidata .. tried both.

Thought the issue is related to TMPDIR .. so after searching for a while I ran these commands:

sudo -i
---@root .. export TMPDIR=../tmp --(umount it first)
then work from there ..
sling import_wikidata

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

You say that it gets stuck at first step. Did you look at the web dashboard (localhost:6767) to see if there is any progress? You should expect the import_wikidata step to take several hours without any output on the console.

Does it crash? What is the CPU utilization? The import_wikidata task should not use any temp files, so I am a bit confused about what happens.

from sling.

Evraa avatar Evraa commented on September 28, 2024

image
image

Here I ran
sling imort_wikipedia

and monitoring CPU utilization .. I left it for quite some time now .. and still same utl.

from sling.

ringgaard avatar ringgaard commented on September 28, 2024

It looks like your task is I/O bound, which would be surprising, since the bottleneck in this step is the single-threaded bz2 decoding. Could is be that your access to ./data/c/wikidata/wikidata-latest-all.json.bz2 is slow? You seem to be running in Azure cloud, so your disk are probably virtual (network) disks. You can try this to see how fast your "disk" I/O is:

time cp ./data/c/wikidata/wikidata-latest-all.json.bz2 /dev/null

When I run "sling import_wikidata" and open the dashboard, it looks like this:

image

There is a built-in webserver running on port 6767 which allows you to monitor the tasks.

from sling.

Evraa avatar Evraa commented on September 28, 2024

Didn't try to access the dashboard for monitoring. will try it.
and yes, using Azure with mounted storage that would be I/O bound, will try a different machine that has data on it, not mounted.

from sling.

Evraa avatar Evraa commented on September 28, 2024

Sorry for bothering you Michael again.
Have a question, what is the use of name-table.repo? and How to use it appropriately in entity recognition?

Thanks

from sling.

Evraa avatar Evraa commented on September 28, 2024

Hello again Michael.

In Sling, is there any function that I can use to search for Q's form wikidata (eg. Q1222)?
and if there is not, would you recommend one that I can use with python that is the fastest?

I tried these: pywikibot and qwikidata ..
but they are slow .. handling requests in 0.3 sec on avg.

Thanks.

from sling.

Evraa avatar Evraa commented on September 28, 2024

That's perfect, and really very quick lookup.
but one important key of the data is missing which is the "labels" ....
is there a way to extract them .. or they are not stored in the knowledge base from the start?

Thakns

from sling.

Evraa avatar Evraa commented on September 28, 2024

This is astonishing .. i will try that out at once.

regarding items is there a different items set for each language ..?

from sling.

Evraa avatar Evraa commented on September 28, 2024

This is actually GREAT.
Thanks a lot, Michael.

from sling.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.