Greetings All, Great effort and astonishing work you've been doing here, really ap

Right now I only store the name and deion for the items in the knowledge base to

Halting on wikidata/text-file-reader about sling HOT 21 CLOSED

ringgaard commented on September 28, 2024

Halting on wikidata/text-file-reader

from sling.

Comments (21)

ringgaard commented on September 28, 2024 1

The name table is used in the knowledge browser for name prefix lookups. For entity resolution, you would normally use the phrase table. It can be used for looking up all entities matching a (normalized) name/phrase.

from sling.

ringgaard commented on September 28, 2024 1

Do you want to look up the item for the QID? In that case you can load the KB into memory and do the look up directly:

import sling

# Load Wikidata knowledge base into memory.
kb = sling.Store()
kb.load("data/e/kb/kb.sling")

# Look up item in KB.
item = kb["Q1222"]
print(item.id, item.name)
print(item.data())

It takes some time (minutes) and resources (~20 GB RAM) to load the knowledge base, but after this, it is very fast to look up items.

You can fetch the knowledge base from the ringgaard.com site with this command:

sling fetch --dataset kb

from sling.

ringgaard commented on September 28, 2024 1

Right now I only store the name and description for the items in the knowledge base to save space. You can find all the labels and aliases in the items (--dataset items). You can look these up using a RecordDatabase:

import sling

items = sling.RecordDatabase("data/e/kb/[email protected]")

commons = sling.Store()
n_name = commons["name"]
n_alias = commons["alias"]
commons.freeze()

rec = items["Q1222"]
item = sling.Store(commons).parse(rec)

for name in item(n_name): print("name", name, name.qualifier())
for alias in item(n_alias): print("alias", alias, alias.qualifier())

from sling.

ringgaard commented on September 28, 2024 1

Each item record contains all the labels and aliases for all the languages (at least the languages supported by SLING). The qualifier tells you the language (if any) of the label/alias.

from sling.

ringgaard commented on September 28, 2024

It does take a long time to import the Wikidata dump because decompressing the bz2 file is slow. You can monitor the progress of the job by looking at http://localhost:6767. It does not involve a "sort step", so it should not produce any large intermediate data sets.

You can fetch phrase and name tables for other languages by using the --language parameter:

sling fetch --dataset nametab,phrasetab --language fr

from sling.

Evraa commented on September 28, 2024

thank you very much for your fast reply and help.

Follow up question, what are the languages available for this command:
sling fetch --dataset nametab,phrasetab --language fr

from sling.

ringgaard commented on September 28, 2024

I process approximately 30 different wikipedias. You can see the language directories here:

https://ringgaard.com/data/kb/

from sling.

Evraa commented on September 28, 2024

Thanks Michael, that is so great.

totally appreciate the work you and your team are doing.

If I wanted to produce phrase-table & name-table for other languages, what tips you would give me when doing so?

I got stuck multiple times due to errors regarding export TMPDIR=<path> .. etc.

Sorry for bothering you, and thanks

from sling.

ringgaard commented on September 28, 2024

What command are you running?

from sling.

Evraa commented on September 28, 2024

sling build_wiki

edit:
after buildings and installations are completed .. and it gets stuck as in the image in the first comment.

from sling.

ringgaard commented on September 28, 2024

It takes a lot of resources to run the whole pipeline. You could try to break it into smaller runs and run each task by itself, e.g,

sling import_wikidata
sling import_wikipedia
sling map_wikipedia
sling parse_wikipedia
sling extract_wikilinks
sling merge_categories
sling invert_categories
sling compute_fanin
sling fuse_items
sling build_kb
sling extract_aliases
sling build_nametab
sling build_phrasetab

Then you can let me know which task that gets stuck, and we can take it from there on.

from sling.

Evraa commented on September 28, 2024

Thank you for your help.
Actually, I broke it down and it gets stuck at first step .. importing wikipedia .. or even wikidata .. tried both.

Thought the issue is related to TMPDIR .. so after searching for a while I ran these commands:

sudo -i
---@root .. export TMPDIR=../tmp --(umount it first)
then work from there ..
sling import_wikidata

from sling.

ringgaard commented on September 28, 2024

You say that it gets stuck at first step. Did you look at the web dashboard (localhost:6767) to see if there is any progress? You should expect the import_wikidata step to take several hours without any output on the console.

Does it crash? What is the CPU utilization? The import_wikidata task should not use any temp files, so I am a bit confused about what happens.

from sling.

Evraa commented on September 28, 2024

Here I ran
sling imort_wikipedia

and monitoring CPU utilization .. I left it for quite some time now .. and still same utl.

from sling.

ringgaard commented on September 28, 2024

It looks like your task is I/O bound, which would be surprising, since the bottleneck in this step is the single-threaded bz2 decoding. Could is be that your access to ./data/c/wikidata/wikidata-latest-all.json.bz2 is slow? You seem to be running in Azure cloud, so your disk are probably virtual (network) disks. You can try this to see how fast your "disk" I/O is:

time cp ./data/c/wikidata/wikidata-latest-all.json.bz2 /dev/null

When I run "sling import_wikidata" and open the dashboard, it looks like this:

There is a built-in webserver running on port 6767 which allows you to monitor the tasks.

from sling.

Evraa commented on September 28, 2024

Didn't try to access the dashboard for monitoring. will try it.
and yes, using Azure with mounted storage that would be I/O bound, will try a different machine that has data on it, not mounted.

from sling.

Evraa commented on September 28, 2024

Sorry for bothering you Michael again.
Have a question, what is the use of name-table.repo? and How to use it appropriately in entity recognition?

Thanks

from sling.

Evraa commented on September 28, 2024

Hello again Michael.

In Sling, is there any function that I can use to search for Q's form wikidata (eg. Q1222)?
and if there is not, would you recommend one that I can use with python that is the fastest?

I tried these: pywikibot and qwikidata ..
but they are slow .. handling requests in 0.3 sec on avg.

Thanks.

from sling.

Evraa commented on September 28, 2024

That's perfect, and really very quick lookup.
but one important key of the data is missing which is the "labels" ....
is there a way to extract them .. or they are not stored in the knowledge base from the start?

Thakns

from sling.

Evraa commented on September 28, 2024

This is astonishing .. i will try that out at once.

regarding items is there a different items set for each language ..?

from sling.

Evraa commented on September 28, 2024

This is actually GREAT.
Thanks a lot, Michael.

from sling.

Halting on wikidata/text-file-reader about sling HOT 21 CLOSED

Comments (21)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent