Comments (21)
The name table is used in the knowledge browser for name prefix lookups. For entity resolution, you would normally use the phrase table. It can be used for looking up all entities matching a (normalized) name/phrase.
from sling.
Do you want to look up the item for the QID? In that case you can load the KB into memory and do the look up directly:
import sling
# Load Wikidata knowledge base into memory.
kb = sling.Store()
kb.load("data/e/kb/kb.sling")
# Look up item in KB.
item = kb["Q1222"]
print(item.id, item.name)
print(item.data())
It takes some time (minutes) and resources (~20 GB RAM) to load the knowledge base, but after this, it is very fast to look up items.
You can fetch the knowledge base from the ringgaard.com site with this command:
sling fetch --dataset kb
from sling.
Right now I only store the name and description for the items in the knowledge base to save space. You can find all the labels and aliases in the items (--dataset items). You can look these up using a RecordDatabase:
import sling
items = sling.RecordDatabase("data/e/kb/[email protected]")
commons = sling.Store()
n_name = commons["name"]
n_alias = commons["alias"]
commons.freeze()
rec = items["Q1222"]
item = sling.Store(commons).parse(rec)
for name in item(n_name): print("name", name, name.qualifier())
for alias in item(n_alias): print("alias", alias, alias.qualifier())
from sling.
Each item record contains all the labels and aliases for all the languages (at least the languages supported by SLING). The qualifier tells you the language (if any) of the label/alias.
from sling.
It does take a long time to import the Wikidata dump because decompressing the bz2 file is slow. You can monitor the progress of the job by looking at http://localhost:6767. It does not involve a "sort step", so it should not produce any large intermediate data sets.
You can fetch phrase and name tables for other languages by using the --language parameter:
sling fetch --dataset nametab,phrasetab --language fr
from sling.
thank you very much for your fast reply and help.
Follow up question, what are the languages available for this command:
sling fetch --dataset nametab,phrasetab --language fr
from sling.
I process approximately 30 different wikipedias. You can see the language directories here:
https://ringgaard.com/data/kb/
from sling.
Thanks Michael, that is so great.
totally appreciate the work you and your team are doing.
If I wanted to produce phrase-table & name-table for other languages, what tips you would give me when doing so?
I got stuck multiple times due to errors regarding export TMPDIR=<path>
.. etc.
Sorry for bothering you, and thanks
from sling.
What command are you running?
from sling.
sling build_wiki
edit:
after buildings and installations are completed .. and it gets stuck as in the image in the first comment.
from sling.
It takes a lot of resources to run the whole pipeline. You could try to break it into smaller runs and run each task by itself, e.g,
sling import_wikidata
sling import_wikipedia
sling map_wikipedia
sling parse_wikipedia
sling extract_wikilinks
sling merge_categories
sling invert_categories
sling compute_fanin
sling fuse_items
sling build_kb
sling extract_aliases
sling build_nametab
sling build_phrasetab
Then you can let me know which task that gets stuck, and we can take it from there on.
from sling.
Thank you for your help.
Actually, I broke it down and it gets stuck at first step .. importing wikipedia .. or even wikidata .. tried both.
Thought the issue is related to TMPDIR .. so after searching for a while I ran these commands:
sudo -i
---@root .. export TMPDIR=../tmp --(umount it first)
then work from there ..
sling import_wikidata
from sling.
You say that it gets stuck at first step. Did you look at the web dashboard (localhost:6767) to see if there is any progress? You should expect the import_wikidata step to take several hours without any output on the console.
Does it crash? What is the CPU utilization? The import_wikidata task should not use any temp files, so I am a bit confused about what happens.
from sling.
Here I ran
sling imort_wikipedia
and monitoring CPU utilization .. I left it for quite some time now .. and still same utl.
from sling.
It looks like your task is I/O bound, which would be surprising, since the bottleneck in this step is the single-threaded bz2 decoding. Could is be that your access to ./data/c/wikidata/wikidata-latest-all.json.bz2 is slow? You seem to be running in Azure cloud, so your disk are probably virtual (network) disks. You can try this to see how fast your "disk" I/O is:
time cp ./data/c/wikidata/wikidata-latest-all.json.bz2 /dev/null
When I run "sling import_wikidata" and open the dashboard, it looks like this:
There is a built-in webserver running on port 6767 which allows you to monitor the tasks.
from sling.
Didn't try to access the dashboard for monitoring. will try it.
and yes, using Azure with mounted storage that would be I/O bound, will try a different machine that has data on it, not mounted.
from sling.
Sorry for bothering you Michael again.
Have a question, what is the use of name-table.repo? and How to use it appropriately in entity recognition?
Thanks
from sling.
Hello again Michael.
In Sling, is there any function that I can use to search for Q's form wikidata (eg. Q1222)?
and if there is not, would you recommend one that I can use with python that is the fastest?
I tried these: pywikibot and qwikidata ..
but they are slow .. handling requests in 0.3 sec on avg.
Thanks.
from sling.
That's perfect, and really very quick lookup.
but one important key of the data is missing which is the "labels" ....
is there a way to extract them .. or they are not stored in the knowledge base from the start?
Thakns
from sling.
This is astonishing .. i will try that out at once.
regarding items
is there a different items set for each language ..?
from sling.
This is actually GREAT.
Thanks a lot, Michael.
from sling.
Related Issues (16)
- Back button does not work as expected HOT 2
- Automated SPARQL construction HOT 1
- version `GLIBC_2.27' not found HOT 2
- Guide has wrong command: sling --build_wiki HOT 1
- Entity matching using wikidata HOT 11
- Knolcase Sling support HOT 4
- segmentation fault when trying to fetch dataset HOT 15
- Am I able to install sling on a Mac with M1 chip? HOT 8
- Core dump while loading "caspar.flow" HOT 2
- sling fuse_items HOT 1
- How to run the silver annotation pipeline HOT 40
- Missing screenwriter of? HOT 3
- Circlip is missing HOT 5
- Search for aliases are impossible HOT 3
- New reverse property "creator of" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sling.