osm-search / wikipedia-wikidata Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 16.0 143 KB

Preprocessing Wikidata and Wikipedia inlink data for Nominatim geocoder

License: GNU General Public License v2.0

Shell 78.42% Python 21.58%

wikipedia-wikidata's People

Contributors

Stargazers

Watchers

Forkers

group-bank-eu imresamu kiransatdive arko1988 mtmail nadabyte

wikipedia-wikidata's Issues

Cleanup wikipedia/data importance computation scripts

The scripts in this repository could do with some cleanup. Currently the import process needs a Postgres database and more than 1TB of disk space. There is room for improvement here. It might even be that we don't need a database at all and can just create a Postgresql dump directly from the Wiki dumps using some clever stream processing.

Filter WP: titles

WP: and Wikipedia: seem to be special article pages with huge amount of links from articles. converted/vi/pages.csv.gz has 15.420.39 entries so 99% of pages have a link to the Wikipedia_tiếng_Việt page.

Check for more prefixes.

Check the top 10.000 for more titles to be filtered.

# select * from wikipedia_article_full where totalcount > 10000000 order by totalcount desc limit 20;
 language |        title         | langcount | othercount | totalcount |     importance     |
----------+----------------------+-----------+------------+------------+--------------------+
 vi       | Wikipedia_tiếng_Việt |  15361391 |        270 |   15361661 |                  1 |
 vi       | WP:TNCBQV            |  15220189 |          0 |   15220189 | 0.9994408730181494 |
 vi       | WP:VPBQ              |  15216148 |          0 |   15216148 | 0.9994248258988223 |
 vi       | WP:VBM               |  15201393 |          0 |   15201393 | 0.9993661964570792 |
 vi       | WP:TDTL              |  14266708 |          0 |   14266708 | 0.9955312487794986 |
 vi       | WP:N                 |  11192996 |          0 |   11192996 |  0.980867875718218 |
(6 rows)

set user-agent for downloads

Wikimedia asks for setting a proper user agent (not wget) when accessing the query services. We should set that in our scripts.

https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits "Clients who don’t comply with the User-Agent policy may be blocked completely – make sure to send a good User-Agent header." https://meta.wikimedia.org/wiki/User-Agent_policy

`unpigz -c` vs. ( gzip -dc | zcat ) speed

imho :

the pigz based uncompression ( unpigz -c ) is faster than the current gzip -dc method.
It is worth rechecking ..

see:

https://unix.stackexchange.com/questions/363644/fastest-and-most-efficient-way-to-get-number-of-records-lines-in-a-gzip-compre/363739#363739

null values in titles

Errors during

UPDATE ${LANG}pagelinkcount
              SET othercount = ${LANG}pagelinkcount.othercount + x.count
              FROM (
                SELECT page_title AS title,
                       count
                FROM ${LANG}langlinks

[14:18:39] ERROR:  null value in column "title" violates not-null constraint
[14:18:39] DETAIL:  Failing row contains (ca, null, 4, 0, 4, null, null, null, null, null, null, null, null).
[14:21:31] ERROR:  null value in column "title" violates not-null constraint
[14:21:31] DETAIL:  Failing row contains (cs, null, 33, 0, 33, null, null, null, null, null, null, null, null).
[14:23:37] ERROR:  null value in column "title" violates not-null constraint
[14:23:37] DETAIL:  Failing row contains (da, null, 2, 0, 2, null, null, null, null, null, null, null, null).
[14:30:19] ERROR:  null value in column "title" violates not-null constraint
[14:30:19] DETAIL:  Failing row contains (de, null, 7, 0, 7, null, null, null, null, null, null, null, null).
[14:44:21] ERROR:  null value in column "title" violates not-null constraint
[14:44:21] DETAIL:  Failing row contains (en, null, 10, 0, 10, null, null, null, null, null, null, null, null).
[14:51:43] ERROR:  null value in column "title" violates not-null constraint
[14:51:43] DETAIL:  Failing row contains (es, null, 5, 0, 5, null, null, null, null, null, null, null, null).
[15:08:08] ERROR:  null value in column "title" violates not-null constraint
[15:08:08] DETAIL:  Failing row contains (fr, null, 27, 0, 27, null, null, null, null, null, null, null, null).
[15:11:54] ERROR:  null value in column "title" violates not-null constraint
[15:11:54] DETAIL:  Failing row contains (ko, null, 3, 0, 3, null, null, null, null, null, null, null, null).
[15:15:13] ERROR:  null value in column "title" violates not-null constraint

Filter languages in langlinks file earlier

Example:

Eiffel Tower,9232,hif
The Plot to Blow Up the Eiffel Tower,3009907,hu
The Man on the Eiffel Tower,22716058,lb
Under the Eiffel Tower,59917348,pt
Eiffel Tower,9232,simple

Eiffel Tower has article id 9232 in the english wikipedia. The langlinks file shows 157 links from other wikipedia projects (languages). We have a positive list of languages we want to process (conf/languages.txt) so we could remove many rows.

Duplicate entries in wikimedia_importance CSV

The wikimedia_importance.csv does have duplicates wrt (language, title). This is not allowed because we have a uniqueness constraint in the database. They do differ in their wikidata entry, which is really odd. Example:

en,01_Distribution,0.3191129613920406,Q1554656
en,01_Distribution,0.6355219495742352,Q19616

The 'Q19616' is for Italian TV station RAI, so looks rather wrong.

StopIteration error in mysqldump_to_csv.py

New error that causes the output file to be empty. Seems to affect pagelinks file only and only a couple of countries

[01:02:40] [language en] Pagelinks table SQL => CSV
[01:03:40] StopIteration:
[...]
[01:09:13] 4.0K	wiki_build_20220620/converted/wikipedia/en/pagelinks.csv.gz

use wikimedia mirror if possible

Our imports don't require the newest data, even once per month is already a big step compared to the irregular runs we do right now. Use a mirror and/or make the URL configurable. https://dumps.wikimedia.org/mirrors.html

SPARQL queries for wikidata types can timeout

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service queries we run in the "Get wikidata places from wikidata query API" step sometimes produce incomplete (JSON) output:

tail -1000 *Q811979*.json

#     }, {
#       "item" : {
#         "type" : "uri",
#         "value" : "SPARQL-QUERY: queryStr=SELECT ?item WHERE{?item wdt:P31*/wdt:P279*wd:Q811979;}
# java.util.concurrent.TimeoutException
# 	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
# 	at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:292)

which in the next step produces a warning, e.g. parse error: Invalid numeric literal at line 7154740, column 18 but the script just continues.

grep -c TimeoutException *.json | grep -v ':0'
tmp_Q15324.json:1
tmp_Q17334923.json:1
tmp_Q177634.json:1
tmp_Q271669.json:1
tmp_Q3257686.json:1
tmp_Q41176.json:1
tmp_Q43229.json:1
tmp_Q486972.json:1
tmp_Q532.json:1
tmp_Q811979.json:1
tmp_Q82794.json:1

Add proper checks for the output and HTTP status in the loop, possibly abort or retry behaviour and check the limits in https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits are observed. Maybe the queries can be split into sub-types.

wikimedia-importance.sql.gz needs to include index creation

Bottom of old file contains index generation. We currently don't create any

CREATE INDEX wikipedia_article_osm_type_osm_id_idx ON public.wikipedia_article USING btree (osm_type, osm_id) WHERE (osm_type IS NOT NULL);
CREATE INDEX wikipedia_article_slim_title_language_idx ON public.wikipedia_article USING btree (title, language);
CREATE INDEX wikipedia_article_wd_page_title_idx ON public.wikipedia_article USING btree (wd_page_title);
CREATE INDEX wikipedia_redirect_language_from_title_idx ON public.wikipedia_redirect USING btree (language, from_title);

Different importances for same page in different languages

I did a lookup for pages in all languages for Köln and got different importances:

nominatim=# select * from wikipedia_article_new where wd_page_title = 'Q365';
 language |       title        | langcount | othercount | totalcount |   lat    |   lon   |     importance     | title_en | osm_type | osm_id | wd_page_title | instance_of 
----------+--------------------+-----------+------------+------------+----------+---------+--------------------+----------+----------+--------+---------------+-------------
 eo       | Kolonjo            |      1052 |      79339 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 lt       | Kelnas             |       324 |      80067 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 he       | קלן                |      1826 |      78565 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 ja       | ケルン             |      2011 |      78380 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 fr       | Cologne            |      7434 |      72957 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 cs       | Kolín_nad_Rýnem    |      1629 |      80391 |      82020 | 50.94222 | 6.95778 | 0.6837768090380653 |          |          |        | Q365          | Q14757767
 de       | Köln               |     31105 |      49286 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 fa       | کلن                |      1107 |      79284 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 pt       | Colônia_(Alemanha) |      1478 |      80391 |      81869 | 50.94222 | 6.95778 |  0.683665449382748 |          |          |        | Q365          | Q14757767
 kk       | Кёльн              |       430 |      79961 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 ko       | 쾰른               |       624 |      79767 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 ru       | Кёльн              |      4361 |      76030 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 no       | Köln               |      1242 |      79149 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 sk       | Kolín_nad_Rýnom    |       476 |      80391 |      80867 | 50.94222 | 6.95778 | 0.6829212484910951 |          |          |        | Q365          | Q14757767
 sr       | Келн               |       607 |      79784 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 it       | Colonia_(Germania) |      5862 |      80391 |      86253 | 50.94222 | 6.95778 | 0.6868178766557351 |          |          |        | Q365          | Q14757767
 nl       | Keulen_(stad)      |      4087 |      80391 |      84478 | 50.94222 | 6.95778 |  0.685561261670655 |          |          |        | Q365          | Q14757767
 ro       | Köln               |       872 |      79519 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 sv       | Köln               |      1641 |      78750 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 fi       | Köln               |       944 |      79447 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 vi       | Köln               |       538 |      79853 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 uk       | Кельн              |      1977 |      78414 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 pl       | Kolonia_(Niemcy)   |      3897 |      80391 |      84288 | 50.94222 | 6.95778 | 0.6854251894599938 |          |          |        | Q365          | Q14757767
 es       | Colonia_(Alemania) |      3915 |      80391 |      84306 | 50.94222 | 6.95778 | 0.6854380936576088 |          |          |        | Q365          | Q14757767
 war      | Colonia            |        65 |      80326 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 hu       | Köln               |      1495 |      78896 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 hr       | Köln               |       492 |      79899 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 zh       | 科隆               |      1299 |      79092 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 hi       | कोलोन              |        27 |      80364 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 sl       | Köln               |       413 |      79978 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 tr       | Köln               |      1020 |      79371 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 eu       | Kolonia_(Alemania) |       381 |      80391 |      80772 | 50.94222 | 6.95778 |   0.68285021255408 |          |          |        | Q365          | Q14757767
 da       | Köln               |       806 |      79585 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 ms       | Cologne            |       159 |      80232 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 bg       | Кьолн              |      1547 |      78844 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 id       | Köln               |       315 |      80076 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 ca       | Colònia_(Alemanya) |      1521 |      80391 |      81912 | 50.94222 | 6.95778 | 0.6836971819756548 |          |          |        | Q365          | Q14757767
 ar       | كولونيا            |      1344 |      79047 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767
 en       | Cologne            |     13314 |      67077 |      80391 | 50.94222 | 6.95778 | 0.6825644791239005 |          |          |        | Q365          | Q14757767

It almost looks like all totalcounts for entries where the title contains a underscore are wrong. Note how the expected number for totalcount (80391) shows up in the othercount column instead.

This anomaly is not present in the wikidata dump we use currently.

QRank

I recently came across QRank, which provides regularly updated ranks for Wikidata items that are designed for map-related applications. Maybe it would be useful signal for ranking search results, though I don’t know to what extent it overlaps with the heuristics already implemented here.

Filter non-earth coordinates

E.g. https://en.wikipedia.org/wiki/Huygens_(crater)

The coordinates on other planets have no use to Nominatim.

add script to find newest wikidata dump

https://wikimedia.bringyour.com/wikidatawiki/ There is a dumpruninfo.txt to check if all files are ready.

Why is othercount incremented by number of page links?

UPDATE ${i}pagelinkcount
SET othercount = ${i}pagelinkcount.othercount + x.count
FROM (
  SELECT page_title AS title,
         count
  FROM ${i}langlinks
  JOIN ${i}page ON (ll_from = page_id)
  JOIN ${j}pagelinkcount ON (ll_lang = '${j}' AND ll_title = ${j}pagelinkcount.title)
) AS x
WHERE x.title = ${i}pagelinkcount.title

where i and j are two language codes, e.g. "en".

A wikipedia page can have "inlinks" from other pages of the same language (pagelinkcount.count) and from another language (pagelinkcount.othercount, from langlinks tables)

The SQL logic above opens two questions:

it would only increase othercount for those pages that already have an entry in the table. It would not add new table record. Arguably othercount should be independent from count.
it would increase othercount by count, that means if a page has an inlink from another language othercount is increased by the number of inlinks from the same language. I would expect it's only increased by 1 because the langlinks table has only one entry (one link).

This logic should be revisited when we look at scoring again.

speed up CSV writing

One flag might speed up the filter_*.py scripts http://www.lucainvernizzi.net/blog/2015/08/03/8x-speed-up-for-python-s-csv-dictwriter/

inspect checksum when downloading from wikimedia servers

Be sure what the download is complete. We download for example

    download https://dumps.wikimedia.org/${i}wiki/latest/${i}wiki-latest-page.sql.gz
    download https://dumps.wikimedia.org/${i}wiki/latest/${i}wiki-latest-pagelinks.sql.gz
    download https://dumps.wikimedia.org/${i}wiki/latest/${i}wiki-latest-langlinks.sql.gz
    download https://dumps.wikimedia.org/${i}wiki/latest/${i}wiki-latest-redirect.sql.gz

The directory https://dumps.wikimedia.org/dewiki/latest/ contains a dewiki-latest-md5sums.txt file.

pre-count pagelinks

In wikipedia_process.sh the ${LANG}pagelinks table (one colum: a page title) is read and a new ${LANG}pagelinkcount table is created with two columns: page title, number of occurances.

The ${LANG}pagelinks tables are huge. For English 820m entries, 40GB, a full traverse query, e.g. SELECT COUNT(*) takes 10 minutes .

For all languages the ${LANG}pagelinks tables tables are 179GB while ${LANG}pagelinkcount tables are 12GB.

Currently adding the counts is something like:

CREATE TABLE enpagelinkcount AS
SELECT pl_title AS title,COUNT(*) FROM enpagelinks GROUP BY pl_title;
26,959,559 rows
11 minutes

The new approach would be:

zcat converted/wikipedia/en/pagelinks.csv.gz | bin/count_first_column.py | gzip -9 > pagelinkcount_pre.csv.gz
59,475,009 lines
4 minutes
import pagelinkcount_pre.csv.gz: 20 seconds
CREATE TABLE enpagelinkcount AS
SELECT pl_title AS title, SUM(count) FROM pagelinkcount_pre GROUP BY pl_title
26,959,558 rows
2 minutes

total 6-7 minutes

Similar to uniq -c, count_first_column.py only checks if the previous line is the same as the current and increases a count with memory usage. So the output can still have duplicates. To remove duplicates one would have to do an additional sort.

All pagelinks files together contain 3.7b lines, so 4.5x English. Total savings for all countries might be 15-20 minutes processing and 150GB disc space (database).

replace mysql2pgsql.perl

In import_wikipedia.sh we pipe three ${language}wiki*.sql.gz per language to mysql2pgsql.perl and then into psql to import the file into a database table.

mysql2pgsql contains logic to deal with various column type conversions. Most is no longer useful to us because we move the CREATE TABLE and index creation to import_wikipedia.sh already. The former to skip creation of sequential indices, the latter to skip creation of most indices (creating them after the full table import is also faster).

The data in the MySQL format input is comma separate records like
(39335,2,'Pit/In_Arbeit','',0,0,0.165717677367588,'20171118161939','20171118161943',6778517,0,'wikitext',NULL)
It should be possible to use either a very stripped down version of the Perl script or replace it with another smaller script. It should deal with quoting and unicode encoding of course.

Currently time zcat dewiki-latest-page.sql.gz | ../mysql2pgsql.perl --nodrop /dev/stdin /dev/stdout | wc takes about 3 minutes for a 250MB compressed input file. 1 CPU more at 100%. All language files together are about 36GB compressed so we can estimate 3min * 4 * 36 = 432min = 7h12min for the mysql2psql.perl execution.

As later optimization we could investigate writing CSV files instead.

As later optimization we could import less columns. For example for the ${language}page table tne input has 13 columns but only 3 (page_id, page_title, page_namespace) are used.
Later processing also includes WHERE page_namespace = 0 (0 is 'article' https://en.wikipedia.org/wiki/Wikipedia:Namespace) so we could try to skip != 0 earlier.

wikipedia_redirect contains too many redirects

The wikipedia_redirect table currently contains all redirects that wikipedia has. It should be limited to redirects to pages that are in wikipedia_article.

Practical use

Hello!

Using this repository, I processed melons from Wiki
I received, as you described, four files:
wikimedia_importance.csv.gz
wikipedia_importance.sql.gz
wikipedia_article.csv.gz
wikipedia_redirect.csv.gz

I don't understand how to continue using these files with Nominatim

it only requires the wikimedia-importance.sql.gz file

My version Nominatim 4.3.0

Missing data for river 'Elbe'

A couple of weeks ago somebody changed the wikipedia entry for the river Elbe from de:Elbe to cs:Labe. As a result the importance could no longer be matched. A couple of observations:

rivers don't seem to have wikidata entries in our wikipedia_article table
there is only an entry for the German version in the wikipedia_article table
the redirect table does not have cross-language entries either

The first point surely needs fixing. Not sure about the second and third.

UnicodeDecodeError in mysqldump_to_csv.py

Need to figure out if this stops the script and leaves the output file incomplete. For all languages so far only the langlinks files seem affected.

[01:40:48] [language bg] langlinks table SQL => CSV
[01:40:52] Traceback (most recent call last):
[01:40:52]   File "bin/mysqldump_to_csv.py", line 114, in <module>
[01:40:52]     main()
[01:40:52]   File "bin/mysqldump_to_csv.py", line 104, in main
[01:40:52]     for line in fileinput.input():
[01:40:52]   File "/usr/lib/python3.8/fileinput.py", line 248, in __next__
[01:40:52]     line = self._readline()
[01:40:52]   File "/usr/lib/python3.8/codecs.py", line 322, in decode
[01:40:52]     (result, consumed) = self._buffer_decode(data, self.errors, final)
[01:40:52] UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 690-691: invalid continuation byte

Investigate importance of 0

About 3% of the entries in wikipedia_article currently have an importance of 0. We should investigate when this happens and if this is expected.

Depending on how the values come to be, we have to handle the entries as 'insufficient data' or 'of extremely low importance'. In the first case, the lines can be dropped, int the second case we should keep them.