wiki extractor results directories end up in QN about wikiextractor HOT 8 CLOSED

attardi commented on June 23, 2024

wiki extractor results directories end up in QN

from wikiextractor.

Comments (8)

attardi commented on June 23, 2024

On 27/5/2015 20:58, sylvia1 wrote:

I want to get the title and the content of every wikipedia articles. I
found the wiki extractor to be very useful to this purpose. I use wiki
extractor according to the instructions on the github. When running
wiki extractor V2.8, I ran into 'maximum template recursion' error
after a few hours. I am getting wiki extractor from this github
webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.py

You should not worry about this warning. It occurs because of malformed
code in the templates. Wikipedia itself performs a similar check when
generating the HTML pages.

So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.

In wiki extractor V2.4, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QH.

In wiki extractor V2.5, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.

In wiki extractor V2.6, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.

But I am really confused, because I have no idea which version has the
complete wikipedia articles. In my understanding, it seems none of
them succeed. Because in the resulting directory it should contain
from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in
V2.5 and V2.6, it stops at QN.

The number of directories produced depends on the value of parameter -b,
which determines how many pages to store in each file.
The only way to know how far the extraction went, is to look at what it
prints.

The dump I used reached:

46315506 Christopher Wood (Australian cricketer)

Could any one who run the wiki extractor successfully please shed some
light on me? What should the successful result look like? And which
version should I run to get the correct result?

You should use the latest version: 2.33.

—
Reply to this email directly or view it on GitHub
#21.

from wikiextractor.

sylvia1 commented on June 23, 2024

I tried the latest version V2.32. I did not found V2.33. But it seems to have the same problem of V2.8. The program seems to run into an infinite loop which prints: "maximum template recursion" only. I have to stop (close the terminal) the program, because it is printing "maximum template recursion" all the time. The command I use is：
python WikiExtractor.py -cb 250K -o extracted enwiki-20150304-pages-meta-current.xml.bz2

When I check the results of V2.32, the directory also stops at EQ. Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.

In every directory from AA to EN, there are 100 files. Only in EQ, there are 25 files.

You said "The only way to know how far the extraction went, is to look at what it prints". But when I run V2.32 and V2.8, the program prints "maximum template recursion" all the time all over the screen. I have nowhere to check how far the program goes.

You said I should not worry about the warning "maximum template recursion". But it seems to be printing this sentence all the time (for hours).

Could you please shed some light on me? Is there anything I did wrong? What should I do to get this fixed? I just want to get the title and content of the full English wikipedia. Thank you very much.

from wikiextractor.

attardi commented on June 23, 2024

On 28/5/2015 17:52, sylvia1 wrote:

I tried the latest version V2.32. I did not found V2.33. But it seems
to have the same problem of V2.8. The program seems to run into an
infinite loop which prints: "maximum template recursion" only. I have
to stop (close the terminal) the program, because it is printing
"maximum template recursion" all the time. The command I use is：
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2

When I check the results of V2.32, the directory also stops at EQ.
Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.

In every directory from AA to EN, there are 100 files. Only in EQ,
there are 25 files.

You said "The only way to know how far the extraction went, is to look
at what it prints". But when I run V2.32 and V2.8, the program prints
"maximum template recursion" all the time all over the screen. I have
nowhere to check how far the program goes.

You said I should not worry about the warning "maximum template
recursion". But it seems to be printing this sentence all the time
(for hours).

You can redirect the output to a file:
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2 > list.txt
and look the ID of the last article reported before the loop.

I will try myself with the dump from 20150304 to see what happens.

Could you please shed some light on me? Is there anything I did wrong?
What should I do to get this fixed? I just want to get the title and
content of the full English wikipedia. Thank you very much.

—
Reply to this email directly or view it on GitHub
#21 (comment).

from wikiextractor.

sylvia1 commented on June 23, 2024

I run the program as you told me. And here is the attached result screenshot. One is the end of the file. The other one shows the last INFO the program is printing, after that the program prints Max template recursion all the time.

from wikiextractor.

attardi commented on June 23, 2024

You should use the dump of the articles:

python WikiExtractor.py -o extracted enwiki-20150304-pages-articles.xml.bz2

from wikiextractor.

sylvia1 commented on June 23, 2024

thanks for the info. But when I use enwiki-20150515-pages-articles.xml.bz2. It still has a problem

Now I am downloading 20141208 version. May I know which dump version you are using? Which dump version will have no error? Thanks.

from wikiextractor.

attardi commented on June 23, 2024

I had no problem with that article.
I am using WikipediaExtractr.py version 2.33 on the dump enwiki-20150304-pages-articles.xml.bz2.
I am running Python 2.7.6.

from wikiextractor.

attardi commented on June 23, 2024

Fixed in version 2.34.

from wikiextractor.

wiki extractor results directories end up in QN about wikiextractor HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent