Code Monkey home page Code Monkey logo

Comments (8)

attardi avatar attardi commented on June 23, 2024

On 27/5/2015 20:58, sylvia1 wrote:

I want to get the title and the content of every wikipedia articles. I
found the wiki extractor to be very useful to this purpose. I use wiki
extractor according to the instructions on the github. When running
wiki extractor V2.8, I ran into 'maximum template recursion' error
after a few hours. I am getting wiki extractor from this github
webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.py

You should not worry about this warning. It occurs because of malformed
code in the templates. Wikipedia itself performs a similar check when
generating the HTML pages.

So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.

In wiki extractor V2.4, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QH.

In wiki extractor V2.5, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.

In wiki extractor V2.6, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.

But I am really confused, because I have no idea which version has the
complete wikipedia articles. In my understanding, it seems none of
them succeed. Because in the resulting directory it should contain
from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in
V2.5 and V2.6, it stops at QN.

The number of directories produced depends on the value of parameter -b,
which determines how many pages to store in each file.
The only way to know how far the extraction went, is to look at what it
prints.

The dump I used reached:

46315506 Christopher Wood (Australian cricketer)

Could any one who run the wiki extractor successfully please shed some
light on me? What should the successful result look like? And which
version should I run to get the correct result?

You should use the latest version: 2.33.

β€”
Reply to this email directly or view it on GitHub
#21.

from wikiextractor.

sylvia1 avatar sylvia1 commented on June 23, 2024

I tried the latest version V2.32. I did not found V2.33. But it seems to have the same problem of V2.8. The program seems to run into an infinite loop which prints: "maximum template recursion" only. I have to stop (close the terminal) the program, because it is printing "maximum template recursion" all the time. The command I use is:
python WikiExtractor.py -cb 250K -o extracted enwiki-20150304-pages-meta-current.xml.bz2

When I check the results of V2.32, the directory also stops at EQ. Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.

In every directory from AA to EN, there are 100 files. Only in EQ, there are 25 files.

You said "The only way to know how far the extraction went, is to look at what it prints". But when I run V2.32 and V2.8, the program prints "maximum template recursion" all the time all over the screen. I have nowhere to check how far the program goes.

You said I should not worry about the warning "maximum template recursion". But it seems to be printing this sentence all the time (for hours).

Could you please shed some light on me? Is there anything I did wrong? What should I do to get this fixed? I just want to get the title and content of the full English wikipedia. Thank you very much.

from wikiextractor.

attardi avatar attardi commented on June 23, 2024

On 28/5/2015 17:52, sylvia1 wrote:

I tried the latest version V2.32. I did not found V2.33. But it seems
to have the same problem of V2.8. The program seems to run into an
infinite loop which prints: "maximum template recursion" only. I have
to stop (close the terminal) the program, because it is printing
"maximum template recursion" all the time. The command I use is:
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2

When I check the results of V2.32, the directory also stops at EQ.
Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.

In every directory from AA to EN, there are 100 files. Only in EQ,
there are 25 files.

You said "The only way to know how far the extraction went, is to look
at what it prints". But when I run V2.32 and V2.8, the program prints
"maximum template recursion" all the time all over the screen. I have
nowhere to check how far the program goes.

You said I should not worry about the warning "maximum template
recursion". But it seems to be printing this sentence all the time
(for hours).

You can redirect the output to a file:
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2 > list.txt
and look the ID of the last article reported before the loop.

I will try myself with the dump from 20150304 to see what happens.

Could you please shed some light on me? Is there anything I did wrong?
What should I do to get this fixed? I just want to get the title and
content of the full English wikipedia. Thank you very much.

β€”
Reply to this email directly or view it on GitHub
#21 (comment).

from wikiextractor.

sylvia1 avatar sylvia1 commented on June 23, 2024

I run the program as you told me. And here is the attached result screenshot. One is the end of the file. The other one shows the last INFO the program is printing, after that the program prints Max template recursion all the time.

screenshot from 2015-05-29 15 20 56
screenshot from 2015-05-29 15 24 39

from wikiextractor.

attardi avatar attardi commented on June 23, 2024

You should use the dump of the articles:

python WikiExtractor.py -o extracted enwiki-20150304-pages-articles.xml.bz2

from wikiextractor.

sylvia1 avatar sylvia1 commented on June 23, 2024

thanks for the info. But when I use enwiki-20150515-pages-articles.xml.bz2. It still has a problem
screenshot from 2015-05-31 17 48 23

Now I am downloading 20141208 version. May I know which dump version you are using? Which dump version will have no error? Thanks.

from wikiextractor.

attardi avatar attardi commented on June 23, 2024

I had no problem with that article.
I am using WikipediaExtractr.py version 2.33 on the dump enwiki-20150304-pages-articles.xml.bz2.
I am running Python 2.7.6.

from wikiextractor.

attardi avatar attardi commented on June 23, 2024

Fixed in version 2.34.

from wikiextractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.