Comments (8)
On 27/5/2015 20:58, sylvia1 wrote:
I want to get the title and the content of every wikipedia articles. I
found the wiki extractor to be very useful to this purpose. I use wiki
extractor according to the instructions on the github. When running
wiki extractor V2.8, I ran into 'maximum template recursion' error
after a few hours. I am getting wiki extractor from this github
webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.pyYou should not worry about this warning. It occurs because of malformed
code in the templates. Wikipedia itself performs a similar check when
generating the HTML pages.So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.
In wiki extractor V2.4, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QH.In wiki extractor V2.5, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.In wiki extractor V2.6, the program seems to be run successfully; the
program stops after printing "45581241 Kaduthuruthy Thazhathupally" to
the terminal; the resulting directory ranges from AA to QN.But I am really confused, because I have no idea which version has the
complete wikipedia articles. In my understanding, it seems none of
them succeed. Because in the resulting directory it should contain
from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in
V2.5 and V2.6, it stops at QN.
The number of directories produced depends on the value of parameter -b,
which determines how many pages to store in each file.
The only way to know how far the extraction went, is to look at what it
prints.
The dump I used reached:
46315506 Christopher Wood (Australian cricketer)
Could any one who run the wiki extractor successfully please shed some
light on me? What should the successful result look like? And which
version should I run to get the correct result?You should use the latest version: 2.33.
β
Reply to this email directly or view it on GitHub
#21.
from wikiextractor.
I tried the latest version V2.32. I did not found V2.33. But it seems to have the same problem of V2.8. The program seems to run into an infinite loop which prints: "maximum template recursion" only. I have to stop (close the terminal) the program, because it is printing "maximum template recursion" all the time. The command I use isοΌ
python WikiExtractor.py -cb 250K -o extracted enwiki-20150304-pages-meta-current.xml.bz2
When I check the results of V2.32, the directory also stops at EQ. Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.
In every directory from AA to EN, there are 100 files. Only in EQ, there are 25 files.
You said "The only way to know how far the extraction went, is to look at what it prints". But when I run V2.32 and V2.8, the program prints "maximum template recursion" all the time all over the screen. I have nowhere to check how far the program goes.
You said I should not worry about the warning "maximum template recursion". But it seems to be printing this sentence all the time (for hours).
Could you please shed some light on me? Is there anything I did wrong? What should I do to get this fixed? I just want to get the title and content of the full English wikipedia. Thank you very much.
from wikiextractor.
On 28/5/2015 17:52, sylvia1 wrote:
I tried the latest version V2.32. I did not found V2.33. But it seems
to have the same problem of V2.8. The program seems to run into an
infinite loop which prints: "maximum template recursion" only. I have
to stop (close the terminal) the program, because it is printing
"maximum template recursion" all the time. The command I use isοΌ
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2When I check the results of V2.32, the directory also stops at EQ.
Before EQ, it is AA to AZ, BA to BZ, CA to CZ. And then it is EA to EQ.In every directory from AA to EN, there are 100 files. Only in EQ,
there are 25 files.You said "The only way to know how far the extraction went, is to look
at what it prints". But when I run V2.32 and V2.8, the program prints
"maximum template recursion" all the time all over the screen. I have
nowhere to check how far the program goes.You said I should not worry about the warning "maximum template
recursion". But it seems to be printing this sentence all the time
(for hours).You can redirect the output to a file:
python WikiExtractor.py -cb 250K -o extracted
enwiki-20150304-pages-meta-current.xml.bz2 > list.txt
and look the ID of the last article reported before the loop.
I will try myself with the dump from 20150304 to see what happens.
Could you please shed some light on me? Is there anything I did wrong?
What should I do to get this fixed? I just want to get the title and
content of the full English wikipedia. Thank you very much.β
Reply to this email directly or view it on GitHub
#21 (comment).
from wikiextractor.
I run the program as you told me. And here is the attached result screenshot. One is the end of the file. The other one shows the last INFO the program is printing, after that the program prints Max template recursion all the time.
from wikiextractor.
You should use the dump of the articles:
python WikiExtractor.py -o extracted enwiki-20150304-pages-articles.xml.bz2
from wikiextractor.
thanks for the info. But when I use enwiki-20150515-pages-articles.xml.bz2. It still has a problem
Now I am downloading 20141208 version. May I know which dump version you are using? Which dump version will have no error? Thanks.
from wikiextractor.
I had no problem with that article.
I am using WikipediaExtractr.py version 2.33 on the dump enwiki-20150304-pages-articles.xml.bz2.
I am running Python 2.7.6.
from wikiextractor.
Fixed in version 2.34.
from wikiextractor.
Related Issues (20)
- Issues on newer (2023) and older (2019) dumps
- ptwiki-latest error HOT 2
- wikiextractor 3.0.6 not extracting HOT 3
- Why was --keep_tables removed?
- Never finishes and even debug gets stuck in a loop
- Warning: Template Errors HOT 1
- Is Windows supported
- Is Windows 10 supported? HOT 28
- Template errors in article HOT 2
- Add feature to extractPage to also dump the extracted page to json/csv/txt
- [Request for Help] Should I support a template file like `templates.txt` followed the arg `--templates`?
- Bullet points are missing in the final extracted text
- does not extract all wiki
- Parsing seems to exclude some part of the page
- Wikidata Extraction
- How to store a document in a separate txt file instead of a single txt file containing multiple documents
- ValueError: cannot find context for 'fork' & cannot pickle '_io.TextIOWrapper' object HOT 2
- pypi not updated with latest version (3.0.7)
- Get all revisions content
- OSS-Fuzz Integration
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wikiextractor.