siznax / wptools Goto Github PK

View Code? Open in Web Editor NEW

572.0 23.0 79.0 1.16 MB

Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis

License: MIT License

Python 100.00%

mediawiki mediawiki-api restbase wikidata open-data glam commons linked-open-data wikimedia-commons wikipedia

wptools's People

Contributors

Stargazers

Watchers

Forkers

mr-wiredancer mutley89 un33k sunliwen giangbinhtran darshansadhna christophermoura kapoorabhish 0x9900 marcsantiago lisongx jjjake yaohsienhsieh bnmin rikima stephensebastin wanglaok robbieclarken mandiberg peeter-t2 yifangma anuraggubba kevingaoyong emilyxiadata suxiao18740 simple-english minhtriet136 ider-zh firasbayazed iraied neiljphilip alto516 enguerrandgrx deepaksithu julthep courtneymartinez ritaran chrisjanwust ztwu jayvdb ukanuk mgautam98 wanluwen pavellos21 mahmoud-riad open24hours kovarden uriva remooheshmat zn-qiao benyuan1998 dvershinin bron-analytics gabrielfreeze nanaanim27 peterboshra993 databill86 mfezzat khileshchauhan matthewgehring python-repository-hub dpetek natsuapo komlard amira-khaled7 stanleyjoe33 oluwatosindurodola donsolid hermessecund kerenyambura sadernalwis syalah husseinmf77 whatsnowplaying codeaudit

wptools's Issues

Scripts/ not distributed with PyPi

Oh, Python Packaging, you hot mess you.

Suggestions welcome.

Create helper class to pull data from API responses

This is intended to alleviate the number of branches in core._set_<action>_data methods.

feature request: query page by pageid?

Hi,

I'm wondering can I query a page by pageid? From docs seems not

language variant not working

Hi Steve,

I'm using zh-cn to get a wikibase's title, but seems the label and title is both traditional chinese not simplified chinese(also, what's the difference between these two?) but wikipedia display zh-cn in its page ,

Q20474 (zh/zh-cn)
{
  lang: zh
  variant: zh-cn
  wikibase: Q20474
}
www.wikidata.org (wikidata) Q20474
www.wikidata.org (claims) Q1165777|Q8391309|Q188451
zh.wikipedia.org (query) 迴響貝斯
zh.wikipedia.org (parse) 1565553
迴響貝斯 (zh/zh-cn)
{
  cache: <dict(4)> {claims, parse, query, wikidata}
  claims: <dict(3)> {Q1165777, Q188451, Q8391309}
  extext: <str(520)> **回响贝斯**（英语：**Dubstep**），回响贝斯在1990年代诞生，源起英国伦敦南部...
  extract: <str(593)> <p><b>回响贝斯</b>（<span>英语：<span lang="en" xml:la...
  label: 迴響貝斯
  lang: zh
  links: https://en.wikipedia.org/wiki/Dubstep
  modified: <dict(2)> {page, wikidata}
  pageid: 1565553
  parsetree: <str(1268)> <root><template><title>Expand English</titl...
  props: <dict(3)> {P279, P31, P910}
  random: 游铭训
  title: 迴響貝斯
  url: https://zh.wikipedia.org/wiki/%E8%BF%B4%E9%9F%BF%E8%B2%9D%E6%96%AF
  url_raw: https://zh.wikipedia.org/wiki/%E8%BF%B4%E9%9F%BF%E8%B2%9D%E6%96%AF?action=raw
  variant: zh-cn
  wikibase: Q20474
  wikidata: <dict(3)> {category, instance, subclass}
  wikidata_url: https://www.wikidata.org/wiki/Q20474
  wikitext: <str(1000)> {{Expand English|Dubstep}}{{unreferenced|tim...
}

wikibase lookup bug

For example q = wptools.page(wikibase='Q43303').get_query() will always give back a None page?

I think in this case, the code will throw a exception if it don't have enough information to get the query information.

Consider high-level service class

Currently, we do this:

>>> query = wptools.lead('Aardvark', test=True)
>>> query[:72]
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Aardvark'

>>> html = wptools.lead('Aardvark')  # HTTP request 1/2
>>> html[:72]
'<p>The <b>aardvark</b> (<span class="nowrap"><span class="IPA nopopups">'    

>>> text = wptools.lead('Aardvark', plain=True)  # HTTP request 2/2
>>> text[:72]
'The **aardvark** (/ˈɑːrd.vɑːrk/ _**ARD**-vark_; _Orycteropus afer_)'

Would it be worth it to make a service class, so we can do this?

>>> o = wptools.WPTools(fn='lead', title='Aardvark')

>>> o.query[:72]
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Aardvark'

>>> o.html
None

>>> o.get()  # HTTP request 1/1

>>> o.html[:72]
'<p>The <b>aardvark</b> (<span class="nowrap"><span class="IPA nopopups">'    

>>> o.text[:72]
'The **aardvark** (/ˈɑːrd.vɑːrk/ _**ARD**-vark_; _Orycteropus afer_)'

The second request ("HTTP request 2/2") in the first block is not necessary, but the advantage of the second block would be that you can refer to the same object to access both the HTML and the text and there would be only one HTTP request.

Not all functions may work out this nicely though.

Figure out "best" image

We get the following images from MediaWiki:API and Wikidata:

Image => Wikidata Property:P18
pageimage => MediaWiki:API (action=query) pageimage
thumbnail => MediaWiki:API (action=query) thumbnail

RESTBase /page/mobile-text also gives us a couple of images:

image => apparently, pageimage (above)
thumb => apparently, scaled (larger) thumbnail (above)

Spot-checking some articles, there were only conflicts with thumbnail, where the RESTBase thumb was the scaled (larger) MediaWiki:API thumbnail.

In the case of Napoleon, the Wikidata Image is a portrait (great!), while the get_query images are an Imperial Coat of Arms (what?).

››› Napoleon (en)

Image: https://upload.wikimedia.org/wikipedia/commons/b/b5/Jacques-Louis_David_-_The_Emperor_Napoleon_in_His_Study_at_the_Tuileries_-_Google_Art_Project_2.jpg
images: <dict(5)> {Image, image, pageimage, thumb, thumbnail}
pageimage: https://upload.wikimedia.org/wikipedia/commons/f/f3/Grandes_Armes_Imp%C3%A9riales_%281804-1815%292.svg
thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Grandes_Armes_Imp%C3%A9riales_%281804-1815%292.svg/882px-Grandes_Armes_Imp%C3%A9riales_%281804-1815%292.svg.png

In the case of Stephen Fry, pageimage seems more current.

››› Stephen Fry (en)

Image: https://upload.wikimedia.org/wikipedia/commons/1/15/Stephen_Fry_cropped.jpg
images: <dict(5)> {Image, image, pageimage, thumb, thumbnail}
pageimage: https://upload.wikimedia.org/wikipedia/commons/3/38/Stephen_Fry_June_2016.jpg
thumbnail: https://upload.wikimedia.org/wikipedia/commons/3/38/Stephen_Fry_June_2016.jpg

In the case of Ella Fitzgerald, it's hard to say.

››› Ella Fitzgerald (en) {}

Image: https://upload.wikimedia.org/wikipedia/commons/6/65/Ella_Fitzgerald_1962.JPG
images: <dict(5)> {Image, image, pageimage, thumb, thumbnail}
pageimage: https://upload.wikimedia.org/wikipedia/commons/a/a1/Ella_Fitzgerald_(Gottlieb_02871).jpg
thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Ella_Fitzgerald_%28Gottlieb_02871%29.jpg/969px-Ella_Fitzgerald_%28Gottlieb_02871%29.jpg

So, which is the "best" image? I guess it depends on your goal.

Figure out why requests seems slower than pycurl

I love python requests for its interface design and code quality. Indeed, it is the model for this package, but it does not work for me out-of-the-box as fast as pycurl for some reason (see wptools/fetch.py). I've tried tinkering with my environment a bit without any luck. I'd like to track down exactly why before making a 1.0 release.

Add request method for common "get_" tasks

Most get_ methods do nearly the same thing:

check cache
maybe raise LookupError
form query
get response, info
cache query, response, info
set attributes from response data
maybe get_claims, get_imageinfo, show

feature request: proxy support

sadly , from China. Many part of the wikipedia page is blocked, if I can pass in a proxy that would be great

Not compatible with python3

Open to pull-requests for this...

Ubuntu dependencies

Email submission...

For xubuntu 16.04, it was just a few steps:

$ sudo apt-get install python-pip
$ sudo apt-get install libcurl4-gnutls-dev
$ sudo apt-get install libgnutls28-dev
$ sudo pip install wptools

Monkeypatch for python3 unicode

For those who want to use wptools in python 3 after you clone the wptools and install local copy in your system.

core.py

-            if type(prop) is str or type(prop) is unicode:
+            if type(prop) is str:

Since python 3 move to str, it is difficult to make a patch that works for both python 2 and 3 for this particular unicode check.

Version check is kinda tricky.

import sys

if sys.version_info.major == "2":
   if type(prop) is str or type(prop) is unicode:
       .....
elif sys.version_info.major == "3":
   if type(prop) is str:
       .....

Support language variants (simplified Chinese)

I'm using p = wptools.page(wikibase='Q5094115', lang='zh').get() to get the chinese content of the page, but the content I get from p.extract is traditional Chinese

>>> print p.extract
<p><b>林一峰</b>（英文名：Chet Lam，1976年4月11日<span title="Template:BLP editintro">－</span>），香港土生土長的創作歌手，未受過正統音樂訓練，靠自學而學懂彈結他、作曲、寫詞；除了大量替其他歌手創作及製作外，林成立了兩間工作室：LYFE及思人創作，推出過十多張粵語、國語跟英語專輯，替自己及其他音樂單位舉辦音樂演出。文字方面，林已經推出了7本著作，包括2014年集音樂專輯、旅行故事、中英對照食譜於一身的《慢煮快活》。</p>
<p>林是香港唱作組合at17成員林二汶的哥哥，兩人經常合作演出。</p>

I'm wondering is there a way to get the simplified version of the content from the api, in the https://zh.wikipedia.org/ , you can do this by selecting different chiense flavor

Thanks Steve!

Helper method to validate/resize image URLs

Currently, we calculate image urls with utils.media_url and default namespace='commons' to save additional API calls. This results in an invalid path if the file is not in commons.

404 NOT FOUND:

https://upload.wikimedia.org/wikipedia/commons/1/14/Super_Sentai_World_screenshot.jpg

200 OK:

https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg

We should offer a helper method to validate image paths using MediaWiki API:Imageinfo. If the the pages key is -1 or has a missing member, then the image should be located in the language namespace repository.

Missing from commons.wikimedia.org:

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo

"pages": {
    "-1": {
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "missing": "",
        "imagerepository": ""
    }
}

Found in en.wikipedia.org:

https://en.wikipedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo

"pages": {
    "31043864": {
        "pageid": 31043864,
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "imagerepository": "local",
        "imageinfo": [
            {
                "timestamp": "2011-03-01T06:56:07Z",
                "user": "Jonny2x4"
            }
        ]
    }
}

This may require not one, but two additional API calls: 1) verify path by API:Imageinfo, and if the image is missing from Commons, 2) recompute path with namescape=<lang> and verify that path with an API:Imageinfo call to <lang>.wikimedia.org.

We should also use this same method/code path to recompute image URLs for a specific size by adding the iiurlwidth param:

https://en.wikipedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo&iiprop=url&iiurlwidth=320

"pages": {
    "31043864": {
        "pageid": 31043864,
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "imagerepository": "local",
        "imageinfo": [
            {
"thumburl": "https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg",
"thumbwidth": 320,
"thumbheight": 240,
"url": "https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg",
"descriptionurl": "https://en.wikipedia.org/wiki/File:Super_Sentai_World_screenshot.jpg",
"descriptionshorturl": "https://en.wikipedia.org/w/index.php?curid=31043864"
            }
        ]
    }
}

Moar test coverage

Working on this piecemeal, but want to have good coverage by 1.0.

Support more wikis by allowing insecure requests

Currently, all queries use HTTPS, so if you try a non-Wikimedia site, you may get an SSL error:

>>> p = wptools.page(wiki='tolkiengateway.net')
tolkiengateway.net (action=random) None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wptools/core.py", line 113, in __init__
    self.get_random()
  File "wptools/core.py", line 588, in get_random
    response = self.__fetch.curl(query)
  File "wptools/fetch.py", line 110, in curl
    return self.curl_perform(crl)
  File "wptools/fetch.py", line 118, in curl_perform
    crl.perform()
pycurl.error: (7, 'Failed to connect to tolkiengateway.net port 443: Connection refused')

If we adjust the query to use 'http' instead of 'https':

diff --git a/wptools/fetch.py b/wptools/fetch.py
index 249b210..81250d7 100644
--- a/wptools/fetch.py
+++ b/wptools/fetch.py
@@ -176,7 +176,7 @@ class WPToolsFetch(object):

         self.action = action
         self.thing = thing
-        return qry
+        return qry.replace('https', 'http')

Then we can access more MediaWiki instances...

>>> p = wptools.page(wiki='tolkiengateway.net')
tolkiengateway.net (action=random) None
Helm's_Deep_(scene) (en)
{
  lang: en
  pageid: 41211
  title: Helm's_Deep_(scene)
}

No-lead text output probably double-encoded

$ ./text.py Kintsugi | head
_**Kintsugi (é��ç¶�ã��?)**_ (Japanese: _golden joinery_)

Okay with -l (lead) option:

$ ./text.py Kintsugi -l | head
_**Kintsugi (金継ぎ?)**_ (Japanese: _golden joinery_)

Consolidate attributes

Let's have some reasonable level of access to important API data on an instance, but also consolidate where it makes sense. Let's use this issue for discussions on what makes sense, and start here:

Put image items (image, thumbnail, etc.) in images attribute (see #33)
Put cache items (g_query, etc.) in cache attribute

AttributeError on get_wikidata in __get_image_files

trackback

  File "/var/dae/apps/fm/fm/model/wiki/fetch.py", line 27, in fetch_wikidata_item_by_name_and_lang
    timeout=WIKI_FETCH_TIMEOUT).get_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 830, in get_wikidata
    self.get_imageinfo(False)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 601, in get_imageinfo
    files = self.__get_image_files()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 131, in __get_image_files
    fname = item.replace('_', ' ')
AttributeError: 'list' object has no attribute 'replace'

Add option to skip requests

Sometimes we're only interested in the text, as in #49

output is not json format

Thanks for the release,

But when I run "./infobox.py University_of_Cambridge", the output contains 631 lines and is not in json format. Does this problem occurs on your computer?

wikidata item classification

it's not really about this lib, but it could be better wptools could help on this:

�I'm using the claims attr from wikidata to classify if the page I get is a musician or musicial group.

WIKI_DATA_HUMAN = 'Q5'
WIKI_DATA_BAND = 'Q215380'
WIKI_DATA_ROCK_BAND = 'Q5741069'

Now I'm using these three attr, if any page has anyone of these three attr in its claims, I think it's a musician. But the wikidata has tons of this instance of attrbite ( it' a tree I guess), so is there any a clever way to tell, whether a wikidata item is a subclass of another wikidata?

Thanks in advanced! Steve

Infobox templates should revert to wikitext

Infobox output (e.g. wptools.infobox('Aardvark')) is derived from the parsetree in order to avoid string hacking wikitext, however, this can lead to values that do not appear to be the most useful:

"fossil_range": "<template><title>Fossil range</title><part><name index=\"1\"/>
<value>5</value></part><part><name index=\"2\"/><value>0</value></part>
</template>&lt;small&gt;Early [[Pliocene]] &#8211; Recent&lt;/small&gt;"

I think a better result may be more like the original wikitext, e.g. wptools.wikitext('Aardvark') :

fossil_range = {{Fossil range|5|0}}<small>Early [[Pliocene]] – Recent</small>

We can expand this template via the API (see API:Parsing_wikitext, but this yields are large amount of HTML which may be equally uninteresting:

I think the most useful result may be:

fossil_range = {{Fossil range|5|0}}

Perhaps there is a way to get the original wikitext out of the parsetree template?

RESTBase /page/mobile-text/ missing some maths

For example, in the Size functor article, the RESTBase /page/mobile-text/ result is missing:

* for each {\displaystyle x\in \mathbb {R} \ } , {\displaystyle F_{i}(x)=H_{i}(M_{x});\ }
* {\displaystyle F_{i}(k_{xy})=H_{i}(j_{xy}).\ }

MediaWiki:API action=query:

equal to the morphism in <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="normal">R</mi>
          <mi mathvariant="normal">o</mi>
          <mi mathvariant="normal">r</mi>
          <mi mathvariant="normal">d</mi>
        </mrow>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle \mathrm {Rord} \ }</annotation>
  </semantics></math></span></span> from <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\ }</annotation>
  </semantics></math></span></span> to <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>y</mi>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle y\ }</annotation>
  </semantics></math></span></span>,</p>
<ul><li>for each <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mo>∈</mo>
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="double-struck">R</mi>
        </mrow>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\in \mathbb {R} \ }</annotation>
  </semantics></math></span></span>, <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <msub>
          <mi>F</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <mi>x</mi>
        <mo stretchy="false">)</mo>
        <mo>=</mo>
        <msub>
          <mi>H</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>M</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>;</mo>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle F_{i}(x)=H_{i}(M_{x});\ }</annotation>
  </semantics></math></span></span></li>
<li><span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <msub>
          <mi>F</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>k</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
            <mi>y</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>=</mo>
        <msub>
          <mi>H</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>j</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
            <mi>y</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>.</mo>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle F_{i}(k_{xy})=H_{i}(j_{xy}).\ }</annotation>
  </semantics></math></span></span></li>
</ul><p>In other words,

RESTBase /page/mobile-text/:

equal to the <a href="/wiki/Morphism" title="Morphism">morphism</a> in <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="normal">R</mi>
          <mi mathvariant="normal">o</mi>
          <mi mathvariant="normal">r</mi>
          <mi mathvariant="normal">d</mi>
        </mrow>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle \mathrm {Rord} \ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6c971f380a0932266ee272cb327075e361d632d6" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:5.7ex; height:2.176ex;" alt="{\displaystyle \mathrm {Rord} \ }"></span> from <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/4bf17264a35330beeb310c35f9676cf9837482e3" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:1.921ex; height:1.676ex;" alt="x\ "></span> to <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>y</mi>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle y\ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6e6f65d38cec79fda789e1335dea91732a186a41" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.671ex; width:1.747ex; height:2.009ex;" alt="y\ "></span>,
In other words,

get_imageinfo hangs on proxy with timeout

timeout seems not working here

>>> wptools.page(wikibase='Q312559', lang='zh').get(proxy=DOUBAN_PROXY, timeout=2)
Q312559 (zh)
{
  lang: zh
  wikibase: Q312559
}
www.wikidata.org (wikidata) Q312559
www.wikidata.org (claims) Q21|Q5|Q11399
zh.wikipedia.org (query) 比尔·怀曼
zh.wikipedia.org (parse) 1542057
zh.wikipedia.org (imageinfo) File:Bill Wyman 2009.jpg

BTW, is there a way to ignnore imageinfo processing, since I only need the text data

Simplify "wptool" script

Recent improvements should allow us to make the wptool script much simpler. In fact, a goal should be for that script to do as little possible, supported by attributes resulting from the fewest queries.

get_rest() raises ValueError

The following title:

Красная_площадь (ru)
{
  lang: ru
  title: Красная_площадь
}

causes an exception on get_rest():

Traceback (most recent call last):
  File "try.py", line 32, in test_rest
    r.get_rest()
  File "/Users/steve/Code/wptools/wptools/core.py", line 615, in get_rest
    self._set_rest_data()
  File "/Users/steve/Code/wptools/wptools/core.py", line 385, in _set_rest_data
    lead = self.__get_lead(data)
  File "/Users/steve/Code/wptools/wptools/core.py", line 141, in __get_lead
    lead.append(self.__get_lead_rest(data))
  File "/Users/steve/Code/wptools/wptools/core.py", line 209, in __get_lead_rest
    return self.__postprocess_lead(html)
  File "/Users/steve/Code/wptools/wptools/core.py", line 215, in __postprocess_lead
    snip = utils.snip_html(html, verbose=1 if self.verbose else 0)
  File "/Users/steve/Code/wptools/wptools/utils.py", line 120, in snip_html
    elem.remove(desc)
  File "src/lxml/lxml.etree.pyx", line 950, in lxml.etree._Element.remove (src/lxml/lxml.etree.c:50798)
ValueError: Element is not a child of this node.

Add support for get_wikidata continuations

Hi Steve,
i'm not sure this is possible yet in wikidata's api, but I have this list of id(1000+), I only need to know its title right now. Instead of query one by one, it would be great to just have its name in one query.

UnicodeDecodeError raised on LookupError

wptools.page('阿Vane', lang='zh', variant='zh-cn').get_wikidata()

seems that when the title has chinese mixed with english words, it will has error decode it

  File "/var/dae/apps/fm/fm/model/wiki/fetch.py", line 37, in fetch_wikidata_item_by_name_and_lang
    name, lang=lang, silent=True, variant=variant).get_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 721, in get_wikidata
    self._set_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 482, in _set_wikidata
    self.g_wikidata['query'].replace('&format=json', ''))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 175: ordinal not in range(128)

props and claims differences

Hi Steve,

I'm wondering the page object's props and claims, what's the difference?
Am I correct props it's more like the raw data, and claims it's parsed to be used more friendly in the library?

thanks so much!

Consider custom exception class

It would be nice to have a custom exception class if necessary. It could protect users from changes in our implementation (for example, we raisepycurl.error now, but that may change to a requests exception in the future), but we should have a clear and strong case for making another class.

expose "modified" attribute

I saw the api reponse has this value to indicate the page's last modified, can we add this to wptools?

AttributeError: 'NoneType' object has no attribute 'get'

>>> wptools.page(u'松下奈緒',  lang='zh', variant='zh-cn').get_wikidata()
松下奈緒 (zh/zh-cn)
{
  lang: zh
  title: 松下奈緒
  variant: zh-cn
}
www.wikidata.org (wikidata) 松下奈緒
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 778, in get_wikidata
    self._set_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 475, in _set_wikidata
    self._marshal_claims(item.get('claims'))
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 444, in _marshal_claims
    self.props = self._wikidata_props(query_claims)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 421, in _wikidata_props
    snak = prop.get('mainsnak').get('datavalue').get('value')
AttributeError: 'NoneType' object has no attribute 'get'

Add flag to defer get_imageinfo when calling get()

Currently, when we call get() we may end up calling get_imageinfo() up to three times (for get_query, get_parse, and get_wikidata):

In [11]: be = wptools.page('Let It Be').get()
Let_It_Be (en)
{
  lang: en
  title: Let_It_Be
}
en.wikipedia.org (query) Let_It_Be
en.wikipedia.org (imageinfo) File:Billy Preston perforning in 1971.jpg
en.wikipedia.org (parse) 140537
en.wikipedia.org (imageinfo) File:Billy Preston perforning in 1971.jpg|F...
www.wikidata.org (wikidata) Q199585
www.wikidata.org (claims) Q484958|Q2078599|Q640978|Q184259|Q1299|Q11399|...
Let_It_Be (en)
{
  ...
}

Let's insert a flag to defer getting imageinfo until the last call in get().

API misses should raise an exception

now I'm using wptools.page().get() to get a page's data, but turn out the data doesn't exist,

but still, the instance it return have these two

{
  g_parse: <dict(3)> {info, query, response}
  g_query: <dict(3)> {info, query, response}
  g_wikidata: <dict(3)> {info, query, response}
  images: <dict(3)> {rimage, rthumb, wimage}
  lang: en
  title: 张楚
  wikidata: <dict(3)> {category, instance, subclass}
}

the wikidata is from the ex-request it made, not the current page

missing title, pageid should raise LookupError

When i'm lookup an wikibase's chinese wikipedia page, I'm using

>>> p  =wptools.page(wikibase='Q8075262', lang='zh', variant='zh-cn').get()
Q8075262 (zh/zh-cn)
{
  images: <dict(1)> {wimage}
  lang: zh
  variant: zh-cn
  wikibase: Q8075262
}
www.wikidata.org (action=wikidata) Q8075262
www.wikidata.org (action=wikidata) Q215380|Q837837
get_wikidata: need title or pageid
get_wikidata: need title or pageid

but it didn't throw a LookupError as I expected..

add api to set request timeout

I'm using an proxy to reques wikipedia, so it reponse timeout often, could add a tiemout parameter is better?(like requests package)

  File "/var/dae/apps/fm/venv/src/wptools/wptools/fetch.py", line 112, in curl
    return self.curl_perform(crl)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/fetch.py", line 120, in curl_perform
    crl.perform()
pycurl.error: (28, 'Operation timed out after 15000 milliseconds with 0 out of 0 bytes received')

Implement "infobox as wikitext"

In some cases, it may be easier to parse than the JSON with stray template fragments.

Consolidate images

Each image should be put in a dict with its key indicating the source, and no attempt to compute a media URL until later. Something like this:

images {
  get_query: <filename>,
  get_parse: <filename>,
  get_wikidata: [<filename],
  get_rest: <filename>
}

We should open another issue to provide a method to resolve filenames.

Continuous integration

Planning to install this once we have decent test coverage.

Handle API issues intelligently and consistently

For example:

title="Blerg" => NOTFOUND 'Blerg'
title="Misfits" => DISAMBIGUATION 'Misfits'
title="Abe Lincoln" => REDIRECT 'Abraham Lincoln'

Need solid unit tests for language variants (esp. Chinese)

As shown in #50, getting language variants right is complicated by:

incorrect API data
incomplete support by wikisites
knowing what to expect under ideal conditions

We need some solid test cases that operate on some title/page/item with variants that are correct.

pycurl SSL certificate problem (on Windows)

After installing via pip and start testing with:
import wptools

a = wptools.page('usa')

i always get:

a.get()
en.wikipedia.org (action=query) USA
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 460, in get
self.get_query(show=False)
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 547, in get_query
query['response'] = self.__fetch.curl(qry)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 109, in curl
return self.curl_perform(crl)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 117, in curl_perform
crl.perform()
pycurl.error: (60, 'SSL certificate problem: unable to get local issuer certificate')

a.get()
en.wikipedia.org (action=query) USA
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 460, in get
self.get_query(show=False)
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 547, in get_query
query['response'] = self.__fetch.curl(qry)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 109, in curl
return self.curl_perform(crl)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 117, in curl_perf
orm
crl.perform()
pycurl.error: (28, 'Resolving timed out after 15007 milliseconds')

Any hints of where to go from her ?

Implement get_imageinfo()

When API requests populate an instance with filenames (e.g. images see #33), we need a method to resolve filenames into valid URLs. We should provide a method that compute an initial guess at the URL using utils.media_url() with the default namespace (commons) and perform a HEAD on said URL, adjusting the namespace by language and so on until each filename URL yields HTTP status 200.

When I rencounter a disambiguation page in get_wikidata(), I want to get all the page in the disambiguation. And then decide which one is what I want( by looking up it's wikidata claims maybe)

Implement category features

list category pages
random page from category

https://www.mediawiki.org/wiki/API:Categorymembers

Fix error on multibyte input to utils.media_url()

Probably just need to encode('utf-8') and urllib.urlquote(url, safe=True) the input.

here's a the thumb source from the MediaWiki API:

"source": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Cuill%C3%A8re_Fontal%C3%A8s_Pal%C3%A9olithique_MHNT.PRE.2010.0.11.1.jpg/320px-Cuill%C3%A8re_Fontal%C3%A8s_Pal%C3%A9olithique_MHNT.PRE.2010.0.11.1.jpg",

and the traceback:

$ ./scripts/images.py "Spoon" pageimages
Traceback (most recent call last):
  File "./scripts/images.py", line 56, in <module>
    main()
  File "./scripts/images.py", line 52, in main
    wpimages(args.title, args.source, args.t, args.v, args.w)
  File "./scripts/images.py", line 25, in wpimages
    data = wptools.images(title, source, test, verbose, wiki)
  File "/Users/steve/Code/wptools/wptools/api.py", line 35, in images
    return extract.qry_images(data, source)
  File "/Users/steve/Code/wptools/wptools/extract.py", line 110, in qry_images
    return img_pageimages(data)
  File "/Users/steve/Code/wptools/wptools/extract.py", line 78, in img_pageimages
    data["source"] = utils.media_url(data["pageimage"])
  File "/Users/steve/Code/wptools/wptools/utils.py", line 51, in media_url
    digest = hashlib.md5(name).hexdigest()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 5: ordinal not in range(128)