Code Monkey home page Code Monkey logo

wptools's People

Contributors

0x9900 avatar jayvdb avatar kovarden avatar lisongx avatar matthewgehring avatar mcepl avatar robbieclarken avatar siznax avatar uriva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wptools's Issues

language variant not working

Hi Steve,

I'm using zh-cn to get a wikibase's title, but seems the label and title is both traditional chinese not simplified chinese(also, what's the difference between these two?) but wikipedia display zh-cn in its page ,

screen shot 2016-11-08 at 3 39 07 pm

Q20474 (zh/zh-cn)
{
  lang: zh
  variant: zh-cn
  wikibase: Q20474
}
www.wikidata.org (wikidata) Q20474
www.wikidata.org (claims) Q1165777|Q8391309|Q188451
zh.wikipedia.org (query) 迴響貝斯
zh.wikipedia.org (parse) 1565553
迴響貝斯 (zh/zh-cn)
{
  cache: <dict(4)> {claims, parse, query, wikidata}
  claims: <dict(3)> {Q1165777, Q188451, Q8391309}
  extext: <str(520)> **回响贝斯**(英语:**Dubstep**),回响贝斯在1990年代诞生,源起英国伦敦南部...
  extract: <str(593)> <p><b>回响贝斯</b>(<span>英语:<span lang="en" xml:la...
  label: 迴響貝斯
  lang: zh
  links: https://en.wikipedia.org/wiki/Dubstep
  modified: <dict(2)> {page, wikidata}
  pageid: 1565553
  parsetree: <str(1268)> <root><template><title>Expand English</titl...
  props: <dict(3)> {P279, P31, P910}
  random: 游铭训
  title: 迴響貝斯
  url: https://zh.wikipedia.org/wiki/%E8%BF%B4%E9%9F%BF%E8%B2%9D%E6%96%AF
  url_raw: https://zh.wikipedia.org/wiki/%E8%BF%B4%E9%9F%BF%E8%B2%9D%E6%96%AF?action=raw
  variant: zh-cn
  wikibase: Q20474
  wikidata: <dict(3)> {category, instance, subclass}
  wikidata_url: https://www.wikidata.org/wiki/Q20474
  wikitext: <str(1000)> {{Expand English|Dubstep}}{{unreferenced|tim...
}

wikibase lookup bug

For example q = wptools.page(wikibase='Q43303').get_query() will always give back a None page?

I think in this case, the code will throw a exception if it don't have enough information to get the query information.

Consider high-level service class

Currently, we do this:

>>> query = wptools.lead('Aardvark', test=True)
>>> query[:72]
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Aardvark'

>>> html = wptools.lead('Aardvark')  # HTTP request 1/2
>>> html[:72]
'<p>The <b>aardvark</b> (<span class="nowrap"><span class="IPA nopopups">'    

>>> text = wptools.lead('Aardvark', plain=True)  # HTTP request 2/2
>>> text[:72]
'The **aardvark** (/ˈɑːrd.vɑːrk/ _**ARD**-vark_; _Orycteropus afer_)'

Would it be worth it to make a service class, so we can do this?

>>> o = wptools.WPTools(fn='lead', title='Aardvark')

>>> o.query[:72]
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Aardvark'

>>> o.html
None

>>> o.get()  # HTTP request 1/1

>>> o.html[:72]
'<p>The <b>aardvark</b> (<span class="nowrap"><span class="IPA nopopups">'    

>>> o.text[:72]
'The **aardvark** (/ˈɑːrd.vɑːrk/ _**ARD**-vark_; _Orycteropus afer_)'

The second request ("HTTP request 2/2") in the first block is not necessary, but the advantage of the second block would be that you can refer to the same object to access both the HTML and the text and there would be only one HTTP request.

Not all functions may work out this nicely though.

Figure out "best" image

We get the following images from MediaWiki:API and Wikidata:

Image => Wikidata Property:P18
pageimage => MediaWiki:API (action=query) pageimage
thumbnail => MediaWiki:API (action=query) thumbnail

RESTBase /page/mobile-text also gives us a couple of images:

image => apparently, pageimage (above)
thumb => apparently, scaled (larger) thumbnail (above)

Spot-checking some articles, there were only conflicts with thumbnail, where the RESTBase thumb was the scaled (larger) MediaWiki:API thumbnail.

In the case of Napoleon, the Wikidata Image is a portrait (great!), while the get_query images are an Imperial Coat of Arms (what?).

››› Napoleon (en)

In the case of Stephen Fry, pageimage seems more current.

››› Stephen Fry (en)

In the case of Ella Fitzgerald, it's hard to say.

››› Ella Fitzgerald (en) {}

So, which is the "best" image? I guess it depends on your goal.

Add request method for common "get_" tasks

Most get_ methods do nearly the same thing:

  • check cache
  • maybe raise LookupError
  • form query
  • get response, info
  • cache query, response, info
  • set attributes from response data
  • maybe get_claims, get_imageinfo, show

Ubuntu dependencies

Email submission...

For xubuntu 16.04, it was just a few steps:

$ sudo apt-get install python-pip
$ sudo apt-get install libcurl4-gnutls-dev
$ sudo apt-get install libgnutls28-dev
$ sudo pip install wptools

Monkeypatch for python3 unicode

For those who want to use wptools in python 3 after you clone the wptools and install local copy in your system.

core.py

-            if type(prop) is str or type(prop) is unicode:
+            if type(prop) is str:

Since python 3 move to str, it is difficult to make a patch that works for both python 2 and 3 for this particular unicode check.

Version check is kinda tricky.

import sys

if sys.version_info.major == "2":
   if type(prop) is str or type(prop) is unicode:
       .....
elif sys.version_info.major == "3":
   if type(prop) is str:
       .....

Support language variants (simplified Chinese)

I'm using p = wptools.page(wikibase='Q5094115', lang='zh').get() to get the chinese content of the page, but the content I get from p.extract is traditional Chinese

>>> print p.extract
<p><b>林一峰</b>(英文名:Chet Lam,1976年4月11日<span title="Template:BLP editintro">-</span>),香港土生土長的創作歌手,未受過正統音樂訓練,靠自學而學懂彈結他、作曲、寫詞;除了大量替其他歌手創作及製作外,林成立了兩間工作室:LYFE及思人創作,推出過十多張粵語、國語跟英語專輯,替自己及其他音樂單位舉辦音樂演出。文字方面,林已經推出了7本著作,包括2014年集音樂專輯、旅行故事、中英對照食譜於一身的《慢煮快活》。</p>
<p>林是香港唱作組合at17成員林二汶的哥哥,兩人經常合作演出。</p>

I'm wondering is there a way to get the simplified version of the content from the api, in the https://zh.wikipedia.org/ , you can do this by selecting different chiense flavor
screen shot 2016-10-20 at 1 50 03 pm

Thanks Steve!

Helper method to validate/resize image URLs

Currently, we calculate image urls with utils.media_url and default namespace='commons' to save additional API calls. This results in an invalid path if the file is not in commons.

404 NOT FOUND:

https://upload.wikimedia.org/wikipedia/commons/1/14/Super_Sentai_World_screenshot.jpg

200 OK:

https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg

We should offer a helper method to validate image paths using MediaWiki API:Imageinfo. If the the pages key is -1 or has a missing member, then the image should be located in the language namespace repository.

Missing from commons.wikimedia.org:

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo

"pages": {
    "-1": {
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "missing": "",
        "imagerepository": ""
    }
}

Found in en.wikipedia.org:

https://en.wikipedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo

"pages": {
    "31043864": {
        "pageid": 31043864,
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "imagerepository": "local",
        "imageinfo": [
            {
                "timestamp": "2011-03-01T06:56:07Z",
                "user": "Jonny2x4"
            }
        ]
    }
}

This may require not one, but two additional API calls: 1) verify path by API:Imageinfo, and if the image is missing from Commons, 2) recompute path with namescape=<lang> and verify that path with an API:Imageinfo call to <lang>.wikimedia.org.

We should also use this same method/code path to recompute image URLs for a specific size by adding the iiurlwidth param:

https://en.wikipedia.org/w/api.php?action=query&titles=File:Super_Sentai_World_screenshot.jpg&prop=imageinfo&iiprop=url&iiurlwidth=320

"pages": {
    "31043864": {
        "pageid": 31043864,
        "ns": 6,
        "title": "File:Super Sentai World screenshot.jpg",
        "imagerepository": "local",
        "imageinfo": [
            {
"thumburl": "https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg",
"thumbwidth": 320,
"thumbheight": 240,
"url": "https://upload.wikimedia.org/wikipedia/en/1/14/Super_Sentai_World_screenshot.jpg",
"descriptionurl": "https://en.wikipedia.org/wiki/File:Super_Sentai_World_screenshot.jpg",
"descriptionshorturl": "https://en.wikipedia.org/w/index.php?curid=31043864"
            }
        ]
    }
}

Support more wikis by allowing insecure requests

Currently, all queries use HTTPS, so if you try a non-Wikimedia site, you may get an SSL error:

>>> p = wptools.page(wiki='tolkiengateway.net')
tolkiengateway.net (action=random) None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wptools/core.py", line 113, in __init__
    self.get_random()
  File "wptools/core.py", line 588, in get_random
    response = self.__fetch.curl(query)
  File "wptools/fetch.py", line 110, in curl
    return self.curl_perform(crl)
  File "wptools/fetch.py", line 118, in curl_perform
    crl.perform()
pycurl.error: (7, 'Failed to connect to tolkiengateway.net port 443: Connection refused')

If we adjust the query to use 'http' instead of 'https':

diff --git a/wptools/fetch.py b/wptools/fetch.py
index 249b210..81250d7 100644
--- a/wptools/fetch.py
+++ b/wptools/fetch.py
@@ -176,7 +176,7 @@ class WPToolsFetch(object):

         self.action = action
         self.thing = thing
-        return qry
+        return qry.replace('https', 'http')

Then we can access more MediaWiki instances...

>>> p = wptools.page(wiki='tolkiengateway.net')
tolkiengateway.net (action=random) None
Helm's_Deep_(scene) (en)
{
  lang: en
  pageid: 41211
  title: Helm's_Deep_(scene)
}

No-lead text output probably double-encoded

$ ./text.py Kintsugi | head
_**Kintsugi (���?)**_ (Japanese: _golden joinery_)

Okay with -l (lead) option:

$ ./text.py Kintsugi -l | head
_**Kintsugi (金継ぎ?)**_ (Japanese: _golden joinery_)

Consolidate attributes

Let's have some reasonable level of access to important API data on an instance, but also consolidate where it makes sense. Let's use this issue for discussions on what makes sense, and start here:

  1. Put image items (image, thumbnail, etc.) in images attribute (see #33)
  2. Put cache items (g_query, etc.) in cache attribute

AttributeError on get_wikidata in __get_image_files

trackback

  File "/var/dae/apps/fm/fm/model/wiki/fetch.py", line 27, in fetch_wikidata_item_by_name_and_lang
    timeout=WIKI_FETCH_TIMEOUT).get_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 830, in get_wikidata
    self.get_imageinfo(False)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 601, in get_imageinfo
    files = self.__get_image_files()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 131, in __get_image_files
    fname = item.replace('_', ' ')
AttributeError: 'list' object has no attribute 'replace'

output is not json format

Thanks for the release,

But when I run "./infobox.py University_of_Cambridge", the output contains 631 lines and is not in json format. Does this problem occurs on your computer?

wikidata item classification

it's not really about this lib, but it could be better wptools could help on this:

�I'm using the claims attr from wikidata to classify if the page I get is a musician or musicial group.

WIKI_DATA_HUMAN = 'Q5'
WIKI_DATA_BAND = 'Q215380'
WIKI_DATA_ROCK_BAND = 'Q5741069'

Now I'm using these three attr, if any page has anyone of these three attr in its claims, I think it's a musician. But the wikidata has tons of this instance of attrbite ( it' a tree I guess), so is there any a clever way to tell, whether a wikidata item is a subclass of another wikidata?

Thanks in advanced! Steve

Infobox templates should revert to wikitext

Infobox output (e.g. wptools.infobox('Aardvark')) is derived from the parsetree in order to avoid string hacking wikitext, however, this can lead to values that do not appear to be the most useful:

"fossil_range": "<template><title>Fossil range</title><part><name index=\"1\"/>
<value>5</value></part><part><name index=\"2\"/><value>0</value></part>
</template>&lt;small&gt;Early [[Pliocene]] &#8211; Recent&lt;/small&gt;"

I think a better result may be more like the original wikitext, e.g. wptools.wikitext('Aardvark') :

fossil_range = {{Fossil range|5|0}}<small>Early [[Pliocene]] – Recent</small>

We can expand this template via the API (see API:Parsing_wikitext, but this yields are large amount of HTML which may be equally uninteresting:

aardvark-infobox-fossil_range

I think the most useful result may be:

fossil_range = {{Fossil range|5|0}}

Perhaps there is a way to get the original wikitext out of the parsetree template?

RESTBase /page/mobile-text/ missing some maths

For example, in the Size functor article, the RESTBase /page/mobile-text/ result is missing:

* for each {\displaystyle x\in \mathbb {R} \ } , {\displaystyle F_{i}(x)=H_{i}(M_{x});\ }
* {\displaystyle F_{i}(k_{xy})=H_{i}(j_{xy}).\ }

MediaWiki:API action=query:

equal to the morphism in <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="normal">R</mi>
          <mi mathvariant="normal">o</mi>
          <mi mathvariant="normal">r</mi>
          <mi mathvariant="normal">d</mi>
        </mrow>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle \mathrm {Rord} \ }</annotation>
  </semantics></math></span></span> from <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\ }</annotation>
  </semantics></math></span></span> to <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>y</mi>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle y\ }</annotation>
  </semantics></math></span></span>,</p>
<ul><li>for each <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mo></mo>
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="double-struck">R</mi>
        </mrow>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\in \mathbb {R} \ }</annotation>
  </semantics></math></span></span>, <span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <msub>
          <mi>F</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <mi>x</mi>
        <mo stretchy="false">)</mo>
        <mo>=</mo>
        <msub>
          <mi>H</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>M</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>;</mo>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle F_{i}(x)=H_{i}(M_{x});\ }</annotation>
  </semantics></math></span></span></li>
<li><span><span><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <msub>
          <mi>F</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>k</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
            <mi>y</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>=</mo>
        <msub>
          <mi>H</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>i</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <msub>
          <mi>j</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>x</mi>
            <mi>y</mi>
          </mrow>
        </msub>
        <mo stretchy="false">)</mo>
        <mo>.</mo>
        <mtext> </mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle F_{i}(k_{xy})=H_{i}(j_{xy}).\ }</annotation>
  </semantics></math></span></span></li>
</ul><p>In other words,

RESTBase /page/mobile-text/:

equal to the <a href="/wiki/Morphism" title="Morphism">morphism</a> in <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <mi mathvariant="normal">R</mi>
          <mi mathvariant="normal">o</mi>
          <mi mathvariant="normal">r</mi>
          <mi mathvariant="normal">d</mi>
        </mrow>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle \mathrm {Rord} \ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6c971f380a0932266ee272cb327075e361d632d6" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:5.7ex; height:2.176ex;" alt="{\displaystyle \mathrm {Rord} \ }"></span> from <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>x</mi>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle x\ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/4bf17264a35330beeb310c35f9676cf9837482e3" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.338ex; width:1.921ex; height:1.676ex;" alt="x\ "></span> to <span><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="true" scriptlevel="0">
        <mi>y</mi>
        <mtext>&nbsp;</mtext>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle y\ }</annotation>
  </semantics></math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6e6f65d38cec79fda789e1335dea91732a186a41" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.671ex; width:1.747ex; height:2.009ex;" alt="y\ "></span>,
In other words,

get_imageinfo hangs on proxy with timeout

timeout seems not working here

>>> wptools.page(wikibase='Q312559', lang='zh').get(proxy=DOUBAN_PROXY, timeout=2)
Q312559 (zh)
{
  lang: zh
  wikibase: Q312559
}
www.wikidata.org (wikidata) Q312559
www.wikidata.org (claims) Q21|Q5|Q11399
zh.wikipedia.org (query) 比尔·怀曼
zh.wikipedia.org (parse) 1542057
zh.wikipedia.org (imageinfo) File:Bill Wyman 2009.jpg

BTW, is there a way to ignnore imageinfo processing, since I only need the text data

Simplify "wptool" script

Recent improvements should allow us to make the wptool script much simpler. In fact, a goal should be for that script to do as little possible, supported by attributes resulting from the fewest queries.

get_rest() raises ValueError

The following title:

Красная_площадь (ru)
{
  lang: ru
  title: Красная_площадь
}

causes an exception on get_rest():

Traceback (most recent call last):
  File "try.py", line 32, in test_rest
    r.get_rest()
  File "/Users/steve/Code/wptools/wptools/core.py", line 615, in get_rest
    self._set_rest_data()
  File "/Users/steve/Code/wptools/wptools/core.py", line 385, in _set_rest_data
    lead = self.__get_lead(data)
  File "/Users/steve/Code/wptools/wptools/core.py", line 141, in __get_lead
    lead.append(self.__get_lead_rest(data))
  File "/Users/steve/Code/wptools/wptools/core.py", line 209, in __get_lead_rest
    return self.__postprocess_lead(html)
  File "/Users/steve/Code/wptools/wptools/core.py", line 215, in __postprocess_lead
    snip = utils.snip_html(html, verbose=1 if self.verbose else 0)
  File "/Users/steve/Code/wptools/wptools/utils.py", line 120, in snip_html
    elem.remove(desc)
  File "src/lxml/lxml.etree.pyx", line 950, in lxml.etree._Element.remove (src/lxml/lxml.etree.c:50798)
ValueError: Element is not a child of this node.

Add support for get_wikidata continuations

Hi Steve,
i'm not sure this is possible yet in wikidata's api, but I have this list of id(1000+), I only need to know its title right now. Instead of query one by one, it would be great to just have its name in one query.

UnicodeDecodeError raised on LookupError

wptools.page('阿Vane', lang='zh', variant='zh-cn').get_wikidata()

seems that when the title has chinese mixed with english words, it will has error decode it

  File "/var/dae/apps/fm/fm/model/wiki/fetch.py", line 37, in fetch_wikidata_item_by_name_and_lang
    name, lang=lang, silent=True, variant=variant).get_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 721, in get_wikidata
    self._set_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 482, in _set_wikidata
    self.g_wikidata['query'].replace('&format=json', ''))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 175: ordinal not in range(128)

props and claims differences

Hi Steve,

I'm wondering the page object's props and claims, what's the difference?
Am I correct props it's more like the raw data, and claims it's parsed to be used more friendly in the library?

thanks so much!

Consider custom exception class

It would be nice to have a custom exception class if necessary. It could protect users from changes in our implementation (for example, we raisepycurl.error now, but that may change to a requests exception in the future), but we should have a clear and strong case for making another class.

AttributeError: 'NoneType' object has no attribute 'get'

>>> wptools.page(u'松下奈緒',  lang='zh', variant='zh-cn').get_wikidata()
松下奈緒 (zh/zh-cn)
{
  lang: zh
  title: 松下奈緒
  variant: zh-cn
}
www.wikidata.org (wikidata) 松下奈緒
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 778, in get_wikidata
    self._set_wikidata()
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 475, in _set_wikidata
    self._marshal_claims(item.get('claims'))
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 444, in _marshal_claims
    self.props = self._wikidata_props(query_claims)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/core.py", line 421, in _wikidata_props
    snak = prop.get('mainsnak').get('datavalue').get('value')
AttributeError: 'NoneType' object has no attribute 'get'

Add flag to defer get_imageinfo when calling get()

Currently, when we call get() we may end up calling get_imageinfo() up to three times (for get_query, get_parse, and get_wikidata):

In [11]: be = wptools.page('Let It Be').get()
Let_It_Be (en)
{
  lang: en
  title: Let_It_Be
}
en.wikipedia.org (query) Let_It_Be
en.wikipedia.org (imageinfo) File:Billy Preston perforning in 1971.jpg
en.wikipedia.org (parse) 140537
en.wikipedia.org (imageinfo) File:Billy Preston perforning in 1971.jpg|F...
www.wikidata.org (wikidata) Q199585
www.wikidata.org (claims) Q484958|Q2078599|Q640978|Q184259|Q1299|Q11399|...
Let_It_Be (en)
{
  ...
}

Let's insert a flag to defer getting imageinfo until the last call in get().

API misses should raise an exception

now I'm using wptools.page().get() to get a page's data, but turn out the data doesn't exist,

but still, the instance it return have these two

{
  g_parse: <dict(3)> {info, query, response}
  g_query: <dict(3)> {info, query, response}
  g_wikidata: <dict(3)> {info, query, response}
  images: <dict(3)> {rimage, rthumb, wimage}
  lang: en
  title: 张楚
  wikidata: <dict(3)> {category, instance, subclass}
}

the wikidata is from the ex-request it made, not the current page

missing title, pageid should raise LookupError

When i'm lookup an wikibase's chinese wikipedia page, I'm using

>>> p  =wptools.page(wikibase='Q8075262', lang='zh', variant='zh-cn').get()
Q8075262 (zh/zh-cn)
{
  images: <dict(1)> {wimage}
  lang: zh
  variant: zh-cn
  wikibase: Q8075262
}
www.wikidata.org (action=wikidata) Q8075262
www.wikidata.org (action=wikidata) Q215380|Q837837
get_wikidata: need title or pageid
get_wikidata: need title or pageid

but it didn't throw a LookupError as I expected..

add api to set request timeout

I'm using an proxy to reques wikipedia, so it reponse timeout often, could add a tiemout parameter is better?(like requests package)

  File "/var/dae/apps/fm/venv/src/wptools/wptools/fetch.py", line 112, in curl
    return self.curl_perform(crl)
  File "/var/dae/apps/fm/venv/src/wptools/wptools/fetch.py", line 120, in curl_perform
    crl.perform()
pycurl.error: (28, 'Operation timed out after 15000 milliseconds with 0 out of 0 bytes received')

Consolidate images

Each image should be put in a dict with its key indicating the source, and no attempt to compute a media URL until later. Something like this:

images {
  get_query: <filename>,
  get_parse: <filename>,
  get_wikidata: [<filename],
  get_rest: <filename>
}

We should open another issue to provide a method to resolve filenames.

Need solid unit tests for language variants (esp. Chinese)

As shown in #50, getting language variants right is complicated by:

  1. incorrect API data
  2. incomplete support by wikisites
  3. knowing what to expect under ideal conditions

We need some solid test cases that operate on some title/page/item with variants that are correct.

pycurl SSL certificate problem (on Windows)

After installing via pip and start testing with:
import wptools

a = wptools.page('usa')

i always get:

a.get()
en.wikipedia.org (action=query) USA
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 460, in get
self.get_query(show=False)
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 547, in get_query
query['response'] = self.__fetch.curl(qry)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 109, in curl
return self.curl_perform(crl)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 117, in curl_perform
crl.perform()
pycurl.error: (60, 'SSL certificate problem: unable to get local issuer certificate')

or

a.get()
en.wikipedia.org (action=query) USA
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 460, in get
self.get_query(show=False)
File "C:\Anaconda2\lib\site-packages\wptools\core.py", line 547, in get_query
query['response'] = self.__fetch.curl(qry)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 109, in curl
return self.curl_perform(crl)
File "C:\Anaconda2\lib\site-packages\wptools\fetch.py", line 117, in curl_perf
orm
crl.perform()
pycurl.error: (28, 'Resolving timed out after 15007 milliseconds')

Any hints of where to go from her ?

Implement get_imageinfo()

When API requests populate an instance with filenames (e.g. images see #33), we need a method to resolve filenames into valid URLs. We should provide a method that compute an initial guess at the URL using utils.media_url() with the default namespace (commons) and perform a HEAD on said URL, adjusting the namespace by language and so on until each filename URL yields HTTP status 200.

support Wikidata claims lists

I'm see a lot for a musician, the genre is a list

screen shot 2016-10-27 at 4 27 11 pm

but wptools return a single value, I'm not sure whether is the api only return first value

feature request: get all disambiguation results

Hi Steve,

Not sure this can be done now

When I rencounter a disambiguation page in get_wikidata(), I want to get all the page in the disambiguation. And then decide which one is what I want( by looking up it's wikidata claims maybe)

Fix error on multibyte input to utils.media_url()

Probably just need to encode('utf-8') and urllib.urlquote(url, safe=True) the input.

here's a the thumb source from the MediaWiki API:

"source": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Cuill%C3%A8re_Fontal%C3%A8s_Pal%C3%A9olithique_MHNT.PRE.2010.0.11.1.jpg/320px-Cuill%C3%A8re_Fontal%C3%A8s_Pal%C3%A9olithique_MHNT.PRE.2010.0.11.1.jpg",

and the traceback:

$ ./scripts/images.py "Spoon" pageimages
Traceback (most recent call last):
  File "./scripts/images.py", line 56, in <module>
    main()
  File "./scripts/images.py", line 52, in main
    wpimages(args.title, args.source, args.t, args.v, args.w)
  File "./scripts/images.py", line 25, in wpimages
    data = wptools.images(title, source, test, verbose, wiki)
  File "/Users/steve/Code/wptools/wptools/api.py", line 35, in images
    return extract.qry_images(data, source)
  File "/Users/steve/Code/wptools/wptools/extract.py", line 110, in qry_images
    return img_pageimages(data)
  File "/Users/steve/Code/wptools/wptools/extract.py", line 78, in img_pageimages
    data["source"] = utils.media_url(data["pageimage"])
  File "/Users/steve/Code/wptools/wptools/utils.py", line 51, in media_url
    digest = hashlib.md5(name).hexdigest()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 5: ordinal not in range(128)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.