barrust / mediawiki Goto Github PK

MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

mediawiki wikipedia api-wrapper python parser parser-library

mediawiki's Introduction

MediaWiki

*mediawiki* is a python wrapper and parser for the MediaWiki API. The goal is to allow users to quickly and efficiently pull data from the MediaWiki site of their choice instead of worrying about dealing directly with the API. As such, it does not force the use of a particular MediaWiki site. It defaults to Wikipedia but other MediaWiki sites can also be used.

MediaWiki wraps the MediaWiki API so you can focus on leveraging your favorite MediaWiki site's data, not getting it. Please check out the code on github!

Note: this library was designed for ease of use and simplicity. If you plan on doing serious scraping, automated requests, or editing, please look into Pywikibot which has a larger API, advanced rate limiting, and other features so we may be considerate of the MediaWiki infrastructure. Pywikibot has also other extra features such as support for Wikibase (that runs Wikidata).

Installation

Pip Installation:

$ pip install pymediawiki

To install from source:

To install mediawiki, simply clone the repository on GitHub, then run from the folder:

$ python setup.py install

mediawiki supports python versions 3.7 - 3.12

For python 2.7 support, install release 0.6.7

$ pip install pymediawiki==0.6.7

Documentation

Documentation of the latest release is hosted on readthedocs.io

To build the documentation yourself run:

$ pip install sphinx
$ cd docs/
$ make html

Automated Tests

To run automated tests, one must simply run the following command from the downloaded folder:

$ python setup.py test

Quickstart

Import mediawiki and run a standard search against Wikipedia:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> wikipedia.search('washington')

Run more advanced searches:

>>> wikipedia.opensearch('washington')
>>> wikipedia.allpages('a')
>>> wikipedia.geosearch(title='washington, d.c.')
>>> wikipedia.geosearch(latitude='0.0', longitude='0.0')
>>> wikipedia.prefixsearch('arm')
>>> wikipedia.random(pages=10)

Pull a MediaWiki page and some of the page properties:

>>> p = wikipedia.page('Chess')
>>> p.title
>>> p.summary
>>> p.categories
>>> p.images
>>> p.links
>>> p.langlinks

See the documentation for more examples!

Changelog

Please see the changelog for a list of all changes.

License

MIT licensed. See the LICENSE file for full details.

mediawiki's People

Contributors

Stargazers

Watchers

mediawiki's Issues

Replacing ''' with """

According to Google Styleguide for Python docstring should use """ instead of ''':

Prefer """ for multi-line strings rather than '''. Projects may choose to use ''' for all non-docstring multi-line strings if and only if they also use ' for regular strings. Docstrings must use """ regardless.

Text Search Functionality

The current search looks for titles related to the passed query. It would be interesting to add the ability to search within the text of the page. MediaWiki API Search shows an option that signals to search the text of the article. A use case could be something like the following:

wikipedia = MediaWiki()
res = wikipedia.textsearch('chess match', results=100)
print(res)

Mediawiki.allpages(limit=999999999) capped at 500 (?)

Hello there,

I need to get all the pages for a given wiki instance. I tried

en_wikipedia.allpages(limit=999999999999)

(no query specified - in hopes of having ALL the pages returned) and yet it only returns 500 values (len()).

Is it possible to get ALL the pages for a given wiki, please? Perhaps limit=-1 or limit=None to return the entirety of page listing?

Thank you for the really easy-to-use module!

search giving wrong page

from mediawiki import DisambiguationError, MediaWiki
wikipedia = MediaWiki()
search = "Mother"
meaning = wikipedia.page(search)
print(meaning)
def_meaning = meaning.summarize(chars=300)
print(meaning)
print(def_meaning)

this will result in:
<MediaWikiPage 'Father'>
<MediaWikiPage 'Father'>
A father is the male parent of a child. Besides the paternal bonds of a father to his children, the father may have a parental, legal, and social relationship with the child that carries with it certain rights and obligations. An adoptive father is a male who has become the child's parent through the...

why does it gives the father page instead of the mother?

fix documentation theme issues

Based on changes to readthedocs.org and the use of the sphinx-rtd-theme, it will be necessary to fix the custom theme by changing the name. Also recommend updating the theme to the latest release of sphinx-rtd-theme

categorymembers doesn't return files

I'd like to iterate through category pages. The API has categorymembers for this. An example query: https://en.wiktionary.org/w/api.php?action=query&list=categorymembers&cmtitle=Category%3ASwahili_lemmas&format=json

Is that something you could add?

Support pagination (continue)

.search("a", results=1000) will lead to 500 results since 500 is the maximum. However, a continue parameter is supported ("When more results are available, use this to continue"), which I suppose allows for paging through more results.

It would be nice if this was somehow supported (maybe as an iterator that does another query when needed).

Better test data handling

It would be better if test data was not stored in a pickle and stored in json, or the like. Some issues to currently overcome:

sets are not serializable
storing keys as strings

Listing page revisions and loading a specific revision

It will be really helpful to be able to interact with the revisions of pages in some applications. Any plans on adding it in some capacity anytime soon?

Category Tree Support

Parsing out a category tree would be very useful.

Querying more than (500/5000) results in an automated way?

Hi,

I'm attempting to get the results of a query (categorymembers) and the number of results are more than 500. I'm cool with doing multiple queries, but is there a way to continue where I left off? I know the "cmcontinue" param is available in raw_res in the categorymembers function, but I'm not sure if I can leverage it directly to get the results I want, or if I'm missing something.

For example, let's say I want to use the user default max (500) to get a list of all the pages that exist in a category, but there's 8000 pages in the category. Is it possible to loop a query to get all the pages?

Method `page.sections` return html stuff in some cases

Hello,
I'm using this library to get textual descriptions for classes in the CUB 2011 dataset.

For each class of the 200 bird classes in the CUB dataset, I get the relative wikipedia page and look at the sections with the property page.sections.
In some cases I get html codes inside the sections, for example:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
page = wikipedia.page('Pied billed Grebe')
print(page.sections)

output:
[u'Taxonomy and name', u'Subspecies<sup>[8]</sup>', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']

Then, if I use the page.section(str) method with the string u'Subspecies<sup>[8]</sup>':

print(page.section(page.sections[1]))

output: None

The correct string to find the object with the method page.sections(str) is simply 'Subspecies'.

I actually managed to fix this issue implementing this method:

def fixed_sections(page_content, verbose=False):
    sections = []
    import re
    section_regexp = r'\n==* .* ==*\n' # '== {STUFF_NOT_\n} =='
    found_obj = re.findall( section_regexp, page.content)
    
    if found_obj is not None:
        for obj in found_obj:
            obj = obj.lstrip('\n= ').rstrip(' =\n')
            sections.append(obj)
            if verbose: print("Found section: {}".format(obj))
    return sections

correct_sections  = fixed_sections(page.content)
print(correct_sections)
print(page.section(correct_sections[1]))

With this code I get the correct output, i.e. the content of the section (sub-section in this case):

[u'Taxonomy and name', u'Subspecies', u'Description', u'Vocalization', u'Distribution and habitat', u'Behaviour', u'Breeding', u'Diet', u'Threats', u'In culture', u'Status', u'References', u'External links']
P. p. podiceps, (Linnaeus, 1758), North America to Panama & Cuba.
P. p. antillarum, (Bangs, 1913), Greater & Lesser Antilles.
P. p. antarcticus, (Lesson, 1842), South America to central Chile & Argentina.

This fix works for me, but it require to execute a reg-exp for each page, so maybe is not optimal.

Support for files (URL, uploading user, etc)

Hiya, I was wondering if this project has a place for wrapping some file-related information.

Now, I forked this repository and added wrapping for a given file's URL, but I was wondering if this sort of functionality has enough merit and reason to add to the main project before I create a pull request.

I haven't added any tests yet, but if it'll be okay I'll go ahead and add some.
Thank you 🙂

Parameter to identify your client and follow the api etiquette.

I thought about is the api etiquette. Currently there are rate_limit and rate_limit_wait, but you can' set a header to identify your client. Perhaps other stuff is required by wikipedia too?

feature: add an available_languages property

There is a supported_languages property, but that doesn't mean that a language is available. A property such as available_languages would be useful. The solution would be a list of languages that were successfully pinged. Maybe there is something out there that needs to be exposed via this API...

No wikitext in current PyPI package

The following code

from mediawiki import MediaWiki
mw = MediaWiki('https://handwiki.org/wiki/api.php')
p = mw.page('Nikola Tesla') # this gives the wrong page
print(p)
p.wikitext

Gives the following output

<MediaWikiPage 'Tesla (unit)'>
AttributeError: 'MediaWikiPage' object has no attribute 'wikitext'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/projects/wiki_subsetting/app/classes.py in <module>
      2 mw = MediaWiki('https://handwiki.org/wiki/api.php')
      3 p = mw.page('Nikola Tesla') # this gives the wrong page
----> 4 p.wikitext

AttributeError: 'MediaWikiPage' object has no attribute 'wikitext'

Using version 0.7.0 on linux 64-bit.

The latest documentation states that this attribute should be available (link).

"raw" page content

Hi, is there a way how to get the wikicode of a page? page.content does not really work for this

Installing Extensions

How exactly do I install the needed "extensions" for the code to work? I am trying to get the content of a fandom page and it's throwing me an error called "Unable to extract page content, The TextExtracts extension must be installed!"

parse_section_links does not allow parsing links in the intro (summary)

Hi, would it be possible to let page.parse_section_links also parse links in the first paragraph of a page? Since it isn't part of a section, it's currently impossible to extract links from it...

Support for wikibase (powering WIkidata) or explicit documenting it as out of scope would be great

Parse hatnote(s)

It would be nice if we could also parse the hatnote information. It is currently only available in the html, so we may need to parse it from there to make it work.

Bug encountered during accessing local Wiki

Python 3.5.2 (default, Oct  8 2019, 13:06:37) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mediawiki import MediaWiki
>>> mw = MediaWiki(url="http://192.168.10.3:10080/mywiki/api.php")
Traceback (most recent call last):
  File "/XXXXXXX/barrust-mediawiki/mediawiki/mediawiki/mediawiki.py", line 92, in __init__
    self._get_site_info()
  File "/XXXXXXX/barrust-mediawiki/mediawiki/mediawiki/mediawiki.py", line 868, in _get_site_info
    raise MediaWikiException("Missing query in response")
mediawiki.exceptions.MediaWikiException: An unknown error occured: "Missing query in response". Please report it on GitHub!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/XXXXXXX/barrust-mediawiki/mediawiki/mediawiki/mediawiki.py", line 94, in __init__
    raise MediaWikiAPIURLError(url)
mediawiki.exceptions.MediaWikiAPIURLError: http://192.168.10.3:10080/mywiki/api.php is not a valid MediaWiki API URL

Wiki is valid, works perfectly, API is enabled. The Wiki version is quite recent: 1.33.0
Hm, so what do do now?

Category title transformation is language specific (categorytree is broken as a result for non-english languages)

The english prefix category:" is removed in mediawiki.py:

536                 elif rec['type'] == 'subcat':
537                     tmp = rec['title']
538                     if tmp.startswith('Category:'):                                                                                                                                         
539                         tmp = tmp[9:]
540                     subcats.append(tmp)

This does not work for non-english languages. Categorytree is broken for non-english languages, because each category is prefixed with "Category:".

606                     try:
607                         categories[cat] = self.page('Category:{0}'.format(cat))
608                         parent_cats = categories[cat].categories                                                                                                                            
609                         links[cat] = self.categorymembers(cat, results=None,
610                                                           subcategories=True)
611                         break

In e.g. german, this leads to the generation of: "Category:Kategorie:Kategorietitel". I see three ways (this is why I did not want to create a PR):

do not strip/prepend "Category:" and let the user handle it in their language
offer defaults and a setting for some languages (five with most articles or something like that?)
query some wikipedia api to get the category prefix in the respective language.

I like it simple, so I would stick with the first solution :)

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

Docstring Documentation style

pymediawiki uses the sphinx documentation but switching to the napoleon extension for sphinx is more compact and more readable.

Is an async version planned?

It's a awesome library, but I think an async version would benefit a lot of people. Currently, to use it in a async program the run_in_executor loop is needed. Is an async version ever planned in the future?

Parse possible logos

It would be helpful if we could get a listing of all images within the infobox. To do this, one would likely have to parse the html elements to get to the infobox class.

Add support for using proxies

Combine property pulls into single requests

In order to reduce the load on MediaWiki servers, it would be good to combine as many of the property requests as possible. Things that can be pulled at one time should.

Some possible pit falls:

Figuring out how to properly use the continue parameter when multiple elements are being returned
Determining which properties should be combined into a single MediaWiki API request

Problem with hidden files in the article

Per goldsmith/Wikipedia Issue #132:
User: mkasprz

Error appeared while trying to get 'images' from articles which contain hidden files. For example calling wikipedia.WikipediaPage('One Two Three... Infinity').images caused an error.
I also considered using and 'filehidden' not in page['imageinfo'][0], but this one seems more generic. I'm not really sure which one is better in this exact case.

pypi support

Getting mediawiki onto pypi would allow for better visibility and a larger user group

Python 2.7 EOL

Due to python 2.7 EOL in 2019, a plan to remove python 2 support should be in place. Once a plan is decided upon, all code specifically to support python 2 should be removed.

`MediaWikiPage.section` doesn't return content of section with nested level greater than 2

MediaWikiPage.section doesn't work for section which title is surrounded with more than 3 "=".
E.g. for page page https://en.wikipedia.org/wiki/Napoleon section
page.section("Temporary peace in Europe") returns empty string.

categorymembers(category, results=10, subcategories=True) does not work anymore

Hi,

I feel like there had been some changes in the api, because the function categorymembers(category, results=10, subcategories=True) stopped working and always returns an empty array :/
Could you please have a look?

best regards,
Alex

Cache timout

Currently the caching mechanism does not have a timeout to 'refresh' the data. It would be helpful if one could provide a timeout parameter so that long running applications can refresh the data.

from mediawiki import (MediaWiki)
wikipedia = MediaWiki()
# queries are not cleared out after X seconds
wikipedia.set_refresh = 30
# queries are refreshed after being stale for 30 seconds

It may be more beneficial to have the refresh be more than a few minutes as data is not updated so frequently.

generate_test_data raises MediaWikiException for PULL_GEOSEARCH

If variable PULL_GEOSEARCH in script generate_test_data.py is set to True script ends with the following error:

Traceback (most recent call last):
File "[removed content]/mediawiki/scripts/generate_test_data.py", line 217, in
longitude=Decimal('0.0'), results=22, radius=10000)
File "../mediawiki/mediawiki/utilities.py", line 66, in wrapper
cache[func.name][key] = (time.time(), func(*args, **kwargs))
File "../mediawiki/mediawiki/mediawiki.py", line 489, in geosearch
self._check_error_response(raw_results, title)
File "../mediawiki/mediawiki/mediawiki.py", line 878, in _check_error_response
raise MediaWikiException(err)
mediawiki.exceptions.MediaWikiException: An unknown error occured: "Page coordinates unknown.". Please report it on GitHub!

Implementation of API:Langlinks property

It would be nice if this library could show you available languages, or event switch page between them using API:Langlinks property.

I've noticed that simple implementation is very similar to categories.
Would you mind if I create PR for this issue?

failure to raise DisambiguationError for "Leaching" page on Wikipedia

Using pymediawiki 0.3.15 with Python 3.6.3 on Linux, I observed the following error when attempting to access the disambiguation page for "Leaching" on Wikipedia.

python -c "import mediawiki; wikipedia=mediawiki.MediaWiki(); wikipedia.page('Leaching')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/lebedov/miniconda3/lib/python3.6/site-packages/mediawiki/mediawiki.py", line 731, in page
    preload=preload)
  File "/home/lebedov/miniconda3/lib/python3.6/site-packages/mediawiki/mediawikipage.py", line 72, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/home/lebedov/miniconda3/lib/python3.6/site-packages/mediawiki/mediawikipage.py", line 532, in __load
    self._raise_disambiguation_error(page, pageid)
  File "/home/lebedov/miniconda3/lib/python3.6/site-packages/mediawiki/mediawikipage.py", line 569, in _raise_disambiguation_error
    one_disambiguation['title'] = item[0]['title']
  File "/home/lebedov/miniconda3/lib/python3.6/site-packages/bs4/element.py", line 1011, in __getitem__
    return self.attrs[key]
KeyError: 'title'

Relatedly, wikipedia 1.4.0 correctly raises a DisambiguationError exception for the above.

Authentication not supported?

Authentication seems not to be supported? That would be an essential feature as non-public Wikis tend to require authentication, sometimes even for reading.

Ability to set language for a specific lookup

Hi there, thanks for this library.

I have an IRC bot that fetches content when it sees certain URLs. It would be nice to pull the correct article for a given URL based on its language code in the URL. Currently, this is only possible by setting the language property. This has two downsides: (a) the memoized cache is cleared, (b) you need to make sure the language property is set back afterwards.

If page() took a language parameter, it would be possible to easily toggle it for specific lookups. The cache could either be ignored in these cases, or the cache could be extended to use a per-language dictionary.

DisambiguationError gives incorrect titles of possible results

The titles of possible results given in the details property of a DisambiguationError can be incorrect. They can be the full text of the list item on the disambiguation page describing the possible result, rather than just the title of the possible result page.

Example:

from mediawiki import MediaWiki, DisambiguationError
wikipedia = MediaWiki()

try:
    p = wikipedia.page("Lincoln Museum")
except DisambiguationError as d:
    print(d.details[5]['title'])
    print(d.details[5]['description'])

gives

Ford's Theatre, Washington, DC, USA — where Abraham Lincoln was assassinated; known as Lincoln Museum from 1936 to 1965 and legally "Ford's Theater (Lincoln Museum)" since 1965
Ford's Theatre, Washington, DC, USA — where Abraham Lincoln was assassinated; known as Lincoln Museum from 1936 to 1965 and legally "Ford's Theater (Lincoln Museum)" since 1965

when what is expected is

Ford's Theatre
Ford's Theatre, Washington, DC, USA — where Abraham Lincoln was assassinated; known as Lincoln Museum from 1936 to 1965 and legally "Ford's Theater (Lincoln Museum)" since 1965

affected version: 0.6.0

Error when launching QuickStart tests on MacOS

I installed mediawiki on my machine (MacBook pro) running on MacOS Mojave, with python 3.9. Right after finishing installation, I ran the first exemple in README.rst. I get an error at wikipedia = MediaWiki() line. I get __init__() got an unexpected keyword argument 'encoding'. I am not sure the issue comes from mediawiki (actually, I have no clue what causes it), but I wanted to share it with you in case you might want to take a look.

Was wrong, my bad sorry,

-- Edit --

Support requesting multiple pages simultaneously

The MediaWiki API allows for users to request multiple pages at the same time; supporting this would be the first step in reducing the number of calls.

Infinite loop while trying to access 'images' for some pages

goldsmith/Wikipedia #133
User: mkasprz

For example this query: wikipedia.page('Rober Eryol').images never ends. After some debugging I noticed that even though request contains 'continue', it's value equals '||', which seems to be the source of this issue.

Support references scraping link title with URL?

I'm working on a tool where I want to scrape the common link "Official Website" from the "External links" section that appears in almost every company and organization article (for example). According to the references function it will return links that appear in the "External links" section. My problem is that I cannot easily locate which link is the "Official Website" as it returns a giant list of URL's.

Perhaps references could return a dictionary that contained both the URL and name of the link?

Something like this:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.references
{ ... "Official Website": "https://www.mcdonalds.com" ... }
>>> page.references["Official Website"]
"https://www.mcdonalds.com"

I'm submitting this issue here as it seems to be the most updated and active python MediaWiki wrapper. Thanks for your work on this. I'll be looking to see if I can add this feature myself, however I'm sure you are more familiar with the source code and may have a better solution to this problem. Thanks.

Edit: Hmm, after some digging it looks like the MediaWiki API just doesn't support returning the link title. I hope I can solve this problem without having to do some regex or beautifulsoup on page.html or something.

Pulling Table of Contents

Pulling the table of contents based on section titles could be very useful. Perhaps as an OrderedDict object of OrderedDicts.

Support all language variants. One wikipedia can offer a variety of languages variants

Correctly supporting all the languages might be even more difficult than #48 .

E.g. the chinese wikipedia is under zh.wikipedia.com. The languages available are:

Continental Simplified
Hong Kong Traditional
Traditional Chinese
Ma Xinxing
Taiwanese

And the url changes, but not the subdomain. So perhaps another test case/issue: get a variety of Chinese. (I am simply supposing that this does not work yet, please close it if I am wrong).

Caught API URL Exception issue

When the set_api_url throws an exception, it does not revert the API URL to a stable state. If the developer catches the exception, future requests will fail.

It would be beneficial to revert to the last good API URL if an exception or error is thrown when setting the property.

wikipedia = Mediawiki()

try:
    wikipedia.set_api_url('http://french.wikipedia.org/w/api.php', lang='fr')
except:
    pass

wikipedia.page('foobar')  # this will fail!

KeyError: 'extract'

Hello there, i am getting extract key error when i am trying to access "content" or anything other than title of page.

Here is my simple code and url i am trying to access.

from mediawiki import MediaWiki
tropwiki=MediaWiki('https://posmotre.li/api.php')

test=tropwiki.page(pageid=4242)
print('found title>'+(test.title))
print('found content>'+(test.content))

Here is traceback what i am getting with it.

found title>The Elder Scrolls
Traceback (most recent call last):
File "/home/test/suptest.py", line 8, in <module>
    print('found content>'+(test.content))
  File "/home/.local/lib/python3.6/site-packages/mediawiki/mediawikipage.py", line 154, in content
    self._pull_content_revision_parent()
  File "/home/.local/lib/python3.6/site-packages/mediawiki/mediawikipage.py", line 140, in _pull_content_revision_parent
    self._content = page_info["extract"]
KeyError: 'extract'

It does success to return page title but errors on content/summary and anything else, is this bug or it is something to do with the url site?
How can i access the pages content or summary on this specific site?

Tested on python3.8 and 3.6
Please check this, i would be grateful to any answer.