wpoa / recitation-bot Goto Github PK

View Code? Open in Web Editor NEW

9.0 12.0 3.0 153 KB

MediaWiki bot to upload content to Wikimedia projects and update corresponding citations on Wikipedia.

License: GNU General Public License v3.0

Python 100.00%

pywikibot mediawiki-bot python doi pubmed-central wikisource wikimedia-commons open-access

recitation-bot's People

Contributors

Stargazers

Watchers

Forkers

rahulravindren arlitt neoplacebrasil

recitation-bot's Issues

No more than 10 categories per article in Wikisource header template

At
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Schizophrenia_and_Violence_Systematic_Review_and_Meta-Analysis&oldid=4970952 ,
the category display is messed up because
https://en.wikisource.org/w/index.php?title=Template:Header&oldid=4660353#Categories
only allows for 10 categories. So we should provide a maximum of 10 and come up with some rules on which ones to eliminate if we have a longer list.
Example:
https://en.wikisource.org/w/index.php?title=Wikisource%3AWikiProject_Open_Access%2FProgrammatic_import_from_PubMed_Central%2FSchizophrenia_and_Violence_Systematic_Review_and_Meta-Analysis&diff=4970959&oldid=4970952 .

Add reuplod option to web interface

To benefit from ongoing bug fixes, it would be useful if the web interface had the option to re-upload (a) text or (b) media files or (c) both.

start Template:Recitation-bot on Wikisource

https://en.wikisource.org/w/index.php?title=Template:Recitation-bot
is used by file uploads to Wikisource and should be modeled after
https://commons.wikimedia.org/wiki/Template:Recitation-bot .

Do not import tables as both table and figure

In some journals (e.g. at PLOS), tables are being made available as image files in addition to tabular format. If the latter exists, we should always go for it and not embed the former in the Wikisource text.

I am open to the idea of importing the image file nonetheless, as it may sometimes be useful in Wikipedia articles, but my suggested default setting would be to ignore those image files entirely.

Sample case:

https://commons.wikimedia.org/w/index.php?title=File:Tracking-Marsupial-Evolution-Using-Archaic-Genomic-Retroposon-Insertions-pbio.1000436.t001.jpg&oldid=126539852 .

restart tools job with qmod rather than python script

restarting can be done with

Restarting jobs

To stop and restart a running job in a single command (e.g. you made a bugfix), use:

qmod -rj job_number

acording to

https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#jsub_options

Sync with OAMI

Currently, recitation bot ignores non-image files. However, for audio and video files, it should sync with OAMI.

For instance,
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_new_species_of_the_Boophis_rappiodes_group_%28Anura_Mantellidae%29_from_the_Sahamalaza_Peninsula_northwest_Madagascar_with_acoustic_monitoring_of_i&oldid=5020464#Supplementary_material_1
links to "File:Zookeys-435-111-s001.wav", which already exists at https://commons.wikimedia.org/wiki/File:A-new-species-of-the-Boophis-rappiodes-group-%28Anura-Mantellidae%29-from-the-Sahamalaza-Peninsula-zookeys-435-111-s001.oga .

Eventually, the relevant parts of OAMI should be folded into recitation bot.

Do not embed not-uploaded supplementary files

A number of file types are not uploaded to Commons (e.g. *.doc). These should not be embedded into the wiki text, but we should link to an external copy instead - be that available from PMC or through Wikimedia Labs.

Examples:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Saudi_Arabian_Y-Chromosome_diversity_and_its_relationship_with_nearby_regions#Additional_file_1
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/In_Silico_Gene_Prioritization_by_Integrating_Multiple_Data_Sources#Supporting_Information
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_land_crab_Johngarthia_planata_%28Stimpson_1860%29_%28Crustacea_Brachyura_Gecarcinidae%29_colonizes_human-dominated_ecosystems_in_the_continental_main#XML_Treatment_for_Johngarthiaplanata .

Keeping track of upload failures

Just tried to upload http://dx.doi.org/10.7554/eLife.01621 , which was published in June but does not seem to be in PMC yet, so the upload understandably failed:
http://tools.wmflabs.org/recitation-bot/10.7554/eLife.01621.html .

There are multiple reasons why uploads may fail, and I am wondering what the best process would be to expose them in a way that is compatible with our bug tracking here on GitHub.

This should integrate with wpoa/OA-signalling#81 .

Allow upload via PMC ID

Some articles in PMC do not have DOIs, and a (small) subset thereof is openly licensed (e.g. http://www.ncbi.nlm.nih.gov/pmc/journals/896/ ), so it would be good if the web form would also accept PMC IDs.

Upload image files of tables if no wikitext-compatible version is available

These uploads should go to the respective Wikisource.

make status pages for each doi added through web interface

Keep "forbidden characters" from paper titles in page titles at Wikisource

At OAMI, the file naming of the uploads to Commons gets rid of many special characters.
At Wikisource, we should strive to keep the paper titles as intact as possible (see also
#15 ), taking into account technical limitations of MediaWiki (e.g. colons or slashes in page names).

Category for uploads

While I agree that the uploads by recitation-bot to Commons should eventually go through OAMI, I don't think the image uploads during this test phase should go to
https://commons.wikimedia.org/wiki/Category:Uploaded_with_Open_Access_Media_Importer
and
https://commons.wikimedia.org/wiki/Category:Uploaded_with_Open_Access_Media_Importer_and_needing_category_review .

Suggestion:
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot
and
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot_and_needing_category_review .

Reupload 10.1186/2049-9957-3-6

This was the one that failed due to the block:
http://tools.wmflabs.org/recitation-bot/10.1186/2049-9957-3-6.html .

prefix test uploads on Wikisource with Wikisource:WikiProject Open Access/Programmatic import from PubMed Central/

as per
wpoa/OA-signalling#92 (comment) .

Attempt to detect articles with incomplete metadata

Especially "in press" stuff - example in
https://en.wikisource.org/wiki/Biodiversity_Assessment_of_the_Fishes_of_Saba_Bank_Atoll,_Netherlands_Antilles#cite_note-pone.0010676-Toller1-5 .

FAIL MESSAGE:'NoneType' object has no attribute 'attrib'

Just tried to import
http://elifesciences.org/content/1/e00248
and got this error message.
Would like to use it in a talk tomorrow.

@notconfusing @wrought - any chance you can fix this today?

make force_reupload checkboxes for text/ images/ both

default should be both

remove table figures from wikitext

Add file description from <title> and <caption> elements in PMC XML

The current file descriptions (e.g. "Media belonging to article 10.1371/journal.pbio.1000436 which is cited on Wikipedia, and automatically imported." in
https://commons.wikimedia.org/w/index.php?title=File:Tracking-Marsupial-Evolution-Using-Archaic-Genomic-Retroposon-Insertions-pbio.1000436.g002.jpg&oldid=126539861 )
are not very helpful.

The corresponding code should thus be replaced with that in OAMI. Sample upload from PLOS:
https://commons.wikimedia.org/wiki/File:Messages-Do-Diffuse-Faster-than-Messengers-Reconciling-Disparate-Estimates-of-the-Morphogen-Bicoid-pcbi.1003629.s006.ogv .

Detect duplicates

Fig. 1 of
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Modelling_the_Species_Distribution_of_Flat-Headed_Cats_%28Prionailurus_planiceps%29_an_Endangered_South-East_Asian_Small_Felid&oldid=5032599
was imported into
https://commons.wikimedia.org/wiki/File:Modelling-the-Species-Distribution-of-Flat-Headed-Cats-%28Prionailurus-planiceps%29-an-Endangered-South-pone.0009612.g001.jpg
but the image there already existed (in higher resolution) as
https://commons.wikimedia.org/wiki/File:Plionailurus_planiceps.png .
According to Commons policies, our upload should thus be deleted.

In such cases, it would be best if we could
(a) detect such a duplicate before upload
(b) post a message on that file's talk page with the proper metadata.

Harmonize category names across projects

On Commons, we have
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot ,
which has led to
https://en.wikisource.org/wiki/Category:Uploaded_with_reCitation_Bot
for equations and tables,
albeit we are using
https://en.wikisource.org/wiki/Category:Uploaded_by_Recitation-bot
to track uploads on Wikisource.

This means we have differences in capitalization, "by" vs. "with" and "_" vs. "-". That needs to be more coherent.

Strive for complete paper titles at Wikisource

There is a maximum length for page titles at MediaWiki - 255 bytes according to https://www.mediawiki.org/wiki/Manual:Page_table#page_title . At the OAMI, we have opted to take the first 100 characters of a paper title before we append portions of the DOI.
This has worked fine so far.

For Wikisource, this is not the best approach, though, and I think we should try to accommodate as much of the article title in the page name. Example:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_cladistically_based_reinterpretation_of_the_taxonomy_of_two_Afrotropical_tenebrionid_genera_Ectateus_Koch_1956_and_Selinus_Mulsant_%26_Rey_1853_%28 .

on reuploads of images, use the previous page text rather than reinserting the template

Set up Commons template for re-citation bot

Uploads by re-citation bot (example: https://commons.wikimedia.org/wiki/File:A-New-Basal-Hadrosauroid-Dinosaur-%28Dinosauria-Ornithopoda%29-with-Transitional-Features-from-the-Late-pone.0098821.g015.jpg ) should be marked by a template
https://commons.wikimedia.org/w/index.php?title=Template:Recitation-bot
to be modeled after
https://commons.wikimedia.org/wiki/Template:Open_Access_Media_Importer .
The latter template has been used in earlier uploads by VIAF bot and re-citation bot and should be replaced by the new one.

tweet new uploads

to the file or main namespaces.

Construct the citation template just as OAMI does

At the moment, for instance, PMID and PMCID are missing:
https://commons.wikimedia.org/w/index.php?title=File:New-Family-of-Bluish-Pyranoanthocyanins-40403.fig.001.jpg&oldid=126893111 .

License statement missing

Instead of a license statement, some articles have only a "{{}}".

Examples:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Analysis_of_Human_Cytomegalovirus-Encoded_SUMO_Targets_and_Temporal_Regulation_of_SUMOylation_of_the_Immediate-Early_Proteins_IE1_and_IE2_during
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_cladistically_based_reinterpretation_of_the_taxonomy_of_two_Afrotropical_tenebrionid_genera_Ectateus_Koch_1956_and_Selinus_Mulsant_%26_Rey_1853_%28

High-res images

I think we've discussed this in at least two threads but couldn't find any of them right now. Anyway, it seems that high-res images are accessible via PMC Europe, e.g.

http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3258128/supplementaryFiles

Change attribution template in uploads to Commons by VIAFbot

These uploads by VIAFbot all use the {{Open Access Media Importer}} template:
https://commons.wikimedia.org/w/index.php?limit=250&tagfilter=&title=Special%3AContributions&contribs=user&target=VIAFbot&namespace=&tagfilter=&year=2014&month=6

That should be replaced by {{Recitation-bot}}.

The same goes for some earlier uploads by Recitation Bot: https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Recitation-bot&offset=20140724025401&limit=250&target=Recitation-bot

implement testing

abstractly, need to do the following to implement testing:

supply test XML articles along with repo
create hook or otherwise toggle bot parameters to use test endpoint(s), etc
- specifically, point to test wikipedia for everything (should just work, especially for pywikibot)
include fixture data (especially for deque)
write tests
test locally
also use https://travis-ci.org/

Force re-upload of images for equations/ tables

Once the upload of images for equations/ tables to Wikisource works, we will need another checkbox in the web form, with the option to force re-upload of these, perhaps separately for tables and figures.

badtoken error

I'm getting badtoken errors whenever I am trying to upload any paper. The issue seems not to be unique to us, as per https://phabricator.wikimedia.org/T61678 . Pinging @notconfusing @wrought .

See http://tools.wmflabs.org/recitation-bot/faillog.html for details.

Fail log does not log

For instance,
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pcbi.1003149.html
and
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pcbi.1000361.html
recently failed, but
http://tools.wmflabs.org/recitation-bot/faillog.html
still only shows the one test entry

why did this license not recognize

http://tools.wmflabs.org/recitation-bot/10.1371/journal.pmed.0050032.html

inspect this code

https://github.com/wpoa/recitation-bot/blob/master/recitation-bot/pmc_extractor.py

Equation uploads to Wikisource should go into their own category

Files like the one at
https://en.wikisource.org/w/index.php?title=File:Neurobiological-Models-of-Two-Choice-Decision-Making-Can-Be-Reduced-to-a-One-Dimensional-Nonlinear-pcbi.1000046.e061.jpg&oldid=5068812
have multiple categories assigned to them, none of which are all too helpful.

I thus propose to do away with these article-level keywords entirely for equation images, and to just put them into some maintenance category of the
https://en.wikisource.org/wiki/Category:Equations_uploaded_with_reCitation_Bot
and
https://en.wikisource.org/wiki/Category:Equations_uploaded_with_reCitation_Bot_and_needing_category_review
kind.

Note that the current category names use a different spelling for the bot than the bot's user name suggests.

Append reupload-related info to top of status pages

Right now,
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pone.0103437.html
only reads

doi: 10.1371/journal.pone.0103437

success: failed

but upon reupload, this should be updated, leaving the original information intact for debugging purposes.

make a force_reupload checkbox in jumper article queue.

so by defualt mulitple requests are ignored, but the user can force a reupload

get xsltproc installed on tools-labs

already submitted a bugzilla for it.
https://bugzilla.wikimedia.org/show_bug.cgi?id=66962

implement logging

implement at least two types of logging using core python logging module:

text logs (and STDOUT)
email logs (for special cases, alert messages)

gratuitous link dump:

When reuploading full text, check for renamed files

Some of the files we upload to Commons will be renamed (example: https://commons.wikimedia.org/w/index.php?title=File%3ATrichostetha_bicolor_feeding_on_flowers_of_Agathosma_capensis_%28Rutaceae%29_at_Saldanha_Bay.jpg&diff=133547664&oldid=133547464 ).

When we re-upload a full text to Wikisource (which we should only do in Wikisource namespace), we should check for such renames and use the new ones when embedding the figures.

DOI upload is broken

Not sure what the problem is precisely (cf. #35 ), but the last ca. 10 attempts to upload something all went nowhere.

Allow for TIF, TIFF, JPEG and SVG uploads

Currently, only .jpg and .png are uploaded, but spelling variants thereof (e.g. .JPEG and .PNG) should also be allowed, as should TIF/ TIFF and SVG.
https://github.com/wpoa/recitation-bot/blob/master/recitation-bot/journal_article.py#L167

Do not delete license information without good reason

Edits like these can easily lead to files being deleted:
https://commons.wikimedia.org/w/index.php?title=File:New-Family-of-Bluish-Pyranoanthocyanins-40403.fig.001.jpg&diff=prev&oldid=130065483 .

I do want the option of updating file pages, but this should probably be switched off per default:

recitation-bot/recitation-bot/journal_article.py

Line 193 in 0492e03

image_page.put(newtext=page_text, comment='Updating description')

pywikibot config required for tests

Need to provide pywikibot config for testing purposes

=========================================================== ERRORS ============================================================
_______________________________________ ERROR collecting tests/test_journal_article.py ________________________________________
tests/test_journal_article.py:1: in <module>
>   from recitationbot import journal_article
recitationbot/journal_article.py:11: in <module>
>   import pywikibot
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/__init__.py:30: in <module>
>   from pywikibot import config2 as config
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/config2.py:162: in <module>
>   _base_dir = _get_base_dir()
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/config2.py:158: in _get_base_dir
>           raise RuntimeError(exc_text)
E           RuntimeError: No user-config.py found in directory '/home/wrought/.pywikibot'.
E             Please check that user-config.py is stored in the correct location.
E             Directory where user-config.py is searched is determined as follows:
E           
E               Return the directory in which user-specific information is stored.
E           
E               This is determined in the following order -
E               1.  If the script was called with a -dir: argument, use the directory
E                   provided in this argument
E               2.  If the user has a PYWIKIBOT2_DIR environment variable, use the value
E                   of it
E               3.  Use (and if necessary create) a 'pywikibot' folder under
E                   'Application Data' or 'AppData\Roaming' (Windows) or
E                   '.pywikibot' directory (Unix and similar) under the user's home
E                   directory.

For a given DOI, which Wikipedia articles cite it? (And what's the total number of citations for this DOI?)

For this, and probably other use cases we just want to serve up JSON from a public URL endpoint. This should be straight-forward with python.

Failed: 10.1371/journal.pbio.1000247

http://tools.wmflabs.org/recitation-bot/10.1371/journal.pbio.1000247.html
states

doi: 10.1371/journal.pbio.1000247

success: failed

'http://creativecommons.org/publicdomain/zero/1.0/'

I don't think ef73cfb#diff-fb47ee8565fbf50e7e1ac9df3de0b94e is enough to fix that, but did not find the relevant code.