wpoa / recitation-bot Goto Github PK
View Code? Open in Web Editor NEWMediaWiki bot to upload content to Wikimedia projects and update corresponding citations on Wikipedia.
License: GNU General Public License v3.0
MediaWiki bot to upload content to Wikimedia projects and update corresponding citations on Wikipedia.
License: GNU General Public License v3.0
At
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Schizophrenia_and_Violence_Systematic_Review_and_Meta-Analysis&oldid=4970952 ,
the category display is messed up because
https://en.wikisource.org/w/index.php?title=Template:Header&oldid=4660353#Categories
only allows for 10 categories. So we should provide a maximum of 10 and come up with some rules on which ones to eliminate if we have a longer list.
Example:
https://en.wikisource.org/w/index.php?title=Wikisource%3AWikiProject_Open_Access%2FProgrammatic_import_from_PubMed_Central%2FSchizophrenia_and_Violence_Systematic_Review_and_Meta-Analysis&diff=4970959&oldid=4970952 .
To benefit from ongoing bug fixes, it would be useful if the web interface had the option to re-upload (a) text or (b) media files or (c) both.
https://en.wikisource.org/w/index.php?title=Template:Recitation-bot
is used by file uploads to Wikisource and should be modeled after
https://commons.wikimedia.org/wiki/Template:Recitation-bot .
In some journals (e.g. at PLOS), tables are being made available as image files in addition to tabular format. If the latter exists, we should always go for it and not embed the former in the Wikisource text.
I am open to the idea of importing the image file nonetheless, as it may sometimes be useful in Wikipedia articles, but my suggested default setting would be to ignore those image files entirely.
Sample case:
restarting can be done with
Restarting jobs
To stop and restart a running job in a single command (e.g. you made a bugfix), use:
qmod -rj job_number
acording to
https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#jsub_options
Currently, recitation bot ignores non-image files. However, for audio and video files, it should sync with OAMI.
For instance,
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_new_species_of_the_Boophis_rappiodes_group_%28Anura_Mantellidae%29_from_the_Sahamalaza_Peninsula_northwest_Madagascar_with_acoustic_monitoring_of_i&oldid=5020464#Supplementary_material_1
links to "File:Zookeys-435-111-s001.wav", which already exists at https://commons.wikimedia.org/wiki/File:A-new-species-of-the-Boophis-rappiodes-group-%28Anura-Mantellidae%29-from-the-Sahamalaza-Peninsula-zookeys-435-111-s001.oga .
Eventually, the relevant parts of OAMI should be folded into recitation bot.
A number of file types are not uploaded to Commons (e.g. *.doc). These should not be embedded into the wiki text, but we should link to an external copy instead - be that available from PMC or through Wikimedia Labs.
Examples:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Saudi_Arabian_Y-Chromosome_diversity_and_its_relationship_with_nearby_regions#Additional_file_1
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/In_Silico_Gene_Prioritization_by_Integrating_Multiple_Data_Sources#Supporting_Information
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_land_crab_Johngarthia_planata_%28Stimpson_1860%29_%28Crustacea_Brachyura_Gecarcinidae%29_colonizes_human-dominated_ecosystems_in_the_continental_main#XML_Treatment_for_Johngarthiaplanata .
Just tried to upload http://dx.doi.org/10.7554/eLife.01621 , which was published in June but does not seem to be in PMC yet, so the upload understandably failed:
http://tools.wmflabs.org/recitation-bot/10.7554/eLife.01621.html .
There are multiple reasons why uploads may fail, and I am wondering what the best process would be to expose them in a way that is compatible with our bug tracking here on GitHub.
This should integrate with wpoa/OA-signalling#81 .
Some articles in PMC do not have DOIs, and a (small) subset thereof is openly licensed (e.g. http://www.ncbi.nlm.nih.gov/pmc/journals/896/ ), so it would be good if the web form would also accept PMC IDs.
These uploads should go to the respective Wikisource.
At OAMI, the file naming of the uploads to Commons gets rid of many special characters.
At Wikisource, we should strive to keep the paper titles as intact as possible (see also
#15 ), taking into account technical limitations of MediaWiki (e.g. colons or slashes in page names).
While I agree that the uploads by recitation-bot to Commons should eventually go through OAMI, I don't think the image uploads during this test phase should go to
https://commons.wikimedia.org/wiki/Category:Uploaded_with_Open_Access_Media_Importer
and
https://commons.wikimedia.org/wiki/Category:Uploaded_with_Open_Access_Media_Importer_and_needing_category_review .
Suggestion:
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot
and
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot_and_needing_category_review .
This was the one that failed due to the block:
http://tools.wmflabs.org/recitation-bot/10.1186/2049-9957-3-6.html .
as per
wpoa/OA-signalling#92 (comment) .
Especially "in press" stuff - example in
https://en.wikisource.org/wiki/Biodiversity_Assessment_of_the_Fishes_of_Saba_Bank_Atoll,_Netherlands_Antilles#cite_note-pone.0010676-Toller1-5 .
Just tried to import
http://elifesciences.org/content/1/e00248
and got this error message.
Would like to use it in a talk tomorrow.
@notconfusing @wrought - any chance you can fix this today?
default should be both
The current file descriptions (e.g. "Media belonging to article 10.1371/journal.pbio.1000436 which is cited on Wikipedia, and automatically imported." in
https://commons.wikimedia.org/w/index.php?title=File:Tracking-Marsupial-Evolution-Using-Archaic-Genomic-Retroposon-Insertions-pbio.1000436.g002.jpg&oldid=126539861 )
are not very helpful.
The corresponding code should thus be replaced with that in OAMI. Sample upload from PLOS:
https://commons.wikimedia.org/wiki/File:Messages-Do-Diffuse-Faster-than-Messengers-Reconciling-Disparate-Estimates-of-the-Morphogen-Bicoid-pcbi.1003629.s006.ogv .
Fig. 1 of
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Modelling_the_Species_Distribution_of_Flat-Headed_Cats_%28Prionailurus_planiceps%29_an_Endangered_South-East_Asian_Small_Felid&oldid=5032599
was imported into
https://commons.wikimedia.org/wiki/File:Modelling-the-Species-Distribution-of-Flat-Headed-Cats-%28Prionailurus-planiceps%29-an-Endangered-South-pone.0009612.g001.jpg
but the image there already existed (in higher resolution) as
https://commons.wikimedia.org/wiki/File:Plionailurus_planiceps.png .
According to Commons policies, our upload should thus be deleted.
In such cases, it would be best if we could
(a) detect such a duplicate before upload
(b) post a message on that file's talk page with the proper metadata.
On Commons, we have
https://commons.wikimedia.org/wiki/Category:Uploaded_with_reCitation_Bot ,
which has led to
https://en.wikisource.org/wiki/Category:Uploaded_with_reCitation_Bot
for equations and tables,
albeit we are using
https://en.wikisource.org/wiki/Category:Uploaded_by_Recitation-bot
to track uploads on Wikisource.
This means we have differences in capitalization, "by" vs. "with" and "_" vs. "-". That needs to be more coherent.
There is a maximum length for page titles at MediaWiki - 255 bytes according to https://www.mediawiki.org/wiki/Manual:Page_table#page_title . At the OAMI, we have opted to take the first 100 characters of a paper title before we append portions of the DOI.
This has worked fine so far.
For Wikisource, this is not the best approach, though, and I think we should try to accommodate as much of the article title in the page name. Example:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_cladistically_based_reinterpretation_of_the_taxonomy_of_two_Afrotropical_tenebrionid_genera_Ectateus_Koch_1956_and_Selinus_Mulsant_%26_Rey_1853_%28 .
Uploads by re-citation bot (example: https://commons.wikimedia.org/wiki/File:A-New-Basal-Hadrosauroid-Dinosaur-%28Dinosauria-Ornithopoda%29-with-Transitional-Features-from-the-Late-pone.0098821.g015.jpg ) should be marked by a template
https://commons.wikimedia.org/w/index.php?title=Template:Recitation-bot
to be modeled after
https://commons.wikimedia.org/wiki/Template:Open_Access_Media_Importer .
The latter template has been used in earlier uploads by VIAF bot and re-citation bot and should be replaced by the new one.
to the file or main namespaces.
At the moment, for instance, PMID and PMCID are missing:
https://commons.wikimedia.org/w/index.php?title=File:New-Family-of-Bluish-Pyranoanthocyanins-40403.fig.001.jpg&oldid=126893111 .
Instead of a license statement, some articles have only a "{{}}".
Examples:
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Analysis_of_Human_Cytomegalovirus-Encoded_SUMO_Targets_and_Temporal_Regulation_of_SUMOylation_of_the_Immediate-Early_Proteins_IE1_and_IE2_during
and
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/A_cladistically_based_reinterpretation_of_the_taxonomy_of_two_Afrotropical_tenebrionid_genera_Ectateus_Koch_1956_and_Selinus_Mulsant_%26_Rey_1853_%28
I think we've discussed this in at least two threads but couldn't find any of them right now. Anyway, it seems that high-res images are accessible via PMC Europe, e.g.
http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3258128/supplementaryFiles
These uploads by VIAFbot all use the {{Open Access Media Importer}} template:
https://commons.wikimedia.org/w/index.php?limit=250&tagfilter=&title=Special%3AContributions&contribs=user&target=VIAFbot&namespace=&tagfilter=&year=2014&month=6
That should be replaced by {{Recitation-bot}}.
The same goes for some earlier uploads by Recitation Bot: https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Recitation-bot&offset=20140724025401&limit=250&target=Recitation-bot
abstractly, need to do the following to implement testing:
Once the upload of images for equations/ tables to Wikisource works, we will need another checkbox in the web form, with the option to force re-upload of these, perhaps separately for tables and figures.
I'm getting badtoken errors whenever I am trying to upload any paper. The issue seems not to be unique to us, as per https://phabricator.wikimedia.org/T61678 . Pinging @notconfusing @wrought .
See http://tools.wmflabs.org/recitation-bot/faillog.html for details.
For instance,
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pcbi.1003149.html
and
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pcbi.1000361.html
recently failed, but
http://tools.wmflabs.org/recitation-bot/faillog.html
still only shows the one test entry
Files like the one at
https://en.wikisource.org/w/index.php?title=File:Neurobiological-Models-of-Two-Choice-Decision-Making-Can-Be-Reduced-to-a-One-Dimensional-Nonlinear-pcbi.1000046.e061.jpg&oldid=5068812
have multiple categories assigned to them, none of which are all too helpful.
I thus propose to do away with these article-level keywords entirely for equation images, and to just put them into some maintenance category of the
https://en.wikisource.org/wiki/Category:Equations_uploaded_with_reCitation_Bot
and
https://en.wikisource.org/wiki/Category:Equations_uploaded_with_reCitation_Bot_and_needing_category_review
kind.
Note that the current category names use a different spelling for the bot than the bot's user name suggests.
Right now,
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pone.0103437.html
only reads
doi: 10.1371/journal.pone.0103437
success: failed
but upon reupload, this should be updated, leaving the original information intact for debugging purposes.
so by defualt mulitple requests are ignored, but the user can force a reupload
already submitted a bugzilla for it.
https://bugzilla.wikimedia.org/show_bug.cgi?id=66962
implement at least two types of logging using core python logging module:
gratuitous link dump:
Some of the files we upload to Commons will be renamed (example: https://commons.wikimedia.org/w/index.php?title=File%3ATrichostetha_bicolor_feeding_on_flowers_of_Agathosma_capensis_%28Rutaceae%29_at_Saldanha_Bay.jpg&diff=133547664&oldid=133547464 ).
When we re-upload a full text to Wikisource (which we should only do in Wikisource namespace), we should check for such renames and use the new ones when embedding the figures.
Not sure what the problem is precisely (cf. #35 ), but the last ca. 10 attempts to upload something all went nowhere.
Currently, only .jpg and .png are uploaded, but spelling variants thereof (e.g. .JPEG and .PNG) should also be allowed, as should TIF/ TIFF and SVG.
https://github.com/wpoa/recitation-bot/blob/master/recitation-bot/journal_article.py#L167
Edits like these can easily lead to files being deleted:
https://commons.wikimedia.org/w/index.php?title=File:New-Family-of-Bluish-Pyranoanthocyanins-40403.fig.001.jpg&diff=prev&oldid=130065483 .
I do want the option of updating file pages, but this should probably be switched off per default:
Need to provide pywikibot config for testing purposes
=========================================================== ERRORS ============================================================
_______________________________________ ERROR collecting tests/test_journal_article.py ________________________________________
tests/test_journal_article.py:1: in <module>
> from recitationbot import journal_article
recitationbot/journal_article.py:11: in <module>
> import pywikibot
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/__init__.py:30: in <module>
> from pywikibot import config2 as config
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/config2.py:162: in <module>
> _base_dir = _get_base_dir()
env/local/lib/python2.7/site-packages/pywikibot-2.0b1-py2.7.egg/pywikibot/config2.py:158: in _get_base_dir
> raise RuntimeError(exc_text)
E RuntimeError: No user-config.py found in directory '/home/wrought/.pywikibot'.
E Please check that user-config.py is stored in the correct location.
E Directory where user-config.py is searched is determined as follows:
E
E Return the directory in which user-specific information is stored.
E
E This is determined in the following order -
E 1. If the script was called with a -dir: argument, use the directory
E provided in this argument
E 2. If the user has a PYWIKIBOT2_DIR environment variable, use the value
E of it
E 3. Use (and if necessary create) a 'pywikibot' folder under
E 'Application Data' or 'AppData\Roaming' (Windows) or
E '.pywikibot' directory (Unix and similar) under the user's home
E directory.
These uploads should go to the respective Wikisource.
What could possibly go wrong? Haven't tried it yet.
The "category" element of the Template:Header on wikisource is not very important, nor necessarily standard.
Because of #12 we should work around this. Instead, we should simply add normal category wikilinks, e.g. [[Category:Evolutionary biology]]
Does make sense to generate a comprehensive link dump based on BEACON format: https://de.wikipedia.org/wiki/Wikipedia:BEACON/Format#Daten-Zeilen
However, need to consider some more use cases for export formats. Namely:
For this, and probably other use cases we just want to serve up JSON from a public URL endpoint. This should be straight-forward with python.
http://tools.wmflabs.org/recitation-bot/10.1371/journal.pbio.1000247.html
states
doi: 10.1371/journal.pbio.1000247
success: failed
'http://creativecommons.org/publicdomain/zero/1.0/'
I don't think ef73cfb#diff-fb47ee8565fbf50e7e1ac9df3de0b94e is enough to fix that, but did not find the relevant code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.