arthurpsmith / author-disambiguator Goto Github PK
View Code? Open in Web Editor NEWWikidata service to help create or link author items to published articles
License: GNU General Public License v3.0
Wikidata service to help create or link author items to published articles
License: GNU General Public License v3.0
e.g. when using
https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Li%20Li&filter=wdt%3AP921+wd%3AQ2092064
to look for papers by "Li Li" about hemophilia A, it would be useful to surface/ highlight in the "potential author items" section all those that have already published stuff on the same topic.
e.g. when a DOI, PMID, PMCID, arXiv ID etc. are listed for an entry, they should be linked to facilitate checking (e.g. affiliations).
Suggestion from Tony Catapano at WikiCite meeting: allow entry of multiple authors for match, when they are typically co-authors:
All displayed author lists should be ordered by author number. If there are gaps in the author numbers that should somehow be indicated (perhaps with [xx .. xx] ?
From User:Jura on Wikidata - BTW, if one types a full name (first middle last name), fuzzy search seems to find people without the middle name, but not those where the middle name is limited to its initial. Maybe these should also be found when starting from first+last name.
The default ordering should be based on publication date (oldest first?)
The general concept of the tool to
is in principle applicable for disambiguating things other than authors as well, e.g. organizations or titles of works, venues or events.
Doing this for titles would be especially interesting, as that would help with topic tagging.
For some further inspiration, see https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/Wikidata_lists/Long_words_in_work_titles&oldid=837291802 .
Example: https://www.wikidata.org/wiki/Q56917927 - Wang Jun/Jun Wang
There should probably be something automatic to pull in public ORCID data on authorship... anyway this needs to be looked into somehow.
Potentially saves lots of clicks and confusion in "Li Li" style scenarios.
If there's text in the Q item field the form seems to select that even if the radio button for that field is not checked?!
So that people can find the repo more easily to provide comment or other contributions.
Probably the first name as an initial breaks something!
So for something like
https://tools.wmflabs.org/author-disambiguator/?fuzzy=0&name=Collins%20WE ,
list all the identified authors for which their name has been stated as "Collins WE" at least once.
See also #18.
When there's an author string and the author item associated with it is already linked to the article, the data associated with the author string statement (stated as, references, etc.) should be moved to the P50 statement, and the author string statement should be deleted.
e.g. the "hemophilia A" link in https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Li%20Li&filter=wdt%3AP921+wd%3AQ2092064 (as per #31 ) would link to https://tools.wmflabs.org/scholia/topic/Q2092064/missing , which links back to the disambiguator.
Making such roundtripping simple (for multiple Scholia/ Author Disambiguator combinations) would probably be a prerequisite for getting larger numbers of people to use these disambiguation workflows.
Looks like some filtering needed...
Just using "Li Li" gives now:
Warning: assert(): SPARQL query failed: SELECT DISTINCT ?q { VALUES ?name { "Li Li"@en "Li Li"@de "Li Li"@fr "Li Li"@es "Li Li"@nl } . ?q (rdfs:label|skos:altLabel) ?name ; wdt:P31/wdt:P279* wd:Q16334295 . } #TOOL: legacy code failed in /Library/WebServer/Documents/disambiguator/magnustools/ToolforgeCommon.php on line 427
??
.. or only the three around the target author.
That could help address the issues with many-authored papers.
Right now, the journal is just given as a string, but it is another unit of curation, so it should be linked somewhere useful.
This could be simply the corresponding /missing page in Scholia (example) or something else.
Things seem to display correctly, but the redirected one looks like a duplicate of what it redirects to. Probably should be some display indication of the issue.
From User:Jura on Wikidata -
Maybe the tool could also check if VIAF is present (and suggested its addition). If you just check for a single one, that might be the most useful one. There are obviously a few other (non-library ones) likely the be found on such author items (notably Scopus, Researchgate, even Linkedout).
Currently,
https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=International%20Commission%20On%20Zoological%20Nomenclature
can not be used to generate QuickStatements to link the papers to Q1071346,
and the suggested string for the ORCID part creates shortened "middle names" C O Z as well,
presumably due to 7690d3d .
Right now articles with no matching "author string" statements are not listed or included in the clustering. If an article would be in a cluster with other articles that have already been processed, that would be a strong indicator that the author should be mapped to the Q value in the already-processed articles. Articles with "author string" or "author" statements should both be clustered consistently.
For papers with large numbers of authors (like Q21481859) wikidatatools sometimes(?) doesn't even seem to fetch the data with the WikidataItemList load function. In any case it appears to be using much more memory than necessary; a rewrite of some sort to improve this is needed.
Mostly useful for those cases which give lots of results when unfiltered (see e.g. #9 ), but can also be useful for smaller sets.
There may be additional qualifiers (for example for affiliation) besides "series ordinal" that should be copied from the author name string entry to the author - we should just copy all of them, and add in the "stated as" qualifier from the string value.
Error when trying this author name:
Fatal error: Allowed memory size of 1572864000 bytes exhausted (tried to allocate 64 bytes) in /data/project/author-disambiguator/public_html/magnustools/wikidata.php on line 331
Any way to fix this???
That way, it is simpler to get an overview of the things that are useful for disambiguation.
Since the "Misc" grouping is just whatever didn't fit into a cluster, we don't expect it to match a single author; it would be better to show for each individual paper which of the possible authors it matches in this seciton.
Example -
https://www.wikidata.org/wiki/Q21558717
shows Physical Review Letters twice ??
After adding the new author statements, the old author string statements should be removed to avoid duplication of the same information on article records
The api query used limits the "Potential Authors" list to only 10 - and doesn't warn when that limit is hit! We should increase to at least 50 and add a warning if more may be found.
From User:billinghurst - any person who has an author page at one of the Wikisources should be considered as worthy of being a hit on the tool's search results. Numerous of those people writing at WSes will not be traditional "authors" though will be writers in the sense of explorers, military officers, politicians, scientists, journalists, etc. Also, without exactly knowing the scope of your tool, I would like to flag a page like s:Littell's Living Age/Volume 135 as an example of a ToC for a journal, of which there are a large range of other samples that may be of interest, number of these will have red links, and many will have solutions for red links as we have done a lot of work in identifying these writers over time.
The results page should include the QS commands in a form text box that can be submitted to QuickStatements directly by clicking a button, rather than requiring cut and paste.
The way it works right now is a little confusing, these should probably be two separate forms.
When author statements are added on an article based on the author string statements, they should preserve all the qualifiers and references from the original "author string" entry (and add "stated as" with the string value as well).
It's not very useful to have a list of thousands of author names. If the author count is above a certain number (20?) we should do the following:
Publication date, journal, affiliation if available, main subject if available, should be accounted for in the clustering analysis (right now it just compares author lists)
e.g. https://tools.wmflabs.org/author-disambiguator/?name=Ph.+Schwemling&doit=Look+for+author&limit=50&filter=wdt%3AP2093+%22Ph.+Schwemling%22
resulted in just one item being edited, despite 50 having been marked.
Not sure what's going on, but the string matching is not working for a search on 'François M. Peeters' - the first few articles don't highlight the matching name in the author list, despite being retrieved in the search!
Current clustering works well in some circumstances, but it does not help with some cases with common name strings. More precise partitioning based on the exact preceding and following author name strings may help.
Example - S. Bhattacharya - appears twice in many high energy physics collaboration papers, one from Brown University and one from the Saha Institute in Kolkata.
That list can easily grow very long, so providing a useful way to filter (e.g. by selecting a substring) would be useful.
The idea here is that people often cite papers of their own, so for something like
https://tools.wmflabs.org/author-disambiguator/?fuzzy=0&name=Collins%20WE ,
citations to or from papers with that string would be expected to bring up some people named Collins as identified authors, which could then serve as another starting point for identifying authors.
If you click one of the "common names" at the bottom it defaults back to 500 article limit - probably not a good idea for names on multi-thousand-author papers
E.g. for https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=PLOS%20ONE%20Staff .
In similar cases before, I could work around by just including one of the groups suggested by the tool, but here, it suggests only one group (which makes sense), and while I suppose I could click away manually a certain number of those tick boxes, I don't know what number that would need to be, and I do not want to try to find out manually.
One solution is likely to increase memory (just like in #9), another would be pre-filtering, as per #17, and yet another would be to limit batches to a certain number of publications at a time, as suggested in option 1 in this comment in #9.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.