Code Monkey home page Code Monkey logo

author-disambiguator's People

Contributors

arthurpsmith avatar daniel-mietchen avatar egonw avatar guyfawcus avatar nintendofan885 avatar wetneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

author-disambiguator's Issues

Link external identifiers

e.g. when a DOI, PMID, PMCID, arXiv ID etc. are listed for an entry, they should be linked to facilitate checking (e.g. affiliations).

Suggestion: handle multiple authors simultaneously

Suggestion from Tony Catapano at WikiCite meeting: allow entry of multiple authors for match, when they are typically co-authors:

  • find all papers they are both (all) on
  • generate QuickStatements to replace both (all) author string records with the author items

Author lists should be ordered

All displayed author lists should be ordered by author number. If there are gaps in the author numbers that should somehow be indicated (perhaps with [xx .. xx] ?

Improve "fuzzy" search

From User:Jura on Wikidata - BTW, if one types a full name (first middle last name), fuzzy search seems to find people without the middle name, but not those where the middle name is limited to its initial. Maybe these should also be found when starting from first+last name.

Build a variant for topic disambiguation

The general concept of the tool to

  • help reconcile strings with items
  • convert a successful reconciliation into QuickStatements edits

is in principle applicable for disambiguating things other than authors as well, e.g. organizations or titles of works, venues or events.

Doing this for titles would be especially interesting, as that would help with topic tagging.

For some further inspiration, see https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/Wikidata_lists/Long_words_in_work_titles&oldid=837291802 .

Handle case when author item already linked to article

When there's an author string and the author item associated with it is already linked to the article, the data associated with the author string statement (stated as, references, etc.) should be moved to the P50 statement, and the author string statement should be deleted.

Next to the (currently red) topic links, add link to Scholia's /missing page for the topic.

e.g. the "hemophilia A" link in https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Li%20Li&filter=wdt%3AP921+wd%3AQ2092064 (as per #31 ) would link to https://tools.wmflabs.org/scholia/topic/Q2092064/missing , which links back to the disambiguator.

Making such roundtripping simple (for multiple Scholia/ Author Disambiguator combinations) would probably be a prerequisite for getting larger numbers of people to use these disambiguation workflows.

Li Li causing a SPARQL failure now??

Just using "Li Li" gives now:
Warning: assert(): SPARQL query failed: SELECT DISTINCT ?q { VALUES ?name { "Li Li"@en "Li Li"@de "Li Li"@fr "Li Li"@es "Li Li"@nl } . ?q (rdfs:label|skos:altLabel) ?name ; wdt:P31/wdt:P279* wd:Q16334295 . } #TOOL: legacy code failed in /Library/WebServer/Documents/disambiguator/magnustools/ToolforgeCommon.php on line 427

??

Link journals to some useful page

Right now, the journal is just given as a string, but it is another unit of curation, so it should be linked somewhere useful.
This could be simply the corresponding /missing page in Scholia (example) or something else.

Work with other author identifiers

From User:Jura on Wikidata -
Maybe the tool could also check if VIAF is present (and suggested its addition). If you just check for a single one, that might be the most useful one. There are obviously a few other (non-library ones) likely the be found on such author items (notably Scopus, Researchgate, even Linkedout).

Articles with matching "author" statements should be included in clustering

Right now articles with no matching "author string" statements are not listed or included in the clustering. If an article would be in a cluster with other articles that have already been processed, that would be a strong indicator that the author should be mapped to the Q value in the already-processed articles. Articles with "author string" or "author" statements should both be clustered consistently.

replace wikidatatools in article fetch to improve memory performance

For papers with large numbers of authors (like Q21481859) wikidatatools sometimes(?) doesn't even seem to fetch the data with the WikidataItemList load function. In any case it appears to be using much more memory than necessary; a rewrite of some sort to improve this is needed.

Bug: Li Li uses too much memory

Error when trying this author name:
Fatal error: Allowed memory size of 1572864000 bytes exhausted (tried to allocate 64 bytes) in /data/project/author-disambiguator/public_html/magnustools/wikidata.php on line 331

Any way to fix this???

Link individual papers in Misc section

Since the "Misc" grouping is just whatever didn't fit into a cluster, we don't expect it to match a single author; it would be better to show for each individual paper which of the possible authors it matches in this seciton.

Remove old "author string" statements

After adding the new author statements, the old author string statements should be removed to avoid duplication of the same information on article records

Potential author list is limited to 10 items!

The api query used limits the "Potential Authors" list to only 10 - and doesn't warn when that limit is hit! We should increase to at least 50 and add a warning if more may be found.

Work with author pages in Wikisource(s)

From User:billinghurst - any person who has an author page at one of the Wikisources should be considered as worthy of being a hit on the tool's search results. Numerous of those people writing at WSes will not be traditional "authors" though will be writers in the sense of explorers, military officers, politicians, scientists, journalists, etc. Also, without exactly knowing the scope of your tool, I would like to flag a page like s:Littell's Living Age/Volume 135 as an example of a ToC for a journal, of which there are a large range of other samples that may be of interest, number of these will have red links, and many will have solutions for red links as we have done a lot of work in identifying these writers over time.

One-click submission to Quickstatements

The results page should include the QS commands in a form text box that can be submitted to QuickStatements directly by clicking a button, rather than requiring cut and paste.

Preserve references in added P50 statements

When author statements are added on an article based on the author string statements, they should preserve all the qualifiers and references from the original "author string" entry (and add "stated as" with the string value as well).

Compress author lists when thousands of authors

It's not very useful to have a list of thousands of author names. If the author count is above a certain number (20?) we should do the following:

  • Display the first 10 authors
  • Display 2 authors before and 2 authors after a matched author name string
  • replace remaining author entries with ellipses "..."

Improve clustering algorithm

Publication date, journal, affiliation if available, main subject if available, should be accounted for in the clustering analysis (right now it just compares author lists)

Problems with accented characters?

Not sure what's going on, but the string matching is not working for a search on 'François M. Peeters' - the first few articles don't highlight the matching name in the author list, despite being retrieved in the search!

On author string pages, include lists for authors cited to and from

That list can easily grow very long, so providing a useful way to filter (e.g. by selecting a substring) would be useful.

The idea here is that people often cite papers of their own, so for something like
https://tools.wmflabs.org/author-disambiguator/?fuzzy=0&name=Collins%20WE ,
citations to or from papers with that string would be expected to bring up some people named Collins as identified authors, which could then serve as another starting point for identifying authors.

For authors with lots of papers to reconcile, the conversion to QuickStatements does not work

E.g. for https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=PLOS%20ONE%20Staff .

In similar cases before, I could work around by just including one of the groups suggested by the tool, but here, it suggests only one group (which makes sense), and while I suppose I could click away manually a certain number of those tick boxes, I don't know what number that would need to be, and I do not want to try to find out manually.

One solution is likely to increase memory (just like in #9), another would be pre-filtering, as per #17, and yet another would be to limit batches to a certain number of publications at a time, as suggested in option 1 in this comment in #9.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.