teester / entityshape Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 3.0 183 KB

An api to compare a wikidata item with an entityschema

License: GNU General Public License v3.0

Python 82.33% HTML 0.12% JavaScript 17.55%

wikidata shex

entityshape's People

Contributors

Stargazers

Watchers

Forkers

waldyrious xentitycorporation dpriskorn

entityshape's Issues

Disallowed section

There should be a separate section for properties not allowed in the item. They should be green if missing and red if present.

Add support for groups

Currently shex groups are not supported and the grouping is ignored.

Thus the following from E228 will not correctly evaluate:

{ ( pq:P1534 @ +; pq:P582 xsd:dateTime +; )* ; }

This translates as must have 0 or more of the following: 1 or more P1534 and 1 or more of P582.

shape.py evaluates it as must have 1 or more P1534 and 1 or moreP582, which is incorrect.

Use PyShExC to generate shapes in JSON-LD

Use PyShExC (https://github.com/shexSpec/grammar-python-antlr) to generate shapes in JSON-LD, which is a standard, rather than the bespoke json we are currently generating.

Advantages:

will translate parts of shemas that we currently do not, such as AND and OR, making it easier to solve #1, #2 and #3
can let PyShExC worry about generating json shapes rather than supporting it here
adheres to a standard, which may make reuse easier

Disadvantages:

requires rewriting compareshape.py to parse JSON-LD instead of the current json.

Process multiple shapes for a single entity

With the imminent deployment of a new entityschema data type on wikidata and presumably, the approval of the Shape Expression for Class property, it should become possible to determine what shapes apply to an entity programmatically. Depending on how it is adopted, I could see the api checking items or properties associated with the queried entity for Shape Expression for Class properties and putting together a list of entity schemas to check the entity against. Then get the results of all the shapes and concatenate them so that you get a list of shapes checked and where (if at all) each property and statement fails.

For example: Simon Harris (Q7518922) is an human (E10) and a Member of the Oireachtas (E236). So if human (Q5) had a Shape Expression for Class property of E10 and Oireachtas Member ID (P4690) had a Shape Expression for Class property of E236, the api could detect E10 and E236 and run a check on Q7518922 with E10, then with E236. The script should then parse the results of both. In the summary section, the properties from both schemas should be listed in the appropriate sections.

In cases where the same property is checked in both schemas, the property should appear in the most restrictive section on the summary. i.e. if a property is necessary in one schema and optional in the other, it should appear in the necessary section only. If the property fails in either schema, it should be listed as a fail. On mousing over the properties, the breakdown from each schema should appear as a tooltip. This should be done in a similar way in the tags added to the properties and statements on the page.

It should also be possible to check a random schema in the usual way, and also to check multiple schemas from the search box, perhaps using a space or comma as a separator. Checking when there is no input in the search field should trigger automatic schema determination. The UI will need a minor update to make it clear that this will happen.

Tasks to complete

get entityshape to check an item against multiple schemas and return the results
get the script to display multiple sets of results initially & update the UI to show how to check multiple schemas
get the script to concatenate the results - this will allow people to check multiple schemas at the same time
once Shape Expression for Class is approved, get the script to autodetect schemas from pages associated with the entity and update the UI to make it clear how this works.

Statements with {0} are marked as required

Statements containing {0} to describe cardinality translate to 'does not contain'. Currently, these statements are being evaluated as if they are required to contain at least 1 match.

Required items which are missing should show up as red and not orange

Items in the required section which are missing currently show up as orange. They should show up as red.

All entityschemas should return a 200 response

Currently, a number of entityschemas fail to parse correctly and return a 500 error, even when the entityschema is valid. Any valid entityschema should return a 200 response along with some sort of result.

The following entityschemas return a 500 response:
~~- E1 - ShExR~~
~~- E2 - Wikimedia~~
~~- E3 - Wikidata Item~~
~~- E4 - Labels/Descriptions~~
~~- E5 - Statement - Blank schema~~
~~- E6 - Language mappings - Blank Schema~~
~~- E7 - Citation - Blank Schema~~
~~- E8 - External RDF - Blank Schema~~
~~- E9 - Wikidata-Wikibase - Blank Schema~~
~~- E16 - Software Titles~~
~~- E37 - human gene~~
~~- E38 - human protein~~
- E39 - Reactome Pathway
~~- E44 - University Teacher~~
~~- E49 - Wikidata prefixes~~
~~- E53 - sportsperson~~
~~- E55 - programming language~~

E59 - evidence and conclusion ontology term
E70 - Clinical Interpretations of Variants in Cancer
~~- E72 - pharmaceutical drug~~
~~- E74 - pseudogene~~
~~- E75 - gene~~
~~- E86 - native Wikipathways schema~~
~~- E87 - biological pathway in Wikidata~~
~~- E89 - Public library branch in The Netherlands~~
~~- E90 - Public library organisation in The Netherlands~~
E93 - FLOSS emulator
~~- E96 - dummy - Blank Schema~~
~~- E99 - statue~~
~~- E100 - city~~
~~- E103 - gene variant according to myvariant.info~~
~~- E117 - newspaper with direct claim properties only~~
~~- E118 - virtual assistant~~
~~- E121 - [empty schema] - Blank Schema~~
~~- E122 - [empty schema] - Blank Schema~~
E123 - Sandbox schema
~~- E124 - [empty schema] - Blank Schema~~
~~- E128 - extrasolar planet~~
~~- E129 - one-of-a-kind computer~~
~~- E132 - web comic~~
~~- E150 - Specific event in figure skating~~
E165 - virus gene
~~- E166 - [empty schema] - Blank Schema~~
~~- E169 - virus protein~~
~~- E175 - [empty schema] - Blank Schema~~
~~- E176 - Chilean astronomers~~
~~- E180 - [empty schema] - Blank Schema~~
~~- E181 - [empty schema] - Blank Schema~~
~~- E182 - [empty schema] - Blank Schema~~
~~- E183 - Chilean Women Football Players~~
~~- E187- hospital~~
~~- E194 - Complex Portal entity~~
~~- E221 - YouTube - Blank Schema~~
~~- E226 - Swedish Academy Chair~~
~~- E227 - Gender~~
E245 - Unicode plane
E246 - Unicode block
E247 - Unicode character
E251 - non-coding RNA
~~- E252 - non-coding RNA gene~~
~~- E258 - Genewiki schema~~
~~- E259 - Wikibase property~~
~~- E261 - Fredmans Epistel places - Blank Schema~~
~~- E262 - Fredman Epistels person - Blank Schema~~
~~- E263 - Type specimens of Oxalis~~
~~- E265 - Gene Wiki SARS-COV2 primary sources~~
~~- E266 - Gene Wiki SARS-COV2 external identifiers~~
~~- E269 - monument historique français~~
~~- E570 - recently deceased humans - Blank Schema~~
E999 - Borked
~~- E12345 - Sandbox Schema~~

~~Total: 69/272 failures (~25%)~~
~~Total: 53/272 failures (~19.5%)~~
~~Total: 39/272 failures (~14.5%)~~
~~Total: 17/272 (6.25%)~~
Total: 10/301 (3.32%)

Userscript should work on mobile wikidata

It should be possible to use the user script on wikidata's mobile site.

Add support for Wikibases other than Wikidata

EntityShape currently seems to assume it will only ever be run on Wikidata and contains hardcoded URIs in both Python as well as JS, both in obvious places like _get_property_name and _get_entity_json, but also _strip_schema_comments and _compare_statements.

In practice, Wikidata does not contain all data, and in my use-case I wanted to check our Q3 against our EntitySchema:E1 - which as documented in that schema works with the ShEx2 validator once CORS is whitelisted for that domain with CORS Everywhere.

Seeing this script recommended in Wikidata's shape tutorial, I tried to adapt the JS by changing the harcoded URLs within it and using mw.loader.load() on it. However, this failed. Looking closer, I saw it passes entity and property IDs into the API, not URI/IRIs.

It would be nice if it could work with a different base URI that was passed into it, or potentially across disparate URIs for entity and schema, to aid use of Wikibase by third-parties and federation between these and Wikidata. Failing that, moving hardcoded URIs into centrally-defined constants would probably make it easier to reuse the code if hosted elsewhere (e.g. by @wbstack).

Add support for lexemes

Lexemes currently get the "failed to validate schema" error message.

Support for lexemes needs to be added to the api. This required determining the important prefixes used by lexemes and translating them into properties, forms and senses in shape.py. The analysis in compareshape.py will presumably also have to take into account forms and senses.

The userscript may also need to be updated if the html element ids and classes are different to the ones that are used in entities.

An initial partial solution may be to ignore certain prefixes and only analyse properties (which the api can currently do) so there's at least partial support available.