incatools / kgcl-rdflib Goto Github PK

View Code? Open in Web Editor NEW

15.0 3.0 6.0 7.13 MB

Tools for working with KGCL

License: MIT License

Makefile 0.82% Shell 0.02% Python 98.37% CSS 0.07% HTML 0.72%

ontology knowledge-graph diff owl rdf kg semantic-diff linkml diffs ontology-change-language

kgcl-rdflib's Introduction

kgcl-rdflib

An engine that applies changes or diffs specified in KGCL to an RDF graph stored using rdflib

KGCL

KGCL (Knowledge Graph Change Language) is a datamodel and language for representing changes in ontologies and knowledge graphs

The core KGCL repo is here:

https://github.com/INCATools/kgcl

This kgcl-rdflib repo applies the KGCL model to rdflib graphs.

kgcl-rdflib's People

Contributors

Stargazers

Watchers

Forkers

nadia-el matthewhorridge ckindermann smartniz joeflack4 hrshdhgd

kgcl-rdflib's Issues

Implement NodeObsoletionWithDirectReplacement

required for mondo, will assign to @joeflack4

parser fails on obsolete with replacement when replacement is a CURIE

This fails:

obsolete GO:0005634 with replacement GO:999

I think the grammar expects the replacement to be a URI

retire kgcl_2_rdf.py

despite it's name this doesn't map to rdf - it renders as yaml.

note there is no need for all this code to generate yaml - linkml yaml_dumper does this automatically!

diff fails on existential restrictions

cls = Namespace("http://www.w3.org/2002/07/owl#"), name = 'some_values_from'
default = None

    def __getitem__(cls, name, default=None):
        name = str(name)
        if str(name).startswith("__"):
            return super().__getitem__(name, default)
        if (cls._warn or cls._fail) and name not in cls:
            if cls._fail:
>               raise AttributeError(f"term '{name}' not in namespace '{cls._NS}'")
E               AttributeError: term 'some_values_from' not in namespace 'http://www.w3.org/2002/07/owl#'

/Users/cjm/Library/Caches/pypoetry/virtualenvs/kgcl-schema-ImX8xLey-py3.9/lib/python3.9/site-packages/rdflib/namespace/__init__.py:196: AttributeError

retire render_kgcl.py

this is all done automatically by linkml

CLI: make more clig-compliant, add click tests

Rename command doesn't work if literals are xsd:strings

If the input has xsd:strings on literals (as is common for many
ontologies like GO):

<http://purl.obolibrary.org/obo/NCBITaxon_2>
<http://www.w3.org/2000/01/rdf-schema#label> "Bacteria"^^xsd:string .

then rename fails: it adds a new name but does not delete the old

This seems to be a more general issue

Remove about_node_representation

{
  "id": "CHANGE:001",
  "old_value": "'foo'",
  "new_value": "'bar'",
  "about_node": "GO:1",
  "about_node_representation": "curie",
  "@type": "NodeRename"
}

This is unneccessary - just as with jsonld we can leave as either CURIE or URI. Of course, wwe should always have a context file that maps prefixes

Annotate method inputs and outputs using typing

For example:

def parse(input):
    """
    Parse a set of KGCL command separated by next-line operator.

    Returns instantiated dataclass objects from model.kgcl.
    """
    statements = input.splitlines()
    parsed = []

    for s in statements:
        parsed.append(parse_statement(s))
    return parsed

The arguments and the output should be typed

Obsolete command obsoletes everything in the ontology

Running obsolete on any term in the ontology ends up obsoleting all elements.

The current tests don't test for this scenario

Tweak serialisation for Node Moves

The current serialisation suggests move {about edge} from {old value} to {new value}.
If an edge is specified as a triple (s,p,o), then move (s,p,o) from o to o' would duplicate the {old value}.
How about move {subject} {predicate} from {old value} to {new value}?

Create code to translate linear representation to change objects

Given linear representations of changes such as:

Rename foo to bar
Move bar from under baz to under fred
Obsolete 'bad' with replacement 'new'
Merge ‘nerve’ into ‘peripheral nerve’

Translate these into instances of Change objects https://cmungall.github.io/knowledge-graph-change-language/

These changes objects can then be translated into transactions at the owl, rdf, rdf*, or kgx level

The original idea was laid out in this doc:

https://docs.google.com/document/d/1__7p64FOI5ZhiZ6F2TXtUc8JN1XXGwglOiVRrlg9G_c/edit#

This doc outlined a grammar but I think this should instead be derived by reversing the fdocstrings in the datamodel.

  node rename:
    is_a: node change
    description: >-
      A node change where the name (aka rdfs:label) of the node changes
    slots:
      - old value
      - new value
      - has textual diff      
    slot_usage:
      old value:
        multivalued: false
      new value:
        multivalued: false
      change description:
        string_serialization: "rename {about} from {old value} to {new value}"
    examples:
      - value: "rename UBERON:0002398 from 'manus' to 'hand'"
        description: "replacing the rdfs:label of 'manus' on an uberon class with the rdfs:label 'hand'

Add tests that check that reification is respected in updates

This library uses low-level rdflib/sparql operations which may be unaware of reification

add a test for a synonym with reification and check that rdf:object is updated

Use modern python idioms

use fstrings rather than java style string concatenation
use ttl files in tests/input rather than encoding ontologies as turtle strings in tests
don't use double-underscore in variable names, this has special meaning in python
remove filesystem path assumptions
follow clig.dev for CLI
Don't use java-style getter-setters on data objects - it is more pythonic just to access fields directly. See also https://docs.python.org/3/library/dataclasses.html
change if type(kgcl_instance) is NodeRename: to use isinstance

Rename calls are agnostic to the `from` value

e.g: Both

rename MONDO:0000087 from 'polymicrogyria' to 'polymicrogyria ABCD' and
rename MONDO:0000087 from 'foo bar' to 'polymicrogyria ABCD'

yield the correct result of renaming whatever label is for MONDO:0000087. The code currently does not validate the from parameter (polymicrogyria in this case) to be TRUE or no.

Dependencies from imports are not generated correctly in Python Dataclasses

Running pipenv run gen-py-classes kgcl.yaml > kgcl.py generates the desired kgcl.py file.
However, import dependencies from the modules ontology_model and prov are not generated correctly.
For example, instead of about_edge: Optional[Union[dict, Edge]] = None I find about_edge: Optional[Union[dict, "Edge"]] = None.

Do not require <>s in URIs

The way the original implementation works was to force URIs in the DSL to be specified in <>s as in turtle

Let's deprecate this but make them optional so existing tests don't break

Language tags in KGCL

How should language tags be supported explicitly by KGCL?

I am assuming statements such as "rename UBERON:0002398 from 'manus' to 'hand'" should preserve available language tags. But what about inputs such as "rename UBERON:0002398 from 'manus@en' to 'hand@en'"? Should 'manus@en' be interpreted as "manus@en" or "manus"@en?

This ambiguity could be avoided by making language tags explicit in KGCL, e.g. "rename UBERON:0002398 from 'manus'@en to 'hand'@en". In this case, we'd need to extend the KGCL data model to hold information about language tags.

Add tests for grammar

Currently only apply is tested. We need tests for grammar - for both parsing and rendering

Provide a better definition of edge

This is quite minimal: https://cmungall.github.io/knowledge-graph-change-language/Edge/

The description of edge should be improved, as well as the mapping to OWL

Mapping to OWL

For v1 we can have the simple mapping:

A subClassOf B <==> Edge(subject=A, predicate=subClassOf, object=B)
A subClassOf P some B <==> Edge(subject=A, predicate=P, object=B)

For v2 we extend this, with an additional owlinterpretation edge property, as per owl star. See https://github.com/cmungall/owlstar/blob/master/owlstar.ttl.

For example:

A subClassOf P only B <==> Edge(subject=A, predicate=P, object=B, interpretation=allValuesFrom)

CLI: (i) create, (ii) document

Description

I'm not might have missed it, but I'm looking through the code and docs for a CLI, but not seeing one.

Not sure yet how to use the tool.

Obsolete command uses the obsolete oio:ObsoleteClass

Obsoletion should not place things under oio:ObsoleteClass.

See:

http://wiki.geneontology.org/index.php/Obsoleting_an_Existing_Ontology_Term

Provide basic docs on how to use kgcl-tools

@ckindermann it would be great if we could come up with a small set of instructions on how to use the KGCL tools locally to demo them to potential supporters. Please advice on what would be the best way to try these locally - https://kgcl.ontodev.com/ is great and works for small examples, but we couldn't use it with private data or bigger ontologies.

Thank you! ;)

support CURIEs

We don't want to show full URIs to users. For many apply commands the label should work but if the user wants to be more precise they can specify a CURIE

similar for reporting, the format can be "label" (CURIE)

Create code to translate change objects to SPARQL UPDATEs

counterpoint to #3

Given an instance of a change object, e.g.

[ a kgcl:NodeRename ;
  kgcl:about UBERON:0002398 ;
  kgcl:old_value "manus" ;
  kgcl:new_value "hand" ]

turn this into the equivalent SPARQL UPDATE

Note: it may be the case that no code as such is required, generic SPARQL could be written to do the transform

create see_also links in schema pointing to obook pages

Currently we have links to GO wiki pages:

  node creation:
    is_a: node change
    mixins:
      - creation
    description: >-
      a node change in which a new node is created
    slots:
      - node id
      - name
      - owl type
      - annotation set
      - language
    slot_usage:
      change description:
        string_serialization: "creating node {id} {label} with {annotation set}"
    todos:
      - allow this for the creation of an instance from a class. This may include metaclasses (templates)
    see_also:
      - http://wiki.geneontology.org/index.php/Guidelines_for_creating_a_GO_term

these should remain, but we should also point to the relevant obook page

in this case, https://oboacademy.github.io/obook/howto/create-new-term/

Note there are some ODK pages that are becoming obsolete, see

INCATools/ontology-development-kit#578

Add github actions to this repo

Currently set up for travis, but we now use gh actions for everything

we should follow one of our existing repos and have actions for

pypi release
CI tests
rebuild of any downstream artefacts

I like what @wdduncan did for NMDC: https://github.com/microbiomedata/nmdc-schema/tree/main/.github/workflows

Modernize repo

Switch to GH Actions
Switch to poetry and tox
Clean-up Makefile and conform to Mark's template

Fix schema for edge changes to clarify how to uniquely identify an edge

The node change hierarchy makes use of an about field to indicate which node is being modified. Nodes have primary keys so it's easy to do things like:

c = NewSynonym(id='chg12345', about='ANAT:HindLimb', new_value='hindlimb')

It looks like I attempted to do something similar for edges, using an about field that points to an edge. However, edges don't have singular primary keys and we want to be able to refer to edges by SPO triple.

so the current model does allow this:

c = PredicateChange(id='chg12345', about=Edge(subject=FOO1:, predicate=..., object=..) new_value='hindlimb')

This has a natural YAML, JSON, JSON-LD, RDF serialization. However, due to the nesting it does NOT have a natural CSV serialization, and a core use case is being able to specify a set of changes in a spreadsheet

option 1: denormalize the core model

for edges, rather than about we would have 3 fields about_{s,p,o}

The first disadvantage is denormalization

The second is that it assumes every edge (OWL axiom) is uniquely identified by a SPO triple. But in fact that is neither true of OWL nor of the more generalized use cases.

I have quite a few ontologies where I have >1 axiom with the same SPO. This is useful for a number of reasons such as provenenace. Axiom 1 may have evidence E1, contributor C1, publication P1, and Axiom 2 may have evidence E2, contributor C2, publication P2. If these all get conflated to the same edge then we mix E1 with P2 and so on.

See the discussion here for further context on how we deal with this in robot: ontodev/robot#214

option 2: denormalize/flatten at time of mapping to CSV

here we would keep the normalized/nested datamodel, but when translating to and from CSVs we would flatten things such that the outer field is concatenated with the inner field

e.g.

id: chg123
description: I am changing predicate because blah blah
about:
  subject: FOO:1
  predicate: P:1
  object: FOO:2
...

==>

d: chg123
description: I am changing predicate because blah blah
about_subject: FOO:1
about_predicate: P:1
about_object: FOO:2

(as a generalized algorithm this only works when the containing slot is singlevalued)

now, technically the SPO may not be unique here. We can imagine two modes:

unambiguous: if your about selector does not uniquely identify a single edge, fail
global: the change is applied to all edges that match

This has some nice properties. E.g. most graphs have maximum one edge between them, so P doesn't necessarily do much work as disambiguator, so we could say

"change predicate to subClassOf in edge between CNS and nervous system"

merge kgcl_tool into kgcl

Organisation:

kgcl
- utils
- model - autogenerated from linkml
  - kgcl
  - kgmodel
- apply
- diff
tests
- test_model.py
- test_apply/
- test_diff/

CLI:

Follow https://clig.dev/, e.g. -o for outputs

For summaries:

ideally this would follow the datamodel, which is hopefully used underneath

Add change type "Invert"

Example:

invert A and B

If A R B, then remove edge and add edge B R A

Changes where the about node cannot be unambiguously determined

http://purl.obolibrary.org/obo/NCBITaxon_2 http://www.w3.org/2000/01/rdf-schema#label "Bacteria" .
http://purl.obolibrary.org/obo/NCBITaxon_3 http://www.w3.org/2000/01/rdf-schema#label "Virus" .

rename 'Bacteria' to 'Virus'
rename 'Virus' to 'Vaccine'

KGCL Model:
NodeRename(ID=test_id_322, Old Value='Bacteria', New Value='Virus')
NodeRename(ID=test_id_323, Old Value='Virus', New Value='Vaccine')

Output Graph:
http://purl.obolibrary.org/obo/NCBITaxon_2 http://www.w3.org/2000/01/rdf-schema#label "Vaccine" .
http://purl.obolibrary.org/obo/NCBITaxon_3 http://www.w3.org/2000/01/rdf-schema#label "Vaccine" .

The behavior is not wrong by the current under-specification, but we should better specify this

I think we are better being strict here, forcing all SimpleChanges to be about a single node. If we later have a use case for updating multiple nodes at once we create specific change classes for that.

NodeRename has a field with cardinality 0..1 for about-node.

I propose that the procedure from going from the change model to the output graph first fills in the about-node (or about-edge slot). If this cannot be done unambiguously, this would raise an error.

However, I appreciate that may be harder for a SPARQL implementation. It's nice to be able to translate the change object into a simple SPARQL object. I am happy with this behavior for now if we document this.

This might be non-ideal for cases where no about node can be determined, a direct SPARQL would simply result in no changes. But maybe that can just be detected directly.