Code Monkey home page Code Monkey logo

kgcl-rdflib's Introduction

kgcl-rdflib

An engine that applies changes or diffs specified in KGCL to an RDF graph stored using rdflib

KGCL

KGCL (Knowledge Graph Change Language) is a datamodel and language for representing changes in ontologies and knowledge graphs

The core KGCL repo is here:

This kgcl-rdflib repo applies the KGCL model to rdflib graphs.

kgcl-rdflib's People

Contributors

ckindermann avatar cmungall avatar hrshdhgd avatar jamesaoverton avatar joeflack4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

kgcl-rdflib's Issues

retire kgcl_2_rdf.py

despite it's name this doesn't map to rdf - it renders as yaml.

note there is no need for all this code to generate yaml - linkml yaml_dumper does this automatically!

diff fails on existential restrictions

cls = Namespace("http://www.w3.org/2002/07/owl#"), name = 'some_values_from'
default = None

    def __getitem__(cls, name, default=None):
        name = str(name)
        if str(name).startswith("__"):
            return super().__getitem__(name, default)
        if (cls._warn or cls._fail) and name not in cls:
            if cls._fail:
>               raise AttributeError(f"term '{name}' not in namespace '{cls._NS}'")
E               AttributeError: term 'some_values_from' not in namespace 'http://www.w3.org/2002/07/owl#'

/Users/cjm/Library/Caches/pypoetry/virtualenvs/kgcl-schema-ImX8xLey-py3.9/lib/python3.9/site-packages/rdflib/namespace/__init__.py:196: AttributeError

Parsed strings retain quotes

Given "rename GO:1 from 'foo' to 'bar'"

the parser returns

{
  "id": "CHANGE:001",
  "old_value": "'foo'",
  "new_value": "'bar'",
  "about_node": "GO:1",
  "about_node_representation": "curie",
  "@type": "NodeRename"
} 

The quoted value is itself quoted

Rename command doesn't work if literals are xsd:strings

If the input has xsd:strings on literals (as is common for many
ontologies like GO):

<http://purl.obolibrary.org/obo/NCBITaxon_2>
<http://www.w3.org/2000/01/rdf-schema#label> "Bacteria"^^xsd:string .

then rename fails: it adds a new name but does not delete the old

This seems to be a more general issue

Remove about_node_representation

{
  "id": "CHANGE:001",
  "old_value": "'foo'",
  "new_value": "'bar'",
  "about_node": "GO:1",
  "about_node_representation": "curie",
  "@type": "NodeRename"
} 

This is unneccessary - just as with jsonld we can leave as either CURIE or URI. Of course, wwe should always have a context file that maps prefixes

Annotate method inputs and outputs using typing

For example:

def parse(input):
    """
    Parse a set of KGCL command separated by next-line operator.

    Returns instantiated dataclass objects from model.kgcl.
    """
    statements = input.splitlines()
    parsed = []

    for s in statements:
        parsed.append(parse_statement(s))
    return parsed

The arguments and the output should be typed

Tweak serialisation for Node Moves

The current serialisation suggests move {about edge} from {old value} to {new value}.
If an edge is specified as a triple (s,p,o), then move (s,p,o) from o to o' would duplicate the {old value}.
How about move {subject} {predicate} from {old value} to {new value}?

Create code to translate linear representation to change objects

Given linear representations of changes such as:

  • Rename foo to bar
  • Move bar from under baz to under fred
  • Obsolete 'bad' with replacement 'new'
  • Merge ‘nerve’ into ‘peripheral nerve’

Translate these into instances of Change objects https://cmungall.github.io/knowledge-graph-change-language/

These changes objects can then be translated into transactions at the owl, rdf, rdf*, or kgx level

The original idea was laid out in this doc:

https://docs.google.com/document/d/1__7p64FOI5ZhiZ6F2TXtUc8JN1XXGwglOiVRrlg9G_c/edit#

This doc outlined a grammar but I think this should instead be derived by reversing the fdocstrings in the datamodel.

  node rename:
    is_a: node change
    description: >-
      A node change where the name (aka rdfs:label) of the node changes
    slots:
      - old value
      - new value
      - has textual diff      
    slot_usage:
      old value:
        multivalued: false
      new value:
        multivalued: false
      change description:
        string_serialization: "rename {about} from {old value} to {new value}"
    examples:
      - value: "rename UBERON:0002398 from 'manus' to 'hand'"
        description: "replacing the rdfs:label of 'manus' on an uberon class with the rdfs:label 'hand'

Use modern python idioms

  • use fstrings rather than java style string concatenation
  • use ttl files in tests/input rather than encoding ontologies as turtle strings in tests
  • don't use double-underscore in variable names, this has special meaning in python
  • remove filesystem path assumptions
  • follow clig.dev for CLI
  • Don't use java-style getter-setters on data objects - it is more pythonic just to access fields directly. See also https://docs.python.org/3/library/dataclasses.html
  • change if type(kgcl_instance) is NodeRename: to use isinstance

Rename calls are agnostic to the `from` value

e.g: Both

  • rename MONDO:0000087 from 'polymicrogyria' to 'polymicrogyria ABCD' and
  • rename MONDO:0000087 from 'foo bar' to 'polymicrogyria ABCD'

yield the correct result of renaming whatever label is for MONDO:0000087. The code currently does not validate the from parameter (polymicrogyria in this case) to be TRUE or no.

Dependencies from imports are not generated correctly in Python Dataclasses

Running pipenv run gen-py-classes kgcl.yaml > kgcl.py generates the desired kgcl.py file.
However, import dependencies from the modules ontology_model and prov are not generated correctly.
For example, instead of about_edge: Optional[Union[dict, Edge]] = None I find about_edge: Optional[Union[dict, "Edge"]] = None.

Do not require <>s in URIs

The way the original implementation works was to force URIs in the DSL to be specified in <>s as in turtle

Let's deprecate this but make them optional so existing tests don't break

Language tags in KGCL

How should language tags be supported explicitly by KGCL?

I am assuming statements such as "rename UBERON:0002398 from 'manus' to 'hand'" should preserve available language tags. But what about inputs such as "rename UBERON:0002398 from 'manus@en' to 'hand@en'"? Should 'manus@en' be interpreted as "manus@en" or "manus"@en?

This ambiguity could be avoided by making language tags explicit in KGCL, e.g. "rename UBERON:0002398 from 'manus'@en to 'hand'@en". In this case, we'd need to extend the KGCL data model to hold information about language tags.

Add tests for grammar

Currently only apply is tested. We need tests for grammar - for both parsing and rendering

Provide a better definition of edge

This is quite minimal: https://cmungall.github.io/knowledge-graph-change-language/Edge/

The description of edge should be improved, as well as the mapping to OWL

Mapping to OWL

For v1 we can have the simple mapping:

  • A subClassOf B <==> Edge(subject=A, predicate=subClassOf, object=B)
  • A subClassOf P some B <==> Edge(subject=A, predicate=P, object=B)

For v2 we extend this, with an additional owlinterpretation edge property, as per owl star. See https://github.com/cmungall/owlstar/blob/master/owlstar.ttl.

For example:

  • A subClassOf P only B <==> Edge(subject=A, predicate=P, object=B, interpretation=allValuesFrom)

CLI: (i) create, (ii) document

Description

I'm not might have missed it, but I'm looking through the code and docs for a CLI, but not seeing one.

Not sure yet how to use the tool.

Provide basic docs on how to use kgcl-tools

@ckindermann it would be great if we could come up with a small set of instructions on how to use the KGCL tools locally to demo them to potential supporters. Please advice on what would be the best way to try these locally - https://kgcl.ontodev.com/ is great and works for small examples, but we couldn't use it with private data or bigger ontologies.

Thank you! ;)

support CURIEs

We don't want to show full URIs to users. For many apply commands the label should work but if the user wants to be more precise they can specify a CURIE

similar for reporting, the format can be "label" (CURIE)

Create code to translate change objects to SPARQL UPDATEs

counterpoint to #3

Given an instance of a change object, e.g.

[ a kgcl:NodeRename ;
  kgcl:about UBERON:0002398 ;
  kgcl:old_value "manus" ;
  kgcl:new_value "hand" ]

turn this into the equivalent SPARQL UPDATE

Note: it may be the case that no code as such is required, generic SPARQL could be written to do the transform

create see_also links in schema pointing to obook pages

Currently we have links to GO wiki pages:

  node creation:
    is_a: node change
    mixins:
      - creation
    description: >-
      a node change in which a new node is created
    slots:
      - node id
      - name
      - owl type
      - annotation set
      - language
    slot_usage:
      change description:
        string_serialization: "creating node {id} {label} with {annotation set}"
    todos:
      - allow this for the creation of an instance from a class. This may include metaclasses (templates)
    see_also:
      - http://wiki.geneontology.org/index.php/Guidelines_for_creating_a_GO_term

these should remain, but we should also point to the relevant obook page

in this case, https://oboacademy.github.io/obook/howto/create-new-term/

Note there are some ODK pages that are becoming obsolete, see

Modernize repo

  • Switch to GH Actions
  • Switch to poetry and tox
  • Clean-up Makefile and conform to Mark's template

Fix schema for edge changes to clarify how to uniquely identify an edge

The node change hierarchy makes use of an about field to indicate which node is being modified. Nodes have primary keys so it's easy to do things like:

c = NewSynonym(id='chg12345', about='ANAT:HindLimb', new_value='hindlimb')

It looks like I attempted to do something similar for edges, using an about field that points to an edge. However, edges don't have singular primary keys and we want to be able to refer to edges by SPO triple.

so the current model does allow this:

c = PredicateChange(id='chg12345', about=Edge(subject=FOO1:, predicate=..., object=..) new_value='hindlimb')

This has a natural YAML, JSON, JSON-LD, RDF serialization. However, due to the nesting it does NOT have a natural CSV serialization, and a core use case is being able to specify a set of changes in a spreadsheet

option 1: denormalize the core model

for edges, rather than about we would have 3 fields about_{s,p,o}

The first disadvantage is denormalization

The second is that it assumes every edge (OWL axiom) is uniquely identified by a SPO triple. But in fact that is neither true of OWL nor of the more generalized use cases.

I have quite a few ontologies where I have >1 axiom with the same SPO. This is useful for a number of reasons such as provenenace. Axiom 1 may have evidence E1, contributor C1, publication P1, and Axiom 2 may have evidence E2, contributor C2, publication P2. If these all get conflated to the same edge then we mix E1 with P2 and so on.

See the discussion here for further context on how we deal with this in robot: ontodev/robot#214

option 2: denormalize/flatten at time of mapping to CSV

here we would keep the normalized/nested datamodel, but when translating to and from CSVs we would flatten things such that the outer field is concatenated with the inner field

e.g.

id: chg123
description: I am changing predicate because blah blah
about:
  subject: FOO:1
  predicate: P:1
  object: FOO:2
...

==>

d: chg123
description: I am changing predicate because blah blah
about_subject: FOO:1
about_predicate: P:1
about_object: FOO:2

(as a generalized algorithm this only works when the containing slot is singlevalued)

now, technically the SPO may not be unique here. We can imagine two modes:

  1. unambiguous: if your about selector does not uniquely identify a single edge, fail
  2. global: the change is applied to all edges that match

This has some nice properties. E.g. most graphs have maximum one edge between them, so P doesn't necessarily do much work as disambiguator, so we could say

  • "change predicate to subClassOf in edge between CNS and nervous system"

merge kgcl_tool into kgcl

merge kgcl_tool into kgcl

Organisation:

  • kgcl
    • utils
    • model - autogenerated from linkml
      • kgcl
      • kgmodel
    • apply
    • diff
  • tests
    • test_model.py
    • test_apply/
    • test_diff/

CLI:

Follow https://clig.dev/, e.g. -o for outputs

For summaries:

image

ideally this would follow the datamodel, which is hopefully used underneath

Changes where the about node cannot be unambiguously determined

http://purl.obolibrary.org/obo/NCBITaxon_2 http://www.w3.org/2000/01/rdf-schema#label "Bacteria" .
http://purl.obolibrary.org/obo/NCBITaxon_3 http://www.w3.org/2000/01/rdf-schema#label "Virus" .

rename 'Bacteria' to 'Virus'
rename 'Virus' to 'Vaccine'

KGCL Model:
NodeRename(ID=test_id_322, Old Value='Bacteria', New Value='Virus')
NodeRename(ID=test_id_323, Old Value='Virus', New Value='Vaccine')

Output Graph:
http://purl.obolibrary.org/obo/NCBITaxon_2 http://www.w3.org/2000/01/rdf-schema#label "Vaccine" .
http://purl.obolibrary.org/obo/NCBITaxon_3 http://www.w3.org/2000/01/rdf-schema#label "Vaccine" .

The behavior is not wrong by the current under-specification, but we should better specify this

I think we are better being strict here, forcing all SimpleChanges to be about a single node. If we later have a use case for updating multiple nodes at once we create specific change classes for that.

NodeRename has a field with cardinality 0..1 for about-node.

I propose that the procedure from going from the change model to the output graph first fills in the about-node (or about-edge slot). If this cannot be done unambiguously, this would raise an error.

However, I appreciate that may be harder for a SPARQL implementation. It's nice to be able to translate the change object into a simple SPARQL object. I am happy with this behavior for now if we document this.

This might be non-ideal for cases where no about node can be determined, a direct SPARQL would simply result in no changes. But maybe that can just be detected directly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.