incatools / relation-graph Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 4.0 598 KB

Materialize OWL existential relations

License: MIT License

Scala 100.00%

obofoundy ontology owl rdf

relation-graph's People

Contributors

Stargazers

Watchers

Forkers

pombase scala-steward wdduncan joeflack4

relation-graph's Issues

Provide TSV export

4 columns:

subject
predicate
object
named graph (e.g redundant, non-redundant)

With CURIEs

Why:

many tools speak TSV (pandas, sqlite, cut, grep)
loading the ttl files using the OWLAPI causes inference that the OPs are actually APs. I believe querying by robot will route through the API causing injection of these unwanted triples
having the two files combined into one is useful

add info to readme about CLI options

It would be helpful (at least for me) if the README had info about the CLI options (.e.g., non-redundant-output-file, redundant-output-file). The relation-graph --help is a bit terse:

Config
Usage: config [options]
  --usage  <bool>
        Print usage and exit
  --help | -h  <bool>
        Print help message and exit
  --ontology-file  <string>
  --non-redundant-output-file  <string>
  --redundant-output-file  <string>
  --mode  <output mode>
  --property  <string*>
  --properties-file  <string?>
  --output-subclasses  <boolean value>
  --reflexive-subclasses  <boolean value>
  --equivalence-as-subclass  <boolean value>

include RBox

Can we include the transitive closure of rdfs:subPropertyOf?

there is also an argument for treating domain/range as edges too and including inferred rbox axioms like this

I understand some are against this, but adding rdfs:labels to output would be helpful. Since OBO ontologies use opaque IRIs, you have to look up the class names if you don't remember the IRI's label ... which often happens for me :)

FWIW, here is a robot command to extract the labels:

robot remove -i input.owl --select annotation-properties --exclude-term rdfs:label \                                                                
      remove --axioms logical \
      -o output-labels.ttl

A sparql query is easy too.

The output would need to be merged with the graph.

cc @balhoff

Unsatisfiable classes even when passing --disable-owl-nothing

robot merge -I http://purl.obolibrary.org/obo/upheno.owl -o db/upheno.owl
relation-graph --disable-owl-nothing true \
                       --ontology-file db/upheno.owl\
                       --output-file db/upheno-relation-graph.tsv.ttl.tmp \
                       --equivalence-as-subclass true \
                       --output-subclasses true \
                       --reflexive-subclasses true

yields

2022.05.16 16:22:47:341 zio-default-async-1 INFO org.renci.relationgraph.Main.program:57
    Running reasoner
2022.05.16 16:28:28:181 zio-default-async-1 INFO org.renci.relationgraph.Main.program:60
    Done running reasoner
2022.05.16 16:28:28:257 zio-default-async-1 ERROR org.renci.relationgraph.Main.run:77
    Ontology is incoherent; please correct unsatisfiable classes.

From #101 it seems maybe that --disable-owl-nothing is not sufficient in itself, although this ontology should not have individuals, and the error is about correcting unsatisfiable classes. Perhaps it is the case the RL part does things like check for disjointness violations as rules regardless of owl-nothing?

While we should fix upheno, it would be good to have a mode that is for just calculating the relation graph from nothing more than RBox axioms, subClassOf, someValuesFrom (perhaps classAssertion). While this can be done as a robot step it's convenient to do directly

help output out of date

When I run relation-graph --help, it produces:

Usage: config [options]
  --usage  <bool>
        Print usage and exit
  --help | -h  <bool>
        Print help message and exit
  --ontology-file  <string>
  --output-file  <string>
  --mode  <output mode>
  --property  <string*>
  --properties-file  <string?>
  --output-subclasses  <boolean value>
  --reflexive-subclasses  <boolean value>
  --equivalence-as-subclass  <boolean value>

This is not the same as the README, which includes:

--output-classes  <bool>
        Output any triples where classes are subjects (default true)
  --output-individuals  <bool>
        Output triples where individuals are subjects, with classes as objects (default false)
  --disable-owl-nothing  <bool>
        Disable inference of unsatisfiable classes by the whelk reasoner (default false)
  --prefixes  <filename>
        Prefix mappings to use for TSV output (YAML dictionary
  --obo-prefixes  <bool>
        Compact OBO-style IRIs regardless of inclusion in prefixes file
  --verbose  <bool>
        Set log level to INFO

cc @balhoff

publish library

The core of relation-graph can be extracted and published as a library.

algorithm to make RG faster

I suspect RG spends a lot of time working through combinations that can be eliminated ahead of time.

E.g. consider a large ontology consisting of protein classes P and go terms T, where the two are linked by P subClassOf involvedIn some T, where T employs a rich RBox such that there are many interesting R some Ts, but there are no named R some Ps. We waste a lot of time if we materialize all R some Ps, we can filter ahead of time and determine there is no named classes inferred to be subclasses.

The following algorithm builds a reduced set of class expressions Xs to materialize:

Cs = bfsort(Ont.classes)
Rs = bfsort(Ont.properties
Excluded = {}
Xs = []
For r in Rs:
  For c in Cs:
    if exists r', c' such that r' in {parent(r),r} and c' in {parent(c),c} and Excludes[(r',c')]:
      Excluded[(r,c)] = True
      Break
    Else:
      Excluded[(r,c}] = |Reasoner.queryDescendants( r some c )| > 0
      Xs.append( r some c )

Intuitively this will give better results in many cases, but may come at a penalty of doing extra computation when Xs approximates the full Cs x Rs cross product. Benchmarks required!

As an aside, I still think the strategy of materializing R-some-Cs is a bit contorted, I suspect you will always get better performance by modeling the RG directly in the ABox and using simple property chains/rules e.g. R o SubClassOf -> R together with transitivity.. initial experiments with datalog suggest this is the case but not clear if performance is due to the datalog system (Souffle) or the differing strategies (TBox vs ABox)

CLI options on README are out of sync with latest build

README has

relation-graph --ontology-file uberon.owl --non-redundant-output-file nonredundant.ttl --redundant-output-file redundant.ttl --mode rdf --property 'http://purl.obolibrary.org/obo/BFO_0000050' --property 'http://purl.obolibrary.org/obo/BFO_0000051' --properties-file more_properties.txt

but the latest has a single output file:

➜  relation-graph git:(master) ./target/universal/stage/bin/relation-graph --help
Config
Usage: config [options]
  --usage  <bool>
        Print usage and exit
  --help | -h  <bool>
        Print help message and exit
  --ontology-file  <string>
  --output-file  <string>
  --mode  <output mode>
  --property  <string*>
  --properties-file  <string?>
  --output-subclasses  <boolean value>
  --reflexive-subclasses  <boolean value>
  --equivalence-as-subclass  <boolean value>

it was very useful having both named graphs as separate outputs before - any recommendations for how to get this out with the new CLI?

including a non-existent URI in a properties-file results in zero results

If I include a non-URI or a URI that does not correspond to an OP in the ontology in my properties file, something silently fails, and the resulting rdf file is empty.

Also: an empty properties file leads to all properties being used. This is potentially useful as a kind of default behavior but maybe surprising.

tsv mode: if multiple prefix contractions are possible, choose the longest namespace

given:

UBERON: "http://purl.obolibrary.org/obo/UBERON_"
obo: "http://purl.obolibrary.org/obo/"

either emit an error, as the results are non-deterministic, or (preferably) choose the longest namespace, making the shortest CURIE, ie. favor UBERON:nnnn

vocabulary for different kinds of redundancy/inference at graph and owl level

Capturing in this repo for now. The goal is to have a better categorization of different notions of directness or redundancy of triples in a Relation Graph. Even if these are not implemented by RG immediately, having this vocabulary will help us in discussions of what to include or exclude any given user-facing graph.

The overall goal is to be able to classify triples in a relation graph with categories that help inform users as to whether they are redundant and the nature of the redundancy. E.g. for some applications it is critical to be able to separate "one-hop" edges from edges that are redundant with a multi-hop path (classic Transitive Reduction).

Relation Graph Transforms

First we define the relation-graph transform (RGT) between an axiom pattern on the left and a triple on the right:

A subClassOf B where A and B are named classes => A rdfs:subClassOf B
A subClassOf R some B where A and B are named classes => A R B
A equivalentClass B where A and B are named classes => A owl:equivalentClass B

(these are a subset of owlstar, with the interpretation triple excluded)

RGT(O_entailed) = RG
RGT(O_asserted) = RG_asserted

The axioms patterns are assumed to match entailed axioms in the input ontology. The original asserted axioms are labeled O_asserted.

A possible extension is to include properties and individuals in the graph, we omit this for now

The set of axioms in O that do not match either LHS pattern is called O' (e.g. RBox axioms, logical axioms with nesting or using other constructs)

We use the term "triple" when talking about the output RG triple, and "axiom" when talking about the input OWL ontology axiom. We assume only semantics at the OWL/axiom level and structure at the triple level.

Any triple in the RG must correspond to an axiom (asserted or entailed) in the input ontology according to the two transforms above

Each triple can be categorized according to one or more entailment/redundancy categories below

Additional assumption: the input OWL is coherent. Preprocessing may be applied to reach this state.

Triple categories

A triple t is asserted based on characteristics of it's corresponding axiom a, which matches one of the patterns above

asserted

The triple comes from an asserted axioms on the LHS, ie the triple matches an axiom in O_asserted

entailed

The triple comes from an entailed axiom on the LHS

note that entailed is not disjoint from asserted. Every asserted triple is necessarily an entailed triple

entailment trivially holds for all edges but we include here for completeness, e.g. we can use it to categorize unseen edges

An example of an entailed triple is A partOf C, given A partOf B, B partOf C, where partOf is transitive in O

(if partOf is not transitive, then the triple is not entailed and hence not in RG)

reflexive

iff A=B

Note that all reflexive triples SHOULD NOT be asserted for globally reflexive or locally reflexive properties, as the input ontology SHOULD NOT assert these. Presence of these indicates the input MAY come from incorrectly configured robot.

For non-reflexive properties, reflexive triples MAY be asserted. For example, from an asserted axiom neuron subClassOf connected-to some neuron

root-tip-tautological (aka taut)

Either A=Nothing or B=thing or P=topProperty or P=bottomProperty

A default configuration of RG MAY be to always exclude tautologies, and the nodes thing and nothing

entailment-direct

Given a triple t0 A R B, corresponding to axiom a0,

t0 is entailment-direct if there exists no triple t1 corresponding to axiom a1 such thatO-a1 |= a0

this has the effect of excluding both two-hops over transitive predicates, as well as one-hops with more general predicates

graph-onehop

A triple A R B is graph-direct if there exists no pair of triples A _ Z, Z _ B

Note this is predicate-blind. It should not be assumed non-onehops encode no useful information. E.g.

A sub C [2-hop]
A partOf B
B sub C

If the 2-hop is exclude it can exclude information that cannot be recapitulated elsehwere, and will impede operations such as inference of a category for a node by subClass traversal

graph-Nhop

Generalization of above

non-redundant predicate

a triple A R B (corresponding to axiom a0) is a nrpred iff there is no triple

t1 A R' B (corresponding to a1) such that R' != R and

O-a0 |= a0 and NOT: O-(a0 + a1) |= a0

Examples

Single hop subproperty example

input owl

A sub partOf some B
partOf sub overlaps
reflexive(partOf)

entailed owl

A sub partOf some B
partOf sub overlaps
overlaps sub topProperty
A sub Thing
B sub Thing
A sub A
B sub B
Nothing sub Nothing
Nothing sub A
Nothing sub B
bottomProperty sub partOf
bottomProperty sub overlaps
A sub partOf some A
B sub partOf some B
A sub overlaps some A
B sub overlaps some B
A sub topProperty some A
B sub topProperty some B
A sub overlaps some B
A sub topProperty some B

(we omit irrelevant patterns as the set is of infininite size, eg if we include unions)

triples in RG annotated with categories (omitting 'entailed' categorization)

A partOf B asserted, entailment-direct, graph-onehop, nrpred
A overlaps B, NON-entailment-direct, graph-onehop, redundant-pred
A sub Thing taut, entailment-direct, graph-onehop
B sub Thing taut, non-entailment-direct, non-onehop
Nothing sub A, taut, non-entailment-direct, non-onehop
Nothing sub B, entailment-direct, graph-direct
A partOf A, reflexive
B partOf B, reflexive
A sub A, reflexive
B sub B, reflexive
A overlaps A, reflexive
B overlaps B, reflexive
A topProp A, reflexive, taut
B topProp B, reflexive, taut

Note that if we select ONLY non-taut, entailment-direct from RG we get a single triple A partOf B

This is what a "typical user" might expect from the graph

Transitivity 2-hop example

input owl

A sub partOf some B
B sub partOf some C
transitive(partOf)

triples (excluding tauts and reflexive for brevity)

A partOf B: asserted, entailment-direct, graph-onehop
B partOf C: asserted, entailment-direct, graph-onehop
A partOf C: non-entailment-direct, graph-2hop

contrast with next example

Non-transitivity 2-hop example

input owl

A sub adjacentTo some B
B sub adjacentTo some C

triples (excluding tauts and reflexive for brevity)

A adjacentTo B: asserted, entailment-direct, graph-onehop
B adjacentTo C: asserted, entailment-direct, graph-onehop

Note this triple is NOT in RG:

A adjacentTo C: NON-ENTAILED, non-entailment-direct, graph-2hop

Property-chain example

input owl

A sub negReg some B
B sub negReg some C
negReg sub reg
posReg sub reg
negReg o negReg -> posRef

triples (excluding tauts and reflexive for brevity)

A negReg B: asserted, entailment-direct, graph-onehop
B negReg C: asserted, entailment-direct, graph-onehop

A reg B: non-entailment-direct, graph-onehop
B reg C: non-entailment-direct, graph-onehop

A reg C: non-entailment-direct, graph-2hop, redundant-pred
A posReg C: non-entailment-direct, graph-2hop, nrpred

The intuition here is that even though A posReg C is a 2-hop and can be entailed from existing triple axioms, it is in some sense interesting in that it is a non-redundant pred

provide more user friendly error message if ontology is incoherent

wget http://purl.obolibrary.org/obo/cheminf.owl -O owl/cheminf.owl
relation-graph --ontology-file owl/cheminf.owl --output-file inferences/cheminf-inf.ttl

makes a giant stack trace. scroll up and you can see the ontology is incoherent

consider: non-strict option that will drop all constraining axioms if ontology is incoherent. it is awkward to do this with robot as part of a pipeline, and sometimes we just want the graph walking regardless of some bfo nonsense at the top

provide rdf/owl output option

ttl is more compact but rdf/xml can be used with a wider variety of tools, e.g. rdftab

it's trivial to convert outside RG e.g using riot, but this adds plumbing complexity

Consider outputting certain "blank nodes"

For a logical definition like this:

HP:Astrocytosis EquivalentTo RO:has_part some (PATO:increased_rate and (RO:inheres_in_part_of some (GO:cell_growth and (RO:occurs_in some CL:astrocyte))) and (RO:has_modifier some PATO:abnormal)

The term HP:Astrocytosis will have outgoing edges only where they are inferred to have named class targets, for example RO:has_part PATO:increased_rate, and any property chain inferences that result from the other nested targets.

Creating nodes for all the sub-expressions would allow more complete query, such as:

?s has_part:/inheres_in_part_of: cell_growth: .

Remove non-redundant output option

The non-redundant output is not really non-redundant. The code could be simplified if it's just removed. It works better to filter redundancy after the fact, e.g., https://github.com/INCATools/ubergraph/blob/master/prune.dl.

Add safe profile for use when combining ontologies

Use case: combine multiple linked ontologies, use RG to determine formal ancestor relationships; avoid accidentally inserting intra-ontology is-as. See: OWL is not modular

This can be done by preprocessing using robot remove with --axioms "equivalent disjoint annotation" as well as Domain and Range axioms (can't figure out how to do that in robot). It would be convenient to have this as a "graph walker" profile in RG, to avoid preprocessing. This will yield correct n-hop edges while not overwriting vetted assertions from source ontologies.

This issue may subsume:

#127

relation-graph fails to exit for an ontology with no declared object properties

Reported by @cmungall.

Output is not valid n-triples

Latest versions of RG use the turtle a shorthand for rdf:type

This isn't valid n-triples:
https://www.w3.org/TR/n-triples/#grammar-production-IRIREF

Now, strictly speaking this is not a problem. The documentation gives no guarantee about the format, and the output is valid turtle.

However, it looks sufficiently like n-triples that it fools some tools - for example:

➜  semantic-sql git:(adding-sources) ✗ cat foo 
<http://www.reactome.org/biopax/81/48887#CellularLocationVocabulary88> a <http://www.biopax.org/release/biopax-level3.owl#CellularLocationVocabulary> .
➜  semantic-sql git:(adding-sources) ✗ riot foo
11:54:09 ERROR riot            :: [line: 1, col: 72] Expected IRI: Got: [KEYWORD:a]

For jena, this is obviated by either giving a file suffix OR with an explicit --syntax

However, it still seems a bit of a bait and switch - either use idiomatic nested turtle or use pure n-triples.

It may also be the case that forcing use of a turtle parser would be slower (just checked: it looks like with Jena parsing as turtle vs a using n-triples parser with explicit rdf:types takaes 5% longer, nbd, but other toolchains may be different)

some of this is maybe irrelevant with a TSV format which will allow guaranteed streaming parsing

Add rdf:types to relation-graph

Example:

@prefix : <http://example.org/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:test rdf:type owl:Ontology .
:b rdfs:subClassOf :a .
:c rdfs:subClassOf :b .
:i1 rdf:type :c .

Add an option for triples for i instantiating all 3 classes

Needless to say this should be types at the OWL level - we don't want type owl:Class type owl:NamedIndividual cluttering the output

Include ObjectPropertyAssertion axioms as triples

lightweight adoption of owlstar graph-level annotations

While RG is awesome, some things I worry about:

it requires a fair bit of tacit knowledge to understand both the operations and the relevance of the target output (materialized existential relations)
there is no metadata in the derived graph that tells machines how to interpret the triples

I think the use of a README works fine in a local context but things can potentially get confusing when RG triples are remixed back into triples following the standard OWL-RDF serialization. This also has implications for ubergraph - the knowledge that certain queries are transferrable from ubergraph to other triplestores is currently also tacit, rather than having queryable metadata in the triplestore.

I think this could be mitigated by the adoption of some kind of standard. As far as I know the closest we have is https://github.com/cmungall/owlstar

The main anticipated use of owlstar is to annotate individual triples with OWL interpretations, as an alternative to the standard OWL-RDF mapping. Now obviously this would result in massive inflation in size of already massive turtle files (somewhat less if turtlestar is used, but still).

However, there are other ways to achieve the same thing, for example, annotating the graph the triples are contained in; e.g.

<g> os:hasTriplePattern os:SubClassOfSomeValuesFrom , os:TypeSomeValuesFrom, os:SubClassBetweenNamedClasses

this graph-level mechanism would also be a good way of communicating other tacit knowledge about the graph - e.g. that it must include reflexive entailed axioms, that materialization was performed for a certain subset of triples (this bleeds into OMO work, cc @matentzn).

The metadata can be probed by clients to determine if certain kinds of queries are applicable. It can also serve as a mechanism for humans to go and lookup in a standard place what we mean by a triple UBERON:1 BFO:2 UBERON:3

We may still want to switch off graph-annotations by default (it is handy in local contexts to assume that all triples represent edges)

There would still be work to do - e.g algebra of compositions for merging different RGs. But this can come later.

There may be other ways to do this - e.g SHACL shapes?

output triples with individuals as subjects

Add a new command-line option to include individuals. Also add an option to exclude class subjects.

cc @dosumis @hkir-dev

update build instructions

Install sbt (Scala Build Tool) on your system. For Mac OS X, it is easily done using Homebrew: brew install sbt. sbt requires a working Java installation, but you do not need to otherwise install Scala.

After sbt is installed, run sbt stage to create the executable. The executable is created in your target/universal/stage/bin directory.

After running sbt stage I get

✗ sbt stage
[info] welcome to sbt 1.6.2 (Homebrew Java 17.0.2)
....
....
[info] Fetched artifacts of
[info] compiling 2 Scala sources to /Users/cjm/repos/relation-graph/core/target/scala-2.13/classes ...
[info] Non-compiled module 'compiler-bridge_2.13' for Scala 2.13.8. Compiling...
[info]   Compilation completed in 6.29s.
[info] compiling 4 Scala sources to /Users/cjm/repos/relation-graph/cli/target/scala-2.13/classes ...
[info] Main Scala API documentation to /Users/cjm/repos/relation-graph/cli/target/scala-2.13/api...
[info] Main Scala API documentation successful.
[info] Main Scala API documentation to /Users/cjm/repos/relation-graph/core/target/scala-2.13/api...
[info] Wrote /Users/cjm/repos/relation-graph/core/target/scala-2.13/relation-graph_2.13-2.2.1.pom
[info] Main Scala API documentation successful.

✗ ls -alt target/universal/stage/bin
total 56
drwxr-xr-x  4 cjm  staff    128 Feb 11 14:57 .
-rwxr--r--  1 cjm  staff  14579 Feb 11 14:57 relation-graph
-rw-r--r--  1 cjm  staff  10648 Feb 11 14:57 relation-graph.bat
drwxr-xr-x  4 cjm  staff    128 Feb  4 16:47 ..

incatools / relation-graph Goto Github PK

relation-graph's People

Contributors

Stargazers

Watchers

Forkers

relation-graph's Issues

Relation Graph Transforms

Triple categories

asserted

entailed

reflexive

root-tip-tautological (aka taut)

entailment-direct

graph-onehop

graph-Nhop

non-redundant predicate

Examples

Single hop subproperty example

Transitivity 2-hop example

Non-transitivity 2-hop example

Property-chain example

Recommend Projects

Recommend Topics

Recommend Org