incatools / boomer Goto Github PK

View Code? Open in Web Editor NEW

26.0 8.0 2.0 1.44 MB

Bayesian OWL ontology merging

Home Page: https://incatools.github.io/boomer/

License: BSD 3-Clause "New" or "Revised" License

Scala 99.89% Makefile 0.11%

owl-ontology boomer obofoundry owl reasoning ontology-merging monarchinitiative mondo geneontology probabilistic-owl

boomer's Issues

Treat 1.0 probability as required axiom

This would be consistent with the treatment of 0 as impossible.

Add more documentation on how to explore output

#185

We add obojson output, which can be fed into obographviz - but we don't really explain any of this on the README

Also just general docs on how to explore the different output files

output files

The output.txt file looks like this:

there is no output.txt specified on the command line, it doesn't seem to be generated

I'd expect an output arg

boomer --ptable probs.tsv --ontology slim-exposure.obo --window-count 10 --runs 20 --prefixes prefixes.yaml --output boomer.txt
2020.04.09 14:57:04 [ERROR] org.monarchinitiative.boomer.Main.$anonfun.applyOrElse:60:60 - Unrecognized argument: --output

cc @wdduncan

allow providing prefixes declarations separate from list of prohibited within-namespace equivalents

Prefix declarations are needed for the input TSV. Right now inferred equivalents within these namespaces are also prohibited. @matentzn would like to be able to enable/disable that feature more selectively.

accept SSSOM format

in addition to the ptable, also accept SSSOM mappings

cc @cmungall

Include SiblingOf in json output

when looking at pngs, it is useful to be able to see which mappings were rejected

I think the easiest way is to include a triple for SiblingOf calls (there is a way to represent properSiblingOf in OWL but it's verbose)

For example given a ptable where one mapping is likely to be interpreted as siblingOf:

A:1	B:1	0.01	0.01	0.05	0.93
A:2	B:1	0.01	0.01	0.95	0.03

I get output that is very useful, where each mapping is traceable:

A:1 SiblingOf B:1 (most probable) 0.93
A:2 EquivalentTo B:1 (most probable) 0.95

however the json doesn't have the first mapping, and thus the png also lacks it (and also loses the A1 node altogether

Docs: Add link to `README.md`

Overview

I just noticed that the documentation doesn't appear on the home page of GitHub.

Bayesian calculation of unspecified probabilities from priors

The functionality for this may go in sssom-py but it seems logical to put anything involving probabilistic calculations into an issue here

currently boomer assumes the user specifies priors for all 4 possibilities

What if we have a file which we have a mapping with a probability specified for only one interpretation? In this case we should use standard probability axioms to calculate other probabilities based on priors of probability of any mapping having a particular interp

E.g. assuming global priors

P(equiv) = 0.8
P(sub) = 0.05
P(sup) = 0.05
P(sib) = 0.1

assume sssom contains equiv statement with confidence 0.4

P(sub | equiv) = 0.0
P(sub | NOTequiv) = P(NOTequiv | sub) . P(sub)
                            ----
                            P(NOTequiv)

                  = 1 * 0.05
                      ---
                      0.2

                  = 0.25

therefore posterior p(sub) = 0.4*0 + 0.6 * .25 = 0.15

all posterior

p(equiv) = 0.4
p(sub) = 0.15
p(sup) = 0.15
p(sub) = 0.3

Consider not outputting cliques of size 2 or less

Right now, on a large run we output all cliques, including those of with just two members. These are neither useful for visualisation nor for human review?

JSON filenames that are hashed have no mention in the `output.md` file

If I need to see the JSON or PNG of the report associating with an entry in output.md, I cannot directly find it without some grep on the command line.

Discussed with Jim about having that as an entry in the output.md file itself to track it to the corresponding JSON/PNG file.

Exit code is set to error even when boomer succeeds

boomer exits with a non-zero exit code all the time, this is a bit non-standard and hampers normal unixy workflows

Report posterior probability of each proposed axiom in a solution

currently we report the posterior probability of the solution, and the prior probability of each axiom

we should report the posterior of each axiom. We do this by taking the sum of all solutions that include that axiom, and dividing that by the sum of probabilities of solutions that have a different interpretation.

If not all solutions are explored this will be an estimate. The estimate will be biased in favor of those with higher probabilities, as lower priorities may never be explored. We can account for this with a prior probability that the solution is biased

Pr(Axiom) = IsBiasedPrior * AxiomPrior + (1-IsBiasedPrior) * AxiomEstimatedPosterior

We can estimate IsBiasedPrior based on how many times the axiom or its alternatives were explored in the overall search tree.

We can also have strategies to minimize bias. E.g for each potential axiom we start at least one search with that as our initial choice.

allow prevention of equivalence inference within a namespace

This will require new functionality in whelk.

Add additional diagnostics to figure out points where boomer doesn't complete

Follow on from

#292

It would be great to have -v -vv etc to see where boomer is at. I have a process that has been running 3 days and I have no idea if it's stuck in a particular clique or is generally slow because of ontology size etc

I could just go in and add some printf statements but maybe there is a better way..

ideally there would also be stats added to the clique report - e.g. this clique took 5mins, this one took 2s, etc.

But primarily we can't want for the report if something is hanging, need a verbosity option

Provided more detailed documentation of window-count

For end-users of boomer.

Boomer: How to deal with huge cliques?

This issue is just so I can link to some discussion while I make other tickets. The question is what boomer should do when faced with an enormous clique:

Ignore it and spit the clique out for people to break it
Try to break by applying some trivial heuristics (fast ones, like bulk dropping low probability axioms)
Anything else come to mind?

I would like boomer to at least try 2, but its hard to do this in a principled manner. Maybe you have a better idea @balhoff?

boomer selects suboptimal solution in simple 3-node problem

for text files see #157.

Given:

Pr(A properSubClassOf C) = 0.99
Pr(A equiv B) = 0.95
Pr(B equiv C) = 0.95

(in each case, the only other possibility is siblingOf)

note each class is in a separate prefix space, so there is no penalty for equivalence between any

Solutions:

1,2,3 : incoherent
1,2 : .99 * .95 * (1-.95) = 0.04
1,3 : .99 * .95 * (1-.95) = 0.04
2,3 : .95 * .95 * (1-0.99) = 0.009
1 : .99 * .05 * .05 = 0.0023
2 : .01 * .95 * .05 = 0.000475
3 : .01 * .95 * .05 = 0.000475
{} : .01 * .05 * .05 = 2.5e-05

boomer generally selects {1} depending on params, but never the optimal

I am pretty sure I have not made a typo - I put each class in its own ID space, so it is not avoiding 2 or 3 (which would happen if A/B/C were in the same ID space)

boomer -p prefixes.yaml -w 100 -r 1000 -t ptable.tsv --ontology logical.omn 
...
2021.02.05 09:23:19:376 [zio-def...] [INFO ] org.monarchinitiative.boomer.Main.program:49 - Most probable: 0.0024750000000000015
...
$ more output.txt 
A:1 SiblingOf B:1               0.05
B:1 SiblingOf C:1               0.05
A:1 ProperSubClassOf C:1        (most probable) 0.99

Boomer markdown output

Boomer markdown output should not entirely obfuscate the IDs (labels with ID in brackets maybe?)
Combined posterior probability for each clique
Some confidence measure (multiplication of the probabilities with least next likely probability)
Some way to go quickly to the images from the markdown output

Incoherent results in output.txt with equivalence triads

triad.tsv:

X:1	Y:1	0.01	0.01	0.97	0.01
Y:1	Z:1	0.01	0.01	0.97	0.01
X:1	Z:1	0.2	0.2	0.01	0.59

empty.rdf:

prefix X: <http://purl.obolibrary.org/obo/X_>
prefix Y: <http://purl.obolibrary.org/obo/Y_>
prefix Z: <http://purl.obolibrary.org/obo/Z_>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix owl: <http://www.w3.org/2002/07/owl#>

X:1 a owl:Class .

prefixes.yaml:

X: http://purl.obolibrary.org/obo/X_
Y: http://purl.obolibrary.org/obo/Y_
Z: http://purl.obolibrary.org/obo/Z_

run:

$ boomer --ptable triad.tsv --ontology empty.rdf --prefixes prefixes.yaml --runs 5 --window-count 2
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:21:18 - Bin size: 3; Most probable: 0.97
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:24:16 - Max possible joint probability: -0.5885511570517892
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Main.$anonfun:42:34 - Most probable: -6.731742884949919
2020.04.27 19:57:27 [INFO] org.monarchinitiative.boomer.Main.$anonfun:57:34 - 5s

output.txt:

X:1 EquivalentTo Z:1    false
Y:1 EquivalentTo Z:1    true
X:1 EquivalentTo Y:1    true

this is incoherent as equivalence is symmetric, transitive

I think this is just an error in reporting, because we have

output.ofn:

# Class: <http://purl.obolibrary.org/obo/X_1> (<http://purl.obolibrary.org/obo/X_1>)

SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1>)
SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Z_1>)

# Class: <http://purl.obolibrary.org/obo/Y_1> (<http://purl.obolibrary.org/obo/Y_1>)

SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)

# Class: <http://purl.obolibrary.org/obo/Z_1> (<http://purl.obolibrary.org/obo/Z_1>)

SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/Y_1>)

When I run through robot reason -A EquivalentClass -i output.ofn -s true -o ... I get the expected

EquivalentClasses(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)

fail fast if there are no satisfiable solutions

Assume we start with:

A:1	A:2	0.1	0.0	0.0	0.0
B:2	B:1	0.1	0.0	0.0	0.0
A:1	B:1	0.0	0.0	1.0	0.0
A:2	B:2	0.0	0.0	1.0	0.0

this is the same as:

A:1 sub A:2
B:2 sub B:1
A:1 = B:1
A:2 = B:2

i.e. the relationship between 1 and 2 are flipped between A and B, yet they are equivalent. This is unsat if we add the

yields:

SINGLETONS

Method: singletons
Score: 0.0
Estimated probability: 1.0
Confidence: 1.0
Subsequent scores (max 10):

A:2 EquivalentTo B:2 (most probable) 1.0

Which is odd. It looks like it's rejecting p=1.0 axioms, but in fact it's accepting them:

Prefix(:=<urn:unnamed:ontology#ont1>)
Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)


Ontology(<urn:unnamed:ontology#ont1>

Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>))
Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>))
Declaration(Class(<http://example.org/A/1>))
Declaration(Class(<http://example.org/A/2>))
Declaration(Class(<http://example.org/B/1>))
Declaration(Class(<http://example.org/B/2>))
############################
#   Classes
############################

# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>)

SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/1>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/2>)

# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>)

SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/2>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/1>)

# Class: <http://example.org/A/1> (<http://example.org/A/1>)

EquivalentClasses(<http://example.org/A/1> <http://example.org/B/1>)
SubClassOf(<http://example.org/A/1> <http://example.org/A/2>)

# Class: <http://example.org/A/2> (<http://example.org/A/2>)

EquivalentClasses(<http://example.org/A/2> <http://example.org/B/2>)

# Class: <http://example.org/B/2> (<http://example.org/B/2>)

SubClassOf(<http://example.org/B/2> <http://example.org/B/1>)


)

which is unsat:

json output should include "sibling" relations, for depiction in graphics

Right now if two classes are forced to not be subclasses in either direction, this mapping won't be depicted in any way in the output images.

Explain what boomer does in a leading paragraph?

I kind of get that this tool is an entity mapper between 2 ontologies? Does it also report on logical inconsistencies when it is 100% sure of an entity match?

What is the best way to communicate changes in axioms to a user

Consider this:

the user provides mappings in a ptable of the "exact" variety
then boomer applies these, but changes some of them to "subclassof"

We should communicate this clearly to the user. The question is: should this be implemented as a mappingset diff? I.e. Boomer exports and sssom mapping table alongside the owlfile and we show a sssom diff to the user between the input mappings and the output? Or should boomer say something in a nicely formatted markdown?

Make a static site for this project

If you like I can have @hrshdhgd do this using either sphinx or mkdocs (I would do mkdocs with a material theme)

However, there may be a preferred way of making docs for scala projects that exposes the scala API

The first pass would be just to break the README into a couple of pages:

about.md
install.md
running.md

then we can add more docs for different workflows etc

Supporting the Mapping Integration workflow

The mapping Integration (as opposed to QC #333) workflow is about effective integration of new mappings into an ontology while maintaining consistency. The goal is to be able to rapidly slurp up existing mappings (almost) without the need of human review

Workflow:

Input Ontology O (e.g. Mondo)
Input M:
- Merged set of mappings:
- Internal (existing, verified mappings):
  - Reviewed: 0.99 %
  - Not reviewed 0.95 %
- External (OAK lexmatch, existing mapping sets)
  - Confidence on a case by case basis, configured as part of mapping commons
PT=sssom-py:ptable(M)
{best-guess.sssom.tsv, results.json, |cluster-X.png|, |cluster-X.md}, =boomer(PT, O)
EDIT: I thought we would do a proper human review of questionable clusters here, but maybe we leave this to #333 instead to make this workflow more scalable
difference.sssom.tsv = sssom-py:diff(M, best-guess.sssom.tsv)
Cursory human review of difference.sssom.tsv (eyeballing), no semapv:MappingReview justification added. Links from SSSOM file to related cluster facilitates to effectively review using a nice image (this could be an app one day).
Rejected mappings from the difference.sssom.tsv should be recorded in a "negative.sssom.tsv" mapping file by the curators

New boomer requirements

Output best-guess.sssom.tsv should be sssom #47 and also include a notion of mapping confidence (I didn't get 100% how cluster and mapping confidence should relate in our meeting, but I think you did) and a link to the associated mapping cluster. If there is other metadata you think that can help with the review, you can add it into the comment section.
Most of the stuff in #333

Comments:

"low prior property mapping will be rejected in a high probability clique" (@cmungall)
boomer does not necessarily create a globally coherent outcome model (@balhoff)

Add instructions for post-processing output axioms with ROBOT

Equivalence mappings are represented as mutual subclass axioms in whelk. ROBOT can be used to generate OWL equivalentClass axioms.

See #39.

Output sssom mapping files rather than (just) owl

For working with boomer more seemlessly, it would be great if it could output the results as a table:

mapping_set_id: https://w3id.org/boomer/s8169763872632786387263
license: https://creativecommons.org/publicdomain/zero/1.0/
curie_map:
  UBERON: http...
  FMA: http...

subject_id	subject_label	predicate_id	object_id	object_label	confidence	mapping_justification
UBERON:123	heart	skos:exactMatch	FMA:321	heart	0.9	semapv:UnspecifiedMatching
UBERON:123	soul	skos:exactMatch	FMA:321	human soul	0.9	semapv:UnspecifiedMatching

Export obojson file for each clique

these will be fed to obographviz for visualization

add an axiom annotation using <https://w3id.org/kgviz/width> to indicate probability, prob * 10, e.g. for Pr(E)=0.7, do:

{
      "sub": "GO:123",
      "pred": "owl:equivalentClass",
      "obj": "RHEA:456"
      "meta": {
        "basicPropertyValues": [
          {
            "pred": "https://w3id.org/kgviz/width",
            "val": 7
          }
        ]
      }
    }

alternatively we could output directly to dot, but working with clusters is a bit of a hassle. ogv has methods for customizing via stylesheets

What do the images mean?

Hey @balhoff ! I ran boomer on MONDO with just exactMatches and generated reports as seen here. @matentzn and I just wanted to understand what these images mean. I'm going to throw random images here just to get the ball rolling:

From what I hear, images are generated to highlight if existing mappings are faulty. Is this actually the case or fake news :) ?
If so, what could be the possible problems in these images?

cc: @cmungall

"No possible resolution of perplexity" + no results

Hi,

I have been using boomer for some ontology merging tests and sometimes I get a "No possible resolution of perplexity" message. Then boomer stops without producing any result.

What does it mean exactly and how to solve the issue ?

I am using the binary "boomer-0.2" version.

I have attached the input ptable and ontology (that contains the two ontologies that I try to merge) that leads to this issue.

The ontologies merged are the VIDO (Virus Infectious Disease Ontology) and the IDO (Infectious Disease Ontology).

The ptable file is an arbitrary probabilistic reinterpretation of an alignment generated with LOGMAP (I have converted the logmap mappings into a ptable).

I have also attached the prefixes.yaml file.

This is my command line (launched on Windows 11) :

boomer --ptable logmap-mappings-converted-to-ptable.tsv --ontology union-ido-vido-owl-functional-syntax.ofn --window-count 1 --runs 100 --prefixes prefixes.yaml --output boomer_output

Thanks for helping
_BOOMER-INPUT-DATA.zip

Annotate axioms in output ontology with prior probabilities

E.g. if the txt output is

A:1 ProperSubClassOf B:1        (most probable) 0.7
A:2 ProperSuperClassOf B:2      (most probable) 0.7
A:3 EquivalentTo B:3    (most probable) 0.7
A:4 SiblingOf B:4       (most probable) 0.7

then add annotations to axioms 1-3. TBD: predicate for probability? biollink?

Also:

currently no axiom is emitted for 4. Can we emit an annotation assertion? It is useful to be explicit about what was not inferred

Also:

Annotations on ontology:

ontology ID. Get from user?
metadata about the input ontologies, parameters, steps. Consider using a full PROV model. But even just some rdfs:comments would be awesome

Supporting Mapping QC workflow

The mapping QC workflow is about reviewing the existing mappings on an ongoing basis. The idea is to review the bottom N clusters once per month and thereby implement an ongoing cycle of ever improving mappings.

Note, there is no mappings being generated by this workflow. This is part of another issue.

Workflow:

Input Ontology O
Input M: existing mappings separated into two levels of confidence
- Reviewed: 0.99 %
- Not reviewed 0.95 %
- Key: No new mappings are added
PT=sssom-py:ptable(M)
{results.json, |cluster-X.png|, |cluster-X.md}, =boomer(PT, O)
{BOTTOM_10_CLUSTERS, LEAST_PROBABLE_MAPPINGS} = oak:boomerang(results.json, N)
GitHub Action: make issues for BOTTOM_10_CLUSTERS, including cluster-X.png and cluster-X.md
The reviewer now checks each cluster and _adds a semapv:MappingReview justification, which is separately curated from the existing mapping. If need be the existing mapping will be changed as well. This will be used to generate confidence scores for input M. There should never be more than 10 issues open. Ideally we can somehow recognise for a given cluster that an issue already exists (by parsing its title for the hashcode boomer provides).

New boomer requirements

Output report results.json contains probability scores that enable us to select cliques which should be reviewed.
results.json should conform to the new OAK cluster data model
cluster-X.md files should be on a by-clique basis rather than one huge file and ideally already contain the image tag which can be assumed to be in the same directory (not sure how this will work with posting a github issue though - maybe you know how this could be automated)

Comments

"joint posterior prop most likely of clique / prop next most likely - how interesting is this cluster?" @cmungall

Merging 14 Ontologies (huge merge)

Hello,

I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.

This is how I proceed :

I compute the 91 LOGMAP alignments between every pair of ontologies (i.e. 91 = n(n-1)/2 with n=14)
I convert and merge these alignments into a single ptable (Boomer format)
I join all these ontologies into a single "union" OWL file (622K classes ~ 2.5 GB)
I launch Boomer on the union OWL file and the single ptable (54K entries ~ 7 MB).

I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.

By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAM).

But I also need the MESH and the NCIT ontologies to be included in my merge result.

Overall, I am wondering if that's the correct way to proceed ?

Here follow some questions :

Should I continue with this strategy ?
-> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...
Or should I change my merging strategy ?
-> Should I split the problem into smaller sub-problems
-> Then organize them in some order (according to some criteria) : this could introduce some bias...
-> And launch Boomer following this order.

For example, I could try this :
- I convert the 91 alignments into 91 ptables (instead of converting and merging them into 1 single ptable)
- For each of the 91 ptables
----> I launch Boomer with this ptable and the union OWL file.
----> In the union OWL file, I add all the equivalence axioms generated by Boomer for this ptable.

So far, it seems to work much faster.
But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.

Any suggestions ?

Oliver

PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .

Using justification in posterior probability scoring

Currently, for P(A|H ) we assume a uniform probability, except in the case where the ontology O is incoherent.

We want to set P(A|H) be higher when the pre-existing axioms A are justified by the hypothetical axioms.

Consider:

A:
classes: cat, felis, mammal, mammalia
cat SubClassOf mammal
felis SubClassOf mammalia
H:
Pr(cat=felis) = 0.5
Pr(mammal=mammalia) = 0.5

(here we may be trying to align two terminologies, a formal and common one, but that is not strictly relevant for this example)

Under the existing boomer posterior probability calculation as specified in the kboom paper, all 4 solutions have equal posterior probability

intuitively we would like to "reward" the selection of { cat=felis, mammal=mammalia }, not just because of our prior knowledge or guesses based on labels, but on the fact that two hierarchies mutually support one another. The fact that cat isa mammal justiifies that felis isa mammalia when the two equivalence axioms are assumed.

conversely, consider

A:
classes: cat, felis, mammal, mammalia, octopus
cat SubClassOf mammal
H:
Pr(cat=octopus) = 0.5
Pr(mammal=mammalia) = 0.5

Again, using existing algorithm all 4 combos have equal posterior probability. However, here we want to weigh against the solution { cat=octopus, mammal=mammalia} -- not because of our prior knowledge, but because the fact there was no assertion that octopus is a mammalia. If we believe cat=octopus, then this entails an entirely new fact that was not asserted.

I'm open to ideas of how to incorporate this. I think the latter case may be faster compute. Just as we make a UNA for pre-populating implicit NotEquivalent axioms between classes in a single ontology.set, we can make a probabilistic OWA assumption, that that if an input sub-ontology does not entail an axiom (where the signature in the axiom is a subset of the sub-ontology signature), then we assign a low probability for that axiom. We might think of this intuitively as the alignment 'disrupting' an ontology by introducing new entailments.

For the former case, this can be posed in terms of the concept of Justification in the DL literature. This may be quite expensive to compute in the general case. See also ontodev/robot#528

A more efficient less complete solution would be to look for "justified squares":

d1 subClassOf[direct] c1
c1 = c2
d1 = d2
d2 subClassOf+ c2

entailed by H/A, but not entailed by A alone. (just calculate all justified squares from A in advance of running tree search and subtract this set).

I can't currently think of a principled way to go from this metric to P(A|H). If we only treat the final posterior probability as a ranking rather than absolute this is less important.

Report joint probability of solution, and next best solution

report Pr1Pr2...*Pr_n for the selected solution
also report same number for 2nd best, or M best solutions

We can also give an ad-hoc confidence score which is the probability of the selected solution divided by the next best (infinite if there is only one solution). Intuitively, if the two best solutions are close we have lower confidence, if it's 100x more likely we can have high confidence.

Note that we can also obtain a more accurate estimation of the probability by dividing the probability of a solution by the sum of all probabilities of all solutions. This will be higher than the simple joint probability when there are cases where solutions have a posterior probability of zero ie unsats.

As a trivial example, given an ontology

A Equiv B

and prior probabilities

Pr(A Equiv B) = 0.1

naive calculation gives Pr=0.1 for the full ontology. However, there are no other possibilities so we are forced to update our prior belief

--output-internal-axioms is dropping disjointness axioms

See #267 (comment).

equivalence in output.txt but not in ontology

I ran boomer on our ECTO ontologies. The output.txt lists a number of equivalences:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/output.txt

XCO:0000105 EquivalentTo ECTO:0002049	true
NCIT:C920 EquivalentTo XCO:0000625	true
NCIT:C119053 EquivalentTo XCO:0000266	true
XCO:0000042 EquivalentTo NCIT:C44462	true
ECTO:0002048 EquivalentTo XCO:0000094	true
PECO:0000059 EquivalentTo XCO:0000088	true
XCO:0000346 EquivalentTo NCIT:C645	true
XCO:0000038 EquivalentTo NCIT:C61398	true
...

But in the axiom-boomer.obo file these are is_a relations:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/axioms-boomer.obo

id: ECTO:0002049
is_a: XCO:0000105

id: XCO:0000105
is_a: ECTO:0002049

Logically, ECTO:0002049 and XCO:0000105 would be inferred to be equivalent. Is possible for you run the reasoner on the output before creating the output.ofn?

cc @cmungall

incatools / boomer Goto Github PK

boomer's Issues

Overview

SINGLETONS

Workflow:

New boomer requirements

Comments:

Workflow:

New boomer requirements

Comments

Recommend Projects

Recommend Topics

Recommend Org