incatools / boomer Goto Github PK
View Code? Open in Web Editor NEWBayesian OWL ontology merging
Home Page: https://incatools.github.io/boomer/
License: BSD 3-Clause "New" or "Revised" License
Bayesian OWL ontology merging
Home Page: https://incatools.github.io/boomer/
License: BSD 3-Clause "New" or "Revised" License
This would be consistent with the treatment of 0 as impossible.
In
We add obojson output, which can be fed into obographviz - but we don't really explain any of this on the README
Also just general docs on how to explore the different output files
The output.txt file looks like this:
there is no output.txt
specified on the command line, it doesn't seem to be generated
I'd expect an output arg
boomer --ptable probs.tsv --ontology slim-exposure.obo --window-count 10 --runs 20 --prefixes prefixes.yaml --output boomer.txt
2020.04.09 14:57:04 [ERROR] org.monarchinitiative.boomer.Main.$anonfun.applyOrElse:60:60 - Unrecognized argument: --output
cc @wdduncan
Prefix declarations are needed for the input TSV. Right now inferred equivalents within these namespaces are also prohibited. @matentzn would like to be able to enable/disable that feature more selectively.
in addition to the ptable, also accept SSSOM mappings
cc @cmungall
when looking at pngs, it is useful to be able to see which mappings were rejected
I think the easiest way is to include a triple for SiblingOf calls (there is a way to represent properSiblingOf in OWL but it's verbose)
For example given a ptable where one mapping is likely to be interpreted as siblingOf:
A:1 B:1 0.01 0.01 0.05 0.93
A:2 B:1 0.01 0.01 0.95 0.03
I get output that is very useful, where each mapping is traceable:
however the json doesn't have the first mapping, and thus the png also lacks it (and also loses the A1 node altogether
I just noticed that the documentation doesn't appear on the home page of GitHub.
The functionality for this may go in sssom-py but it seems logical to put anything involving probabilistic calculations into an issue here
currently boomer assumes the user specifies priors for all 4 possibilities
What if we have a file which we have a mapping with a probability specified for only one interpretation? In this case we should use standard probability axioms to calculate other probabilities based on priors of probability of any mapping having a particular interp
E.g. assuming global priors
P(equiv) = 0.8
P(sub) = 0.05
P(sup) = 0.05
P(sib) = 0.1
assume sssom contains equiv statement with confidence 0.4
P(sub | equiv) = 0.0
P(sub | NOTequiv) = P(NOTequiv | sub) . P(sub)
----
P(NOTequiv)
= 1 * 0.05
---
0.2
= 0.25
therefore posterior p(sub) = 0.4*0 + 0.6 * .25 = 0.15
all posterior
p(equiv) = 0.4
p(sub) = 0.15
p(sup) = 0.15
p(sub) = 0.3
Right now, on a large run we output all cliques, including those of with just two members. These are neither useful for visualisation nor for human review?
If I need to see the JSON or PNG of the report associating with an entry in output.md
, I cannot directly find it without some grep
on the command line.
Discussed with Jim about having that as an entry in the output.md
file itself to track it to the corresponding JSON/PNG file.
boomer exits with a non-zero exit code all the time, this is a bit non-standard and hampers normal unixy workflows
currently we report the posterior probability of the solution, and the prior probability of each axiom
we should report the posterior of each axiom. We do this by taking the sum of all solutions that include that axiom, and dividing that by the sum of probabilities of solutions that have a different interpretation.
If not all solutions are explored this will be an estimate. The estimate will be biased in favor of those with higher probabilities, as lower priorities may never be explored. We can account for this with a prior probability that the solution is biased
Pr(Axiom) = IsBiasedPrior * AxiomPrior + (1-IsBiasedPrior) * AxiomEstimatedPosterior
We can estimate IsBiasedPrior based on how many times the axiom or its alternatives were explored in the overall search tree.
We can also have strategies to minimize bias. E.g for each potential axiom we start at least one search with that as our initial choice.
This will require new functionality in whelk.
Follow on from
It would be great to have -v
-vv
etc to see where boomer is at. I have a process that has been running 3 days and I have no idea if it's stuck in a particular clique or is generally slow because of ontology size etc
I could just go in and add some printf statements but maybe there is a better way..
ideally there would also be stats added to the clique report - e.g. this clique took 5mins, this one took 2s, etc.
But primarily we can't want for the report if something is hanging, need a verbosity option
For end-users of boomer.
This issue is just so I can link to some discussion while I make other tickets. The question is what boomer should do when faced with an enormous clique:
I would like boomer to at least try 2, but its hard to do this in a principled manner. Maybe you have a better idea @balhoff?
for text files see #157.
Given:
(in each case, the only other possibility is siblingOf)
note each class is in a separate prefix space, so there is no penalty for equivalence between any
Solutions:
boomer generally selects {1} depending on params, but never the optimal
I am pretty sure I have not made a typo - I put each class in its own ID space, so it is not avoiding 2 or 3 (which would happen if A/B/C were in the same ID space)
boomer -p prefixes.yaml -w 100 -r 1000 -t ptable.tsv --ontology logical.omn
...
2021.02.05 09:23:19:376 [zio-def...] [INFO ] org.monarchinitiative.boomer.Main.program:49 - Most probable: 0.0024750000000000015
...
$ more output.txt
A:1 SiblingOf B:1 0.05
B:1 SiblingOf C:1 0.05
A:1 ProperSubClassOf C:1 (most probable) 0.99
triad.tsv:
X:1 Y:1 0.01 0.01 0.97 0.01
Y:1 Z:1 0.01 0.01 0.97 0.01
X:1 Z:1 0.2 0.2 0.01 0.59
empty.rdf:
prefix X: <http://purl.obolibrary.org/obo/X_>
prefix Y: <http://purl.obolibrary.org/obo/Y_>
prefix Z: <http://purl.obolibrary.org/obo/Z_>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
X:1 a owl:Class .
prefixes.yaml:
X: http://purl.obolibrary.org/obo/X_
Y: http://purl.obolibrary.org/obo/Y_
Z: http://purl.obolibrary.org/obo/Z_
run:
$ boomer --ptable triad.tsv --ontology empty.rdf --prefixes prefixes.yaml --runs 5 --window-count 2
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:21:18 - Bin size: 3; Most probable: 0.97
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:24:16 - Max possible joint probability: -0.5885511570517892
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Main.$anonfun:42:34 - Most probable: -6.731742884949919
2020.04.27 19:57:27 [INFO] org.monarchinitiative.boomer.Main.$anonfun:57:34 - 5s
output.txt:
X:1 EquivalentTo Z:1 false
Y:1 EquivalentTo Z:1 true
X:1 EquivalentTo Y:1 true
this is incoherent as equivalence is symmetric, transitive
I think this is just an error in reporting, because we have
output.ofn:
# Class: <http://purl.obolibrary.org/obo/X_1> (<http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1>)
SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Z_1>)
# Class: <http://purl.obolibrary.org/obo/Y_1> (<http://purl.obolibrary.org/obo/Y_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)
# Class: <http://purl.obolibrary.org/obo/Z_1> (<http://purl.obolibrary.org/obo/Z_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/Y_1>)
When I run through robot reason -A EquivalentClass -i output.ofn -s true -o ...
I get the expected
EquivalentClasses(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)
Assume we start with:
A:1 A:2 0.1 0.0 0.0 0.0
B:2 B:1 0.1 0.0 0.0 0.0
A:1 B:1 0.0 0.0 1.0 0.0
A:2 B:2 0.0 0.0 1.0 0.0
this is the same as:
A:1 sub A:2
B:2 sub B:1
A:1 = B:1
A:2 = B:2
i.e. the relationship between 1 and 2 are flipped between A and B, yet they are equivalent. This is unsat if we add the
yields:
Method: singletons
Score: 0.0
Estimated probability: 1.0
Confidence: 1.0
Subsequent scores (max 10):
Which is odd. It looks like it's rejecting p=1.0 axioms, but in fact it's accepting them:
Prefix(:=<urn:unnamed:ontology#ont1>)
Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)
Ontology(<urn:unnamed:ontology#ont1>
Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>))
Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>))
Declaration(Class(<http://example.org/A/1>))
Declaration(Class(<http://example.org/A/2>))
Declaration(Class(<http://example.org/B/1>))
Declaration(Class(<http://example.org/B/2>))
############################
# Classes
############################
# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>)
SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/1>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/2>)
# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>)
SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/2>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/1>)
# Class: <http://example.org/A/1> (<http://example.org/A/1>)
EquivalentClasses(<http://example.org/A/1> <http://example.org/B/1>)
SubClassOf(<http://example.org/A/1> <http://example.org/A/2>)
# Class: <http://example.org/A/2> (<http://example.org/A/2>)
EquivalentClasses(<http://example.org/A/2> <http://example.org/B/2>)
# Class: <http://example.org/B/2> (<http://example.org/B/2>)
SubClassOf(<http://example.org/B/2> <http://example.org/B/1>)
)
which is unsat:
Right now if two classes are forced to not be subclasses in either direction, this mapping won't be depicted in any way in the output images.
I kind of get that this tool is an entity mapper between 2 ontologies? Does it also report on logical inconsistencies when it is 100% sure of an entity match?
Consider this:
We should communicate this clearly to the user. The question is: should this be implemented as a mappingset diff? I.e. Boomer exports and sssom mapping table alongside the owlfile and we show a sssom diff to the user between the input mappings and the output? Or should boomer say something in a nicely formatted markdown?
If you like I can have @hrshdhgd do this using either sphinx or mkdocs (I would do mkdocs with a material theme)
However, there may be a preferred way of making docs for scala projects that exposes the scala API
The first pass would be just to break the README into a couple of pages:
then we can add more docs for different workflows etc
The mapping Integration (as opposed to QC #333) workflow is about effective integration of new mappings into an ontology while maintaining consistency. The goal is to be able to rapidly slurp up existing mappings (almost) without the need of human review
Equivalence mappings are represented as mutual subclass axioms in whelk. ROBOT can be used to generate OWL equivalentClass axioms.
See #39.
For working with boomer more seemlessly, it would be great if it could output the results as a table:
mapping_set_id: https://w3id.org/boomer/s8169763872632786387263
license: https://creativecommons.org/publicdomain/zero/1.0/
curie_map:
UBERON: http...
FMA: http...
subject_id | subject_label | predicate_id | object_id | object_label | confidence | mapping_justification |
---|---|---|---|---|---|---|
UBERON:123 | heart | skos:exactMatch | FMA:321 | heart | 0.9 | semapv:UnspecifiedMatching |
UBERON:123 | soul | skos:exactMatch | FMA:321 | human soul | 0.9 | semapv:UnspecifiedMatching |
these will be fed to obographviz for visualization
add an axiom annotation using <https://w3id.org/kgviz/width>
to indicate probability, prob * 10, e.g. for Pr(E)=0.7, do:
{
"sub": "GO:123",
"pred": "owl:equivalentClass",
"obj": "RHEA:456"
"meta": {
"basicPropertyValues": [
{
"pred": "https://w3id.org/kgviz/width",
"val": 7
}
]
}
}
alternatively we could output directly to dot, but working with clusters is a bit of a hassle. ogv has methods for customizing via stylesheets
Hey @balhoff ! I ran boomer on MONDO with just exactMatches
and generated reports as seen here. @matentzn and I just wanted to understand what these images mean. I'm going to throw random images here just to get the ball rolling:
cc: @cmungall
Hi,
I have been using boomer for some ontology merging tests and sometimes I get a "No possible resolution of perplexity" message. Then boomer stops without producing any result.
What does it mean exactly and how to solve the issue ?
I am using the binary "boomer-0.2" version.
I have attached the input ptable and ontology (that contains the two ontologies that I try to merge) that leads to this issue.
The ontologies merged are the VIDO (Virus Infectious Disease Ontology) and the IDO (Infectious Disease Ontology).
The ptable file is an arbitrary probabilistic reinterpretation of an alignment generated with LOGMAP (I have converted the logmap mappings into a ptable).
I have also attached the prefixes.yaml file.
This is my command line (launched on Windows 11) :
boomer --ptable logmap-mappings-converted-to-ptable.tsv --ontology union-ido-vido-owl-functional-syntax.ofn --window-count 1 --runs 100 --prefixes prefixes.yaml --output boomer_output
Thanks for helping
_BOOMER-INPUT-DATA.zip
E.g. if the txt output is
A:1 ProperSubClassOf B:1 (most probable) 0.7
A:2 ProperSuperClassOf B:2 (most probable) 0.7
A:3 EquivalentTo B:3 (most probable) 0.7
A:4 SiblingOf B:4 (most probable) 0.7
then add annotations to axioms 1-3. TBD: predicate for probability? biollink?
Also:
currently no axiom is emitted for 4. Can we emit an annotation assertion? It is useful to be explicit about what was not inferred
Also:
Annotations on ontology:
The mapping QC workflow is about reviewing the existing mappings on an ongoing basis. The idea is to review the bottom N clusters once per month and thereby implement an ongoing cycle of ever improving mappings.
Note, there is no mappings being generated by this workflow. This is part of another issue.
cluster-X.png and cluster-X.md
semapv:MappingReview
justification, which is separately curated from the existing mapping. If need be the existing mapping will be changed as well. This will be used to generate confidence scores for input M. There should never be more than 10 issues open. Ideally we can somehow recognise for a given cluster that an issue already exists (by parsing its title for the hashcode boomer provides).Hello,
I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.
This is how I proceed :
I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.
By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAMβ).
But I also need the MESH and the NCIT ontologies to be included in my merge result.
Overall, I am wondering if that's the correct way to proceed ?
Here follow some questions :
Should I continue with this strategy ?
-> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...
Or should I change my merging strategy ?
-> Should I split the problem into smaller sub-problems
-> Then organize them in some order (according to some criteria) : this could introduce some bias...
-> And launch Boomer following this order.
For example, I could try this :
- I convert the 91 alignments into 91 ptables (instead of converting and merging them into 1 single ptable)
- For each of the 91 ptables
----> I launch Boomer with this ptable and the union OWL file.
----> In the union OWL file, I add all the equivalence axioms generated by Boomer for this ptable.
So far, it seems to work much faster.
But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.
Any suggestions ?
Oliver
PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .
Currently, for P(A|H ) we assume a uniform probability, except in the case where the ontology O is incoherent.
We want to set P(A|H) be higher when the pre-existing axioms A are justified by the hypothetical axioms.
Consider:
A:
classes: cat, felis, mammal, mammalia
cat SubClassOf mammal
felis SubClassOf mammalia
H:
Pr(cat=felis) = 0.5
Pr(mammal=mammalia) = 0.5
(here we may be trying to align two terminologies, a formal and common one, but that is not strictly relevant for this example)
Under the existing boomer posterior probability calculation as specified in the kboom paper, all 4 solutions have equal posterior probability
intuitively we would like to "reward" the selection of { cat=felis, mammal=mammalia }, not just because of our prior knowledge or guesses based on labels, but on the fact that two hierarchies mutually support one another. The fact that cat isa mammal justiifies that felis isa mammalia when the two equivalence axioms are assumed.
conversely, consider
A:
classes: cat, felis, mammal, mammalia, octopus
cat SubClassOf mammal
H:
Pr(cat=octopus) = 0.5
Pr(mammal=mammalia) = 0.5
Again, using existing algorithm all 4 combos have equal posterior probability. However, here we want to weigh against the solution { cat=octopus, mammal=mammalia} -- not because of our prior knowledge, but because the fact there was no assertion that octopus is a mammalia. If we believe cat=octopus, then this entails an entirely new fact that was not asserted.
I'm open to ideas of how to incorporate this. I think the latter case may be faster compute. Just as we make a UNA for pre-populating implicit NotEquivalent axioms between classes in a single ontology.set, we can make a probabilistic OWA assumption, that that if an input sub-ontology does not entail an axiom (where the signature in the axiom is a subset of the sub-ontology signature), then we assign a low probability for that axiom. We might think of this intuitively as the alignment 'disrupting' an ontology by introducing new entailments.
For the former case, this can be posed in terms of the concept of Justification in the DL literature. This may be quite expensive to compute in the general case. See also ontodev/robot#528
A more efficient less complete solution would be to look for "justified squares":
d1 subClassOf[direct] c1
c1 = c2
d1 = d2
d2 subClassOf+ c2
entailed by H/A, but not entailed by A alone. (just calculate all justified squares from A in advance of running tree search and subtract this set).
I can't currently think of a principled way to go from this metric to P(A|H). If we only treat the final posterior probability as a ranking rather than absolute this is less important.
We can also give an ad-hoc confidence score which is the probability of the selected solution divided by the next best (infinite if there is only one solution). Intuitively, if the two best solutions are close we have lower confidence, if it's 100x more likely we can have high confidence.
Note that we can also obtain a more accurate estimation of the probability by dividing the probability of a solution by the sum of all probabilities of all solutions. This will be higher than the simple joint probability when there are cases where solutions have a posterior probability of zero ie unsats.
As a trivial example, given an ontology
A Equiv B
and prior probabilities
Pr(A Equiv B) = 0.1
naive calculation gives Pr=0.1 for the full ontology. However, there are no other possibilities so we are forced to update our prior belief
See #267 (comment).
I ran boomer on our ECTO ontologies. The output.txt
lists a number of equivalences:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/output.txt
XCO:0000105 EquivalentTo ECTO:0002049 true
NCIT:C920 EquivalentTo XCO:0000625 true
NCIT:C119053 EquivalentTo XCO:0000266 true
XCO:0000042 EquivalentTo NCIT:C44462 true
ECTO:0002048 EquivalentTo XCO:0000094 true
PECO:0000059 EquivalentTo XCO:0000088 true
XCO:0000346 EquivalentTo NCIT:C645 true
XCO:0000038 EquivalentTo NCIT:C61398 true
...
But in the axiom-boomer.obo
file these are is_a
relations:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/axioms-boomer.obo
id: ECTO:0002049
is_a: XCO:0000105
id: XCO:0000105
is_a: ECTO:0002049
Logically, ECTO:0002049
and XCO:0000105
would be inferred to be equivalent. Is possible for you run the reasoner on the output before creating the output.ofn
?
cc @cmungall
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.