incatools / boomer Goto Github PK

View Code? Open in Web Editor NEW

26.0 8.0 2.0 1.44 MB

Bayesian OWL ontology merging

Home Page: https://incatools.github.io/boomer/

License: BSD 3-Clause "New" or "Revised" License

Scala 99.89% Makefile 0.11%

owl-ontology boomer obofoundry owl reasoning ontology-merging monarchinitiative mondo geneontology probabilistic-owl

boomer's Introduction

boomer

Bayesian OWL ontology merging

Boomer uses a combined logical and probabilistic approach to translate mappings into logical axioms that can be used to merge ontologies. Boomer implements a search algorithm to find the combined ontology with the highest probability that is also logically coherent. Boomer takes as input ontologies, plus mappings with probabilities for OWL interpretations for each mapping, and produces as output a set of cliques (each representing a group of overlapping mappings which interact with each other when selecting logical interpretations), together with a visualization of each clique. The resulting graphics provide a focused view into problematic tangles of mappings; this lends itself extremely well to an iterative inference process with a curator in the loop.

Boomer is implemented in Scala, and is built on the Whelk OWL reasoner. Whelk uses immutable data structures which allow Boomer to readily roll back to previous reasoning states in the course of its search algorithm. Boomer’s exhaustive search tests axiom combinations in order of decreasing joint probability. For larger cliques Boomer runs many depth-first “greedy” searches, with different starting points searched in parallel.

For some additional background, see this preprint describing an earlier implementation of the software: https://doi.org/10.1101/048843

Usage

Usage: boomer [options]
  --usage  <bool>
        Print usage and exit
  --help | -h  <bool>
        Print help message and exit
  --output | -o  <output files/dir name>
        Name used for folder to ouput clique JSON files; also basename for ontology and Markdown output files.
  --ptable | -t  <filename>
        TSV file containing table of mappings with probabilities.
  --ontology | -a  <filename>
        OWL file containing all asserted/background axioms.
  --prefixes | -p  <filename>
        YAML dictionary of prefix-to-expansion mappings for all prefixes used in the ptable. These namespaces are also used to check for new within-namespace equivalences.
  --window-count | -w  <positive integer>
        Number of groups to split a clique of mappings into when shuffling between greedy search runs. Windows maintain their order; mappings within a window are shuffled.
  --runs | -r  <positive integer>
        Number of separate shuffled runs to conduct for each greedy search.
  --exhaustive-search-limit | -e  <positive integer>
        Maximum size clique for exhaustive search algorithm. Larger cliques use greedy search.
  --output-internal-axioms | -e  <bool>
        Include axioms used to enforce proper subclass relationships (e.g. generated disjoint sibling classes) in OWL output (default false).
  --restrict-output-to-prefixes | -e  <prefix strings (max of 2)>
        Generate output only for cliques where a mapping between these two namespaces was resolved as something other than its highest probability option.

Running

A pre-built copy of Boomer can downloaded from the releases page, e.g., https://github.com/INCATools/boomer/releases/download/v0.1/boomer-0.1.tgz. After unzipping the archive you should see a bin and a lib folder. Keep these together in the same folder, and add the bin folder to your path.

The current command-line interface is specialized for the term mapping use case (see options above).

To run Boomer, you must have Java installed. To set the JVM heap size (usually necessary for processing larger files), use an environment variable:

export JAVA_OPTS=-Xmx10G

Example command line (from https://github.com/geneontology/go-rhea-boom/blob/master/Makefile)

boomer --ptable probs.tsv --ontology go-rhea.ofn --window-count 1 --runs 100 --prefixes prefixes.yaml --output rhea-boom --exhaustive-search-limit 14 --restrict-output-to-prefixes=GO --restrict-output-to-prefixes=RHEA

Building

If you want to build the code yourself, you must first install sbt. Clone the repository and run sbt in the project root folder. A few SBT commands will be useful:

compile: build the code
test: run all tests
stage: create an executable for local use at ./target/universal/stage/bin/boomer
packageZipTarball: package the executable for release

boomer's People

Contributors

Stargazers

Watchers

Forkers

scala-steward stephanieshong

boomer's Issues

Boomer: How to deal with huge cliques?

This issue is just so I can link to some discussion while I make other tickets. The question is what boomer should do when faced with an enormous clique:

Ignore it and spit the clique out for people to break it
Try to break by applying some trivial heuristics (fast ones, like bulk dropping low probability axioms)
Anything else come to mind?

I would like boomer to at least try 2, but its hard to do this in a principled manner. Maybe you have a better idea @balhoff?

Export obojson file for each clique

these will be fed to obographviz for visualization

add an axiom annotation using <https://w3id.org/kgviz/width> to indicate probability, prob * 10, e.g. for Pr(E)=0.7, do:

{
      "sub": "GO:123",
      "pred": "owl:equivalentClass",
      "obj": "RHEA:456"
      "meta": {
        "basicPropertyValues": [
          {
            "pred": "https://w3id.org/kgviz/width",
            "val": 7
          }
        ]
      }
    }

alternatively we could output directly to dot, but working with clusters is a bit of a hassle. ogv has methods for customizing via stylesheets

Supporting the Mapping Integration workflow

The mapping Integration (as opposed to QC #333) workflow is about effective integration of new mappings into an ontology while maintaining consistency. The goal is to be able to rapidly slurp up existing mappings (almost) without the need of human review

Workflow:

Input Ontology O (e.g. Mondo)
Input M:
- Merged set of mappings:
- Internal (existing, verified mappings):
  - Reviewed: 0.99 %
  - Not reviewed 0.95 %
- External (OAK lexmatch, existing mapping sets)
  - Confidence on a case by case basis, configured as part of mapping commons
PT=sssom-py:ptable(M)
{best-guess.sssom.tsv, results.json, |cluster-X.png|, |cluster-X.md}, =boomer(PT, O)
EDIT: I thought we would do a proper human review of questionable clusters here, but maybe we leave this to #333 instead to make this workflow more scalable
difference.sssom.tsv = sssom-py:diff(M, best-guess.sssom.tsv)
Cursory human review of difference.sssom.tsv (eyeballing), no semapv:MappingReview justification added. Links from SSSOM file to related cluster facilitates to effectively review using a nice image (this could be an app one day).
Rejected mappings from the difference.sssom.tsv should be recorded in a "negative.sssom.tsv" mapping file by the curators

New boomer requirements

Output best-guess.sssom.tsv should be sssom #47 and also include a notion of mapping confidence (I didn't get 100% how cluster and mapping confidence should relate in our meeting, but I think you did) and a link to the associated mapping cluster. If there is other metadata you think that can help with the review, you can add it into the comment section.
Most of the stuff in #333

Comments:

"low prior property mapping will be rejected in a high probability clique" (@cmungall)
boomer does not necessarily create a globally coherent outcome model (@balhoff)

JSON filenames that are hashed have no mention in the `output.md` file

If I need to see the JSON or PNG of the report associating with an entry in output.md, I cannot directly find it without some grep on the command line.

Discussed with Jim about having that as an entry in the output.md file itself to track it to the corresponding JSON/PNG file.

Using justification in posterior probability scoring

Currently, for P(A|H ) we assume a uniform probability, except in the case where the ontology O is incoherent.

We want to set P(A|H) be higher when the pre-existing axioms A are justified by the hypothetical axioms.

Consider:

A:
classes: cat, felis, mammal, mammalia
cat SubClassOf mammal
felis SubClassOf mammalia
H:
Pr(cat=felis) = 0.5
Pr(mammal=mammalia) = 0.5

(here we may be trying to align two terminologies, a formal and common one, but that is not strictly relevant for this example)

Under the existing boomer posterior probability calculation as specified in the kboom paper, all 4 solutions have equal posterior probability

intuitively we would like to "reward" the selection of { cat=felis, mammal=mammalia }, not just because of our prior knowledge or guesses based on labels, but on the fact that two hierarchies mutually support one another. The fact that cat isa mammal justiifies that felis isa mammalia when the two equivalence axioms are assumed.

conversely, consider

A:
classes: cat, felis, mammal, mammalia, octopus
cat SubClassOf mammal
H:
Pr(cat=octopus) = 0.5
Pr(mammal=mammalia) = 0.5

Again, using existing algorithm all 4 combos have equal posterior probability. However, here we want to weigh against the solution { cat=octopus, mammal=mammalia} -- not because of our prior knowledge, but because the fact there was no assertion that octopus is a mammalia. If we believe cat=octopus, then this entails an entirely new fact that was not asserted.

I'm open to ideas of how to incorporate this. I think the latter case may be faster compute. Just as we make a UNA for pre-populating implicit NotEquivalent axioms between classes in a single ontology.set, we can make a probabilistic OWA assumption, that that if an input sub-ontology does not entail an axiom (where the signature in the axiom is a subset of the sub-ontology signature), then we assign a low probability for that axiom. We might think of this intuitively as the alignment 'disrupting' an ontology by introducing new entailments.

For the former case, this can be posed in terms of the concept of Justification in the DL literature. This may be quite expensive to compute in the general case. See also ontodev/robot#528

A more efficient less complete solution would be to look for "justified squares":

d1 subClassOf[direct] c1
c1 = c2
d1 = d2
d2 subClassOf+ c2

entailed by H/A, but not entailed by A alone. (just calculate all justified squares from A in advance of running tree search and subtract this set).

I can't currently think of a principled way to go from this metric to P(A|H). If we only treat the final posterior probability as a ranking rather than absolute this is less important.

fail fast if there are no satisfiable solutions

Assume we start with:

A:1	A:2	0.1	0.0	0.0	0.0
B:2	B:1	0.1	0.0	0.0	0.0
A:1	B:1	0.0	0.0	1.0	0.0
A:2	B:2	0.0	0.0	1.0	0.0

this is the same as:

A:1 sub A:2
B:2 sub B:1
A:1 = B:1
A:2 = B:2

i.e. the relationship between 1 and 2 are flipped between A and B, yet they are equivalent. This is unsat if we add the

yields:

SINGLETONS

Method: singletons
Score: 0.0
Estimated probability: 1.0
Confidence: 1.0
Subsequent scores (max 10):

A:2 EquivalentTo B:2 (most probable) 1.0

Which is odd. It looks like it's rejecting p=1.0 axioms, but in fact it's accepting them:

Prefix(:=<urn:unnamed:ontology#ont1>)
Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)


Ontology(<urn:unnamed:ontology#ont1>

Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>))
Declaration(Class(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>))
Declaration(Class(<http://example.org/A/1>))
Declaration(Class(<http://example.org/A/2>))
Declaration(Class(<http://example.org/B/1>))
Declaration(Class(<http://example.org/B/2>))
############################
#   Classes
############################

# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d>)

SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/1>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#b5c8477c41201d94c1ab8968e4c9e91f59c6b59d> <http://example.org/B/2>)

# Class: <http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> (<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e>)

SubClassOf(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/2>)
DisjointClasses(<http://boom.monarchinitiative.org/vocab/DisjointSibling#1d6090917442e5bc22d17b586d26b7b4b7d81d5e> <http://example.org/A/1>)

# Class: <http://example.org/A/1> (<http://example.org/A/1>)

EquivalentClasses(<http://example.org/A/1> <http://example.org/B/1>)
SubClassOf(<http://example.org/A/1> <http://example.org/A/2>)

# Class: <http://example.org/A/2> (<http://example.org/A/2>)

EquivalentClasses(<http://example.org/A/2> <http://example.org/B/2>)

# Class: <http://example.org/B/2> (<http://example.org/B/2>)

SubClassOf(<http://example.org/B/2> <http://example.org/B/1>)


)

which is unsat:

What do the images mean?

Hey @balhoff ! I ran boomer on MONDO with just exactMatches and generated reports as seen here. @matentzn and I just wanted to understand what these images mean. I'm going to throw random images here just to get the ball rolling:

From what I hear, images are generated to highlight if existing mappings are faulty. Is this actually the case or fake news :) ?
If so, what could be the possible problems in these images?

cc: @cmungall

What is the best way to communicate changes in axioms to a user

Consider this:

the user provides mappings in a ptable of the "exact" variety
then boomer applies these, but changes some of them to "subclassof"

We should communicate this clearly to the user. The question is: should this be implemented as a mappingset diff? I.e. Boomer exports and sssom mapping table alongside the owlfile and we show a sssom diff to the user between the input mappings and the output? Or should boomer say something in a nicely formatted markdown?

Report posterior probability of each proposed axiom in a solution

currently we report the posterior probability of the solution, and the prior probability of each axiom

we should report the posterior of each axiom. We do this by taking the sum of all solutions that include that axiom, and dividing that by the sum of probabilities of solutions that have a different interpretation.

If not all solutions are explored this will be an estimate. The estimate will be biased in favor of those with higher probabilities, as lower priorities may never be explored. We can account for this with a prior probability that the solution is biased

Pr(Axiom) = IsBiasedPrior * AxiomPrior + (1-IsBiasedPrior) * AxiomEstimatedPosterior

We can estimate IsBiasedPrior based on how many times the axiom or its alternatives were explored in the overall search tree.

We can also have strategies to minimize bias. E.g for each potential axiom we start at least one search with that as our initial choice.

json output should include "sibling" relations, for depiction in graphics

Right now if two classes are forced to not be subclasses in either direction, this mapping won't be depicted in any way in the output images.

Add more documentation on how to explore output

#185

We add obojson output, which can be fed into obographviz - but we don't really explain any of this on the README

Also just general docs on how to explore the different output files

Treat 1.0 probability as required axiom

This would be consistent with the treatment of 0 as impossible.

Include SiblingOf in json output

when looking at pngs, it is useful to be able to see which mappings were rejected

I think the easiest way is to include a triple for SiblingOf calls (there is a way to represent properSiblingOf in OWL but it's verbose)

For example given a ptable where one mapping is likely to be interpreted as siblingOf:

A:1	B:1	0.01	0.01	0.05	0.93
A:2	B:1	0.01	0.01	0.95	0.03

I get output that is very useful, where each mapping is traceable:

A:1 SiblingOf B:1 (most probable) 0.93
A:2 EquivalentTo B:1 (most probable) 0.95

however the json doesn't have the first mapping, and thus the png also lacks it (and also loses the A1 node altogether

equivalence in output.txt but not in ontology

I ran boomer on our ECTO ontologies. The output.txt lists a number of equivalences:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/output.txt

XCO:0000105 EquivalentTo ECTO:0002049	true
NCIT:C920 EquivalentTo XCO:0000625	true
NCIT:C119053 EquivalentTo XCO:0000266	true
XCO:0000042 EquivalentTo NCIT:C44462	true
ECTO:0002048 EquivalentTo XCO:0000094	true
PECO:0000059 EquivalentTo XCO:0000088	true
XCO:0000346 EquivalentTo NCIT:C645	true
XCO:0000038 EquivalentTo NCIT:C61398	true
...

But in the axiom-boomer.obo file these are is_a relations:
https://github.com/EnvironmentOntology/environmental-exposure-ontology/blob/issue-97/src/mapping/axioms-boomer.obo

id: ECTO:0002049
is_a: XCO:0000105

id: XCO:0000105
is_a: ECTO:0002049

Logically, ECTO:0002049 and XCO:0000105 would be inferred to be equivalent. Is possible for you run the reasoner on the output before creating the output.ofn?

cc @cmungall

allow providing prefixes declarations separate from list of prohibited within-namespace equivalents

Prefix declarations are needed for the input TSV. Right now inferred equivalents within these namespaces are also prohibited. @matentzn would like to be able to enable/disable that feature more selectively.

Provided more detailed documentation of window-count

For end-users of boomer.

Incoherent results in output.txt with equivalence triads

triad.tsv:

X:1	Y:1	0.01	0.01	0.97	0.01
Y:1	Z:1	0.01	0.01	0.97	0.01
X:1	Z:1	0.2	0.2	0.01	0.59

empty.rdf:

prefix X: <http://purl.obolibrary.org/obo/X_>
prefix Y: <http://purl.obolibrary.org/obo/Y_>
prefix Z: <http://purl.obolibrary.org/obo/Z_>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix owl: <http://www.w3.org/2002/07/owl#>

X:1 a owl:Class .

prefixes.yaml:

X: http://purl.obolibrary.org/obo/X_
Y: http://purl.obolibrary.org/obo/Y_
Z: http://purl.obolibrary.org/obo/Z_

run:

$ boomer --ptable triad.tsv --ontology empty.rdf --prefixes prefixes.yaml --runs 5 --window-count 2
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:21:18 - Bin size: 3; Most probable: 0.97
2020.04.27 19:57:25 [INFO] org.monarchinitiative.boomer.Boom.evaluate:24:16 - Max possible joint probability: -0.5885511570517892
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -4.666088600957509
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Boom.evaluateInOrder:39:20 - Found joint probability: -5.163262135555172
2020.04.27 19:57:26 [INFO] org.monarchinitiative.boomer.Main.$anonfun:42:34 - Most probable: -6.731742884949919
2020.04.27 19:57:27 [INFO] org.monarchinitiative.boomer.Main.$anonfun:57:34 - 5s

output.txt:

X:1 EquivalentTo Z:1    false
Y:1 EquivalentTo Z:1    true
X:1 EquivalentTo Y:1    true

this is incoherent as equivalence is symmetric, transitive

I think this is just an error in reporting, because we have

output.ofn:

# Class: <http://purl.obolibrary.org/obo/X_1> (<http://purl.obolibrary.org/obo/X_1>)

SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1>)
SubClassOf(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Z_1>)

# Class: <http://purl.obolibrary.org/obo/Y_1> (<http://purl.obolibrary.org/obo/Y_1>)

SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)

# Class: <http://purl.obolibrary.org/obo/Z_1> (<http://purl.obolibrary.org/obo/Z_1>)

SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/X_1>)
SubClassOf(<http://purl.obolibrary.org/obo/Z_1> <http://purl.obolibrary.org/obo/Y_1>)

When I run through robot reason -A EquivalentClass -i output.ofn -s true -o ... I get the expected

EquivalentClasses(<http://purl.obolibrary.org/obo/X_1> <http://purl.obolibrary.org/obo/Y_1> <http://purl.obolibrary.org/obo/Z_1>)

Boomer markdown output

Boomer markdown output should not entirely obfuscate the IDs (labels with ID in brackets maybe?)
Combined posterior probability for each clique
Some confidence measure (multiplication of the probabilities with least next likely probability)
Some way to go quickly to the images from the markdown output

Add instructions for post-processing output axioms with ROBOT

Equivalence mappings are represented as mutual subclass axioms in whelk. ROBOT can be used to generate OWL equivalentClass axioms.

See #39.

Add additional diagnostics to figure out points where boomer doesn't complete

Follow on from

#292

It would be great to have -v -vv etc to see where boomer is at. I have a process that has been running 3 days and I have no idea if it's stuck in a particular clique or is generally slow because of ontology size etc

I could just go in and add some printf statements but maybe there is a better way..

ideally there would also be stats added to the clique report - e.g. this clique took 5mins, this one took 2s, etc.

But primarily we can't want for the report if something is hanging, need a verbosity option

allow prevention of equivalence inference within a namespace

This will require new functionality in whelk.

Docs: Add link to `README.md`

Overview

I just noticed that the documentation doesn't appear on the home page of GitHub.

"No possible resolution of perplexity" + no results

Hi,

I have been using boomer for some ontology merging tests and sometimes I get a "No possible resolution of perplexity" message. Then boomer stops without producing any result.

What does it mean exactly and how to solve the issue ?

I am using the binary "boomer-0.2" version.

I have attached the input ptable and ontology (that contains the two ontologies that I try to merge) that leads to this issue.

The ontologies merged are the VIDO (Virus Infectious Disease Ontology) and the IDO (Infectious Disease Ontology).

The ptable file is an arbitrary probabilistic reinterpretation of an alignment generated with LOGMAP (I have converted the logmap mappings into a ptable).

I have also attached the prefixes.yaml file.

This is my command line (launched on Windows 11) :

boomer --ptable logmap-mappings-converted-to-ptable.tsv --ontology union-ido-vido-owl-functional-syntax.ofn --window-count 1 --runs 100 --prefixes prefixes.yaml --output boomer_output

Thanks for helping
_BOOMER-INPUT-DATA.zip

output files

The output.txt file looks like this:

there is no output.txt specified on the command line, it doesn't seem to be generated

I'd expect an output arg

boomer --ptable probs.tsv --ontology slim-exposure.obo --window-count 10 --runs 20 --prefixes prefixes.yaml --output boomer.txt
2020.04.09 14:57:04 [ERROR] org.monarchinitiative.boomer.Main.$anonfun.applyOrElse:60:60 - Unrecognized argument: --output

cc @wdduncan

Make a static site for this project

If you like I can have @hrshdhgd do this using either sphinx or mkdocs (I would do mkdocs with a material theme)

However, there may be a preferred way of making docs for scala projects that exposes the scala API

The first pass would be just to break the README into a couple of pages:

about.md
install.md
running.md

then we can add more docs for different workflows etc

Annotate axioms in output ontology with prior probabilities

E.g. if the txt output is

A:1 ProperSubClassOf B:1        (most probable) 0.7
A:2 ProperSuperClassOf B:2      (most probable) 0.7
A:3 EquivalentTo B:3    (most probable) 0.7
A:4 SiblingOf B:4       (most probable) 0.7

then add annotations to axioms 1-3. TBD: predicate for probability? biollink?

Also:

currently no axiom is emitted for 4. Can we emit an annotation assertion? It is useful to be explicit about what was not inferred

Also:

Annotations on ontology:

ontology ID. Get from user?
metadata about the input ontologies, parameters, steps. Consider using a full PROV model. But even just some rdfs:comments would be awesome

Report joint probability of solution, and next best solution

report Pr1Pr2...*Pr_n for the selected solution
also report same number for 2nd best, or M best solutions

We can also give an ad-hoc confidence score which is the probability of the selected solution divided by the next best (infinite if there is only one solution). Intuitively, if the two best solutions are close we have lower confidence, if it's 100x more likely we can have high confidence.

Note that we can also obtain a more accurate estimation of the probability by dividing the probability of a solution by the sum of all probabilities of all solutions. This will be higher than the simple joint probability when there are cases where solutions have a posterior probability of zero ie unsats.

As a trivial example, given an ontology

A Equiv B

and prior probabilities

Pr(A Equiv B) = 0.1

naive calculation gives Pr=0.1 for the full ontology. However, there are no other possibilities so we are forced to update our prior belief

accept SSSOM format

in addition to the ptable, also accept SSSOM mappings

cc @cmungall

Explain what boomer does in a leading paragraph?

I kind of get that this tool is an entity mapper between 2 ontologies? Does it also report on logical inconsistencies when it is 100% sure of an entity match?

--output-internal-axioms is dropping disjointness axioms

See #267 (comment).

Exit code is set to error even when boomer succeeds

boomer exits with a non-zero exit code all the time, this is a bit non-standard and hampers normal unixy workflows

Supporting Mapping QC workflow

The mapping QC workflow is about reviewing the existing mappings on an ongoing basis. The idea is to review the bottom N clusters once per month and thereby implement an ongoing cycle of ever improving mappings.

Note, there is no mappings being generated by this workflow. This is part of another issue.

Workflow:

Input Ontology O
Input M: existing mappings separated into two levels of confidence
- Reviewed: 0.99 %
- Not reviewed 0.95 %
- Key: No new mappings are added
PT=sssom-py:ptable(M)
{results.json, |cluster-X.png|, |cluster-X.md}, =boomer(PT, O)
{BOTTOM_10_CLUSTERS, LEAST_PROBABLE_MAPPINGS} = oak:boomerang(results.json, N)
GitHub Action: make issues for BOTTOM_10_CLUSTERS, including cluster-X.png and cluster-X.md
The reviewer now checks each cluster and _adds a semapv:MappingReview justification, which is separately curated from the existing mapping. If need be the existing mapping will be changed as well. This will be used to generate confidence scores for input M. There should never be more than 10 issues open. Ideally we can somehow recognise for a given cluster that an issue already exists (by parsing its title for the hashcode boomer provides).

New boomer requirements

Output report results.json contains probability scores that enable us to select cliques which should be reviewed.
results.json should conform to the new OAK cluster data model
cluster-X.md files should be on a by-clique basis rather than one huge file and ideally already contain the image tag which can be assumed to be in the same directory (not sure how this will work with posting a github issue though - maybe you know how this could be automated)

Comments

"joint posterior prop most likely of clique / prop next most likely - how interesting is this cluster?" @cmungall

Merging 14 Ontologies (huge merge)

Hello,

I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.

This is how I proceed :

I compute the 91 LOGMAP alignments between every pair of ontologies (i.e. 91 = n(n-1)/2 with n=14)
I convert and merge these alignments into a single ptable (Boomer format)
I join all these ontologies into a single "union" OWL file (622K classes ~ 2.5 GB)
I launch Boomer on the union OWL file and the single ptable (54K entries ~ 7 MB).

I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.

By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAM).

But I also need the MESH and the NCIT ontologies to be included in my merge result.

Overall, I am wondering if that's the correct way to proceed ?

Here follow some questions :

Should I continue with this strategy ?
-> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...
Or should I change my merging strategy ?
-> Should I split the problem into smaller sub-problems
-> Then organize them in some order (according to some criteria) : this could introduce some bias...
-> And launch Boomer following this order.

For example, I could try this :
- I convert the 91 alignments into 91 ptables (instead of converting and merging them into 1 single ptable)
- For each of the 91 ptables
----> I launch Boomer with this ptable and the union OWL file.
----> In the union OWL file, I add all the equivalence axioms generated by Boomer for this ptable.

So far, it seems to work much faster.
But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.

Any suggestions ?

Oliver

PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .

Bayesian calculation of unspecified probabilities from priors

The functionality for this may go in sssom-py but it seems logical to put anything involving probabilistic calculations into an issue here

currently boomer assumes the user specifies priors for all 4 possibilities

What if we have a file which we have a mapping with a probability specified for only one interpretation? In this case we should use standard probability axioms to calculate other probabilities based on priors of probability of any mapping having a particular interp

E.g. assuming global priors

P(equiv) = 0.8
P(sub) = 0.05
P(sup) = 0.05
P(sib) = 0.1

assume sssom contains equiv statement with confidence 0.4

P(sub | equiv) = 0.0
P(sub | NOTequiv) = P(NOTequiv | sub) . P(sub)
                            ----
                            P(NOTequiv)

                  = 1 * 0.05
                      ---
                      0.2

                  = 0.25

therefore posterior p(sub) = 0.4*0 + 0.6 * .25 = 0.15

all posterior

p(equiv) = 0.4
p(sub) = 0.15
p(sup) = 0.15
p(sub) = 0.3

boomer selects suboptimal solution in simple 3-node problem

for text files see #157.

Given:

Pr(A properSubClassOf C) = 0.99
Pr(A equiv B) = 0.95
Pr(B equiv C) = 0.95

(in each case, the only other possibility is siblingOf)

note each class is in a separate prefix space, so there is no penalty for equivalence between any

Solutions:

1,2,3 : incoherent
1,2 : .99 * .95 * (1-.95) = 0.04
1,3 : .99 * .95 * (1-.95) = 0.04
2,3 : .95 * .95 * (1-0.99) = 0.009
1 : .99 * .05 * .05 = 0.0023
2 : .01 * .95 * .05 = 0.000475
3 : .01 * .95 * .05 = 0.000475
{} : .01 * .05 * .05 = 2.5e-05

boomer generally selects {1} depending on params, but never the optimal

I am pretty sure I have not made a typo - I put each class in its own ID space, so it is not avoiding 2 or 3 (which would happen if A/B/C were in the same ID space)

boomer -p prefixes.yaml -w 100 -r 1000 -t ptable.tsv --ontology logical.omn 
...
2021.02.05 09:23:19:376 [zio-def...] [INFO ] org.monarchinitiative.boomer.Main.program:49 - Most probable: 0.0024750000000000015
...
$ more output.txt 
A:1 SiblingOf B:1               0.05
B:1 SiblingOf C:1               0.05
A:1 ProperSubClassOf C:1        (most probable) 0.99

Output sssom mapping files rather than (just) owl

For working with boomer more seemlessly, it would be great if it could output the results as a table:

mapping_set_id: https://w3id.org/boomer/s8169763872632786387263
license: https://creativecommons.org/publicdomain/zero/1.0/
curie_map:
  UBERON: http...
  FMA: http...

subject_id	subject_label	predicate_id	object_id	object_label	confidence	mapping_justification
UBERON:123	heart	skos:exactMatch	FMA:321	heart	0.9	semapv:UnspecifiedMatching
UBERON:123	soul	skos:exactMatch	FMA:321	human soul	0.9	semapv:UnspecifiedMatching

Consider not outputting cliques of size 2 or less

Right now, on a large run we output all cliques, including those of with just two members. These are neither useful for visualisation nor for human review?

incatools / boomer Goto Github PK

boomer's Introduction

boomer

Bayesian OWL ontology merging

Usage

Running

Example command line (from https://github.com/geneontology/go-rhea-boom/blob/master/Makefile)

Building

boomer's People

Contributors

Stargazers

Watchers

Forkers

boomer's Issues

Workflow:

New boomer requirements

Comments:

SINGLETONS

Overview

Workflow:

New boomer requirements

Comments

Recommend Projects

Recommend Topics

Recommend Org