Code Monkey home page Code Monkey logo

Comments (1)

colleenXu avatar colleenXu commented on June 28, 2024
This is all of my initial understanding. There seems to be a LOT more going on, which we'll track in the opening post of this issue.

[EDIT: PAUSE FOR DISCUSSION. NEEDS MORE INVESTIGATION ON WHAT THE SPEC IS, based on Eric's reply in Translator Slack]

Previously, we implemented QNode is_set behavior (original issue, behavior with ID/node-expansion).

  • default behavior (property missing or is_set: false): each result has 1 KG node bound to each QNode
  • but if a QNode has is_set: true, a result can have >= 1 KG node bound to that QNode. AKA there's a merging/consolidation.

Feature

In TRAPI 1.5, is_set is replaced with set_interpretation, which has more explicit rules for results-assembly (PR, lines 881-896). It's an optional property with string values (enum).

  • default behavior (property missing or null) == "BATCH": same as before, each result has 1 KG node bound to each QNode
  • "MANY": same as previous is_set: true behavior.
    • Note: This new specification only seems to cover when QNodes have multiple starting IDs. But I'd like to keep our current use of is_set:true / set_interpretation: MANY on QNodes with no starting IDs to merge/consolidate results.
  • "ALL": new behavior. This should only be set on QNodes that have multiple starting IDs/entities. Similar to the "MANY" behavior, but only keep results that contain all starting IDs/entities.
    • AKA if only some of the starting IDs/entities are in the consolidated result, it should be thrown out (and any KG nodes/edges unique to it should be pruned).

Examples

All use the same basic query, just setting set_interpretation to different values. I used HP IDs with no descendants (because ID/node-expansion triggers an automatic use of is_set: true, see #555 (comment))

set to BATCH (default)

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "BATCH"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 39 results: current_default.json. This was generated with current default (not setting is_set)

set to MANY

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "MANY"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 37 results (rather than 39): current_is_set.json. This was generated with current is_set: true.

set to ALL

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "ALL"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 2 results (rather than 39 or 37). Only 2 disease entities are connected to both starting entities. See the first two results of current_is_set.json. This was generated with current is_set: true.

  • MONDO:0009003 (achromatopsia 2)
  • MONDO:0013560 (Hermansky-Pudlak syndrome 8)

Complications that need discussion

(1) ID/node expansion

Currently, if we find descendants of a starting ID (ID/node expansion), we set that starting ID's QNode to is_set: true #555 (comment). Can we remove this behavior?

  • the current behavior has unintended consequences, like being completely unable to do is_set: false / set_interpretation: BATCH for some queries
  • but I'm not sure if we depend on this behavior (we want to keep the behavior of "subclass_of edges using different descendant IDs are kept in the same result")
  • Context: we implemented this with an old version of representing subclass info (comment). Now we use "constructed edges" + aux-graphs (issue)
Example

I expect 11 results for following query (not setting is_set), but end up with 10 results which is the same as setting is_set: true.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0007800", "HP:0025586"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

This happens because ID/node-expansion finds a descendant for 1 of the starting IDs. Console logs:

  bte:biothings-explorer-trapi:main Expanded ids for node n0: (2 ids -> 3 ids) +0ms
  bte:biothings-explorer-trapi:main Added is_set:true to node n0 +1ms

Note to self with another example

This query has 130 results whether is_set: true or not, when it should have >= 134 results when not. It also has some subclass_of edges/aux-graphs, but I'm not sure if it's a good test for seeing if ID/node-expansion becomes wonky after the changes.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0003259", "HP:0000110"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

(2) Unclear what the KG Node is_set property is for

Asked in Translator Slack:
The PR for set_interpretation also adds an is_set property to KG Nodes (lines 1011-1017). I'm not sure if this is meant to be used, and how (merging KG Nodes??).

Eric's reply in Translator Slack - needs more investigation.

(3) Clarifying an edge case

Asked in Translator Slack:
If set_interpretation is set on a QNode with multiple starting IDs, but these IDs all map to the same entity (using NodeNorm), then there isn't any set behavior to do. Is that fine? Does there need to be any log noting this?

from biothings_explorer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.