Code Monkey home page Code Monkey logo

pandasaurus's Introduction

Pandasaurus

STATUS: BETA

A python library supporting simple queries over ontology annotations in dataframes, using UberGraph queries.

The aim for now is to keep this as a very simple independent Python lib avoiding any complex dependencies.

With the basic library in place, the first planned use for this is as a base for a library that provides simple enrichement and querability to AnnData Cell X Gene matrices following the CZ single cell curation standard.

pandasaurus's People

Contributors

anitacaron avatar dosumis avatar ubyndr avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

pandasaurus's Issues

Ubergraph filters

STATUS: DRAFT

Problem

The results of direct queries of ubergraph redundant and non-redundant graphs often give suboptimal results for biologist-facing use cases. The redundant

Case 1: Most precise object term needed.

Query of object graph => object terms that are too abstract. Query of non-redundant graph fails to => any object term in cases where redundancy stripping assumes users will be able to infer from the properties of a more general class.

Examples:

cell_ontology GO
sensory epithelial cell biological_process
interneuron biological_process
motor neuron biological_process
sensory neuron biological_process
polymodal neuron biological_process
  • Querying the non-redundant graph gives too little, e.g.,GABAergic only links to GO BP on the most general grouping class.

https://api.triplydb.com/s/166HwhWEo

cell_ontology GO
GABAergic neuron gamma-aminobutyric acid secretion, neurotransmission

The redundant graph => 67 cell types

cell_ontology GO
basket cell gamma-aminobutyric acid secretion, neurotransmission
cerebellar Golgi cell gamma-aminobutyric acid secretion, neurotransmission
GABAergic neuron gamma-aminobutyric acid secretion, neurotransmission
Kolmer-Agduhr neuron gamma-aminobutyric acid secretion, neurotransmission
rosehip neuron gamma-aminobutyric acid secretion, neurotransmission
cerebral cortex GABAergic interneuron gamma-aminobutyric acid secretion, neurotransmission
GABAergic interneuron gamma-aminobutyric acid secretion, neurotransmission
...
  • What we want:
cell_ontololgy GO
fan Martinotti neuron biological_process
fan Martinotti neuron transmission of nerve impulse
fan Martinotti neuron secretion by cell
fan Martinotti neuron acid secretion
fan Martinotti neuron gamma-aminobutyric acid secretion, neurotransmission
fan Martinotti neuron secretion
fan Martinotti neuron transport
fan Martinotti neuron cellular process
fan Martinotti neuron biological regulation
fan Martinotti neuron regulation of neurotransmitter levels
fan Martinotti neuron system process
fan Martinotti neuron neurotransmitter transport
fan Martinotti neuron neurotransmitter secretion
fan Martinotti neuron signal release
fan Martinotti neuron multicellular organismal process
fan Martinotti neuron nervous system process
fan Martinotti neuron gamma-aminobutyric acid secretion
fan Martinotti neuron gamma-aminobutyric acid transport
fan Martinotti neuron localization
fan Martinotti neuron establishment of localization
fan Martinotti neuron regulation of biological quality
fan Martinotti neuron organic substance transport
fan Martinotti neuron export from cell
fan Martinotti neuron establishment of localization in cell
fan Martinotti neuron signal release from synapse

-->

cell_ontololgy GO
fan Martinotti neuron transmission of nerve impulse
f
fan Martinotti neuron gamma-aminobutyric acid secretion, neurotransmission
  • Proposed Solution
    For each subject: query for all subClassOf relationships between object terms. Filter out all triples from the original query where the term has subclasses according to this second query. However, this would require many secondary queries and so would be inefficient. Is there some clever way to do this in SPARQL with subqueries?

CASE2: Graph-view generation

Aim: simple redundancy stripping that does not assume users can deal with inheritance of properties down the class heirarchy.

{details and examples TBA}

Add redundancy stripping functions for UberGraph queries

We need a pair of functions for querying ubergaph that include stripping redundancy (see #28 for rationale, discussion and example use cases).

  1. Given a set of subjects and a property and a target ontology return all triples from the redundant graph with specified subject and property and most specific subject class in the target ontology. Return should be a pandas dataframe of triples following standards established for all other pandasaurus queries.

  2. As 1, but given set set of objects, returning the most specific subject.

See https://api.triplydb.com/s/C3Nf1qnYu for example of SPARQL query logic (Needs modification to fit the above specs.

EPIC - overview of functionality for initial release.

Ubergraph queries to populate

Given 2 seeds of classes S(s) and S(o): return all triples from the redundant graph for s subClassOf o as a simple Pandas dataframe with subject and object columns. We may expand in future to include objectProperties in seed, but OK to stay simple for now. (I’ll refer to this as enriched_df below).

Methods for deriving S(s) and S(o) from an initial seed, S(i), prior to calling basic query

  • Simple: S(s) = S(i); S(o) = S(i)
  • Minimal Slim enrichment: S(s) = S(i); S(o) = S(i) + all classes in some specified (set of) slims (where class in slim = class tagged with some specified ‘subset’ axiom)
  • Full Slim enrichment: as Minimal slim enrichment but with transitive query of the non-redundant graph (owl:subClassOf*)
  • Contextual enrichment: S(s) = S(i); S(o) = S(i) + all classes satisfied by some (set of) existential restrictions in the ubergraph redundant graph (e.g. part_of 'Kidney')

It is entirely possible that users will provide invalid IDs in the initial seed, so we need to test for this:

  • Are the CURIE prefixes valid for UberGraph - throw exception/warning if not (option hard/soft fail)
  • Are the CURIEs valid terms in UberGraph - throw exception/warning if not (option hard/soft fail)
  • Do any CURIEs correspond to obsolete terms? (Option to update terms in the list via replaced_by?)

Users will need some way to find which slims are available. To do this we should include methods for finding available slims and displaying their contents:

  • find_available_slims: query for all slims available in a specified ontology, returning name and definition.
  • Show_slim_contents: list names and IDs of all terms in slim.

For all classes in enriched_df, query for labels and synonyms. Return as table with ID, label, synonyms (pipe sep) + column indicating presence in original seed (I’ll refer to this as name_lookup_df below)

User query methods over dataframes

The dataframes are designed to be sufficiently simple that a a bioinformatician with basic competency could work out how to use them to enrich their results. But I still think useful to wrap these.

Add sqlite as dependency for wrapping queries.

Basic query:

semantic_filter: (arg = dataframe + column name + name or synonym of query term) : result = filtered input dataframe via join on column to subject of enriched_df, looking up of object name or synonym via query of name_lookup.

Library should be released on PyPi

Improve docstrings

The current doc strings are very hard to understand. They need to be made clearer and more explicit. Ideally they would also follow some standard that allows for automated generation of doc (e.g. sphinx)

Query class is responsible for returning the non-redundant graph for s subClassOf o as a simple Pandas dataframe
with given 2 seeds of classes, S(s) and S(o) from an initial seed, S(i).

I don't think it is clear what it might mean to return a non-redunant graph as a dataframe and what the seed classes might mean without some context. Here's an attempt at a more readable doc

"""A Query object is initialised by passing a list of seed terms (where each term is a CURIE string, 
e.g. CL:0000001; all OBO standard curies are recognised). It generates a Pandas dataframe that 
enriches the seed list with synonyms and all inferred subClassOf relationships between terms in 
the seed. Additional methods allow enrichment with terms outside the seed from slims or specified 
by a semantic context.

Args: 
  :seed_list:  list of CURIE strings

Attributes:
  :enriched_df:   {Description of dataframe here}"""

(Might be better to put construction args pydoc on __init__? )

Please do the same for all methods.

Queries for graph generation

The default set of queries in Pandasaurus are designed to flatten content for dataframes - including only triples linking to terms provided in the arg. For graph generation we need to use the object list in the subject position, followed by a transitive reduction step. Rather then generating a dataframe output, I think it makes more sense to store these using some in-memory graph representation on the Pandasaurus object (e.g. as rdflib.Graph objects).

Action: for each dataframe enrichment, generate an reduced graph in parallel, stored on the object*

* I think this argues in favour of extension to PandaSaurus extending base objects rather than generating custom ones that call them.

Add ancestor_enrichment

Copied from INCATools/pandasaurus_cxg#46

add another enrichment method to the existing set.

User specifies number of hops

name: ancestor_enrichment (?)
Args: number of hops

subject - terms in seed
object - terms in seed + terms 1-n hops from terms in subject.

Design architecture

Sketch out repo structure, packages structure, class and method names.

We should aim for test driven development with docstrings for all methods & general doc suitable for publicising on PyPi.

cellXGene_extension

Background & use cases.

The CxG schema used by the CellXGene app, standardises key names and ontology values for recording sample metadata & cell type annotation for Anndata format cellXGene matrices, constructed from single cell transcriptomics data. The primary aim of the CxG extension to pandasaurus is to support enrichment and filtering of matrices by cell-type/class using CL ontology structure.

It will do this by generating the initial term list from the cell_type field in the CxG standard AnnData file. For contextual enrichment it will use the CxG standard tissue field and query UberGraph for 'cell' AND 'part_of {tissue}'.

  • We are developing an extension to the CxG schema - which we call the CxG metaschema. This overcomes the flat nature of the current schema - by using JSON to type and link fields. One major usage of this schema will be to type free text cell type fields and infer relative granularity between these annotations and cell_type ontology annotation from co-annotation, taking advantage of ontology enrichment. Details TBA.

Decisions:

Q: Should this be a separate library: pandasaurus_cxg, importing pandasaurus?
A: Yes. Pandasaurus will have other, completely unrelated uses.
Q: Should this library package the CxG metaschema testing, or should this be in a separate lib & imported?
A: Provisional - separate lib, imported.

Features

Pandasaurus CxG extension should load CxG standard Anndata file and check field compliance* and report :

  • Absence of standard fields
  • Any fields following standards that do not have expected content
  • check metaschema for compliance (if present)

* Use standard validator .

Methods

Update ontology terms (warn before update and provide report).

  • Update labels if they have changed
  • Update obsoletes using replaced_by

Highest priority - cell type queries:

  • Show Cell Type enrichment slims (in the cell ontology): Return names and descriptions of slims.
  • Enrich cell types - standard pandasaurus enrichment with list terms in cell_type field as input & optional choice of slims
  • Enrich cell types by anatomical context: Uses the anatomical context in the CxG Anndata file to enrich via the pandasaurus.contextual_enrichment method.
  • Filter matrix:
    Generates a new matrix containing only terms that are subclasses of those in some specified filter set, where the filter set is chosen from terms in the enrichment table.

Major functionality missing = synonym lookup

Original plan was to put this in list in enriched_df. I think this is not ideal for queries as we need people to be able to specify single strings when running filters etc. May be better to have second table with:

ID; name; type

Where name column takes labels and synonyms & type column has records whether this is a label, or synonym+scope.

Report slim coverage

Method to report how well a slim covers a set of terms - provided as a list or specified as an ubergraph query.

By default, coverage is tested via subClassOf, but other relations can be (optionally) specified. Ancestors of terms in the slim are not counted. The result should be a simple percentage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.