incatools / pandasaurus Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 0.0 3.38 MB

Supporting simple queries over ontology annotations in dataframes, using UberGraph queries

License: Apache License 2.0

Python 100.00%

pandasaurus's Introduction

Pandasaurus

STATUS: BETA

A python library supporting simple queries over ontology annotations in dataframes, using UberGraph queries.

The aim for now is to keep this as a very simple independent Python lib avoiding any complex dependencies.

With the basic library in place, the first planned use for this is as a base for a library that provides simple enrichement and querability to AnnData Cell X Gene matrices following the CZ single cell curation standard.

pandasaurus's People

Contributors

Stargazers

Watchers

pandasaurus's Issues

Extend query init to add optional arg for properties to use for enrichment

The current query object only supports enrichment via subClassOf. We should add an optional arg allowing additional properties to be used. Value or arg should be a list of CURIE strings.

Doc should be changed accordingly.

Ubergraph filters

STATUS: DRAFT

Problem

The results of direct queries of ubergraph redundant and non-redundant graphs often give suboptimal results for biologist-facing use cases. The redundant

Case 1: Most precise object term needed.

Query of object graph => object terms that are too abstract. Query of non-redundant graph fails to => any object term in cases where redundancy stripping assumes users will be able to infer from the properties of a more general class.

Examples:

Querying for all GO processes from some set of CL terms (e.g. via capable of)
- Querying the redundant graph => uselessly abstract GO terms. https://api.triplydb.com/s/EYZJ7_buH

cell_ontology	GO
sensory epithelial cell	biological_process
interneuron	biological_process
motor neuron	biological_process
sensory neuron	biological_process
polymodal neuron	biological_process

Querying the non-redundant graph gives too little, e.g.,GABAergic only links to GO BP on the most general grouping class.

https://api.triplydb.com/s/166HwhWEo

cell_ontology	GO
GABAergic neuron	gamma-aminobutyric acid secretion, neurotransmission

The redundant graph => 67 cell types

cell_ontology	GO
basket cell	gamma-aminobutyric acid secretion, neurotransmission
cerebellar Golgi cell	gamma-aminobutyric acid secretion, neurotransmission
GABAergic neuron	gamma-aminobutyric acid secretion, neurotransmission
Kolmer-Agduhr neuron	gamma-aminobutyric acid secretion, neurotransmission
rosehip neuron	gamma-aminobutyric acid secretion, neurotransmission
cerebral cortex GABAergic interneuron	gamma-aminobutyric acid secretion, neurotransmission
GABAergic interneuron	gamma-aminobutyric acid secretion, neurotransmission
...

What we want:

cell_ontololgy	GO
fan Martinotti neuron	biological_process
fan Martinotti neuron	transmission of nerve impulse
fan Martinotti neuron	secretion by cell
fan Martinotti neuron	acid secretion
fan Martinotti neuron	gamma-aminobutyric acid secretion, neurotransmission
fan Martinotti neuron	secretion
fan Martinotti neuron	transport
fan Martinotti neuron	cellular process
fan Martinotti neuron	biological regulation
fan Martinotti neuron	regulation of neurotransmitter levels
fan Martinotti neuron	system process
fan Martinotti neuron	neurotransmitter transport
fan Martinotti neuron	neurotransmitter secretion
fan Martinotti neuron	signal release
fan Martinotti neuron	multicellular organismal process
fan Martinotti neuron	nervous system process
fan Martinotti neuron	gamma-aminobutyric acid secretion
fan Martinotti neuron	gamma-aminobutyric acid transport
fan Martinotti neuron	localization
fan Martinotti neuron	establishment of localization
fan Martinotti neuron	regulation of biological quality
fan Martinotti neuron	organic substance transport
fan Martinotti neuron	export from cell
fan Martinotti neuron	establishment of localization in cell
fan Martinotti neuron	signal release from synapse

-->

cell_ontololgy	GO
fan Martinotti neuron	transmission of nerve impulse
f
fan Martinotti neuron	gamma-aminobutyric acid secretion, neurotransmission

Proposed Solution
For each subject: query for all subClassOf relationships between object terms. Filter out all triples from the original query where the term has subclasses according to this second query. However, this would require many secondary queries and so would be inefficient. Is there some clever way to do this in SPARQL with subqueries?

CASE2: Graph-view generation

Aim: simple redundancy stripping that does not assume users can deal with inheritance of properties down the class heirarchy.

{details and examples TBA}

Add redundancy stripping functions for UberGraph queries

We need a pair of functions for querying ubergaph that include stripping redundancy (see #28 for rationale, discussion and example use cases).

Given a set of subjects and a property and a target ontology return all triples from the redundant graph with specified subject and property and most specific subject class in the target ontology. Return should be a pandas dataframe of triples following standards established for all other pandasaurus queries.
As 1, but given set set of objects, returning the most specific subject.

See https://api.triplydb.com/s/C3Nf1qnYu for example of SPARQL query logic (Needs modification to fit the above specs.

EPIC - overview of functionality for initial release.

Ubergraph queries to populate

Given 2 seeds of classes S(s) and S(o): return all triples from the redundant graph for s subClassOf o as a simple Pandas dataframe with subject and object columns. We may expand in future to include objectProperties in seed, but OK to stay simple for now. (I’ll refer to this as enriched_df below).

Methods for deriving S(s) and S(o) from an initial seed, S(i), prior to calling basic query

Simple: S(s) = S(i); S(o) = S(i)
Minimal Slim enrichment: S(s) = S(i); S(o) = S(i) + all classes in some specified (set of) slims (where class in slim = class tagged with some specified ‘subset’ axiom)
Full Slim enrichment: as Minimal slim enrichment but with transitive query of the non-redundant graph (owl:subClassOf*)
Contextual enrichment: S(s) = S(i); S(o) = S(i) + all classes satisfied by some (set of) existential restrictions in the ubergraph redundant graph (e.g. part_of 'Kidney')

It is entirely possible that users will provide invalid IDs in the initial seed, so we need to test for this:

Are the CURIE prefixes valid for UberGraph - throw exception/warning if not (option hard/soft fail)
Are the CURIEs valid terms in UberGraph - throw exception/warning if not (option hard/soft fail)
Do any CURIEs correspond to obsolete terms? (Option to update terms in the list via replaced_by?)

Users will need some way to find which slims are available. To do this we should include methods for finding available slims and displaying their contents:

find_available_slims: query for all slims available in a specified ontology, returning name and definition.
Show_slim_contents: list names and IDs of all terms in slim.

For all classes in enriched_df, query for labels and synonyms. Return as table with ID, label, synonyms (pipe sep) + column indicating presence in original seed (I’ll refer to this as name_lookup_df below)

User query methods over dataframes

The dataframes are designed to be sufficiently simple that a a bioinformatician with basic competency could work out how to use them to enrich their results. But I still think useful to wrap these.

Add sqlite as dependency for wrapping queries.

Basic query:

semantic_filter: (arg = dataframe + column name + name or synonym of query term) : result = filtered input dataframe via join on column to subject of enriched_df, looking up of object name or synonym via query of name_lookup.

Library should be released on PyPi

Implement tests using these Test cases

Cell types from https://cellxgene.cziscience.com/e/21d3e683-80a4-4d9b-bc89-ebb2df513dde.cxg/ + blood_and_immune_upper_slim
Cell types from https://cellxgene.cziscience.com/e/0b75c598-0893-4216-afe8-5414cab7739d.cxg/

contextual enrichment using terms from tissue field e.,g. part_of kidney or part_of 'renal medulla'

This project needs a better name!

Chat GPT3 suggests Pandasaurus

From that name (with a bit of prompting) Dall-e => this logo

Add ontology name validation in SlimManager

Background: INCATools/pandasaurus_cxg#8 (comment)

Review CURIE checking code for efficiency

Improve docstrings

The current doc strings are very hard to understand. They need to be made clearer and more explicit. Ideally they would also follow some standard that allows for automated generation of doc (e.g. sphinx)

Query class is responsible for returning the non-redundant graph for s subClassOf o as a simple Pandas dataframe
with given 2 seeds of classes, S(s) and S(o) from an initial seed, S(i).

I don't think it is clear what it might mean to return a non-redunant graph as a dataframe and what the seed classes might mean without some context. Here's an attempt at a more readable doc

"""A Query object is initialised by passing a list of seed terms (where each term is a CURIE string, 
e.g. CL:0000001; all OBO standard curies are recognised). It generates a Pandas dataframe that 
enriches the seed list with synonyms and all inferred subClassOf relationships between terms in 
the seed. Additional methods allow enrichment with terms outside the seed from slims or specified 
by a semantic context.

Args: 
  :seed_list:  list of CURIE strings

Attributes:
  :enriched_df:   {Description of dataframe here}"""

(Might be better to put construction args pydoc on __init__? )

Please do the same for all methods.

Queries for graph generation

The default set of queries in Pandasaurus are designed to flatten content for dataframes - including only triples linking to terms provided in the arg. For graph generation we need to use the object list in the subject position, followed by a transitive reduction step. Rather then generating a dataframe output, I think it makes more sense to store these using some in-memory graph representation on the Pandasaurus object (e.g. as rdflib.Graph objects).

Action: for each dataframe enrichment, generate an reduced graph in parallel, stored on the object*

* I think this argues in favour of extension to PandaSaurus extending base objects rather than generating custom ones that call them.

Parent lookup feature

Should we add a parent lookup just like synonym lookup? @dosumis

Add ancestor_enrichment

Copied from INCATools/pandasaurus_cxg#46

add another enrichment method to the existing set.

User specifies number of hops

name: ancestor_enrichment (?)
Args: number of hops

subject - terms in seed
object - terms in seed + terms 1-n hops from terms in subject.

E203 whitespace before ':'

Decide whether to keep or ignore this error, it clashes with Black convention

Design architecture

Sketch out repo structure, packages structure, class and method names.

We should aim for test driven development with docstrings for all methods & general doc suitable for publicising on PyPi.

Implementation core methods in Pandasaurus

See #1 for details

cellXGene_extension

Background & use cases.

The CxG schema used by the CellXGene app, standardises key names and ontology values for recording sample metadata & cell type annotation for Anndata format cellXGene matrices, constructed from single cell transcriptomics data. The primary aim of the CxG extension to pandasaurus is to support enrichment and filtering of matrices by cell-type/class using CL ontology structure.

It will do this by generating the initial term list from the cell_type field in the CxG standard AnnData file. For contextual enrichment it will use the CxG standard tissue field and query UberGraph for 'cell' AND 'part_of {tissue}'.

We are developing an extension to the CxG schema - which we call the CxG metaschema. This overcomes the flat nature of the current schema - by using JSON to type and link fields. One major usage of this schema will be to type free text cell type fields and infer relative granularity between these annotations and cell_type ontology annotation from co-annotation, taking advantage of ontology enrichment. Details TBA.

Decisions:

Q: Should this be a separate library: pandasaurus_cxg, importing pandasaurus?
A: Yes. Pandasaurus will have other, completely unrelated uses.
Q: Should this library package the CxG metaschema testing, or should this be in a separate lib & imported?
A: Provisional - separate lib, imported.

Features

Pandasaurus CxG extension should load CxG standard Anndata file and check field compliance* and report :

Absence of standard fields
Any fields following standards that do not have expected content
check metaschema for compliance (if present)

* Use standard validator .

Methods

Update ontology terms (warn before update and provide report).

Update labels if they have changed
Update obsoletes using replaced_by

Highest priority - cell type queries:

Show Cell Type enrichment slims (in the cell ontology): Return names and descriptions of slims.
Enrich cell types - standard pandasaurus enrichment with list terms in cell_type field as input & optional choice of slims
Enrich cell types by anatomical context: Uses the anatomical context in the CxG Anndata file to enrich via the pandasaurus.contextual_enrichment method.
Filter matrix:
Generates a new matrix containing only terms that are subclasses of those in some specified filter set, where the filter set is chosen from terms in the enrichment table.

Minor Bug Fixes and Improvements - Manual Testing Notes

Issues/bugs:

object_list in the enrichment method should be a set, source list + other source might have duplicates.

Use poetry for dependency management and publish package

Poetry documentation on pyproject.toml and common commands.

Major functionality missing = synonym lookup

Original plan was to put this in list in enriched_df. I think this is not ideal for queries as we need people to be able to specify single strings when running filters etc. May be better to have second table with:

ID; name; type

Where name column takes labels and synonyms & type column has records whether this is a label, or synonym+scope.

Add code coverage automation

From Huseyin's comment;

Can we make automatic code coverage tests part of CI? Recently I tried Codecov. It's free for public repos. You can configure it to run as part of your CI github action (example action). It will automatically calculate the code coverage change for each pull request and make a comment (hkir-dev/brain_data_standards_ontologies#1 (comment)).

Also it would be wonderful if we can find similar apps for code style and quality checks.

Report slim coverage

Method to report how well a slim covers a set of terms - provided as a list or specified as an ubergraph query.

By default, coverage is tested via subClassOf, but other relations can be (optionally) specified. Ancestors of terms in the slim are not counted. The result should be a simple percentage.

Mock run_sparql_query method in test cases

Background: #17 (comment)