incatools / pandasaurus_cxg Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 1.0 3.59 MB

Ontology enrichment tool for CxG standard AnnData files

License: Apache License 2.0

Python 36.18% Jupyter Notebook 63.75% Shell 0.06%

pandasaurus_cxg's People

Contributors

Stargazers

Watchers

Forkers

anitacaron

pandasaurus_cxg's Issues

Investigate switching graph generation to use https://github.com/aws/graph-notebook

Could this provide a more flexible python-native route to graph generation than the current one?

https://github.com/aws/graph-notebook

Support cell set graph generation

Generate RDF graph of cell sets & their subsets, with annotations & visualise the results as a graph

Generate individuals for cell sets;
add subClusterOf relationships between cell sets (and redundancy strip?)
Support graph generation (NetworkX or OAKLIB)

Contextual enrichment should restrict to cells

Inspecting the RDF graph generated by working on the kidney example file, there are Uberon terms in, that have CL IDs.

This suggests 2 bugs that need fixing:

contextual enrichment in pandasaurus_cxg should by restricted to subClassOf CL:cell.
This does not harm for graph/enrichment generation, so is not urgent to fix.
#36
This is quite serious as we don't want files with incorrect IDs getting out into the wild. In general we should always avoid rewriting IRIs, rather than taking them as-is from source.

Add additional annotations based on enrichment

The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. For any set of non-overlapping enrichment terms, we can add a new obs field with the corresponding anotations.

Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method

USer chooses a set of one or more terms from enrichment table and a name for a new obs field.
Test if these classes are subclasses of each other
- if not add new obs field pair (name & ID following standard CxG naming convention).
- otherwise fail with warning indicating which classes are subclasses of each other.

Add function to suggest mappings to CL terms based on lexical matches

OAK has lexmap function(s) that can fuzzy match strings to ontology terms based on labels/synonyms. Pandasaurus enrichment allows us to narrow down potential mappings for cell types to branch of CL.

New function: use author cell type names to find candidate lexical mappings from among the subclasses of the CL term. Investigate using LEXMAP for this. For an MVP just use direct annotation (subcluster_of entries in the co-annotation table). IN future we may extend this to use cell sets)

Fix doc errors

Roadmap and snippets in doc links point to deleted branch"

String for return value incomplete/split on create_cell_type_dict()

Minor Bug Fixes and Improvements - Manual Testing Notes

Issues/bugs:

edge labels were displaying IDs instead of labels. Text wrapping should also be applied on edge labels
cell set and CL terms should have different colours.
some nodes were missing types, generate_subgraph in graph_generator_utils.py need some updates
AnndataEnrichmentAnalyzer initialiser's has an issue related with type hints.
from_file_path in AnndataEnricher has missing parameter

Questions:

How should we handle single dataset with single cell type, i.e. https://cellxgene.cziscience.com/e/c477bce4-66b3-4495-a8bb-4b51d88f65a1.cxg/.

Stop rewriting Uberon IDs as CL IDs in RDF generation.

Add new enrichment method

add another enrichment method to the existing set.

User specifies number of hops

name: ancestor_enrichment (?)
Args: number of hops

subject - terms in seed
object - terms in seed + terms 1-n hops from terms in subject.

Update Jupyter Notebook to tell user-focused story

Current Notebook content is good but descriptions are too abstract. Needs to be more user-focused.

RDF representation should include dataset object populated with properties from h5ad `uns`

Dataset object should be linked to all cell sets via http://purl.org/dc/terms/source

(:Cluster)-[:has_source]->(ds:Dataset)

Property details (from VFB representation)
{
"iri": "http://purl.org/dc/terms/source",
"short_form": "source",
"label": "has_source",
"type": "Annotation"
}

anndata depedency needs specifying?

GraphGeneration.visualize_rdf_graph needs pydoc

In general we need better documentation on how to run this:

How is the node list used to generate graphs?
What is the predicate argument used for?

This should be in PyDoc, but we also need examples in notebook of how we expect users to discover nodes.

Support basic metaschema for flagging cell-type fields

stored in anndata.uns['obs_meta']:

[
{ 
field_name:  "some_field_name"
field_type: "author_cell_type_label" } ,
  ...]

With this, you can drop the need for a schema on the co-annotation method.

Pandasaurus_CxG should support authoring as well as reading.

Design extended notebook that includes graphs

Sketch

Use Kidney example

Generate co-annotation report & cell sets
Visualise graph of cell sets
Enrichment
Visualise graph of cell sets + ontology enrichment.

Add function to filter AnnData matrix by Cell Type

The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. We should be able to to generate a new matrix including only expression data for cells that are instances of this type.

Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method.

Add label settting function for co-annotation graph generation

Support option of choosing only 'normal' cells for analysis

For canonical atlas building, it is important to be able to focus on 'normal' cells only. These have a disease field tagged with the PATO term for normal.

Aim: an option to choose to use only cells tagged with 'disease: normal' in co-annotation analysis.

(Might want to consider this being built on a general method for choosing disease)

refactor to call transitive reduction method from pandasaurus

Jupyter notebook snippets needed

Examples of usage as doc for users.

The two running example datasets we've been working with are good for that. (Probably need to grab via get request from CxG).
I can provide an additional example with PCL.

Supported workflows

Wrapper object

User initialises with a CxG anndata file.
this wraps analyzer and enricher objects

Analyzer object

User initialises chooses free text cell type fields --> analysis (#14)

Enricher object

Runs enrichment method. => enrichment tables + graphs (both stored on object). graphs are rdf, transitively reduced. Stored in memory.

Graph generation (on wrapper object)

User specifies priority list of obs to use for label --> generates RDF graph of co-annotation (transitively reduced) (#30)
Generate enriched graph (combining with co-annotation graph)
Outputs:
- Visualize graphs with networkX
- Save as rdf
- Return networkX object.

Tasks:

Provide term names for contextual enrichment terms

ade._AnndataEnricher__context_list. => IDs. It would be useful to have a method that displays the context list with labels.

Add method to query CL slims

Wrapper for pandasaurus.slim_manager that pre-fills CL 'Cell Ontology' as default in get_sliim_list. Users coming to this naively will not know precisely what string to put in the Ontology arg.

MVP / proof of concept release

Load CxG Anndata file:
- Run minimal enrichment using contents_of cell_type field
- Support enrichment via slims (will already work given basic Pandasaurus functionality)
- Support contextual enrichment via contents of CxG tissue field & part_of
- Filter matrix by cell type from enrichment table.

Extend graph visualization to give more flexibility in filtering

STATUS:DRAFT

Choosing nodes:
- Users should be able to refer to cell sets and classes by rdfs:label.
- User should also be able to refer to cell sets by obs key + value.
Iterpretation of list of input nodes:
- Bottom up: Choose cell sets and show graph above these.
- Top down choose classes and show graph below these.

Support CxG schema validation

It should be possible to use cellxgene-schema

See https://github.com/chanzuckerberg/single-cell-curation/blob/main/cellxgene_schema_cli/tests/test_schema_compliance.py for direct use (doc for PyPi lib is focussed on command line usage.

Record percentage contribution of cells defined by CxG standard obs fields on each cell set

Represent this as edges linking cell sets to nodes representing ontology terms used in obs. The edge property should be the obs key. The edge should also include the percentage of cells in the cell set that have the specified property.

Pandasaurus_CxG Functional Spec

STATUS: DRAFT

Background & use cases.

The CxG schema used by the CellXGene app, standardises key names and ontology values for recording sample metadata & cell type annotation for Anndata format cellXGene matrices, constructed from single cell transcriptomics data. The primary aim of the CxG extension to pandasaurus is to support enrichment and filtering of matrices by cell-type/class using CL ontology structure.

It will do this by generating the initial term list from the cell_type field in the CxG standard AnnData file. For contextual enrichment it will use the CxG standard tissue field and query UberGraph for 'cell' AND 'part_of {tissue}'.

We are developing a meta-schema extension to the CxG schema ([celltag schema](CxG schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md)). This overcomes the flat nature of the current schema - by using JSON to type and link fields. One major usage of this schema will be to type free text cell type fields and infer relative granularity between these annotations and cell_type ontology annotation from co-annotation, taking advantage of ontology enrichment. Details TBA.

Dependencies:

pandasaurus
cell_tag_schema
cellxgene-schema - for validation.

Features

Pandasaurus CxG extension should load CxG standard Anndata file and check field compliance* and report :

Absence of standard fields
Any fields following standards that do not have expected content
check metaschema for compliance (if present)

* Use standard validator .

Methods

Update ontology terms (warn before update and provide report).

Update labels if they have changed
Update obsoletes using replaced_by

Highest priority - cell type queries:

Show Cell Type enrichment slims (in the cell ontology): Return names and descriptions of slims.
Enrich cell types - standard pandasaurus enrichment with list terms in cell_type field as input & optional choice of slims
Enrich cell types by anatomical context: Uses the anatomical context in the CxG Anndata file to enrich via the pandasaurus.contextual_enrichment method.
Filter matrix:
Generates a new matrix containing only terms that are subclasses of those in some specified filter set, where the filter set is chosen from terms in the enrichment table.

Priority 2

Analyse co-occurence & use to generate draft hierarchy of ontology terms and free text annotations - extending CL.

Figure showing intended functionality (some work to do to completely align this with functional spec)

cellXGene metaschema = celltag_schema - intended for more generic use than just CxG metaschema.

Support co-annotation analysis

Multiple fields are tagged as representing cell type.

Use of each value in a cell type field defines a cell set.

We can analyse co-occurence of these values to infer relative granularity:

set(X) cluster_overlaps set(Y)
set(X) cluster_matches set(Y)
set(x) subcluster_of set(y)

field_name1; value1; predicate; field_name2; value2

e.g.

field_name1	value1	predicate	field_name2	value2
author_category	TissueResMemT	cluster_matches	cell_type	memory T cell

To analyse co-occurence in AnnData files we can just do this:

anndata.obs[['author_cell_type', 'cell_type']].drop_duplicates()

(example looks at co-occurence of just 2 fields). Resulting dataframe can be analysed for co-occurence of key:value pairs.

See #7 (comment) for examples of co-annotation analysis to inference of relationship

Co-annotation analysis - enrichment extension

Are any of the CL terms in the initial seed (the set of CL terms used to initialise the Pandasaurus object) also in the object column of the enrichment table?

If so, repeat co-annotation analysis including everything that maps to this term, directly or via the enrichment table.

e.g.

Co-annotation (in this case, everything is 1:1 => predicate cluster_matches).

print(ade._anndata.obs[['author_cell_type', 'cell_type', 'cell_type_ontology_term_id']].drop_duplicates().to_csv(sep='\t'))

author_cell_type	cell_type	cell_type_ontology_term_id
naive B cell	naive B cell	CL:0000788
memory B cell	memory B cell	CL:0000787
gamma-delta T cell	gamma-delta T cell	CL:0000798
plasmablast	plasmablast	CL:0000980
regulatory T cell	regulatory T cell	CL:0000815
CD4-positive, alpha-beta memory T cell	CD4-positive, alpha-beta memory T cell	CL:0000897
CD8-positive, alpha-beta memory T cell	CD8-positive, alpha-beta memory T cell	CL:0000909
naive CD8+ T cell	naive thymus-derived CD8-positive, alpha-beta T cell	CL:0000900
naive CD4+ T cell	naive thymus-derived CD4-positive, alpha-beta T cell	CL:0000895
mucosal invariant T cell (MAIT)	mucosal invariant T cell	CL:0000940
TissueResMemT	memory T cell	CL:0000813
double-positive T cell (DPT)	double-positive, alpha-beta thymocyte	CL:0000809
double negative T cell (DNT)	double negative thymocyte	CL:0002489
TCRVbeta13.1pos	T cell	CL:0000084

enrichment table:

print(ade.minimal_slim_enrichment(['simple_term_enrichment']).to_csv(sep='\t'))

	s	s_label	p	o	o_label
0	CL:0000895	naive thymus-derived CD4-positive, alpha-beta T cell	rdfs:subClassOf	CL:0000084	T cell
1	CL:0000897	CD4-positive, alpha-beta memory T cell	rdfs:subClassOf	CL:0000084	T cell
2	CL:0000900	naive thymus-derived CD8-positive, alpha-beta T cell	rdfs:subClassOf	CL:0000084	T cell
3	CL:0000909	CD8-positive, alpha-beta memory T cell	rdfs:subClassOf	CL:0000084	T cell
4	CL:0000940	mucosal invariant T cell	rdfs:subClassOf	CL:0000084	T cell
5	CL:0002489	double negative thymocyte	rdfs:subClassOf	CL:0000084	T cell
6	CL:0000897	CD4-positive, alpha-beta memory T cell	rdfs:subClassOf	CL:0000813	memory T cell
7	CL:0000909	CD8-positive, alpha-beta memory T cell	rdfs:subClassOf	CL:0000813	memory T cell
8	CL:0000798	gamma-delta T cell	rdfs:subClassOf	CL:0000084	T cell
9	CL:0000809	double-positive, alpha-beta thymocyte	rdfs:subClassOf	CL:0000084	T cell
10	CL:0000813	memory T cell	rdfs:subClassOf	CL:0000084	T cell
11	CL:0000815	regulatory T cell	rdfs:subClassOf	CL:0000084	T cell

T-Cell; CL:0000084 is in original seed (from obs.cell_type...) and in the object column of the enrichment table.

Co-occurence analysis without enrichment => TCRVbeta13.1pos cluster_matches T cell (CL:0000084)

Co-occurence analysis with enrichment =>

TCRVbeta13.1pos subClusterOf T cell( CL:0000084)
mucosal invariant T cell (MAIT) subClusterOf T cell( CL:0000084)
...

(Note: working through like this makes it clear that we need to start making the distinction between ontology terms (where subClassOf is appropriate) and cell sets defined by tagging with ontology terms (cell sets are related by subClusterOf). This will be important for the model, but needs to be handled carefully for user facing assertions. For now, I think we need to gloss this distinction.)

Release on PyPi as beta

Needed today or tomorrow for grant application.

Set IRIs

consist_of --> iri": "http://purl.obolibrary.org/obo/RO_0002473", "label": "composed_primarily_of"
subcluster_of -> iri: "http://purl.obolibrary.org/obo/RO_0015003", "label": "subcluster of"
cluster --> iri: http://purl.obolibrary.org/obo/PCL_0010001", "label": "cell cluster" # This one might change

Generate graphs of cell sets + cell ontology (enrichment) classes

One of the main aims of Pandasaurus_CxG is to use it to build a graph/tree showing the implicit hierarchy of cell (type) annotations based on co-annotation, relating this back to the ontology subClass graph.

To do this we should:

Unify sets of annotations on the same cell_set (all labels linked by cluster_matches).
Generate a graph using subClusterOf between cell sets.
Relate cell sets to ontology terms using some standard relation (consists of?)

We should then support display of a simple graph of cell sets (MVP) & one extended using ontology (subClassOf) hierarchy to terms in the enrichment. To make this manageable for large datasets it should be possible to choose to generate a graph visualisation from some set of cell sets.

Note we still need to keep track of labelsets (the set of terms under one obs).

REPRESENTATION: represent in RDF as individuals. We should follow the schema in Tan et al., 2022 and generate using SPARQL UPDATE.
TBD: How to translate OBS key names into APs.

Outputs: It should be possible to save the resulting RDF to disk. It can be used as input to OBASK.

VISUALIZATION: Use OAKLIB. It works nicely with OWL.

Difficulties:

Visualization requires a choice of which keys to use to label nodes, but we can't know the obs names a priori.
- SOLUTION: Users get choice of preferred keys for visualization on graph visualization generation method.
- #30

Breakdown:

Extend metadata pulled from CxG H5AD

#67
#65
- represent this as edges linking cell sets to nodes representing ontology terms used in obs. The edge property should be the obs key. The edge should also include the percentage of cells in the cell set that have the specified property.

Add wrapper object that initialises and contains enricher and analyzer objects

Extend cell set graphs with (cell) ontology classification. from enrichment

Extend the cell type RDF graph with:

rdfs:type: consists_of some { CL } to nodes in the cell-set graph that correspond to
A non-redundant subClassOf hierarchy above the classes and relationships added in step 1.

Needed - a general redundancy stripping step - as ubergraph inevitably produces a flattened representation of the subClassOf heirarchy.