incatools / pandasaurus_cxg Goto Github PK
View Code? Open in Web Editor NEWOntology enrichment tool for CxG standard AnnData files
License: Apache License 2.0
Ontology enrichment tool for CxG standard AnnData files
License: Apache License 2.0
Could this provide a more flexible python-native route to graph generation than the current one?
Generate RDF graph of cell sets & their subsets, with annotations & visualise the results as a graph
Inspecting the RDF graph generated by working on the kidney example file, there are Uberon terms in, that have CL IDs.
This suggests 2 bugs that need fixing:
The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. For any set of non-overlapping enrichment terms, we can add a new obs field with the corresponding anotations.
Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method
OAK has lexmap function(s) that can fuzzy match strings to ontology terms based on labels/synonyms. Pandasaurus enrichment allows us to narrow down potential mappings for cell types to branch of CL.
New function: use author cell type names to find candidate lexical mappings from among the subclasses of the CL term. Investigate using LEXMAP for this. For an MVP just use direct annotation (subcluster_of entries in the co-annotation table). IN future we may extend this to use cell sets)
add another enrichment method to the existing set.
User specifies number of hops
name: ancestor_enrichment (?)
Args: number of hops
subject - terms in seed
object - terms in seed + terms 1-n hops from terms in subject.
Current Notebook content is good but descriptions are too abstract. Needs to be more user-focused.
Dataset object should be linked to all cell sets via http://purl.org/dc/terms/source
(:Cluster)-[:has_source]->(ds:Dataset)
Property details (from VFB representation)
{
"iri": "http://purl.org/dc/terms/source",
"short_form": "source",
"label": "has_source",
"type": "Annotation"
}
In general we need better documentation on how to run this:
How is the node list used to generate graphs?
What is the predicate argument used for?
This should be in PyDoc, but we also need examples in notebook of how we expect users to discover nodes.
stored in anndata.uns['obs_meta']:
[
{
field_name: "some_field_name"
field_type: "author_cell_type_label" } ,
...]
With this, you can drop the need for a schema on the co-annotation method.
Pandasaurus_CxG should support authoring as well as reading.
Sketch
Use Kidney example
The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. We should be able to to generate a new matrix including only expression data for cells that are instances of this type.
Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method.
For canonical atlas building, it is important to be able to focus on 'normal' cells only. These have a disease field tagged with the PATO term for normal.
Aim: an option to choose to use only cells tagged with 'disease: normal' in co-annotation analysis.
(Might want to consider this being built on a general method for choosing disease)
Examples of usage as doc for users.
The two running example datasets we've been working with are good for that. (Probably need to grab via get request from CxG).
I can provide an additional example with PCL.
Wrapper object
Analyzer object
Enricher object
Graph generation (on wrapper object)
Tasks:
ade._AnndataEnricher__context_list. => IDs. It would be useful to have a method that displays the context list with labels.
Wrapper for pandasaurus.slim_manager that pre-fills CL 'Cell Ontology' as default in get_sliim_list. Users coming to this naively will not know precisely what string to put in the Ontology arg.
STATUS:DRAFT
It should be possible to use cellxgene-schema
See https://github.com/chanzuckerberg/single-cell-curation/blob/main/cellxgene_schema_cli/tests/test_schema_compliance.py for direct use (doc for PyPi lib is focussed on command line usage.
Represent this as edges linking cell sets to nodes representing ontology terms used in obs. The edge property should be the obs key. The edge should also include the percentage of cells in the cell set that have the specified property.
STATUS: DRAFT
The CxG schema used by the CellXGene app, standardises key names and ontology values for recording sample metadata & cell type annotation for Anndata format cellXGene matrices, constructed from single cell transcriptomics data. The primary aim of the CxG extension to pandasaurus is to support enrichment and filtering of matrices by cell-type/class using CL ontology structure.
It will do this by generating the initial term list from the cell_type field in the CxG standard AnnData file. For contextual enrichment it will use the CxG standard tissue field and query UberGraph for 'cell' AND 'part_of {tissue}'.
We are developing a meta-schema extension to the CxG schema ([celltag schema](CxG schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md)). This overcomes the flat nature of the current schema - by using JSON to type and link fields. One major usage of this schema will be to type free text cell type fields and infer relative granularity between these annotations and cell_type ontology annotation from co-annotation, taking advantage of ontology enrichment. Details TBA.
pandasaurus
cell_tag_schema
cellxgene-schema - for validation.
Pandasaurus CxG extension should load CxG standard Anndata file and check field compliance* and report :
* Use standard validator .
Update ontology terms (warn before update and provide report).
Highest priority - cell type queries:
Priority 2
Figure showing intended functionality (some work to do to completely align this with functional spec)
cellXGene metaschema = celltag_schema - intended for more generic use than just CxG metaschema.
Multiple fields are tagged as representing cell type.
Use of each value in a cell type field defines a cell set.
We can analyse co-occurence of these values to infer relative granularity:
set(X) cluster_overlaps set(Y)
set(X) cluster_matches set(Y)
set(x) subcluster_of set(y)
field_name1; value1; predicate; field_name2; value2
e.g.
field_name1 | value1 | predicate | field_name2 | value2 |
---|---|---|---|---|
author_category | TissueResMemT | cluster_matches | cell_type | memory T cell |
To analyse co-occurence in AnnData files we can just do this:
anndata.obs[['author_cell_type', 'cell_type']].drop_duplicates()
(example looks at co-occurence of just 2 fields). Resulting dataframe can be analysed for co-occurence of key:value pairs.
See #7 (comment) for examples of co-annotation analysis to inference of relationship
Are any of the CL terms in the initial seed (the set of CL terms used to initialise the Pandasaurus object) also in the object column of the enrichment table?
If so, repeat co-annotation analysis including everything that maps to this term, directly or via the enrichment table.
e.g.
Co-annotation (in this case, everything is 1:1 => predicate cluster_matches).
print(ade._anndata.obs[['author_cell_type', 'cell_type', 'cell_type_ontology_term_id']].drop_duplicates().to_csv(sep='\t'))
author_cell_type | cell_type | cell_type_ontology_term_id |
---|---|---|
naive B cell | naive B cell | CL:0000788 |
memory B cell | memory B cell | CL:0000787 |
gamma-delta T cell | gamma-delta T cell | CL:0000798 |
plasmablast | plasmablast | CL:0000980 |
regulatory T cell | regulatory T cell | CL:0000815 |
CD4-positive, alpha-beta memory T cell | CD4-positive, alpha-beta memory T cell | CL:0000897 |
CD8-positive, alpha-beta memory T cell | CD8-positive, alpha-beta memory T cell | CL:0000909 |
naive CD8+ T cell | naive thymus-derived CD8-positive, alpha-beta T cell | CL:0000900 |
naive CD4+ T cell | naive thymus-derived CD4-positive, alpha-beta T cell | CL:0000895 |
mucosal invariant T cell (MAIT) | mucosal invariant T cell | CL:0000940 |
TissueResMemT | memory T cell | CL:0000813 |
double-positive T cell (DPT) | double-positive, alpha-beta thymocyte | CL:0000809 |
double negative T cell (DNT) | double negative thymocyte | CL:0002489 |
TCRVbeta13.1pos | T cell | CL:0000084 |
enrichment table:
print(ade.minimal_slim_enrichment(['simple_term_enrichment']).to_csv(sep='\t'))
ย | s | s_label | p | o | o_label |
---|---|---|---|---|---|
0 | CL:0000895 | naive thymus-derived CD4-positive, alpha-beta T cell | rdfs:subClassOf | CL:0000084 | T cell |
1 | CL:0000897 | CD4-positive, alpha-beta memory T cell | rdfs:subClassOf | CL:0000084 | T cell |
2 | CL:0000900 | naive thymus-derived CD8-positive, alpha-beta T cell | rdfs:subClassOf | CL:0000084 | T cell |
3 | CL:0000909 | CD8-positive, alpha-beta memory T cell | rdfs:subClassOf | CL:0000084 | T cell |
4 | CL:0000940 | mucosal invariant T cell | rdfs:subClassOf | CL:0000084 | T cell |
5 | CL:0002489 | double negative thymocyte | rdfs:subClassOf | CL:0000084 | T cell |
6 | CL:0000897 | CD4-positive, alpha-beta memory T cell | rdfs:subClassOf | CL:0000813 | memory T cell |
7 | CL:0000909 | CD8-positive, alpha-beta memory T cell | rdfs:subClassOf | CL:0000813 | memory T cell |
8 | CL:0000798 | gamma-delta T cell | rdfs:subClassOf | CL:0000084 | T cell |
9 | CL:0000809 | double-positive, alpha-beta thymocyte | rdfs:subClassOf | CL:0000084 | T cell |
10 | CL:0000813 | memory T cell | rdfs:subClassOf | CL:0000084 | T cell |
11 | CL:0000815 | regulatory T cell | rdfs:subClassOf | CL:0000084 | T cell |
T-Cell; CL:0000084 is in original seed (from obs.cell_type...) and in the object column of the enrichment table.
Co-occurence analysis without enrichment => TCRVbeta13.1pos cluster_matches T cell (CL:0000084)
Co-occurence analysis with enrichment =>
TCRVbeta13.1pos subClusterOf T cell( CL:0000084)
mucosal invariant T cell (MAIT) subClusterOf T cell( CL:0000084)
...
(Note: working through like this makes it clear that we need to start making the distinction between ontology terms (where subClassOf is appropriate) and cell sets defined by tagging with ontology terms (cell sets are related by subClusterOf). This will be important for the model, but needs to be handled carefully for user facing assertions. For now, I think we need to gloss this distinction.)
Needed today or tomorrow for grant application.
consist_of --> iri": "http://purl.obolibrary.org/obo/RO_0002473", "label": "composed_primarily_of"
subcluster_of -> iri: "http://purl.obolibrary.org/obo/RO_0015003", "label": "subcluster of"
cluster --> iri: http://purl.obolibrary.org/obo/PCL_0010001", "label": "cell cluster" # This one might change
One of the main aims of Pandasaurus_CxG is to use it to build a graph/tree showing the implicit hierarchy of cell (type) annotations based on co-annotation, relating this back to the ontology subClass graph.
To do this we should:
We should then support display of a simple graph of cell sets (MVP) & one extended using ontology (subClassOf) hierarchy to terms in the enrichment. To make this manageable for large datasets it should be possible to choose to generate a graph visualisation from some set of cell sets.
Note we still need to keep track of labelsets (the set of terms under one obs).
REPRESENTATION: represent in RDF as individuals. We should follow the schema in Tan et al., 2022 and generate using SPARQL UPDATE.
TBD: How to translate OBS key names into APs.
Outputs: It should be possible to save the resulting RDF to disk. It can be used as input to OBASK.
VISUALIZATION: Use OAKLIB. It works nicely with OWL.
Difficulties:
Breakdown:
Extend the cell type RDF graph with:
Needed - a general redundancy stripping step - as ubergraph inevitably produces a flattened representation of the subClassOf heirarchy.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.