Code Monkey home page Code Monkey logo

pandasaurus_cxg's People

Contributors

anitacaron avatar dosumis avatar ubyndr avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

anitacaron

pandasaurus_cxg's Issues

Support cell set graph generation

Generate RDF graph of cell sets & their subsets, with annotations & visualise the results as a graph

  1. Generate individuals for cell sets;
  2. add subClusterOf relationships between cell sets (and redundancy strip?)
  3. Support graph generation (NetworkX or OAKLIB)

Contextual enrichment should restrict to cells

Inspecting the RDF graph generated by working on the kidney example file, there are Uberon terms in, that have CL IDs.
image
image
image

This suggests 2 bugs that need fixing:

  • contextual enrichment in pandasaurus_cxg should by restricted to subClassOf CL:cell.
    This does not harm for graph/enrichment generation, so is not urgent to fix.
  • #36
    This is quite serious as we don't want files with incorrect IDs getting out into the wild. In general we should always avoid rewriting IRIs, rather than taking them as-is from source.

Add additional annotations based on enrichment

The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. For any set of non-overlapping enrichment terms, we can add a new obs field with the corresponding anotations.

Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method

  1. USer chooses a set of one or more terms from enrichment table and a name for a new obs field.
  2. Test if these classes are subclasses of each other
    • if not add new obs field pair (name & ID following standard CxG naming convention).
    • otherwise fail with warning indicating which classes are subclasses of each other.

Add function to suggest mappings to CL terms based on lexical matches

OAK has lexmap function(s) that can fuzzy match strings to ontology terms based on labels/synonyms. Pandasaurus enrichment allows us to narrow down potential mappings for cell types to branch of CL.

New function: use author cell type names to find candidate lexical mappings from among the subclasses of the CL term. Investigate using LEXMAP for this. For an MVP just use direct annotation (subcluster_of entries in the co-annotation table). IN future we may extend this to use cell sets)

Fix doc errors

  1. Roadmap and snippets in doc links point to deleted branch"

image

  1. String for return value incomplete/split on create_cell_type_dict()
    image

Minor Bug Fixes and Improvements - Manual Testing Notes

Issues/bugs:

  • edge labels were displaying IDs instead of labels. Text wrapping should also be applied on edge labels
  • cell set and CL terms should have different colours.
  • some nodes were missing types, generate_subgraph in graph_generator_utils.py need some updates
  • AnndataEnrichmentAnalyzer initialiser's has an issue related with type hints.
  • from_file_path in AnndataEnricher has missing parameter

Questions:

Add new enrichment method

add another enrichment method to the existing set.

User specifies number of hops

name: ancestor_enrichment (?)
Args: number of hops

subject - terms in seed
object - terms in seed + terms 1-n hops from terms in subject.

GraphGeneration.visualize_rdf_graph needs pydoc

In general we need better documentation on how to run this:

How is the node list used to generate graphs?
What is the predicate argument used for?

This should be in PyDoc, but we also need examples in notebook of how we expect users to discover nodes.

Support basic metaschema for flagging cell-type fields

stored in anndata.uns['obs_meta']:

[
{ 
field_name:  "some_field_name"
field_type: "author_cell_type_label" } ,
  ...]

With this, you can drop the need for a schema on the co-annotation method.

Pandasaurus_CxG should support authoring as well as reading.

Add function to filter AnnData matrix by Cell Type

The enrichment allows us to find all cells that are instances of any CL term in the enrichment table. We should be able to to generate a new matrix including only expression data for cells that are instances of this type.

Before implementing - add example to Jupyter Notebook & chat with Evan & Mary about whether worth adding as a method.

Support option of choosing only 'normal' cells for analysis

For canonical atlas building, it is important to be able to focus on 'normal' cells only. These have a disease field tagged with the PATO term for normal.

Aim: an option to choose to use only cells tagged with 'disease: normal' in co-annotation analysis.

(Might want to consider this being built on a general method for choosing disease)

Jupyter notebook snippets needed

Examples of usage as doc for users.

The two running example datasets we've been working with are good for that. (Probably need to grab via get request from CxG).
I can provide an additional example with PCL.

Supported workflows

Wrapper object

  • User initialises with a CxG anndata file.
  • this wraps analyzer and enricher objects

Analyzer object

  • User initialises chooses free text cell type fields --> analysis (#14)

Enricher object

  • Runs enrichment method. => enrichment tables + graphs (both stored on object). graphs are rdf, transitively reduced. Stored in memory.

Graph generation (on wrapper object)

  • User specifies priority list of obs to use for label --> generates RDF graph of co-annotation (transitively reduced) (#30)
  • Generate enriched graph (combining with co-annotation graph)
  • Outputs:
    • Visualize graphs with networkX
    • Save as rdf
    • Return networkX object.

Tasks:

Add method to query CL slims

Wrapper for pandasaurus.slim_manager that pre-fills CL 'Cell Ontology' as default in get_sliim_list. Users coming to this naively will not know precisely what string to put in the Ontology arg.

MVP / proof of concept release

  • Load CxG Anndata file:
    • Run minimal enrichment using contents_of cell_type field
    • Support enrichment via slims (will already work given basic Pandasaurus functionality)
    • Support contextual enrichment via contents of CxG tissue field & part_of
    • Filter matrix by cell type from enrichment table.

Extend graph visualization to give more flexibility in filtering

STATUS:DRAFT

  1. Choosing nodes:
    • Users should be able to refer to cell sets and classes by rdfs:label.
    • User should also be able to refer to cell sets by obs key + value.
  2. Iterpretation of list of input nodes:
    • Bottom up: Choose cell sets and show graph above these.
    • Top down choose classes and show graph below these.

Pandasaurus_CxG Functional Spec

STATUS: DRAFT

Background & use cases.

The CxG schema used by the CellXGene app, standardises key names and ontology values for recording sample metadata & cell type annotation for Anndata format cellXGene matrices, constructed from single cell transcriptomics data. The primary aim of the CxG extension to pandasaurus is to support enrichment and filtering of matrices by cell-type/class using CL ontology structure.

It will do this by generating the initial term list from the cell_type field in the CxG standard AnnData file. For contextual enrichment it will use the CxG standard tissue field and query UberGraph for 'cell' AND 'part_of {tissue}'.

We are developing a meta-schema extension to the CxG schema ([celltag schema](CxG schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md)). This overcomes the flat nature of the current schema - by using JSON to type and link fields. One major usage of this schema will be to type free text cell type fields and infer relative granularity between these annotations and cell_type ontology annotation from co-annotation, taking advantage of ontology enrichment. Details TBA.

Dependencies:

pandasaurus
cell_tag_schema
cellxgene-schema - for validation.

Features

Pandasaurus CxG extension should load CxG standard Anndata file and check field compliance* and report :

  • Absence of standard fields
  • Any fields following standards that do not have expected content
  • check metaschema for compliance (if present)

* Use standard validator .

Methods

Update ontology terms (warn before update and provide report).

  • Update labels if they have changed
  • Update obsoletes using replaced_by

Highest priority - cell type queries:

  • Show Cell Type enrichment slims (in the cell ontology): Return names and descriptions of slims.
  • Enrich cell types - standard pandasaurus enrichment with list terms in cell_type field as input & optional choice of slims
  • Enrich cell types by anatomical context: Uses the anatomical context in the CxG Anndata file to enrich via the pandasaurus.contextual_enrichment method.
  • Filter matrix:
    Generates a new matrix containing only terms that are subclasses of those in some specified filter set, where the filter set is chosen from terms in the enrichment table.

Priority 2

  • Analyse co-occurence & use to generate draft hierarchy of ontology terms and free text annotations - extending CL.

Figure showing intended functionality (some work to do to completely align this with functional spec)
image
cellXGene metaschema = celltag_schema - intended for more generic use than just CxG metaschema.

Support co-annotation analysis

Multiple fields are tagged as representing cell type.

Use of each value in a cell type field defines a cell set.

We can analyse co-occurence of these values to infer relative granularity:

image

set(X) cluster_overlaps set(Y)
set(X) cluster_matches set(Y)
set(x) subcluster_of set(y)

field_name1; value1; predicate; field_name2; value2

e.g.

field_name1 value1 predicate field_name2 value2
author_category TissueResMemT cluster_matches cell_type memory T cell

To analyse co-occurence in AnnData files we can just do this:

anndata.obs[['author_cell_type', 'cell_type']].drop_duplicates()

(example looks at co-occurence of just 2 fields). Resulting dataframe can be analysed for co-occurence of key:value pairs.

See #7 (comment) for examples of co-annotation analysis to inference of relationship

Co-annotation analysis - enrichment extension

Are any of the CL terms in the initial seed (the set of CL terms used to initialise the Pandasaurus object) also in the object column of the enrichment table?

If so, repeat co-annotation analysis including everything that maps to this term, directly or via the enrichment table.

e.g.

Co-annotation (in this case, everything is 1:1 => predicate cluster_matches).

print(ade._anndata.obs[['author_cell_type', 'cell_type', 'cell_type_ontology_term_id']].drop_duplicates().to_csv(sep='\t'))
author_cell_type cell_type cell_type_ontology_term_id
naive B cell naive B cell CL:0000788
memory B cell memory B cell CL:0000787
gamma-delta T cell gamma-delta T cell CL:0000798
plasmablast plasmablast CL:0000980
regulatory T cell regulatory T cell CL:0000815
CD4-positive, alpha-beta memory T cell CD4-positive, alpha-beta memory T cell CL:0000897
CD8-positive, alpha-beta memory T cell CD8-positive, alpha-beta memory T cell CL:0000909
naive CD8+ T cell naive thymus-derived CD8-positive, alpha-beta T cell CL:0000900
naive CD4+ T cell naive thymus-derived CD4-positive, alpha-beta T cell CL:0000895
mucosal invariant T cell (MAIT) mucosal invariant T cell CL:0000940
TissueResMemT memory T cell CL:0000813
double-positive T cell (DPT) double-positive, alpha-beta thymocyte CL:0000809
double negative T cell (DNT) double negative thymocyte CL:0002489
TCRVbeta13.1pos T cell CL:0000084

enrichment table:

print(ade.minimal_slim_enrichment(['simple_term_enrichment']).to_csv(sep='\t'))
ย  s s_label p o o_label
0 CL:0000895 naive thymus-derived CD4-positive, alpha-beta T cell rdfs:subClassOf CL:0000084 T cell
1 CL:0000897 CD4-positive, alpha-beta memory T cell rdfs:subClassOf CL:0000084 T cell
2 CL:0000900 naive thymus-derived CD8-positive, alpha-beta T cell rdfs:subClassOf CL:0000084 T cell
3 CL:0000909 CD8-positive, alpha-beta memory T cell rdfs:subClassOf CL:0000084 T cell
4 CL:0000940 mucosal invariant T cell rdfs:subClassOf CL:0000084 T cell
5 CL:0002489 double negative thymocyte rdfs:subClassOf CL:0000084 T cell
6 CL:0000897 CD4-positive, alpha-beta memory T cell rdfs:subClassOf CL:0000813 memory T cell
7 CL:0000909 CD8-positive, alpha-beta memory T cell rdfs:subClassOf CL:0000813 memory T cell
8 CL:0000798 gamma-delta T cell rdfs:subClassOf CL:0000084 T cell
9 CL:0000809 double-positive, alpha-beta thymocyte rdfs:subClassOf CL:0000084 T cell
10 CL:0000813 memory T cell rdfs:subClassOf CL:0000084 T cell
11 CL:0000815 regulatory T cell rdfs:subClassOf CL:0000084 T cell

T-Cell; CL:0000084 is in original seed (from obs.cell_type...) and in the object column of the enrichment table.

Co-occurence analysis without enrichment => TCRVbeta13.1pos cluster_matches T cell (CL:0000084)

Co-occurence analysis with enrichment =>

TCRVbeta13.1pos subClusterOf T cell( CL:0000084)
mucosal invariant T cell (MAIT) subClusterOf T cell( CL:0000084)
...

(Note: working through like this makes it clear that we need to start making the distinction between ontology terms (where subClassOf is appropriate) and cell sets defined by tagging with ontology terms (cell sets are related by subClusterOf). This will be important for the model, but needs to be handled carefully for user facing assertions. For now, I think we need to gloss this distinction.)

Generate graphs of cell sets + cell ontology (enrichment) classes

One of the main aims of Pandasaurus_CxG is to use it to build a graph/tree showing the implicit hierarchy of cell (type) annotations based on co-annotation, relating this back to the ontology subClass graph.

To do this we should:

  1. Unify sets of annotations on the same cell_set (all labels linked by cluster_matches).
  2. Generate a graph using subClusterOf between cell sets.
  3. Relate cell sets to ontology terms using some standard relation (consists of?)

We should then support display of a simple graph of cell sets (MVP) & one extended using ontology (subClassOf) hierarchy to terms in the enrichment. To make this manageable for large datasets it should be possible to choose to generate a graph visualisation from some set of cell sets.

Note we still need to keep track of labelsets (the set of terms under one obs).

REPRESENTATION: represent in RDF as individuals. We should follow the schema in Tan et al., 2022 and generate using SPARQL UPDATE.
TBD: How to translate OBS key names into APs.

Outputs: It should be possible to save the resulting RDF to disk. It can be used as input to OBASK.

VISUALIZATION: Use OAKLIB. It works nicely with OWL.

Difficulties:

  • Visualization requires a choice of which keys to use to label nodes, but we can't know the obs names a priori.
    • SOLUTION: Users get choice of preferred keys for visualization on graph visualization generation method.
    • #30

Breakdown:

Extend metadata pulled from CxG H5AD

  • #67
  • #65
    • represent this as edges linking cell sets to nodes representing ontology terms used in obs. The edge property should be the obs key. The edge should also include the percentage of cells in the cell set that have the specified property.

Extend cell set graphs with (cell) ontology classification. from enrichment

Extend the cell type RDF graph with:

  1. rdfs:type: consists_of some { CL } to nodes in the cell-set graph that correspond to
  2. A non-redundant subClassOf hierarchy above the classes and relationships added in step 1.

Needed - a general redundancy stripping step - as ubergraph inevitably produces a flattened representation of the subClassOf heirarchy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.