jamesaoverton / cell-name-and-marker-validator Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 157 KB

Cell Name and Marker Validation Demo

Home Page: https://cell-name-and-marker-validator.ontodev.com

Makefile 19.29% Python 38.31% HTML 40.33% Ruby 2.07%

cell-name-and-marker-validator's People

Contributors

Stargazers

Watchers

Forkers

tuqiang2014

cell-name-and-marker-validator's Issues

Rework tokenization and normalization

This is only partly implemented in the current code.

Tokenize:

take source.tsv
read the 'POPULATION_DEFNITION_REPORTED' column
split on whatever delimiters that project has used
write a new 'Gating tokenized' column with semicolon+space separated gates

Normalize:

read the 'Gating tokenized' column
for each gate split the name from the level (e.g. 'hi', '+')
use gate-mappings.tsv to map the names to ontology IDs (e.g. PR:000001828) or keywords (e.g. singlets)
use value-scale.tsv to normalize the levels
write a new 'Gating mapped to ontologies' column

Cache JSON files from ImmPort API

Trying to do some analysis today, it was annoying to have to keep downloading the same JSON files while I modified the Python processing. Now I think it would be better to keep copies of the JSON files for re-analysis.

Further revisions to batch validation

I'd like some more changes to the batch validation.

The cached data should be divided by assay type: data/fcsAnalyzed/SDY113.json, data/hai/SDY113.json. The current code will overwrite the JSON without checking the assay types.

Currently the script is re-authenticating for every fetch from ImmPort, but it should only authenticate once each time the script is executed.

In general, I don't want the commands to be interactive. I want everything scripted and version controlled, so it's reproducible. I'd like to pass the username and password using environment variables. I'd like to be able to call make build/hai.tsv (or something), and have a script look in HIPC_Studies.tsv to get all the right SDY accessions, then process all those studies.

For interactive scripts, having an option for each input (e.g. --studiesinfo build/HIPC_Studies.tsv) is good. For scripts used in Makefiles I prefer the convention of specifying the required files in the right order for a Make task, and then just calling the list of inputs and the output:

result.tsv: src/script.py build/input1.tsv build/input2.tsv
    $^ $@

and so

$ make result.tsv
src/script.py build/input1.tsv build/input2.tsv result.tsv
$

The main advantage is avoiding duplication of the arguments. I also like make to notice if the script has been updated.

Store a smaller version of `pr.owl` in our repo for testing.

Travis CI builds often fail due to "Failed to connect to ftp.proconsortium.org port 21: Connection timed out" (due to no fault of our own). It would be better if our tests did not rely on fetching pr.owl. Maybe we could store a smaller version of pr.owl in our repo for testing.
With just the terms that we need.

Implement 'cell type & gates' syntax

When specifying the cell type, users should also be able to specify a Cell Ontology label, followed by & and a gating definition.

Add batch validation script

Add a batch validation script for population definition and name that is similar to the batch validation script being added to the https://github.com/jamesaoverton/hipc-validation repository (see PR # jamesaoverton/hipc-validation#3)

Continue with refactoring of global maps

The class IriMaps in common.py manages some of the maps used by different modules, and contains static methods to extract others from a given data set. This mix of instance properties and static methods is not very elegant and is confusing. Either all the maps should be attached to instances of the class, or else all of them should be extractable via static methods. I lean towards the former.

Eliminate old `normalize.py` and `report.py`

These two files should be replaced by their new counterparts: normalize2.py and report2.py.

Refactor code for better sharing

The batch processing and web page are doing a lot of the same things, and should share more code.

The history is that I wrote the batch processing first, then wrote server.py from scratch (in a hurry). The key difference is that the batch processing scripts need to ingest legacy data, while the web site needs to validate new data.

Update to use VALVE

I would like to add examples, terminology browser, and validation form for cell names and markers using VALVE, following the model of the extended immune exposure site:

I've done preliminary work on a Google Sheet here:

https://docs.google.com/spreadsheets/d/109FaxCDuwj9fxPqk1_haIrwo3_Y6EPU8lbNGYzmbRsU/edit#gid=0

I also have some data and scripts that I will shared.

Consider removing globals from `common.py`

The server.py had some global variables. I wanted to reuse some of that code in normalize2.py, so I moved it to common.py, but I had to move the globals with it.

Some of the globals never change, and that's fine. But some hold state that is loaded from files. I would prefer to avoid holding that state in globals, and instead pass the necessary information with function arguments. This is a preference, and not an absolute requirement, but I think it's work looking into.

Add preferred labels

Most of our gate terms come from the Protein Ontology (PR). PR uses the rdfs:label predicate to indicate their primary labels. These primary labels are correct and informative, but usually too long for our purposes. Instead we want to use one of the synonyms PR provides.

The synonym we want uses the oio:hasExactSynonym predicate annotated with a oboInOwl:hasSynonymType of 'http://purl.obolibrary.org/obo/pr#PRO-short-label'. It's a little tricky to extract these.

Here's a fragment of pr.owl. We want to extract the pair of the term ID 'http://purl.obolibrary.org/obo/PR_000000005' and the short label 'TGFBR2'.

    <owl:Axiom>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PRO:DNx</oboInOwl:hasDbXref>
        <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">TGFBR2</owl:annotatedTarget>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/PR_000000005"/>
        <oboInOwl:hasSynonymType rdf:resource="http://purl.obolibrary.org/obo/pr#PRO-short-label"/>
        <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
    </owl:Axiom>

There are two main options:

We can process the XML and look for this pattern.
We can use rapper to convert the XML to N-Triples format (line-based), then look for the pattern on consecutive lines. In N-Triples the same example looks like this:

_:genid132818 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Axiom> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "PRO:DNx"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedTarget> "TGFBR2"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedSource> <http://purl.obolibrary.org/obo/PR_000000005> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasSynonymType> <http://purl.obolibrary.org/obo/pr#PRO-short-label> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedProperty> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> .

Note that _:genid132818 is not a stable identifier, and may change on each run.

When doing batch processing, we'll use these labels to generate a new 'Gating preferred label' column.