jamesaoverton / cell-name-and-marker-validator Goto Github PK
View Code? Open in Web Editor NEWCell Name and Marker Validation Demo
Home Page: https://cell-name-and-marker-validator.ontodev.com
Cell Name and Marker Validation Demo
Home Page: https://cell-name-and-marker-validator.ontodev.com
This is only partly implemented in the current code.
Tokenize:
source.tsv
Normalize:
gate-mappings.tsv
to map the names to ontology IDs (e.g. PR:000001828) or keywords (e.g. singlets)value-scale.tsv
to normalize the levelsTrying to do some analysis today, it was annoying to have to keep downloading the same JSON files while I modified the Python processing. Now I think it would be better to keep copies of the JSON files for re-analysis.
I'd like some more changes to the batch validation.
The cached data should be divided by assay type: data/fcsAnalyzed/SDY113.json
, data/hai/SDY113.json
. The current code will overwrite the JSON without checking the assay types.
Currently the script is re-authenticating for every fetch from ImmPort, but it should only authenticate once each time the script is executed.
In general, I don't want the commands to be interactive. I want everything scripted and version controlled, so it's reproducible. I'd like to pass the username and password using environment variables. I'd like to be able to call make build/hai.tsv
(or something), and have a script look in HIPC_Studies.tsv
to get all the right SDY accessions, then process all those studies.
For interactive scripts, having an option for each input (e.g. --studiesinfo build/HIPC_Studies.tsv
) is good. For scripts used in Makefiles
I prefer the convention of specifying the required files in the right order for a Make task, and then just calling the list of inputs and the output:
result.tsv: src/script.py build/input1.tsv build/input2.tsv
$^ $@
and so
$ make result.tsv
src/script.py build/input1.tsv build/input2.tsv result.tsv
$
The main advantage is avoiding duplication of the arguments. I also like make
to notice if the script has been updated.
Travis CI builds often fail due to "Failed to connect to ftp.proconsortium.org port 21: Connection timed out" (due to no fault of our own). It would be better if our tests did not rely on fetching pr.owl
. Maybe we could store a smaller version of pr.owl
in our repo for testing.
With just the terms that we need.
When specifying the cell type, users should also be able to specify a Cell Ontology label, followed by &
and a gating definition.
Add a batch validation script for population definition and name that is similar to the batch validation script being added to the https://github.com/jamesaoverton/hipc-validation repository (see PR # jamesaoverton/hipc-validation#3)
The class IriMaps
in common.py
manages some of the maps used by different modules, and contains static methods to extract others from a given data set. This mix of instance properties and static methods is not very elegant and is confusing. Either all the maps should be attached to instances of the class, or else all of them should be extractable via static methods. I lean towards the former.
These two files should be replaced by their new counterparts: normalize2.py
and report2.py
.
The batch processing and web page are doing a lot of the same things, and should share more code.
The history is that I wrote the batch processing first, then wrote server.py
from scratch (in a hurry). The key difference is that the batch processing scripts need to ingest legacy data, while the web site needs to validate new data.
I would like to add examples, terminology browser, and validation form for cell names and markers using VALVE, following the model of the extended immune exposure site:
I've done preliminary work on a Google Sheet here:
https://docs.google.com/spreadsheets/d/109FaxCDuwj9fxPqk1_haIrwo3_Y6EPU8lbNGYzmbRsU/edit#gid=0
I also have some data and scripts that I will shared.
The server.py
had some global variables. I wanted to reuse some of that code in normalize2.py
, so I moved it to common.py
, but I had to move the globals with it.
Some of the globals never change, and that's fine. But some hold state that is loaded from files. I would prefer to avoid holding that state in globals, and instead pass the necessary information with function arguments. This is a preference, and not an absolute requirement, but I think it's work looking into.
Most of our gate terms come from the Protein Ontology (PR). PR uses the rdfs:label
predicate to indicate their primary labels. These primary labels are correct and informative, but usually too long for our purposes. Instead we want to use one of the synonyms PR provides.
The synonym we want uses the oio:hasExactSynonym
predicate annotated with a oboInOwl:hasSynonymType
of 'http://purl.obolibrary.org/obo/pr#PRO-short-label'. It's a little tricky to extract these.
Here's a fragment of pr.owl
. We want to extract the pair of the term ID 'http://purl.obolibrary.org/obo/PR_000000005' and the short label 'TGFBR2'.
<owl:Axiom>
<oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PRO:DNx</oboInOwl:hasDbXref>
<owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">TGFBR2</owl:annotatedTarget>
<owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/PR_000000005"/>
<oboInOwl:hasSynonymType rdf:resource="http://purl.obolibrary.org/obo/pr#PRO-short-label"/>
<owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
</owl:Axiom>
There are two main options:
rapper
to convert the XML to N-Triples format (line-based), then look for the pattern on consecutive lines. In N-Triples the same example looks like this:_:genid132818 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Axiom> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "PRO:DNx"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedTarget> "TGFBR2"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedSource> <http://purl.obolibrary.org/obo/PR_000000005> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasSynonymType> <http://purl.obolibrary.org/obo/pr#PRO-short-label> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedProperty> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> .
Note that _:genid132818
is not a stable identifier, and may change on each run.
When doing batch processing, we'll use these labels to generate a new 'Gating preferred label' column.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.