Code Monkey home page Code Monkey logo

cell-name-and-marker-validator's People

Contributors

beckyjackson avatar dependabot[bot] avatar jamesaoverton avatar lmcmicu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tuqiang2014

cell-name-and-marker-validator's Issues

Rework tokenization and normalization

This is only partly implemented in the current code.

Tokenize:

  • take source.tsv
  • read the 'POPULATION_DEFNITION_REPORTED' column
  • split on whatever delimiters that project has used
  • write a new 'Gating tokenized' column with semicolon+space separated gates

Normalize:

  • read the 'Gating tokenized' column
  • for each gate split the name from the level (e.g. 'hi', '+')
  • use gate-mappings.tsv to map the names to ontology IDs (e.g. PR:000001828) or keywords (e.g. singlets)
  • use value-scale.tsv to normalize the levels
  • write a new 'Gating mapped to ontologies' column

Cache JSON files from ImmPort API

Trying to do some analysis today, it was annoying to have to keep downloading the same JSON files while I modified the Python processing. Now I think it would be better to keep copies of the JSON files for re-analysis.

Further revisions to batch validation

I'd like some more changes to the batch validation.

The cached data should be divided by assay type: data/fcsAnalyzed/SDY113.json, data/hai/SDY113.json. The current code will overwrite the JSON without checking the assay types.

Currently the script is re-authenticating for every fetch from ImmPort, but it should only authenticate once each time the script is executed.

In general, I don't want the commands to be interactive. I want everything scripted and version controlled, so it's reproducible. I'd like to pass the username and password using environment variables. I'd like to be able to call make build/hai.tsv (or something), and have a script look in HIPC_Studies.tsv to get all the right SDY accessions, then process all those studies.

For interactive scripts, having an option for each input (e.g. --studiesinfo build/HIPC_Studies.tsv) is good. For scripts used in Makefiles I prefer the convention of specifying the required files in the right order for a Make task, and then just calling the list of inputs and the output:

result.tsv: src/script.py build/input1.tsv build/input2.tsv
    $^ $@

and so

$ make result.tsv
src/script.py build/input1.tsv build/input2.tsv result.tsv
$

The main advantage is avoiding duplication of the arguments. I also like make to notice if the script has been updated.

Store a smaller version of `pr.owl` in our repo for testing.

Travis CI builds often fail due to "Failed to connect to ftp.proconsortium.org port 21: Connection timed out" (due to no fault of our own). It would be better if our tests did not rely on fetching pr.owl. Maybe we could store a smaller version of pr.owl in our repo for testing.
With just the terms that we need.

Continue with refactoring of global maps

The class IriMaps in common.py manages some of the maps used by different modules, and contains static methods to extract others from a given data set. This mix of instance properties and static methods is not very elegant and is confusing. Either all the maps should be attached to instances of the class, or else all of them should be extractable via static methods. I lean towards the former.

Refactor code for better sharing

The batch processing and web page are doing a lot of the same things, and should share more code.

The history is that I wrote the batch processing first, then wrote server.py from scratch (in a hurry). The key difference is that the batch processing scripts need to ingest legacy data, while the web site needs to validate new data.

Update to use VALVE

I would like to add examples, terminology browser, and validation form for cell names and markers using VALVE, following the model of the extended immune exposure site:

I've done preliminary work on a Google Sheet here:

https://docs.google.com/spreadsheets/d/109FaxCDuwj9fxPqk1_haIrwo3_Y6EPU8lbNGYzmbRsU/edit#gid=0

I also have some data and scripts that I will shared.

Consider removing globals from `common.py`

The server.py had some global variables. I wanted to reuse some of that code in normalize2.py, so I moved it to common.py, but I had to move the globals with it.

Some of the globals never change, and that's fine. But some hold state that is loaded from files. I would prefer to avoid holding that state in globals, and instead pass the necessary information with function arguments. This is a preference, and not an absolute requirement, but I think it's work looking into.

Add preferred labels

Most of our gate terms come from the Protein Ontology (PR). PR uses the rdfs:label predicate to indicate their primary labels. These primary labels are correct and informative, but usually too long for our purposes. Instead we want to use one of the synonyms PR provides.

The synonym we want uses the oio:hasExactSynonym predicate annotated with a oboInOwl:hasSynonymType of 'http://purl.obolibrary.org/obo/pr#PRO-short-label'. It's a little tricky to extract these.

Here's a fragment of pr.owl. We want to extract the pair of the term ID 'http://purl.obolibrary.org/obo/PR_000000005' and the short label 'TGFBR2'.

    <owl:Axiom>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PRO:DNx</oboInOwl:hasDbXref>
        <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">TGFBR2</owl:annotatedTarget>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/PR_000000005"/>
        <oboInOwl:hasSynonymType rdf:resource="http://purl.obolibrary.org/obo/pr#PRO-short-label"/>
        <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
    </owl:Axiom>

There are two main options:

  1. We can process the XML and look for this pattern.
  2. We can use rapper to convert the XML to N-Triples format (line-based), then look for the pattern on consecutive lines. In N-Triples the same example looks like this:
_:genid132818 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Axiom> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "PRO:DNx"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedTarget> "TGFBR2"^^<http://www.w3.org/2001/XMLSchema#string> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedSource> <http://purl.obolibrary.org/obo/PR_000000005> .
_:genid132818 <http://www.geneontology.org/formats/oboInOwl#hasSynonymType> <http://purl.obolibrary.org/obo/pr#PRO-short-label> .
_:genid132818 <http://www.w3.org/2002/07/owl#annotatedProperty> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> .

Note that _:genid132818 is not a stable identifier, and may change on each run.

When doing batch processing, we'll use these labels to generate a new 'Gating preferred label' column.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.