Code Monkey home page Code Monkey logo

atlas-annotations's People

Contributors

a-solovyev12 avatar alfonsomunozpomer avatar irisdianauy avatar mkeays avatar pcm32 avatar pinin4fjords avatar rpetry avatar suhaibmo avatar wbazant avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atlas-annotations's Issues

Downloads of BioMart annotation tsv file happens in incorrect format

In the E196 E!G43 and WBPS12 update, we noticed that the scala script src/pipeline/Start.sc successfully downloads the file. But for few plant species the file content were corrupted .
The file contents were having html/xml tags which is not useful.

brassica_napus.ensgene.interpro.tsv
hordeum_vulgare.ensgene.go.tsv
oryza_indica.ensgene.go.tsv
oryza_sativa.ensgene.interpro.tsv
physcomitrella_patens.ensgene.interpro.tsv
solanum_tuberosum.ensgene.interpro.tsv

The code didn't break because the downloaded file exists

Solution.
Need to come up with a way for identify correctness of the format/content, and look for any html tags in the file system.

The species-properties.json file generated by atlas_species.sh is not right

Error introduced in 8f0da81

The following println statement in src/Directories.sc:

println("Using the following paths for sources of experiments:")

Makes that running the script to generate the species file, ./sh/atlas_species.sh, produces the following (badly-formatted) JSON file:

Using the following paths for sources of experiments:
[{
  "name":"Aegilops_tauschii",
  "defaultQueryFactorType":"ORGANISM_PART",
  "kingdom":"plants",
  "resources":[{
    "type":"genome_browser",
...

Make pipeline more amenable to test runs

To test for any changes in the Scala part of the pipeline I delete lots of annotations, run source/setup_test_env.sh, and do:
amm src/pipeline/Start.sc --force true

This issue proposes to add an optional command line argument that defines where the annotation sources are, so that the pipeline can be integration tested automatically.

Downloader for array designs fails silently

We have seen cases where array design downloads fail silently, generating an file full of HTML errors instead of a proper design file, but not failing in the process: it should either exit with an error code or retry and then fail.

Output for a recent download of oryza_indica array design A-AFFY-126 produces:

                                                                    <h1><a href="/" title="Back to Server error homepage">Server error</a></h1>^M       
                                            </div>^M    
                                        <li id="about" class=" last"><a href="//www.ebi.ac.uk/about" title="About us">About us</a></li>^M       
                                        <li id="industry" class=""><a href="//www.ebi.ac.uk/industry" title="Industry">Industry</a></li>^M      
                                        <li id="research" class=""><a href="//www.ebi.ac.uk/research" title="Research">Research</a></li>^M      
                                        <li id="services" class=" first "><a href="//www.ebi.ac.uk/services" title="Services">Services</a></li>^M       
                                        <li id="training" class=""><a href="//www.ebi.ac.uk/training" title="Training">Training</a></li>^M      
                                    </ul>^M     
                                <div id="global-masthead" class="masthead grid_24">^M   
                                <div id="local-masthead" class="masthead grid_24 nomenu">^M     
                    <!-- set active class as appropriate -->^M  
                <h3 class="about"><a href="//www.ebi.ac.uk/about">About us</a></h3>^M   
                <h3 class="embl-ebi"><a href="//www.ebi.ac.uk/" title="EMBL-EBI">EMBL-EBI</a></h3>^M    
                <h3 class="industry"><a href="//www.ebi.ac.uk/industry">Industry</a></h3>^M     
                <h3 class="research"><a href="//www.ebi.ac.uk/research">Research</a></h3>^M     
                <h3 class="services"><a href="//www.ebi.ac.uk/services">Services</a></h3>^M     
                <h3 class="training"><a href="//www.ebi.ac.uk/training">Training</a></h3>^M     
                <ul id="global-nav">^M  
            <!-- NB: for additional title style patterns, see http://frontier.ebi.ac.uk/web/style/patterns -->^M        
            <!-- local-title -->^M      
            <!--This has to be one line and no newline characters-->^M  
            </div>^M    
            </nav>^M
...

Clean up annotations without ids

Sometimes BioMart returns annotations of the form

id<tab><empty space>

We want to remove these in almost all cases - the only places where we don't is:

  • array designs
  • gene id to gene name files
    The reason we don't: we use them for decorating files, and we assume that they'll be complete.

The benefit from this is operational efficiency: about 30% less space, and quite a few processes will run by this much faster. The resulting files will be also slightly more "correct" in the abstract sense of representing the annotations we want.

To implement this functionality you will need to add a new piece in the file Transform.sc, and then test it, and then take the annotation update part of atlasprod for a run with the resulting annotations. I am fairly certain about array design files and gene id to gene name files being the only ones where we want the blanks but it would need verifying.

Wrong dataset error is not adequately prompted to users

If the Biomart dataset name is wrong, the user still gets a list of biomart attribute fields displayed as not found (validation code from Scala). It should instead display something along the lines of:

Problem retrieving attributes for dataset oglumaepatula_eg_gene in schema , check your parameters

instead of saying that the attributes are not found.

Protist getting a dangerous free ride on its version

Currently the scala code resolves the type of the organism based on string matching 'ensembl', 'wbsp', or 'genomes' on the path of the config file feed for that organism. Because our only protists organism resides inside ensembl AND ensembl version == ensembl protist version, this works out nicely for that protists.

Potential solutions:
1.- Move protist species to a 'protists' containing path, and inject the version for protists as a separate argument.
2.- Make scala code rely on the type written inside the species config file (inside annsrc), in the field 'type', and inject the version of protist as a separate argument.

In the meantime, we are safe since the two versions match, but this shouldn't be left for long.

new species for bulk Chenopodium quinoa

Please add Chenopodium quinoa to the bulk (and SCEA) genomes as a new species - thanks!
I'm pretty sure i updated the isl_genomes list with the required features already

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.