ebi-gene-expression-group / atlas-annotations Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 8.47 MB

The pipeline producing bioentity annotations used in Expression Atlas searches

Scala 33.07% Shell 36.95% Perl 27.19% Python 2.79%

atlas-annotations's People

Contributors

Watchers

atlas-annotations's Issues

Add Coprinopsis cinerea and Aspergillus nidulans for bulk Expression Atlas E-MTAB-7005; E-MTAB-6996

Add Coprinopsis cinerea for bulk Expression Atlas E-MTAB-7005
Ensembl fungi species

Downloads of BioMart annotation tsv file happens in incorrect format

In the E196 E!G43 and WBPS12 update, we noticed that the scala script src/pipeline/Start.sc successfully downloads the file. But for few plant species the file content were corrupted .
The file contents were having html/xml tags which is not useful.

brassica_napus.ensgene.interpro.tsv
hordeum_vulgare.ensgene.go.tsv
oryza_indica.ensgene.go.tsv
oryza_sativa.ensgene.interpro.tsv
physcomitrella_patens.ensgene.interpro.tsv
solanum_tuberosum.ensgene.interpro.tsv

The code didn't break because the downloaded file exists

Solution.
Need to come up with a way for identify correctness of the format/content, and look for any html tags in the file system.

The species-properties.json file generated by atlas_species.sh is not right

Error introduced in 8f0da81

The following println statement in src/Directories.sc:

println("Using the following paths for sources of experiments:")

Makes that running the script to generate the species file, ./sh/atlas_species.sh, produces the following (badly-formatted) JSON file:

Using the following paths for sources of experiments:
[{
  "name":"Aegilops_tauschii",
  "defaultQueryFactorType":"ORGANISM_PART",
  "kingdom":"plants",
  "resources":[{
    "type":"genome_browser",
...

Make pipeline more amenable to test runs

To test for any changes in the Scala part of the pipeline I delete lots of annotations, run source/setup_test_env.sh, and do:
amm src/pipeline/Start.sc --force true

This issue proposes to add an optional command line argument that defines where the annotation sources are, so that the pipeline can be integration tested automatically.

Add Aedes aegypti genome

This is already in ISL but needed in Single-Cell Atlas now for E-CURD-58.
Thanks!

import ensembl genomes for Caernohabditis brenneri, remanei, japonica and briggsae

Genomes for all these available in EnsemblMetazoa (http://metazoa.ensembl.org/biomart/martview/e5f7c91d34e0866ca19fbd37ee50e394)

Downloader for array designs fails silently

We have seen cases where array design downloads fail silently, generating an file full of HTML errors instead of a proper design file, but not failing in the process: it should either exit with an error code or retry and then fail.

Output for a recent download of oryza_indica array design A-AFFY-126 produces:

                                                                    <h1><a href="/" title="Back to Server error homepage">Server error</a></h1>^M       
                                            </div>^M    
                                        <li id="about" class=" last"><a href="//www.ebi.ac.uk/about" title="About us">About us</a></li>^M       
                                        <li id="industry" class=""><a href="//www.ebi.ac.uk/industry" title="Industry">Industry</a></li>^M      
                                        <li id="research" class=""><a href="//www.ebi.ac.uk/research" title="Research">Research</a></li>^M      
                                        <li id="services" class=" first "><a href="//www.ebi.ac.uk/services" title="Services">Services</a></li>^M       
                                        <li id="training" class=""><a href="//www.ebi.ac.uk/training" title="Training">Training</a></li>^M      
                                    </ul>^M     
                                <div id="global-masthead" class="masthead grid_24">^M   
                                <div id="local-masthead" class="masthead grid_24 nomenu">^M     
                    <!-- set active class as appropriate -->^M  
                <h3 class="about"><a href="//www.ebi.ac.uk/about">About us</a></h3>^M   
                <h3 class="embl-ebi"><a href="//www.ebi.ac.uk/" title="EMBL-EBI">EMBL-EBI</a></h3>^M    
                <h3 class="industry"><a href="//www.ebi.ac.uk/industry">Industry</a></h3>^M     
                <h3 class="research"><a href="//www.ebi.ac.uk/research">Research</a></h3>^M     
                <h3 class="services"><a href="//www.ebi.ac.uk/services">Services</a></h3>^M     
                <h3 class="training"><a href="//www.ebi.ac.uk/training">Training</a></h3>^M     
                <ul id="global-nav">^M  
            <!-- NB: for additional title style patterns, see http://frontier.ebi.ac.uk/web/style/patterns -->^M        
            <!-- local-title -->^M      
            <!--This has to be one line and no newline characters-->^M  
            </div>^M    
            </nav>^M
...

Clean up annotations without ids

Sometimes BioMart returns annotations of the form

id<tab><empty space>

We want to remove these in almost all cases - the only places where we don't is:

array designs
gene id to gene name files
The reason we don't: we use them for decorating files, and we assume that they'll be complete.

The benefit from this is operational efficiency: about 30% less space, and quite a few processes will run by this much faster. The resulting files will be also slightly more "correct" in the abstract sense of representing the annotations we want.

To implement this functionality you will need to add a new piece in the file Transform.sc, and then test it, and then take the annotation update part of atlasprod for a run with the resulting annotations. I am fairly certain about array design files and gene id to gene name files being the only ones where we want the blanks but it would need verifying.

Wrong dataset error is not adequately prompted to users

If the Biomart dataset name is wrong, the user still gets a list of biomart attribute fields displayed as not found (validation code from Scala). It should instead display something along the lines of:

Problem retrieving attributes for dataset oglumaepatula_eg_gene in schema , check your parameters

instead of saying that the attributes are not found.

new species for bulk (gramene import): Cucumis sativus (cucumber)

New species request for bulk

Add plasmodium_falciparum and callithrix_jacchus as they are being used in Atlas SC

We need those two additional organisms added here, so that their data is retrieved from E!

import ensembl genomes for Drosophila ananassae, mojavensis, pseudoobscura, simulans, virilis and yakuba

Genomes for all these available in EnsemblMetazoa (http://metazoa.ensembl.org/biomart/martview/e5f7c91d34e0866ca19fbd37ee50e394)

Protist getting a dangerous free ride on its version

Currently the scala code resolves the type of the organism based on string matching 'ensembl', 'wbsp', or 'genomes' on the path of the config file feed for that organism. Because our only protists organism resides inside ensembl AND ensembl version == ensembl protist version, this works out nicely for that protists.

Potential solutions:
1.- Move protist species to a 'protists' containing path, and inject the version for protists as a separate argument.
2.- Make scala code rely on the type written inside the species config file (inside annsrc), in the field 'type', and inject the version of protist as a separate argument.

In the meantime, we are safe since the two versions match, but this shouldn't be left for long.

new species for bulk Chenopodium quinoa

Please add Chenopodium quinoa to the bulk (and SCEA) genomes as a new species - thanks!
I'm pretty sure i updated the isl_genomes list with the required features already

Scala code for redecoration doesn't pick up correct array design files for oryza non-reference species

Scala code picks up oryza_sativa.A-AFFY-126.tsv for redecoration of Oryza Indica, even if we process oryza_indica.A-AFFY-126.tsv will NOT get utilised.

ebi-gene-expression-group / atlas-annotations Goto Github PK

atlas-annotations's People

Contributors

Watchers

atlas-annotations's Issues

Recommend Projects

Recommend Topics

Recommend Org