ebi-gene-expression-group / atlas-annotations Goto Github PK
View Code? Open in Web Editor NEWThe pipeline producing bioentity annotations used in Expression Atlas searches
The pipeline producing bioentity annotations used in Expression Atlas searches
Add Coprinopsis cinerea for bulk Expression Atlas E-MTAB-7005
Ensembl fungi species
In the E196 E!G43 and WBPS12 update, we noticed that the scala script src/pipeline/Start.sc
successfully downloads the file. But for few plant species the file content were corrupted .
The file contents were having html/xml tags which is not useful.
brassica_napus.ensgene.interpro.tsv
hordeum_vulgare.ensgene.go.tsv
oryza_indica.ensgene.go.tsv
oryza_sativa.ensgene.interpro.tsv
physcomitrella_patens.ensgene.interpro.tsv
solanum_tuberosum.ensgene.interpro.tsv
The code didn't break because the downloaded file exists
Solution.
Need to come up with a way for identify correctness of the format/content, and look for any html tags in the file system.
Error introduced in 8f0da81
The following println
statement in src/Directories.sc
:
println("Using the following paths for sources of experiments:")
Makes that running the script to generate the species file, ./sh/atlas_species.sh
, produces the following (badly-formatted) JSON file:
Using the following paths for sources of experiments:
[{
"name":"Aegilops_tauschii",
"defaultQueryFactorType":"ORGANISM_PART",
"kingdom":"plants",
"resources":[{
"type":"genome_browser",
...
To test for any changes in the Scala part of the pipeline I delete lots of annotations, run source/setup_test_env.sh
, and do:
amm src/pipeline/Start.sc --force true
This issue proposes to add an optional command line argument that defines where the annotation sources are, so that the pipeline can be integration tested automatically.
This is already in ISL but needed in Single-Cell Atlas now for E-CURD-58.
Thanks!
Genomes for all these available in EnsemblMetazoa (http://metazoa.ensembl.org/biomart/martview/e5f7c91d34e0866ca19fbd37ee50e394)
We have seen cases where array design downloads fail silently, generating an file full of HTML errors instead of a proper design file, but not failing in the process: it should either exit with an error code or retry and then fail.
Output for a recent download of oryza_indica array design A-AFFY-126 produces:
<h1><a href="/" title="Back to Server error homepage">Server error</a></h1>^M
</div>^M
<li id="about" class=" last"><a href="//www.ebi.ac.uk/about" title="About us">About us</a></li>^M
<li id="industry" class=""><a href="//www.ebi.ac.uk/industry" title="Industry">Industry</a></li>^M
<li id="research" class=""><a href="//www.ebi.ac.uk/research" title="Research">Research</a></li>^M
<li id="services" class=" first "><a href="//www.ebi.ac.uk/services" title="Services">Services</a></li>^M
<li id="training" class=""><a href="//www.ebi.ac.uk/training" title="Training">Training</a></li>^M
</ul>^M
<div id="global-masthead" class="masthead grid_24">^M
<div id="local-masthead" class="masthead grid_24 nomenu">^M
<!-- set active class as appropriate -->^M
<h3 class="about"><a href="//www.ebi.ac.uk/about">About us</a></h3>^M
<h3 class="embl-ebi"><a href="//www.ebi.ac.uk/" title="EMBL-EBI">EMBL-EBI</a></h3>^M
<h3 class="industry"><a href="//www.ebi.ac.uk/industry">Industry</a></h3>^M
<h3 class="research"><a href="//www.ebi.ac.uk/research">Research</a></h3>^M
<h3 class="services"><a href="//www.ebi.ac.uk/services">Services</a></h3>^M
<h3 class="training"><a href="//www.ebi.ac.uk/training">Training</a></h3>^M
<ul id="global-nav">^M
<!-- NB: for additional title style patterns, see http://frontier.ebi.ac.uk/web/style/patterns -->^M
<!-- local-title -->^M
<!--This has to be one line and no newline characters-->^M
</div>^M
</nav>^M
...
Sometimes BioMart returns annotations of the form
id<tab><empty space>
We want to remove these in almost all cases - the only places where we don't is:
The benefit from this is operational efficiency: about 30% less space, and quite a few processes will run by this much faster. The resulting files will be also slightly more "correct" in the abstract sense of representing the annotations we want.
To implement this functionality you will need to add a new piece in the file Transform.sc, and then test it, and then take the annotation update part of atlasprod for a run with the resulting annotations. I am fairly certain about array design files and gene id to gene name files being the only ones where we want the blanks but it would need verifying.
If the Biomart dataset name is wrong, the user still gets a list of biomart attribute fields displayed as not found (validation code from Scala). It should instead display something along the lines of:
Problem retrieving attributes for dataset oglumaepatula_eg_gene in schema , check your parameters
instead of saying that the attributes are not found.
new species for bulk (gramene import): Cucumis sativus (cucumber)
New species request for bulk
We need those two additional organisms added here, so that their data is retrieved from E!
Genomes for all these available in EnsemblMetazoa (http://metazoa.ensembl.org/biomart/martview/e5f7c91d34e0866ca19fbd37ee50e394)
Currently the scala code resolves the type of the organism based on string matching 'ensembl', 'wbsp', or 'genomes' on the path of the config file feed for that organism. Because our only protists organism resides inside ensembl AND ensembl version == ensembl protist version, this works out nicely for that protists.
Potential solutions:
1.- Move protist species to a 'protists' containing path, and inject the version for protists as a separate argument.
2.- Make scala code rely on the type written inside the species config file (inside annsrc), in the field 'type', and inject the version of protist as a separate argument.
In the meantime, we are safe since the two versions match, but this shouldn't be left for long.
Please add Chenopodium quinoa to the bulk (and SCEA) genomes as a new species - thanks!
I'm pretty sure i updated the isl_genomes list with the required features already
Scala code picks up oryza_sativa.A-AFFY-126.tsv for redecoration of Oryza Indica, even if we process oryza_indica.A-AFFY-126.tsv will NOT get utilised.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.