The galaxy-hackathon from bxlab

Submit workflows to ToolShed via the Galaxy workflow interface

Galaxy can upload workflows to myexperiment.org but it would be also nice to upload workflows directly to the Galaxy ToolShed. For the sake of simplicity we should assume a TS account for the hackathon.

Things you can learn:

workflows
Galaxy and TS API

Skills:

python
JavaScript

FASTQ datatype enhancements

As discussed with @frederikcoppens it would be nice to upload compressed FASTQ-files and to handle them properly in tools.

One way to do this is to enhance the already existing FASTQ datatype with a new metadata element that indicates if it compressed or not. Tools should then be able to recognize this metadata element. GSNAP is one of the tools which would benefit from it.

You can learn about:

datatypes
metadata
tools & metadata

Skills:

python

Add functionality for Bioconductor tool integration to planemo

Working on fork of planemo (https://github.com/nturaga/planemo/tree/add_bioc_tool_init)

Currently adding and debugging new argument to planemo bioc_tool_init --rscipt

Aim to support generating tool definition files directly from R scripts
Eventually support converting bioc vignettes into sample tools if there is specific pipeline for package in the vignette.

UCSC TrackHub Visualization

#Tool_development #Python #UCSC_trackHub #Visualization #Tracks #Cheetah

Hey!

Another project we could work on is...TrackHub UCSC Visualization through Galaxy:

Track hubs are web-accessible directories of genomic data that can be viewed on the UCSC Genome Browser.

Track hubs can be displayed on genomes that UCSC directly supports, or on your own sequence. Hubs are a useful tool for visualizing a large number of genome-wide data sets. For example, a project that has produced several wiggle plots of data can use the hub utility to organize the tracks into composite and super-tracks, making it possible to show the data for a large collection of tissues and experimental conditions in a visually elegant way, similar to how the ENCODE native data tracks are displayed in the browser.

Here are a few screenshots about what we have available:

The project is already well advanced, and you can see the repository here: https://github.com/goeckslab/hub-archive-creator
The TODO is here, but new ideas are more than welcome :)

You can also already play with it on this instance, under the G-OnRamp category.

Will be glad to work with you, guys!

Galaxy Communication Server (Chat)

Hey!

So with @bgruening and @erasche, we are implementing a chat inside Galaxy for users to discuss all together, and have private rooms. If you are interested in such a way of people interacting right inside Galaxy, please come to talk with all three of us and take a look at the PR we made into Galaxy branch:

galaxyproject#2515

The project contains Python / Javascript / Galaxy communication so everyone could contribute :)

Coding Hack Request: Data Collections use and improvements

Generate Data Collections upon upload in Data libraries
Interactive Tour?

Training Data (includes tutorial, example)

Contributors: @jennaj @griffinp @kpoterlo @yvanlebras @BoughAida @ssander5 @devikaatgit @cschu @tnabtaf @kmurat1

This issue is dedicated to Training Data hackathon group. The idea is to gather sample data who can be used as example, tutorial, .... on Galaxy instances.

Please, don't hesitate to create a comment and add data links and description ;)

Example:

RADseq technology

Genetic map

-parents

female http://546969.197.189.163/datasets/bbbfa414ae315caf/display/
male http://546969.197.189.163/datasets/4467809fea030689/display/

-progeny

progeny 1 http://546969.197.189.163/datasets/ddf83cf807e6e774/display/
progeny 2 http://546969.197.189.163/datasets/30bf7a4ced2335cc/display/
....

Population genomics

barcode http://546969.197.189.163/datasets/6df0b7b066ddc4c9/display/
population map http://546969.197.189.163/datasets/d796ca8e1687a54b/display/
reference genome http://546969.197.189.163/datasets/06cf32e9aa8aad75/display/
FastQ file http://546969.197.189.163/datasets/34c3e3c01e1a37f4/display/

If data are not reachable through the web (personal data on your laptop, ...) , the best way is to upload the data on a https://usegalaxy.org/ Galaxy history

The idea can be to meet after having gathering data and discuss about which one are good / duplicate / too big before proposing actions like, data directly shareable, need to be reduced, ....

Side by side Common Workflow Language workflow & Galaxy workflow

See common-workflow-language/common-workflow-language#104

Tasks: produce Galaxy workflow following the instructions at https://github.com/nekrut/galaxy/wiki/Reference-based-RNA-seq

Repeat the above using the latest draft-4 syntax of CWL

Adding a Neo4j IE [late submission]

We have implemented a Neo4j GIE based on a modified version of the Neo4j:2.3.3 image which takes a neostore datatype generated by this tool allowing users to explore a Neo4j Graph database. We have a demo galaxy instance with the IE here.

Enhancing Shared Data Libraries

Goal:
Design representative testing/tutorial/training data to be published as Shared Data Libraries.

Objectives:

Migrate intact Data Libraries from one Galaxy to another #40 #24
Compile and provide small, representative data that is well annotated.
Aim for data that provides interesting results or highlights typical issues in the analysis
Aim for a range of workflows
Base off of published analyses (key publications with reproducible results)
Publically accessible original data source
Capture data reduction methods

Discussion points:

Analyses to target? I.e: genomes, transcriptomes/exomes, NGS samples, reference annotation, and any other content required to execute complete workflows.
- what are the common issues with each analysis to highlight?
Useful data already on the site?

Related

Submit workflows to ToolShed via the Galaxy workflow interface #4
Training Data (includes tutorial, example) #28
Coding Hack: Make the necessary modifications to Data Library structure/interface etc.

Hisat tool splice junction output

Add an output for splice junctions

2016 GCC Data Hackathon Hub

2016 GCC Data Hackathon Hub

The 2nd annual Data Wrangling Hackathon (“Datathon”) starts on Saturday, June 25 and continues on Sunday, June 26. The associated Coding Hackathon #1 overlaps with these dates. The working rooms of both hacks will be in close proximity and the two Hackathons will share meals to encourage and facilitate collaboration.
https://gcc2016.iu.edu/hacks/index.php
https://gcc2016.iu.edu/hacks/hacks-data.php

Goals

Datathon Focus

Training Data (includes tutorial, example) #28 @jennaj @griffinp @kpoterlo @yvanlebras @BoughAida @ssander5 @devikaatgit @cschu @tnabtaf
Enhancing Shared Data Libs #30 @ssander5 @jennaj
Data Libs: Implement Controlled Vocab/Metadata/Source and Document as a Guide #42 @ssander5 @jennaj
Best Practise Workflows #29 @MoHeydarian @firaan1 @cschu @frederikcoppens
- Associated Datasets/Training Data
- Interactive Tours based on content
- Rnaseq variant calling pipeline
- Share with GTN/GOBLET
- Epigenomics analysis

Code Hack Focus Collaboration

Those participating in the GCC 2016 Datathon will have the opportunity to contribute to Galaxy in the following areas:

Help to improve the overall Galaxy experience from a scientific usage perspective by suggesting how specific use-cases and other interface changes make “Using Galaxy” even better.
Contribute ideas, provide example data, and perform follow-up testing for point-changes involving wrapped tools and/or the user interface for modifications performed in collaboration with the Coding Hackathon. Participants are encouraged to directly contribute to coding activities if interested.
Develop best-practice workflows for publication at http://usegalaxy.org (tutorial design: written, video, shared objects, tours - all could be part of this effort). There are many analysis areas that are well represented, yet there are opportunities for those in the scientific community to contribute and to share in their area of expertise.
Design representative testing/tutorial/training data to be published as Shared Data Libraries. Data that has been manipulated to be small, representative, and provides interesting results or highlights typical issues in the analysis, for a range of analysis workflows, is the primary goal. Which analysis to target? This is up to the participants. "Data" may include genomes, transcriptomes/exomes, NGS samples, reference annotation, and any other content required to execute complete workflows.
General documentation of all flavors and types. Learned something new? Have a tip? We’ll wrangle these up during the Datathon and find the best place to get the information out there where other users can benefit.

Aside from the aforementioned areas, all forms of contribution are welcome! Everyone is strongly encouraged to post suggestion/opinion/criticism below.

We will be using Github to organize this year (not Trello). This is how it works:

Get a Github account
Watch this repository and star this ticket
Share your ideas and feedback
New tickets opened and linked to this one - for concrete and actionable tasks directly related to Galaxy and/or specific Tools (created in the appropriate repositories).

See all soon! Cheers from Christian (@cschu), Frederik (@frederikcoppens), and Jennifer (@jennaj)

For prior comments, please see: galaxyproject#2520

JBrowse extensions

E Rasche's (@erasche) JBrowse tool is pretty awesome.. The idea here is to add some functionality to the tool.

Add the indexed reference genomes to @erasche's JBrowse tool
Add some minor display options
- gff default label type (gene, product etc)
- Same for vcf
Default start display options and location

Should be a small job.

Neostore Composite Datatype

As part of the #26 we needed to create a neostore composite datatype. This is similar to blastdb composite datatype. This datatype is for a neo4j database which is generated by this tool https://github.com/SANBI-SA/tools-sanbi-uwc/tree/master/tools/build_ctb_gene currently in the testtoolshed.

The datatype is here: https://github.com/SANBI-SA/tools-sanbi-uwc/tree/master/datatypes/neo4j_datatypes

Coding Hack Request: Allow hmmer3 tools to use models from tool-data tables

@bgruening

data sharing between galaxy instances, extending "DATA LIBRARIES"

Coding Hack Request: Workflow export to bash command line script

@frederikcoppens @ssander5 @dannon

GCC2016 Hackathon - Comment here to be added!

Comment on this issue and I'll add your user to the hackathon team so you'll be able to create branches here and commit.

Also, please go ahead and make planning issues for any projects you'd like to work on!

Microbial reference data tool wrapper

A colleague has written a python script that uses the data library api to add data libraries of microbial references to Galaxy by genus and/or species from Refseq.

It will need to be an admin only tool.

Install conda packages at install-time

@bwlang and me have decided to check if we can

trigger conda installation when installing from the toolshed (as opposed to when first running the tool).
display the resolver status (seems to already work partially in current code-base)

The API endpoint already works, although strikingly a GET causes the dependency to be resolved and installed.
r = requests.get(url+"/api/dependency_resolvers/0/requirements", params={"key": your_api_key, "name": "bowtie2==2.2.8"})

One difficulty here is that the tool shed takes the dependency information (or requirements) from the tool_dependencies.xml file (which I guess will be phased out in new tools at some point).
So if we want to do that nice "Install dependencies / requirements " the toolshed should probably return the requirements from the tool xml.

And indeed, the toolsshed does already provide that information through the API
r = requests.get(ts_url+"/api/repositories/get_repository_revision_install_info", {"name":"msp_sr_readmap_and_size_histograms", "owner":"drosofff", "changeset_revision":"d6b93af0da55"})

IGV.js browser needs galaxy api updates

Make changes to the galaxy to allow access from IGV.js

api/genomes.py
return fasta index .fai
type param to return raw sequence from show

retrieve external dataset url for bed,gff,bam,vcf from api

Python 3 support

Python 2.7 will not be maintained past 2020.

Add support for Python >= 3.3 while maintaining support for Python 2.6 and 2.7.

xref.: https://trello.com/c/dZcCVf9I/2702-migrate-to-python-3

The list of files already fixed is in .ci/py3_sources.txt

Useful tools and documentation:
https://docs.python.org/3/howto/pyporting.html
https://docs.python.org/2/library/2to3.html
https://python-modernize.readthedocs.org/
https://pythonhosted.org/six/
http://python3porting.com/preparing.html

Requirements which need to be ported or dropped:

bx-python
Cheetah (cheetahtemplate/cheetah#7 and cheetahtemplate/cheetah#9)
Fabric (fabric/fabric#1424)
feedparser (presently copied inside https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/util/sanitize_html.py , needs to become a proper dependency and need to wait for a new release > 5.2.1, see comment below)
mercurial (used only by the Tool Shed, I think)
SVGFig (issue galaxyproject#1717, PR galaxyproject#1747)
wchartype
WebHelpers (unmaintained)

xref: galaxyproject#1715

Galaxy Disk Usage Visualization

I'm working on (https://github.com/dannon/galaxy/tree/quota_display) a better way for users to view and delete the data they have. I have a work in progress branch that uses d3 to generate a treemap, and plotly to display it. It looks like this:

Would love help. It's primarily javascript w/ backbone, though we might have to add functionality to the API in Galaxy, which would of course be work in python.

Primary targets are better hover information, the ability to select histories from the view, and general usability improvements prior to linking it into Galaxy.

Revival of Galaxy Scientists

Last year we started Galaxy Scientists as a forum for power uses to discuss UX improvements.
This did not really take of using Slack.

Is it worth to try to revive this? e.g. through a galaxy/galaxy_scientists github repo and discuss there?

Compression of existing FASTQ datasets

We have Galaxy instance running in our laboratory for more than 4 years. Recently, we ran out of disk space (100TB) and then we realised the FASTQ files occupied more than 60% of our disk space. I would like whether there is any solution to compress the existing FASTQ datasets both in histories and libraries of the galaxy.

History item labels: preserve sample names in datasets

Provide a way to label history items, and have the labels propagate downstream to derivative history items.

In other words: Show sample names on history items, which is much more useful than sticking "on data 11" onto the title.

UI:
Show each label in a box off to the side of the title.

even when collapsed
If the label list is too wide, show a "..." label which shows a drop-down of the rest on click.
User can delete or rename any label.
but it won't propagate to existing downstream history items (?)

Implementation:
Store in a new database table

2 fields: HDA id, label
When a new history item is created, label it with all the labels of its inputs.

Data manager for GTF/GFF files

There should be a data manager for genome annotation files to populate the .loc file (e.g. to use for featurecounts).

Coding Hack Request: finish ballgown wrapper at some point?

@davebx ?

Workflow Best Practices

Build Workflows
- RNA-seq quantification
- RNA-seq variant calling
- ChIP-seq
Associated Datasets/Training Data
Interactive Tours based on content
Share with GTN/GOBLET
Galaxy Flavour for Workflow

Associated Coding Hack Tasks

Workflow improvements @bgruening @yvanlebras @kpoterlo @jennaj @firaan1 @MoHeydarian @ssander5 @kmurat @cschu

#4 (Submit workflows from workflow designer)
Creation/Export of High Res Images for workflows
Workflow Export to Unix @ssander5 @frederikcoppens

Enhance Charts visualizations, allow xy-ranges and color pickers

Recently we integrated shared form building procedures into Charts, this allows us to use color pickers and sliders to configure visualization. This hackathon project intends to demonstrate this by adding color pickers to specify series colors.

Publishing a Workflow to TS & installing it elsewhere

POC of steps needed to

export a workflow
define dependencies
get it in TS
install it elsewhere (with dependencies)

To Do

add test data set to come along at install
same for interactive tour

make an interactive tour for this too?

wrap gene ontology GOLR services

Working on this here .. will try to publish to the toolshed as needed:

https://github.com/nathandunn/galaxy-monarch-curl/

Trinotateviz

Trinotate is a transcriptome annotation and quantification pipeline that stores the results in a sqlite database. The database contains transcript to gene mapping, protein predictions, blast matches for nucleotides and proteins against multiple databases, several prediction software tools results (signalp etc) and differential expression results.
Trinotate web provides an interface to the results but requires a web server and sqlite. The interface doesn’t support filtering or pagination but only keyword search in blobs of data combining all results for a given transcript.
I have created a galaxy visualisation for Trinotate sqlite db. Currently it can display the contents of the db with filtering and pagination. The source code is in https://github.com/Eduardo-Alves/trinotateviz.git The test folder contains a sample trinotate sqlite database provided with Trinotate Web.
My plan is to add the ability to compare multiple transcriptomes and use galaxy trackster browser to see how different transcriptomes align to uniref proteins or a related species. I have several questions on how to implement this:
1-The grid was implemented using angular.js ui-grid because of the support for expandable grids. Is this a good choice? Any better alternatives?
2-Server side filtering and pagination was implemented in the sql call to sqlite data provider. In order to keep the various sub grids in sync all results include gene_id subquery that joins all tables and filters. This is quite messy. Again is there any better alternative?
3-The first improvement I need is to be able to compare multiple transcriptomes. I don’t want to modify the database schema so to maintain compatibility with trinotate. I am planning to pre- append the gene names with transcriptome name and then filter in sql. Each assembly will have a sub-grid and filtering will be done across all transcriptome so that it will be possible to see for instance how many genes are differentially expressed in each assembly or how many of these have blast results for a given e-value threshold. Is the subgrid idea reasonable?
4-Next requirement is to be able to visualise transcripts from various transcriptomes against a single reference, which can be alignments of predicted proteins against uniref. This will allow researchers to see which transcriptomes best assemble their proteins of interest. The idea is to use trackster to display blast results of transcripts aligning to a given protein. I haven’t started looking into this yet so any ideas are welcome.
Any suggestions, bug fixes, pull requests are welcome. This is my first javascript project so please forgive in advance the bugs.

Optional dataset naming based on original fastq files

I'd like to implement an optional default naming convention that includes

*tool name* on *original input dataset*

@mvdbeek thinks this could be implemented with the element identifier field being talked about:
galaxyproject#2006
galaxyproject#2140

Perhaps as a next phase we could find a way to designate an input dataset or collection as a "naming source"

I'd like to find a way to make this work in the context of a workflow, but also in the context of ad-hoc analysis.

Coding Hack Request: Export workflows as hi-res images

@firaan1 @MoHeydarian

Circos

I've built a Circos tool. If you'd like to play around with it and give feedback, just say hi and I've got a copy installed.

http://circos.ca
https://github.com/TAMU-CPT/galaxy-circos-tool

Coding hack request: update Stringtie to the latest version

Stringtie was updated and now includes a merge option, providing an alternative to Cuffmerge/Cuffcompare.

I think @davebx has started wrapping the new version.

Installation of workflow re-installs existing tools?

When installing a workflow with dependencies, the tools are installed again while they were already available in the instance.
Can this be avoided?

I did not specify versions in the xml for dependencies (in case this would make a difference)

Galaxy Security Issues

Adding more support for password strength, changes and handling; inactive account suspension, and adding account creation with tokens and email to bioblend.

Coding Hack Request: History functionality tweaks

Directly import data from history to data libs
Multiple history panel
search all datasets should offer more options for what to do with the results
- import everything into a datalib
- delete selected data
- import everything into new history
- multi-dataset visualisation (e.g. visualise selected data in Trackster)
- multi-set renaming?
- run specific single-input tool on selected datasets

Upload file larger than 2GB

It would be nice to support uploading file larger than 2GB. For example the http://resumablejs.com library can be used.

You can learn about:

The client side of Galaxy

Skills:

JavaScript

Coding Hack Request: add option to install data manager when installing a tool

When installing a tool, you need to separately install the data managers to generate indices, ... Could this be added as a (default on) option upon tool install?

Data Libs: Implement Controlled Vocab/Metadata/Source and Document as a Guide

Related to

Enhancing Shared Data Libs #30

Migrate intact Data Libraries from one Galaxy to another #40

Make fastq datasets compressible on the fly.

Uncompressed fastq is a huge waste of disk space. So, a proposal:

Datasets should have attributes "compressible", "uncompressible", "compress_to", "uncompress_to" and "compressed" that allow to flag that this type can be compressed and what datatype it would become once compressed.
The API should be extended with a "compress" and "uncompress" method.
Implicit datatype converters should be leveraged to do on-the-fly decompression for tools that need it.
The tool input form should be aware of where compressed datasets can be used where an uncompressed dataset of appropriate type would typically be used.
A new type, FastqGz should be created to support gzipped fastq.
(Optional) uncompressed datasets should be cache.

As discussed with @ashvark and @dannon.