Comments (15)
great idea - could be a big speed improvement for slow IO environments
from galaxy-hackathon.
So are you talking about uploading and storeing/manageing the new uploaded data as compressed data? Or just uploading a compressed fastq file and then extacting it once the upload is complete server-side?
Or do you want the data to remain compressed in storage on the server?
from galaxy-hackathon.
This is a good idea. @timothom I think Galaxy already decompresses fastq files that are uploaded, so I assume this would involve storing as compressed data? Would this involve disabling the decompression of uploaded fastq files?
It would be nice for us in our lab to be able to operate directly on compressed fastq files though.
from galaxy-hackathon.
I'd love if even linking in compressed files worked seemlessly. That wouldn't require mucking with the upload tool to not autodecompress stuff.
from galaxy-hackathon.
This would make very happy several french galaxy admins!
We have a kind of patch/hack that works on our instances, but I'm sure there would be a more elegant way to do it: https://www.e-biogenouest.org/wiki/ManArchiveGalaxy
from galaxy-hackathon.
There has also been some work by @yhoogstrate in galaxyproject#2535 with a different approach.
from galaxy-hackathon.
@mvdbeek: @ashvark and I reviewed @yhoogstrate's work and it requires this squashfs thing to be installed all over the cluster, no?
from galaxy-hackathon.
@ashvark has started some work on a compressed Fastq type, see #38. @bgruening: how would this work with tools that do not support compressed fastq? And how would compressing existing datasets work - would set_meta() compress / decompress if that key changed?
Finally, see the issue @frederikcoppens mentioned on #38 - something to look out for.
from galaxy-hackathon.
@pvanheus With a new compressed fastq datatype, this would require updating the wrappers to also allow this datatype I assume? Then tools that do not support it require a conversion to use it as input.
Would adding a "convert" tool to uncompress (and compress) be an option?
from galaxy-hackathon.
Yes. I am planning to trying to add converters but i am afraid that would not be good idea for larger fastq files
from galaxy-hackathon.
Why a new format, just annotate the old format and convert tools that do not support compressed fastq to react on the metadata. This should be compatible and doable without much effort. I'm assuming here that most of the tools already have native support for gzipped fastq.
from galaxy-hackathon.
@bgruening because metadata is per-user not per-dataset. However, how about we make a new type: uncompressed fastq. So Fastq is compressed fastq. I'm just thinking of a way to convert existing datasets... @natefoo also pointed out to me that the correct way to handle tools that depend on .gz extension is that at job run time the dataset is linked in with the extension as per datatypes_conf.xml.
from galaxy-hackathon.
@bgruening and @pvanheus . I have created a separate branch (https://github.com/ashvark/galaxy/tree/fastq_enhancements) in my repository for the enhancement of fastq datatype to handle gzipped fastq files as such. I have tested this only with simpe testcases. Below is the explanation of the changes
- added metadata element 'is_gzipped' for the Fastq datatype in the file datatypes/Sequence.py
- modified get_headers() method in datatypes/sniff.py to handle zipped file.
- added a condition in upload.py to avoid the decompression of gzipped fastq files during upload
TO DO
- test with various scenarios so that it does not disturb any other functionalities
I would like to know your suggestions and improvements.
from galaxy-hackathon.
+ref: galaxyproject/tools-iuc#354
from galaxy-hackathon.
@yhoogstrate that pull request remains open and seems no further development has been done against it.
Another discussing is here: #38
Perhaps we should a combined efforts around this.
@ashvark I briefly tested your changes locally and worked ok.
The other issue is file/dataset extension that sometimes tools use to determine the format of the file, is there any reasons why Galaxy forces the .dat extension. I know it will be a big change, but can files be stored and tracked in their original extension in Galaxy?
from galaxy-hackathon.
Related Issues (20)
- Coding Hack Request: History functionality tweaks HOT 2
- Revival of Galaxy Scientists HOT 12
- Neostore Composite Datatype HOT 1
- Make fastq datasets compressible on the fly. HOT 4
- Install conda packages at install-time HOT 3
- Migrate intact Data Libraries from one Galaxy to another
- Coding Hack Request: finish ballgown wrapper at some point?
- Data Libs: Implement Controlled Vocab/Metadata/Source and Document as a Guide
- wrap gene ontology GOLR services HOT 1
- Coding Hack Request: Workflow export to bash command line script
- Coding Hack Request: Export workflows as hi-res images
- Coding Hack Request: wrapper for canu assembler HOT 1
- Coding hack request: update Stringtie to the latest version
- Coding Hack Request: Allow hmmer3 tools to use models from tool-data tables
- Publishing a Workflow to TS & installing it elsewhere HOT 2
- Coding Hack Request: add option to install data manager when installing a tool HOT 4
- Data manager for GTF/GFF files HOT 5
- Installation of workflow re-installs existing tools?
- Importing workflows: automatically add to Published Workflows?
- Wrapping up the GCC2016 Datathon HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from galaxy-hackathon.