workflowconversion / ctdconverter Goto Github PK
View Code? Open in Web Editor NEWSeries of python scripts to convert CTD files into other formats such as Galaxy, CWL
License: MIT License
Series of python scripts to convert CTD files into other formats such as Galaxy, CWL
License: MIT License
It would be useful to have uni-tests for the conversions. I am happy to do this for the Galaxy component
We (as in I) reinvented the wheel. What it all started as using the print()
method, turned into a reimplementation of a logger, which is kind of sad, if you really think about it.
Python offers logging capabilities, so these should be used instead (see https://docs.python.org/2/library/logging.html).
#if $param_printsextract_version:
-printsextract:version "$param_printsextract_version"
#end if
#if $param_input_infile:
-input:infile $param_input_infile
#end if
Should just be
#if $param_printsextract_version:
-version "$param_printsextract_version"
#end if
#if $param_input_infile:
-infile $param_input_infile
#end if
Although XML documents are valid regardless of the ordering of their attributes, this would be a useful feature to manually compare different documents.
e.g.
the `param_debug` field is not valid because
tried int but
`u'0'` is not int
which is
- default: '0'
doc: Sets the debug level
id: param_debug
inputBinding:
prefix: -debug
label: Sets the debug level
type:
- 'null'
- int
So, the defaults type must match (one of?) the cwl param type defined
Currently <
and >
in the command block are quoted. E.g. <![CDATA[
gets &![CDATA[
also shell redirection is affected.
Same for help.
For the values of the attributes quoting is still required.
If someone could give me a clue how this could be done I could implement it.
NoFlo has a graph language for Flow-Based Programming workflows in javascript.
@timosachsenberg , Oliver said you might have some Galaxy workflows that used the output of CTDConverter?
At the moment the python code is not structured fully as a package.
It would be useful for unit-testing and installation purposes if we packaged up the code (I can do this)
FYI: @bernt-matthias
Most tools take force-feeding of parameters not well (Unknown option(s) '[-version]' given. Aborting!
).
The cwl:
- default: 2.1.0
doc: Version of the tool that generated this parameters file.
id: param_version
inputBinding:
prefix: -version
label: Version of the tool that generated this parameters file.
type:
- 'null'
- string
from the ctd
<?xml version="1.0" encoding="UTF-8"?>
<tool ctdVersion="1.7" version="2.1.0" name="FileInfo" docurl="http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/html/TOPP_FileInfo.html" category="File Handling" >
<description><![CDATA[Shows basic information about the file, such as data ranges and file type.]]></description>
<manual><![CDATA[Shows basic information about the file, such as data ranges and file type.]]></manual>
<PARAMETERS version="1.6.2" xsi:noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<NODE name="FileInfo" description="Shows basic information about the file, such as data ranges and file type.">
<ITEM name="version" value="2.1.0" type="string" description="Version of the tool that generated this parameters file." required="false" advanced="true" />
<NODE name="1" description="Instance '1' section for 'FileInfo'">
<ITEM name="in" value="" type="input-file" description="input file " required="true" advanced="false" supported_formats="*.mzData,*.mzXML,*.mzML,*.dta,*.dta2d,*.mgf,*.featureXML,*.consensusXML,*.idXML,*.pepXML,*.fid,*.mzid" />
<ITEM name="in_type" value="" type="string" description="input file type -- default: determined from file extension or content" required="false" advanced="false" restrictions="mzData,mzXML,mzML,dta,dta2d,mgf,featureXML,consensusXML,idXML,pepXML,fid,mzid" />
<ITEM name="out" value="" type="output-file" description="Optional output file. If left out, the output is written to the command line." required="false" advanced="false" supported_formats="*.txt" />
<ITEM name="out_tsv" value="" type="output-file" description="Second optional output file. Tab separated flat text file." required="false" advanced="true" supported_formats="*.csv" />
<ITEM name="m" value="false" type="string" description="Show meta information about the whole experiment" required="false" advanced="false" restrictions="true,false" />
<ITEM name="p" value="false" type="string" description="Shows data processing information" required="false" advanced="false" restrictions="true,false" />
<ITEM name="s" value="false" type="string" description="Computes a five-number statistics of intensities, qualities, and widths" required="false" advanced="false" restrictions="true,false" />
<ITEM name="d" value="false" type="string" description="Show detailed listing of all spectra and chromatograms (peak files only)" required="false" advanced="false" restrictions="true,false" />
<ITEM name="c" value="false" type="string" description="Check for corrupt data in the file (peak files only)" required="false" advanced="false" restrictions="true,false" />
<ITEM name="v" value="false" type="string" description="Validate the file only (for mzML, mzData, mzXML, featureXML, idXML, consensusXML, pepXML)" required="false" advanced="false" restrictions="true,false" />
<ITEM name="i" value="false" type="string" description="Check whether a given mzML file contains valid indices (conforming to the indexedmzML standard)" required="false" advanced="false" restrictions="true,false" />
<ITEM name="log" value="" type="string" description="Name of log file (created only when specified)" required="false" advanced="true" />
<ITEM name="debug" value="0" type="int" description="Sets the debug level" required="false" advanced="true" />
<ITEM name="threads" value="1" type="int" description="Sets the number of threads allowed to be used by the TOPP tool" required="false" advanced="false" />
<ITEM name="no_progress" value="false" type="string" description="Disables progress logging to command line" required="false" advanced="true" restrictions="true,false" />
<ITEM name="force" value="false" type="string" description="Overwrite tool specific checks." required="false" advanced="true" restrictions="true,false" />
<ITEM name="test" value="false" type="string" description="Enables the test mode (needed for internal use only)" required="false" advanced="true" restrictions="true,false" />
</NODE>
</NODE>
</PARAMETERS>
</tool>
My guess is, that the generator starts producing params from each ITEM
as long as it is child of NODE
but should only(???) from the innermost NODE
.
Perhaps issue an error or warning if no outputs are defined in the CTD?
/home/mcrusoe/miniconda2/envs/ctd-converter/bin/cwltool 1.0.20180130110340
Resolved 'mason_methylation.cwl' to 'file:///home/mcrusoe/src/CTDSchema/seqan/mason_methylation.cwl'
Tool definition failed validation:
mason_methylation.cwl:6:1: Object `mason_methylation.cwl` is not valid because
tried `CommandLineTool` but
missing required field `outputs`
An idea just discussed with @bgruening:
configfile
and add cheetah code for variablesFor OpenMS this seems to work only for the ini files (which are very similar to the ctd files) since I found no parameter to give the ctd to the OpenMS tools.
flags (parameters without argument) are placed in the cwl associated with (default)values (of their boolean default values in the tool).
Originally posted by @mwalzer in https://github.com/WorkflowConversion/CTDConverter/issue_comments#issuecomment-444969252
This is unwanted, not sure how to get rid of:
<param name="param_printsextract_version" type="text" size="30" value="6.6.0" label="Version of the tool that generated this parameters file" help="(-version) ">
<sanitizer>
<valid initial="string.printable">
<remove value="'"/>
<remove value="""/>
</valid>
</sanitizer>
</param>
CTD looks like:
<?xml version="1.0" ?>
<tool name="printsextract" version="6.6.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://github.com/genericworkflownodes/CTDopts/raw/master/schemas/CTD_0_3.xsd">
<description>Extract data from PRINTS database for use by pscan</description>
<PARAMETERS version="1.6.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://github.com/genericworkflownodes/CTDopts/raw/master/schemas/Param_1_6_2.xsd">
<NODE description="Extract data from PRINTS database for use by pscan" name="printsextract">
<ITEM description="Version of the tool that generated this parameters file." name="version" tags="advanced" type="string" value="6.6.0"/>
<NODE description="Parameters of printsextract" name="1">
<NODE description="Input section" name="input">
<ITEM description="PRINTS database file" name="infile" supported_formats="*.prints database" type="input-file" value=""/>
</NODE>
</NODE>
</NODE>
</PARAMETERS>
</tool>
Also produces unused command bit:
#if $param_printsextract_version:
-printsextract:version "$param_printsextract_version"
#end if
This is the list of TODOs that I would like to cover in the near future:
CDATA for command and help: started here bernt-matthias@a9ea258
optional string/int/... parameters dont get optional="true"
its currently only set for parameters with restrictions bernt-matthias@9ab3b25
int/float without a default value currently are autoset to 0 -- needs manual cherry pick from #49
fix indentation of closing #end if
for the #if str(...)
and the -parameter
, bernt-matthias@eb50cfa
there may be different ways to specify itemlists on the cli: -param A B C
or -param A -param B -param C
, see OpenMS/OpenMS#4196, but in OpenMS only the former is used.
add the possibility to add hidden parameters
<item type='boolean' >
<item type='string' restriction="true,false">
<item type='string' restriction="...,...,...">
and <itemlist type='string' restriction="...,...,...">
'$param_param_wodefault_mandatory_restricted'
instead of the inner if, bernt-matthias@9ab3b25 <itemlist type='string/int/float'>
are currently rendered as repeat, which does not allow for default values. the best seems to be render them as comma separated list of elements with appropriate validators and sanitizers. bernt-matthias@ed73ced
max=1
is set?value="A B"
Input files
format="txt,txt,txt"
bernt-matthias@e4b4333Output files
param_stdout
(seems absent automatically if there are outputs) Use {# ' '.join([\"'%s'\"%str(_) for _ in " + actual_parameter + "])}
instead of tmp variable in command block for list selection, maybe also for multiple file input (by adding an if _ != None
), also all references to repeats could be removed from there. bernt-matthias@bcb9986
tests?
macro="..."
from conditional there for add the conditional select param<test>
(ie. remove all the rest) from the ctd files in the OpenMS tests implement <NODE>
in tests (and render as section?)
OpenMS has parameters specifying prefixes for output files, e.g. DTAExtractor.ctd <ITEM name="out" value="" type="string" description="base name of DTA output files (RT, m/z and extension are appended)" required="true" advanced="false" />
data type mapping
tof_const
parameter (OpenMS expects csv) works only if the file extension is absent or csv -> maybe remove extension for tabular or hardcode to csv (the former might be easier for tests because test data is detected as text)There appears to be a parsing error in one of the OpenMS ctd files. Though I'm not sure where, file looks okay on a first glance. Can't find any u'\xb2' (² ?).
Offending file attached.
INFO: Converting from /tmp/ctd/MetaProSIP.ctd
Traceback (most recent call last):
File "generator.py", line 352, in main
xsd_location=args.xsd_location)
File "generator.py", line 598, in convert
create_inputs(tool, model, **kwargs)
File "generator.py", line 965, in create_inputs
create_param_attribute_list(param_node, param, kwargs["supported_file_formats"])
File "generator.py", line 1129, in create_param_attribute_list
label, help_text = generate_label_and_help(param.description)
File "generator.py", line 1139, in generate_label_and_help
desc = str(desc).replace("#br#", " <br>")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 9: ordinal not in range(128)
When a commandline flag is converted with CTD to a dropdown menu, the menu has three options: TRUE, FALSE and "Nothing selected". This is due to the entry optional="True"
. However, for flags this is unwanted behaviour, as a missing flag normally means that the default is FALSE. So the option "Nothing selected" should not be there. A flag would be better reflected with optional="False", value="False"
.
To be clear, my definition of "flag" is: a commandline parameter, which takes no additional input (T/F, String etc.). For example: adding -append
to a command translates to append=T
, with the default append=F
.
Sorry for the lengthy explanation, I am still new to this.I hope I made myself clear.
The converter produces
<outputs>
<data name="param_out" format="data"/>
</outputs>
I think, there are two issues with that:
What I presume would be better:
<outputs>
<data name="param_out" type="data" format="idXML"/>
</outputs>
from what is available in the ctd:
<ITEM name="out" value="" type="output-file" description="Output file" required="true" advanced="false" supported_formats="*.idXML" />
CTDs do not include this information. So the schema should be modified to support positional parameters, which are to be used in output CWL/ToolConfig files.
Then we can have a conda package conda-forge/staged-recipes#14908
@TorHou @bgruening what would be the best way of doing this... EMBOSS has a couple times when the presence/absence of an output is dependent on a specific parameter's value.
E.g. passing the -plot
flag to iep
will cause the output -graph
to become used. Same for -noreport
and removing the -outfile
output.
Any thoughts on this? (No rush, it's a weekend :) )
If one pip installs this project it creates: common
, galaxy
, and cwl
directly in the python dictribution. Would be nice to have all those modules in one folder, e.g. ctdconverter/common
.
galaxyproject/galaxy#35 which would map to the "add_group" command in CTD
for label="Input file"
type is correctly set as 'data', but format attribute is empty. E.g.:
<param name="param_in" type="data" format="" optional="False" label="Input file" help="(-in) "/>
with ctd
<ITEM name="in" value="" type="input-file" description="Input file" required="true" advanced="false" supported_formats="*.idXML" />
The most reasonable way to go around this would be to refactor generator.py
following these guidelines:
generator.py
script will be moved to galaxy/converter.py
. A new script under cwl/converter.py
will also be added.common
.convert.py
, should be created on the topmost level. This script will then decide which specific script (whether galaxy/converter.py
or cwl/converter.py
) should be invoked.In the end, invoking the converter should look similar to:
$ python convert.py [FORMAT] [ADDITIONAL_ARGUMENTS ...]
That is, the first positional parameter will be the output format (either galaxy
or cwl
), while the rest of the arguments should not be modified in order to keep current functionality (i.e., generating Galaxy ToolConfig files). In other words, this is how a single CTD file is converted into a ToolConfig file:
$ python generator.py -i tool.ctd -o tool.xml
After the refactoring, this will change to:
$ python convert.py galaxy -i tool.ctd -o tool.xml
And in the case of CWL, it would look like:
$ python convert.py cwl -i tool.ctd -o tool.cwl
It'd be nice to be able to pip install
this :)
Even when the missing ctdVersion
is added, CTDOpts gets grumpy with empty values
like this snippet from flexbar --write-ctd
using Seqan 2.4.0
<ITEM name="barcode-tail-length" value="" type="int" description="Region size in tail trim-end modes. Default: barcode length." required="false" advanced="true" />
(ctd-converter) mcrusoe@mrcdev:~$ python ~/src/CTDConverter/convert.py cwl -i ~/debian/flexbar/flexbar.ctd -o f.c
INFO: Using cwl converter
WARNING: Validation against a schema has not been enabled.
INFO: Parsing /home/mcrusoe/debian/flexbar/flexbar.ctd
Traceback (most recent call last):
File "/home/mcrusoe/src/CTDConverter/convert.py", line 215, in main
converter.get_preferred_file_extension())
File "/home/mcrusoe/src/CTDConverter/common/utils.py", line 118, in parse_input_ctds
parsed_ctds.append(ParsedCTD(CTDModel(from_file=input_ctd), input_ctd, output_file))
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 643, in __init__
self._load_from_file(from_file)
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 688, in _load_from_file
self.parameters = self._build_param_model(params_container_node, base=None)
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 708, in _build_param_model
self._build_param_model(child, current_group)
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 713, in _build_param_model
base.add(**setup) # register parameter in model
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 564, in add
self.parameters[name] = Parameter(name, self, **kwargs)
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 376, in __init__
self._validate_numerical_defaults(default)
File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 456, in _validate_numerical_defaults
"default": ', '.join(map(str, errors_so_far))})
ModelParsingError: An error occurred while parsing the CTD file: Invalid default value(s) provided for parameter barcode-min-overlap of type <type 'int'>: ''
ERROR: There seems to be a problem with one of your input CTDs.
Traceback (most recent call last):
File "/home/mcrusoe/src/CTDConverter/convert.py", line 272, in <module>
sys.exit(main())
File "/home/mcrusoe/src/CTDConverter/convert.py", line 234, in main
utils.error("Reason: " + e.msg, 0)
AttributeError: 'ModelParsingError' object has no attribute 'msg'
Tagging in @h-2
From lamda --write-ctd
<?xml version="1.0" encoding="UTF-8"?>
<tool name="Lambda" version="1.0.2 (Git commit )" docurl="http://www.seqan.de" category="" >
<executableName>lambda</executableName>
<description>the Local Aligner for Massive Biological DatA</description>
<manual>Lambda is a local aligner optimized for many query sequences and searches in protein space. It is compatible to BLAST, but much faster than BLAST and many other comparable tools.
Detailed information is available in the wiki: <https://github.com/seqan/lambda/wiki>
</manual>
produces via convert.py cwl
#!/usr/bin/env cwl-runner
# This CWL file was automatically generated using CTDConverter.
# Visit https://github.com/WorkflowConversion/CTDConverter for more information.
baseCommand: lambda
class: CommandLineTool
cwlVersion: v1.0
doc: "Lambda is a local aligner optimized for many query sequences and searches in\
\ protein space. It is compatible to BLAST, but much faster than BLAST and many\
\ other comparable tools.\nDetailed information is available in the wiki: <https://github.com/seqan/lambda/wiki>\n\
\n\n\nFor more information, visit http://www.seqan.de"
which could be as
#!/usr/bin/env cwl-runner
# This CWL file was automatically generated using CTDConverter.
# Visit https://github.com/WorkflowConversion/CTDConverter for more information.
baseCommand: lambda
class: CommandLineTool
cwlVersion: v1.0
doc: |
Lambda is a local aligner optimized for many query sequences and searches in
protein space. It is compatible to BLAST, but much faster than BLAST and many
other comparable tools.
Detailed information is available in the wiki: https://github.com/seqan/lambda/wiki
For more information, visit http://www.seqan.de
On the CTD side, you can have plain booleans, but a plain old boolean won't let you set a restriction on it with choices
.
This is because with type=bool
and choices=[a, b]
, we run into this bit of code: https://github.com/erasche/CTDopts/blob/master/CTDopts/CTDopts.py#L302 which tries to join the values. The values are previously mapped to param.type
, meaning everything is coerced into a boolean, which then doesn't get joined properly.
I've found that by using type=str
and choices=['true', 'false']
, I can get this script to output what I want in galaxy XML, i.e. (truevalue="--param-name" falsevalue=""
), however that's not very user-friendly.
That would also allow to create some unit tests
for CTD2Galaxy. Yours, Steffen
For the xml:
<param name="param_additional_termini" type="boolean"
truevalue="-additional:termini" falsevalue="" optional="True"
label="Include charge at N and C terminus" help="(-termini) "/>
Currently:
#if $param_additional_termini:
-additional:termini
#end if
Expected:
$param_additional_termini
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.