Code Monkey home page Code Monkey logo

ctdconverter's People

Contributors

bernt-matthias avatar bgruening avatar blankclemens avatar chahuistle avatar jpfeuffer avatar mr-c avatar mwalzer avatar tomnl avatar torhou avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctdconverter's Issues

Colon separated command line flags

#if $param_printsextract_version:
  -printsextract:version     "$param_printsextract_version"
#end if
#if $param_input_infile:
  -input:infile $param_input_infile
#end if

Should just be

#if $param_printsextract_version:
  -version     "$param_printsextract_version"
#end if
#if $param_input_infile:
  -infile $param_input_infile
#end if

defaults render cwl invalid

e.g.

the `param_debug` field is not valid because
                               tried int but
                                 `u'0'` is not int

which is

 - default: '0'
  doc: Sets the debug level
  id: param_debug
  inputBinding:
    prefix: -debug
  label: Sets the debug level
  type:
  - 'null'
  - int

So, the defaults type must match (one of?) the cwl param type defined

disable quoting in command block

Currently < and > in the command block are quoted. E.g. <![CDATA[ gets &amp;![CDATA[ also shell redirection is affected.
Same for help.

For the values of the attributes quoting is still required.

If someone could give me a clue how this could be done I could implement it.

Python package

At the moment the python code is not structured fully as a package.

It would be useful for unit-testing and installation purposes if we packaged up the code (I can do this)

FYI: @bernt-matthias

cwl conversion adds version parameter even if the tool does not have a version parameter

Most tools take force-feeding of parameters not well (Unknown option(s) '[-version]' given. Aborting!).
The cwl:

- default: 2.1.0
  doc: Version of the tool that generated this parameters file.
  id: param_version
  inputBinding:
    prefix: -version
  label: Version of the tool that generated this parameters file.
  type:
  - 'null'
  - string

from the ctd

<?xml version="1.0" encoding="UTF-8"?>
<tool ctdVersion="1.7" version="2.1.0" name="FileInfo" docurl="http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/html/TOPP_FileInfo.html" category="File Handling" >
<description><![CDATA[Shows basic information about the file, such as data ranges and file type.]]></description>
<manual><![CDATA[Shows basic information about the file, such as data ranges and file type.]]></manual>
<PARAMETERS version="1.6.2" xsi:noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <NODE name="FileInfo" description="Shows basic information about the file, such as data ranges and file type.">
    <ITEM name="version" value="2.1.0" type="string" description="Version of the tool that generated this parameters file." required="false" advanced="true" />
    <NODE name="1" description="Instance &apos;1&apos; section for &apos;FileInfo&apos;">
      <ITEM name="in" value="" type="input-file" description="input file " required="true" advanced="false" supported_formats="*.mzData,*.mzXML,*.mzML,*.dta,*.dta2d,*.mgf,*.featureXML,*.consensusXML,*.idXML,*.pepXML,*.fid,*.mzid" />
      <ITEM name="in_type" value="" type="string" description="input file type -- default: determined from file extension or content" required="false" advanced="false" restrictions="mzData,mzXML,mzML,dta,dta2d,mgf,featureXML,consensusXML,idXML,pepXML,fid,mzid" />
      <ITEM name="out" value="" type="output-file" description="Optional output file. If left out, the output is written to the command line." required="false" advanced="false" supported_formats="*.txt" />
      <ITEM name="out_tsv" value="" type="output-file" description="Second optional output file. Tab separated flat text file." required="false" advanced="true" supported_formats="*.csv" />
      <ITEM name="m" value="false" type="string" description="Show meta information about the whole experiment" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="p" value="false" type="string" description="Shows data processing information" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="s" value="false" type="string" description="Computes a five-number statistics of intensities, qualities, and widths" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="d" value="false" type="string" description="Show detailed listing of all spectra and chromatograms (peak files only)" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="c" value="false" type="string" description="Check for corrupt data in the file (peak files only)" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="v" value="false" type="string" description="Validate the file only (for mzML, mzData, mzXML, featureXML, idXML, consensusXML, pepXML)" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="i" value="false" type="string" description="Check whether a given mzML file contains valid indices (conforming to the indexedmzML standard)" required="false" advanced="false" restrictions="true,false" />
      <ITEM name="log" value="" type="string" description="Name of log file (created only when specified)" required="false" advanced="true" />
      <ITEM name="debug" value="0" type="int" description="Sets the debug level" required="false" advanced="true" />
      <ITEM name="threads" value="1" type="int" description="Sets the number of threads allowed to be used by the TOPP tool" required="false" advanced="false" />
      <ITEM name="no_progress" value="false" type="string" description="Disables progress logging to command line" required="false" advanced="true" restrictions="true,false" />
      <ITEM name="force" value="false" type="string" description="Overwrite tool specific checks." required="false" advanced="true" restrictions="true,false" />
      <ITEM name="test" value="false" type="string" description="Enables the test mode (needed for internal use only)" required="false" advanced="true" restrictions="true,false" />
    </NODE>
  </NODE>
</PARAMETERS>
</tool>

My guess is, that the generator starts producing params from each ITEM as long as it is child of NODE but should only(???) from the innermost NODE.

cwl always needs an `output` section, even if empty

Perhaps issue an error or warning if no outputs are defined in the CTD?

/home/mcrusoe/miniconda2/envs/ctd-converter/bin/cwltool 1.0.20180130110340
Resolved 'mason_methylation.cwl' to 'file:///home/mcrusoe/src/CTDSchema/seqan/mason_methylation.cwl'
Tool definition failed validation:                                              
mason_methylation.cwl:6:1: Object `mason_methylation.cwl` is not valid because
                             tried `CommandLineTool` but                  
                               missing required field `outputs`

Galaxy xml via configfile

An idea just discussed with @bgruening:

  • dump ctd to configfile and add cheetah code for variables
  • command is then just supplying the ctd to the original command

For OpenMS this seems to work only for the ini files (which are very similar to the ctd files) since I found no parameter to give the ctd to the OpenMS tools.

Including a version command

This is unwanted, not sure how to get rid of:

    <param name="param_printsextract_version" type="text" size="30" value="6.6.0" label="Version of the tool that generated this parameters file" help="(-version) ">
      <sanitizer>
        <valid initial="string.printable">
          <remove value="'"/>
          <remove value="&quot;"/>
        </valid>
      </sanitizer>
    </param>

CTD looks like:

<?xml version="1.0" ?>
<tool name="printsextract" version="6.6.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://github.com/genericworkflownodes/CTDopts/raw/master/schemas/CTD_0_3.xsd">
        <description>Extract data from PRINTS database for use by pscan</description>
        <PARAMETERS version="1.6.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://github.com/genericworkflownodes/CTDopts/raw/master/schemas/Param_1_6_2.xsd">
                <NODE description="Extract data from PRINTS database for use by pscan" name="printsextract">
                        <ITEM description="Version of the tool that generated this parameters file." name="version" tags="advanced" type="string" value="6.6.0"/>
                        <NODE description="Parameters of printsextract" name="1">
                                <NODE description="Input section" name="input">
                                        <ITEM description="PRINTS database file" name="infile" supported_formats="*.prints database" type="input-file" value=""/>
                                </NODE>
                        </NODE>
                </NODE>
        </PARAMETERS>
</tool>

Also produces unused command bit:

#if $param_printsextract_version:
  -printsextract:version     "$param_printsextract_version"
#end if

Lesson learned from unit tests

This is the list of TODOs that I would like to cover in the near future:

  • CDATA for command and help: started here bernt-matthias@a9ea258

  • optional string/int/... parameters dont get optional="true" its currently only set for parameters with restrictions bernt-matthias@9ab3b25

  • int/float without a default value currently are autoset to 0 -- needs manual cherry pick from #49

  • fix indentation of closing #end if for the #if str(...) and the -parameter, bernt-matthias@eb50cfa

  • there may be different ways to specify itemlists on the cli: -param A B C or -param A -param B -param C, see OpenMS/OpenMS#4196, but in OpenMS only the former is used.

  • add the possibility to add hidden parameters

  • <item type='boolean' >

    • are possible in ctd v1.7
  • <item type='string' restriction="true,false">

  • <item type='string' restriction="...,...,..."> and <itemlist type='string' restriction="...,...,...">

    • itemlist with restrictions should be a select, not it is a repeat, bernt-matthias@9ab3b25
    • I would prefer non-radio selects .. lets keep radio up to 4 options
    • Do text and select inputs need the sanitizer (selects definitely can contain weird characters)
    • selects with default select the option and a value="default" (redundant), bernt-matthias@12fe492
    • select: cheetah can probably be simplified by using '$param_param_wodefault_mandatory_restricted' instead of the inner if, bernt-matthias@9ab3b25
  • <itemlist type='string/int/float'> are currently rendered as repeat, which does not allow for default values. the best seems to be render them as comma separated list of elements with appropriate validators and sanitizers. bernt-matthias@ed73ced

    • if default value(s) are given max=1 is set?
    • defaults are not set as repeat elements, but as value="A B"
    • int and float repeats are rendered as text
    • [ ]
  • Input files

  • Output files

    • all output files need quoting bernt-matthias@bcb9986
    • optional output files? -> add boolean to inputs and filter to the output bernt-matthias@5873f96
    • do output files (with a single possible format) need to have the correct extension?
    • output files allowing multiple formats currently just set one
      • if there is an input with the same supported formats (maybe check if its only one input to be sure) then set it as format_source
      • otherwise a select box for the user which to choose
    • multiple output files (needs to be the same number as for a specific input in most cases .. seems difficult to determine)
      • maybe check if there is only a single input or have an additional tabular config for controling this
    • Can output parameters be black listed? Seems not possible to black list param_stdout (seems absent automatically if there are outputs)
  • Use {# ' '.join([\"'%s'\"%str(_) for _ in " + actual_parameter + "])} instead of tmp variable in command block for list selection, maybe also for multiple file input (by adding an if _ != None ), also all references to repeats could be removed from there. bernt-matthias@bcb9986

  • tests?

    • remove macro="..." from conditional there for add the conditional select param
    • implement the generation of simple test(s) for the test cases: setting all mandatory options wo defaults bernt-matthias@7d22fad
    • use the same code to extract a <test> (ie. remove all the rest) from the ctd files in the OpenMS tests
  • implement <NODE> in tests (and render as section?)

  • OpenMS has parameters specifying prefixes for output files, e.g. DTAExtractor.ctd <ITEM name="out" value="" type="string" description="base name of DTA output files (RT, m/z and extension are appended)" required="true" advanced="false" />

  • data type mapping

    • TOFCalibration tof_const parameter (OpenMS expects csv) works only if the file extension is absent or csv -> maybe remove extension for tabular or hardcode to csv (the former might be easier for tests because test data is detected as text)

UnicodeEncodeError for u'\xb2'

There appears to be a parsing error in one of the OpenMS ctd files. Though I'm not sure where, file looks okay on a first glance. Can't find any u'\xb2' (² ?).
Offending file attached.

INFO: Converting from /tmp/ctd/MetaProSIP.ctd 
Traceback (most recent call last):
  File "generator.py", line 352, in main
    xsd_location=args.xsd_location)
  File "generator.py", line 598, in convert
    create_inputs(tool, model, **kwargs)
  File "generator.py", line 965, in create_inputs
    create_param_attribute_list(param_node, param, kwargs["supported_file_formats"])
  File "generator.py", line 1129, in create_param_attribute_list
    label, help_text = generate_label_and_help(param.description)
  File "generator.py", line 1139, in generate_label_and_help
    desc = str(desc).replace("#br#", " <br>")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 9: ordinal not in range(128)

Conversion of commandline flags

When a commandline flag is converted with CTD to a dropdown menu, the menu has three options: TRUE, FALSE and "Nothing selected". This is due to the entry optional="True". However, for flags this is unwanted behaviour, as a missing flag normally means that the default is FALSE. So the option "Nothing selected" should not be there. A flag would be better reflected with optional="False", value="False".

To be clear, my definition of "flag" is: a commandline parameter, which takes no additional input (T/F, String etc.). For example: adding -append to a command translates to append=T, with the default append=F.

Sorry for the lengthy explanation, I am still new to this.I hope I made myself clear.

produces incomplete output tag

The converter produces

<outputs>
    <data name="param_out" format="data"/>
</outputs>

I think, there are two issues with that:

  • type is missing and should be 'data'
  • format is available in the ctd, but not included

What I presume would be better:

<outputs>
    <data name="param_out" type="data" format="idXML"/>
</outputs>

from what is available in the ctd:

<ITEM name="out" value="" type="output-file" description="Output file" required="true" advanced="false" supported_formats="*.idXML" />

Positional parameters in generated files

CTDs do not include this information. So the schema should be modified to support positional parameters, which are to be used in output CWL/ToolConfig files.

output when

@TorHou @bgruening what would be the best way of doing this... EMBOSS has a couple times when the presence/absence of an output is dependent on a specific parameter's value.

E.g. passing the -plot flag to iep will cause the output -graph to become used. Same for -noreport and removing the -outfile output.

Any thoughts on this? (No rush, it's a weekend :) )

restructure python package

If one pip installs this project it creates: common, galaxy, and cwl directly in the python dictribution. Would be nice to have all those modules in one folder, e.g. ctdconverter/common.

Inputs tag for input files is missing format

for label="Input file" type is correctly set as 'data', but format attribute is empty. E.g.:

<param name="param_in" type="data" format="" optional="False" label="Input file" help="(-in) "/>

with ctd

<ITEM name="in" value="" type="input-file" description="Input file" required="true" advanced="false" supported_formats="*.idXML" />

Add CWL support

The most reasonable way to go around this would be to refactor generator.py following these guidelines:

  • Each supported format (Galaxy, CWL) should have its own folder, and each folder should contain a converter script.
  • The generator.py script will be moved to galaxy/converter.py. A new script under cwl/converter.py will also be added.
  • All common functionality, such as CTD parsing, validation against a schema, logging, etc., should be housed in a separate folder, e.g., common.
  • The documentation will clearly define which parameters are required across all supported formats and which ones are specific for one format. This should also be reflected on the code.
  • A main script, convert.py, should be created on the topmost level. This script will then decide which specific script (whether galaxy/converter.py or cwl/converter.py) should be invoked.

In the end, invoking the converter should look similar to:

$ python convert.py [FORMAT] [ADDITIONAL_ARGUMENTS ...]

That is, the first positional parameter will be the output format (either galaxy or cwl), while the rest of the arguments should not be modified in order to keep current functionality (i.e., generating Galaxy ToolConfig files). In other words, this is how a single CTD file is converted into a ToolConfig file:

$ python generator.py -i tool.ctd -o tool.xml

After the refactoring, this will change to:

$ python convert.py galaxy -i tool.ctd -o tool.xml

And in the case of CWL, it would look like:

$ python convert.py cwl -i tool.ctd -o tool.cwl

Unable to parse CTD from Seqan 2.4.0

Even when the missing ctdVersion is added, CTDOpts gets grumpy with empty values like this snippet from flexbar --write-ctd using Seqan 2.4.0

<ITEM name="barcode-tail-length" value="" type="int" description="Region size in tail trim-end modes. Default: barcode length." required="false" advanced="true" />
(ctd-converter) mcrusoe@mrcdev:~$ python ~/src/CTDConverter/convert.py cwl  -i ~/debian/flexbar/flexbar.ctd -o f.c
INFO: Using cwl converter
WARNING: Validation against a schema has not been enabled.
INFO: Parsing /home/mcrusoe/debian/flexbar/flexbar.ctd
Traceback (most recent call last):
  File "/home/mcrusoe/src/CTDConverter/convert.py", line 215, in main
    converter.get_preferred_file_extension())
  File "/home/mcrusoe/src/CTDConverter/common/utils.py", line 118, in parse_input_ctds
    parsed_ctds.append(ParsedCTD(CTDModel(from_file=input_ctd), input_ctd, output_file))
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 643, in __init__
    self._load_from_file(from_file)
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 688, in _load_from_file
    self.parameters = self._build_param_model(params_container_node, base=None)
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 708, in _build_param_model
    self._build_param_model(child, current_group)
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 713, in _build_param_model
    base.add(**setup)  # register parameter in model
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 564, in add
    self.parameters[name] = Parameter(name, self, **kwargs)
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 376, in __init__
    self._validate_numerical_defaults(default)
  File "/home/mcrusoe/miniconda2/envs/ctd-converter/lib/python2.7/site-packages/CTDopts/CTDopts.py", line 456, in _validate_numerical_defaults
    "default": ', '.join(map(str, errors_so_far))})
ModelParsingError: An error occurred while parsing the CTD file: Invalid default value(s) provided for parameter barcode-min-overlap of type <type 'int'>: ''
ERROR: There seems to be a problem with one of your input CTDs.
Traceback (most recent call last):
  File "/home/mcrusoe/src/CTDConverter/convert.py", line 272, in <module>
    sys.exit(main())
  File "/home/mcrusoe/src/CTDConverter/convert.py", line 234, in main
    utils.error("Reason: " + e.msg, 0)
AttributeError: 'ModelParsingError' object has no attribute 'msg'

Tagging in @h-2

CWL: \n is not valid YAML, nor do URLs need wrapping in <>

From lamda --write-ctd

<?xml version="1.0" encoding="UTF-8"?>
<tool name="Lambda" version="1.0.2 (Git commit )" docurl="http://www.seqan.de" category="" >
        <executableName>lambda</executableName>
        <description>the Local Aligner for Massive Biological DatA</description>
        <manual>Lambda is a local aligner optimized for many query sequences and searches in protein space. It is compatible to BLAST, but much faster than BLAST and many other comparable tools.
Detailed information is available in the wiki: &lt;https://github.com/seqan/lambda/wiki&gt;
</manual>

produces via convert.py cwl

#!/usr/bin/env cwl-runner

# This CWL file was automatically generated using CTDConverter.
# Visit https://github.com/WorkflowConversion/CTDConverter for more information.

baseCommand: lambda
class: CommandLineTool
cwlVersion: v1.0
doc: "Lambda is a local aligner optimized for many query sequences and searches in\
  \ protein space. It is compatible to BLAST, but much faster than BLAST and many\
  \ other comparable tools.\nDetailed information is available in the wiki: <https://github.com/seqan/lambda/wiki>\n\
  \n\n\nFor more information, visit http://www.seqan.de"

which could be as

#!/usr/bin/env cwl-runner

# This CWL file was automatically generated using CTDConverter.
# Visit https://github.com/WorkflowConversion/CTDConverter for more information.

baseCommand: lambda
class: CommandLineTool
cwlVersion: v1.0
doc: |
  Lambda is a local aligner optimized for many query sequences and searches in
  protein space. It is compatible to BLAST, but much faster than BLAST and many
  other comparable tools.
  Detailed information is available in the wiki: https://github.com/seqan/lambda/wiki
 
  For more information, visit http://www.seqan.de

Confusing handling of booleans

On the CTD side, you can have plain booleans, but a plain old boolean won't let you set a restriction on it with choices.

This is because with type=bool and choices=[a, b], we run into this bit of code: https://github.com/erasche/CTDopts/blob/master/CTDopts/CTDopts.py#L302 which tries to join the values. The values are previously mapped to param.type, meaning everything is coerced into a boolean, which then doesn't get joined properly.

I've found that by using type=str and choices=['true', 'false'], I can get this script to output what I want in galaxy XML, i.e. (truevalue="--param-name" falsevalue=""), however that's not very user-friendly.

Booleans in command string could be more succint

For the xml:

<param name="param_additional_termini" type="boolean" 
  truevalue="-additional:termini" falsevalue="" optional="True" 
  label="Include charge at N and C terminus" help="(-termini) "/>

Currently:

#if $param_additional_termini:
  -additional:termini
#end if

Expected:

$param_additional_termini

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.