More things I done learned about Galaxy tool development

2012-01-01

More things I done learned about Galaxy tool development

A pot-pouri of titbits that are probably documented somewhere, but weren't obvious to me.

Labels on the tool menu

You can't nest sections on the tool panel - but you can put labels in sections for the same effect. Look at tool_conf.xml and the NGS: QC and manipulation section for an example. Labels looks like a single closed tag:

<label text="Illumina data" id="illumina" />

More / advanced options

The fasta groomer tool is a good demo of how to implement a dropdown to expose or hide extra options for a tool. To summarize, in your parameters have a conditional section with a selector and then a when for the values of the selector:

<option value="more">More options</option>
</param>
<when value="less">
<!-- no options -->
</when>
<when value="advanced"> # many parameters </when>
</conditional>

File formats

The correct file format for text files is txt (no 'e'). Interesting, you can give a file a bogus or incorrect format (e.g. format="fsta") and no error is generated. The only apparant symptom is that when you click on the eye icon to view the file, it downloads instead.

Errors in the tool config file

Curiously, not all errors in individual tool config file are picked up. I had a situation where I'd failed to nest <configfile> correctly inside <configfiles> and no error was signaled: the element just wasn't used. So beware: just because Galaxy doesn't complain about the file, doesn't mean it's right.

Config files

These are very useful ... just not always for the things you might imagine.

A config file is unfortunately named, because it actually means a file that that is generated by the tool config file. You can't set the name of a config file (e.g. to "myprog.cfg") to a hardcoded name that is expected by your executable - It will be something arbitrary. Therefore the config must be passed on the commandline line. In effect, it's another input file, just one that is generated for that invocation of the tool.

Which brings us to a wider issue in Galaxy tool development: passing parameters. The need to pass all in- and out- file names to a Galaxy tool means that your commandline can get astonishingly overloaded, you have to write parsing code to sort the options out, it all becomes more complicated than it should be. Config files offer a quick way around all this mess. The below example uses ruby, but variants will be possible in most other scripting languages.

In the tool config file, dump all the parameters, file names etc. into a config file, in the form of a hash:

<configfile name="gene_finder_cfg">
# arguments for the gene_finder script
args = {
 :input_files => [ #for $in_seq in $input_seq_files { :path => "$in_seq.in", :title => "$in_seq.in.name", }, #end for ], :ref_files => [ #for $ref_seq in $ref_seq_files { :path => "$ref_seq.in", :title => "$ref_seq.in.name", }, #end for ], ... }

Also write the tool commandline, so as to pass the config file path

<command interpreter="ruby"> gene_finder.rb $gene_finder_cfg </command>

In the script, grab the config file and read it and evaluate it:

config_file = ARGV[0] options = eval (File.read (config_file))

All your parameters are now in "options". Easy.

Tabular data

Must be uploaded as .tsv, i.e. will tab seperated columns. Galaxy doesn't understand CSV.

Reference data

Many tools will require reference data: long lasting data that is not analysed itself, but is used in the analysis of other data, changes infrequently and is usually shared. While there's some clear mechanisms for some of the genomic tools, the general idiom for new sorts of reference data is unclear. Two obvious mechanisms suggest themselves:

Hardcode it into the tool and put a file reference data in the tool data directory. But this means that the tool can only use one reference (e.g. not one from a selection). Updating means replacing the datafile - easy but no change is propagated to the user.
Alternatively, you could just unload the reference file, and share it. This would allow the existence of multiple reference files to be chosen from (I need to run this against the E. coli set not the yeast one) and for easy updates and explicit versioning (v Nov 2011). It may prove useful semantically to subtype the tables.

Repeating inputs

Repeating (i.e. a variable number of) inputs can easily be configured like so:

<inputs> <repeat name="maf_filters" title="Filter"> <param name="in" type="data" format="maf" label="MAF File"/> </repeat> </inputs>

That is, a variable number of MAF files are selected as inputs. Where an easy mistake can be made is when configuring the commandline. You might do this:

<command> myawesomescript #for $m in $maf_filters: $m #end if </command>

This will generate an uninformative error message. The problem is that $m is the value for the repeat interation or loop. To get at the param, you have to qualify it with the name of the param:

<command> myawesomescript #for $m in $maf_filters: $m.in #end if </command>

The reason for this is more obvious when you are looping over multiple things:

<inputs> <repeat name="maf_filters" title="Filter"> <param name="in" type="data" format="maf" label="MAF File"/> <param name="in_foo" type="text" label="What foo value?" value="bar"/> </repeat> </inputs>