Code Monkey home page Code Monkey logo

soba's Introduction

logo

Description:

The Sequence Ontology Bioinformatics Analysis command line tool (SOBAcl) will generate a variety of tables, graphs and reports from the data in GFF3 files and format the output in a variety of ways. Selections can be specified which will cause a subset of the data within files to be used to generate the output. The sections below give descriptions of the major features of SOBAcl. Detailed descriptions of each of the command line options will be printed by running SOBAcl --help. And finally, the shell script t/sobacl_test.sh has a lot of working examples of most of the major functionality. The command line examples given in this file can also be executed in the t directory.

Web Based Version

This tools is also available for web based analysis here.

Tables:

SOBAcl allows you to flexibly generate data tables by specifying the data that is to be presented in each row of the table, how the summarized data should be grouped in each column of the table, and what type of summary statistic should be reported in each cell of the table. All output for tables is printed to STDOUT.

Examples of tables include:

  • Create a table of the count of every feature type grouped by file for multiple files.
  • Create a table of the mean length of all features in a file.
  • Create a table of the CDS footprint of all mRNAs by chromosome.

To create the above examples you would specify the rows, columns, data and data_type. For example the command line to generate a table of the count of every feature type grouped by file for multiple files would look like this - try it:

SOBAcl --columns file --rows type --data type --data_type count   \
  data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff 

Tables can be output in 4 different formats: text, tab, html and html_page. The text format (used in the example above) produces an ASCII table that is aligned for viewing with fixed length fonts such as on a terminal screen. The tab format produces tab delimited files appropriate for import into a spreadsheet application such as Excel or for post processing with other command line tools like perl or awk. The html format will produce just the table in html format (for inclusion into a larger html page) and the html_page format will create an entire html page appropriate for viewing in a browser.

Graphs:

SOBAcl will produce a variety of graphs: lines, bars, hbars (horizontal bar chart), points, linespoints, area, and pie. For each of these graphs the --x_axis [rows] parameter specifies the data that becomes the x-axis values. The --series [columns] parameter defines the values that become the data series. The --y_axis [data] parameter determines the data used for the y-axis and the --data_type parameter specifies the summary statistic used to report the y-axis value. The name of the of the output file is specified on the command line with the --output parameter. The --collapse parameter determines if series will presented each on a different chart (this is always the case for pie charts) or whether series data will be collapsed onto a single chart.

The GD graphics library is used to create the graphs and parameters can be passed to the perl module GD::Graph as key=value pairs to the --gd parameter.

Examples of graphs include:

  • Create a graph of the mean length of every feature by chromosome. Each feature would apprear on the x-axis, the mean length would be the y-axis and a different bar would apprear for each chromosome for every feature on the x-axis.
  • Create a pie chart based on the footprint of each type of feature in the file.
  • Create a line graph of the total length of CDSs on each chromosome.

To create a graph you would specify the x-axis, y-axis, series and data_type. For example the command line to generate a graph of the mean length of every feature by chromosome would be written:

SOBAcl --series seqid --x_axis type --y_axis length --data_type mean \
       --layout bars --gd width=600 --output seqid_type_hbars        \
       --format png data/dmel-all-r5.30_0001000.gff

Graphviz Graphs:

If a --layout type of graphviz is specified, a graphical view of the directed acycilical graph (DAG) of the onology terms found in the file and the relationships between them will be generated. These graphs are generated by the GraphViz library and you can pass parameters to the perl module Graphviz as key=value pairs via the --gv --gv_node and --gv_edge parameters. This allows you to control many details (the color, shape, etc) of the graph as well as it's nodes and edges.

Reports:

Some types of data are too complex to be easily summarized by charts or tables. SOBAcl provides built in reports to describe the data in GFF3 files in complex ways. Currently only one report is provided, but more will be developed in the future. The report currently supported is a count of all of the attributes in a file grouped by feature source and type. You can generate this report with the following command line:

SOBAcl --report attributes --format html_page \
     data/dmel-all-r5.30_0001000.gff

Custom Templates:

All text output from SOBAcl is generated by Perl's Template Toolkit templating system. You can pass a custom template on the command line to generate fully customized output. Template Toolkit offers a simple but very powerful templating syntax. SOBAcl processes a GFF3 file, prepares the data in various ways and then passes variables containing those data to the template engine. Those variables and the data they contain are then available within to the custom templates that you write. The contents of these variables include all of the parameter values that you specified on the command line to SOBAcl, as well as the data from the file summarized as you specified on the command line. Within your template you can create loops to iterate through the data, filter, transform or format the data in various ways and finally include the data in your output. A SOBAcl command line and a very simple template is given below. See the excellent documentation provided on the Template Toolkit website and have a look at the templates provided in the templates directory of this distribution to learn more.

SOBAcl --columns file --rows type --data length --data_type mean \
       --layout table --format tab                               \
       --template soba_custom_template_text.tt                   \
       data/dmel-all-r5.30_0001000.gff                           \
       data/dmel-all-r5.30_0010000.gff

My Template
[% title %]
[%- FOR row = data_set.keys.sort -%]
[%- FOR column = data_set.$row.keys.sort -%]
[%- data_set.$row.$column.stats.count -%]
[%- "\t" -%]
[%- data_set.$row.$column.stats.$data_type %]
[%- loop.last ? "\n" : "\t" -%]
[%- END -%]
[%- END -%]

Command line parameters passed to SOBAcl are passed along to the templates. Within the templates the values of those parameters are available as template variables via the name of the parameter. Template toolkit uses '[%' and '%]' to delimit templating commands. In the tempalte example above the text "My Template" would be printed in the output unchanged while '[% title %]' would print the title passed to SOBAcl via the --title parameter. The template commands '[% FOR %]' and '[% END %]' define loops which will iterate over the data summarized by SOBAcl. You will need to have a pretty good understanding of Template Toolkit to take full advantage of custom templates.

The data processed by SOBAcl is passed to the template via the data_set variable. For tables this data structure is a hash. The hash structure is shown below (first as it would appear in Perl and then as it would be used in a template:

$data_set{$row}{$column}{stats}->mean
data_set.$row.$column.stats.mean

The data prepared by SOBAcl for the attributes report looks like this:

$data_set{att_count}{$source}{$type}{$attribute_key}
data_set.att_count.$source.$type.$key

Notice how the template uses the '.' for both hash references and method calls. The '$' sigil is used by the template system to disambiguate an interpolated variable from a hash key or method call.

See the templates in the template directory for more examples.

Databases:

GFF3 files may optionally be loaded into a database before by SOBAcl and SOBAcl can then use the database rather than a GFF3 flat file. This is most useful when complex SQL based data selection is needed as described below.

Feature Selection:

Sometime you only want to report or summarize some of the data in a GFF3 file. You can do this with SOBAcl on the fly by using the --select parameter. The format of values passed to the --select parameter is borrowed from Perl's SQL::Abstract module. Breifly, you pass a Perl hash to SOBAcl. Hash keys specify GFF3 columns and hash values specify the value to constrain that column's data to. SQL::Abstract allows very complex queries that expose most SQL feature via this interface. If you load your data into a SOBA database first (see above) you can exploit the full power of SQL::Abstract to build complex SQL queries from the command line. If you are processing a GFF3 text file rather than a database only one level of key value pairs is supported in the select parameter. The examples below show simple command lines that use a select statement:

SOBAcl --columns seqid --rows type --data length --data_type mean \
--layout table --format text --select 'type => ["ne", "contig"]'  \
data/refseq_short.gff3

SOBAcl --columns seqid --rows type --data length --data_type mean \
--layout table --format text                                      \
--select 'start => [">=", "1000"], end => ["<=", "1000000"]'      \
data/refseq_short.gff3

Note that both select values above use custom comparison operators. The following comparison operators are available:

== eq != ne < lt <= le > gt >= ge =~ !~

The '=' and '!' only available when processing text files because SQL does not have robust regular expression support. The 'eq' operator is the default if none is given in the statement in which case the array reference collapses to a simple value as seen in this select statements:

--select 'type => "contig"'

soba's People

Contributors

srynobio avatar barrymoore avatar

Stargazers

David Jeffrey Merwin avatar Asan Emirsaleh avatar  avatar Federico López-Osorio avatar Abhijeet Shah avatar Evan Ernst avatar  avatar Upendra Kumar Devisetty avatar

Watchers

James Cloos avatar  avatar David Jeffrey Merwin avatar

soba's Issues

SO_Tree: Add SVG support for sequenceontology.org/miso

The current pictures on SO.org are basically unreadable. Here is a fresh example (from http://www.sequenceontology.org/miso/current_release/term/SO:0000375):

image

Instead of doing the old higher-res image thing, we could make this image even better by just taking the SVG vector output from GraphViz. The SVG format is widely supported on the web, and it tends to produce much smaller images.

(I don't know who to talk to for actually switching the web miso thing to SVG after soba has it though.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.