Code Monkey home page Code Monkey logo

java-genomics-toolkit's Introduction

Java Genomics Toolkit

This is a collection of applications for genomics data processing, primarily high-throughput next-generation sequencing. There is a particular focus on processing data in Wiggle format, since many other tools already cover SAM, BAM, FastQ, etc. However, Wiggle/BigWig formats provide a compact way to store numerical data resulting from ChIP-seq and MNase-seq experiments. Common computations provided in this toolkit include adding, subtracting, dividing, multiplying, log-transforming, averaging, Z-scoring, and smoothing Wig files. There are also tools for processing MNase-seq (nucleosome) data, creating heatmap matrices, computing basic statistics on intervals in the genome, and KMeans clustering.

All tools are designed to process data in pieces so that memory requirements never exceed ~1GB, regardless of genome size. Tools are intended to be modular, so that multiple tools can easily be strung together into ad hoc pipelines or workflows in Galaxy. For example, a common pipeline for our ChIP-seq experiments is: 1) map reads with bowtie, 2) calculate coverage of sequencing reads, 3) normalize by subtracting input, 4) Z-score the normalized coverage, 5) correlate replicates, 6) average multiple replicates, and 7) make a heatmap of the final result.

Tools may be run from the command-line or from Galaxy (getgalaxy.org).

This toolkit requires Java 7, available at www.oracle.com/technetwork/java/javase/downloads/index.html.

Available Tools

For a list of available tools, see palpant.us/java-genomics-toolkit or search for java-genomics-toolkit in the Galaxy Tool Shed (toolshed.g2.bx.psu.edu) to see the tools in action.

Loading the Tools into Galaxy

One-click installation is available for your local Galaxy instance through the Galaxy Tool Shed (toolshed.g2.bx.psu.edu). Configuration files are provided for loading the applications into Galaxy manually. Unzip or check out the java-genomics-toolkit distribution into Galaxy’s “tools” folder, and add the supplied toolConf entries to your toolConf.xml file.

Command-Line Usage

Applications can be run on the command-line, and the toolRunner.sh script is provided for convenience. toolRunner.sh sets up the correct classpath and allows tools to be run using their short name (e.g. converters.IntervalToWig). Calling any script without arguments will display the help, as well as the missing mandatory arguments:

$ > ./toolRunner.sh wigmath.Add
$ Usage: <main class> [options] Input files
$   Options:
$   * -o, --output   Output file

Mandatory arguments are denoted with a (*).

Other tools require more input:

$ > ./toolRunner.sh ngs.Autocorrelation
$ Usage: <main class> [options]
$   Options:
$   * -i, --input    Input file
$   * -l, --loci     Genomic loci (Bed format)
$   -m, --max        Autocorrelation limit (bp)
$                    Default: 200
$   * -o, --output   Output file

Log transform a Wig file with base 2

$ > ./toolRunner.sh wigmath.LogTransform --input input.wig --base 2 --output output.log2.wig

List all available tools

$ > ./toolRunner.sh list

Downloading the toolkit

The recommended way to download the toolkit is to checkout the source code with git:

$ > git clone https://github.com/timpalpant/java-genomics-toolkit.git

since then updates may be easily retrieved with

$ > git pull

To build the tools from source, call the ant build script

$ > ant

Precompiled binaries that include JRE7 and are ready-to-use are available for Linux i586 and x64 platforms from the downloads tab.

Adding new assemblies

By default, java-genomics-toolkit loads assembly information from chromosome length files in the resources/assemblies directory (or from tool-data resources if loaded into Galaxy). If you would like to use assemblies that are not available, you can either specify the full path to a custom *.len file (see the examples in the resources directory for format), or you can copy your *.len file into the resources directory to refer to it by shortcut, e.g.

$ > ./toolRunner.sh ngs.BaseAlignCounts -i reads.bam -x 250 -a /path/to/my/sacCer4.len -o counts.wig

or

$ > ./toolRunner.sh ngs.BaseAlignCounts -i reads.bam -x 250 -a sacCer4 -o counts.wig

if sacCer4.len is available in the resources/assemblies/ directory.

Java Genomics IO

Those wishing to write their own scripts may be interested in github.com/timpalpant/java-genomics-io, the toolkit upon which these applications are built.

java-genomics-toolkit's People

Contributors

colindaven avatar timpalpant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-genomics-toolkit's Issues

Bigwig conversion ?

Hi, thanks for the nice toolkit.

Is there any way of converting to bigwig (either wig or bedgraph) within the toolkit? Is this planned?

Thanks,
Colin

null pointer exception when doing sum

Ians-MacBook-Air Thu Jun 04 12:33 java-genomics-toolkit $./toolRunner.sh wigmath.Add ~/mus_hive/pipeline_data/rnaseq/STAR_output/SPRETEiJ/ebi/liver/ERR476403/Signal.Unique.str1.out.renamed.bw ~/mus_hive/pipeline_data/rnaseq/STAR_output/SPRETEiJ/ebi/liver/ERR476405/Signal.Unique.str1.out.renamed.bw -o t DEBUG - Executing setup operations DEBUG - Initializing input files INFO - Autodetected BigWig file type: /Users/ifiddes/mus_hive/pipeline_data/rnaseq/STAR_output/SPRETEiJ/ebi/liver/ERR476403/Signal.Unique.str1.out.renamed.bw INFO - Autodetected BigWig file type: /Users/ifiddes/mus_hive/pipeline_data/rnaseq/STAR_output/SPRETEiJ/ebi/liver/ERR476405/Signal.Unique.str1.out.renamed.bw DEBUG - Initialized 2 input files DEBUG - Found 36 chromosomes in the intersection of all inputs DEBUG - Initializing thread pool with 1 threads DEBUG - Performing main computation DEBUG - Processing chunk 1:3106778-13106777 Exception in thread "main" java.lang.NullPointerException at edu.unc.genomics.WigAnalysisTool.run(WigAnalysisTool.java:122) at edu.unc.genomics.CommandLineTool.instanceMain(CommandLineTool.java:64) at edu.unc.genomics.wigmath.Add.main(Add.java:71)

Correlate tool will always end in error

  1. in BigWigFileReader you unnecessarily guard for intervals that are only within the boundaries of the bigwig file.
    if (!includes(interval)) {
throw new WigFileException("BigWigFile does not contain data for region: "+interval);
}
  1. In Correlate you take the maximum extent of all the WigFiles
  // Get the maximum extent for each chromosome

==> Boom!

log4j

Warning - contains an old (but likely not vulnerable?) version of log4J.

java-genomics-toolkit/lib/BigWig.jar!/lib/log4j-1.2.15.jar contains Log4J-1.x <= 1.2.17 OLD
java-genomics-toolkit/lib/log4j-1.2.15.jar contains Log4J-1.x <= 1.2.17 OLD

Detected by
https://github.com/mergebase/log4j-detector

getDataVector loop incorrect

I don't understand why this was never reported, but I think this is wrong (in getDataVector):

int chunkStop = Math.min(chunkStart+DEFAULT_CHUNK_SIZE-1, stop);
// Take bin-sized chunks
chunkStop = (chunkStop-chunkStart)/stepSize;
Interval chunk = new Interval(chr, chunkStart, chunkStop);

You end up with a chunkStop that is smaller than chunkStart.
Exception in thread "main" edu.unc.genomics.CommandLineToolException: edu.unc.genomics.io.WigFileException: BigWigFile does not contain data for region: chr1:3000100-199999

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.