Code Monkey home page Code Monkey logo

kmerstream's Introduction

KmerStream

Streaming algorithm for computing kmer statistics for massive genomics datasets.

Installation

To compile just type make

Running

To see the usage just type KmerStream

KmerStream 1.1

Estimates occurrences of k-mers in fastq or fasta files and saves results

Usage: KmerStream [options] ... FASTQ files

-k, --kmer-size=INT      Size of k-mers, either a single value or comma separated list
-q, --quality-cutoff=INT Comma separated list, keep k-mers with bases above quality threshold in PHRED (default 0)
-o, --output=STRING      Filename for output
-e, --error-rate=FLOAT   Error rate guaranteed (default value 0.01)
-t, --threads=INT        Number of threads to use (default value 1)
-s, --seed=INT           Seed value for the randomness (default value 0, use time based randomness)
-b, --bam                Input is in BAM format (default false)
    --binary             Output is written in binary format (default false)
    --tsv                Output is written in TSV format (default false)
    --verbose            Print lots of messages during run
    --online             Prints out estimates every 100K reads
    --q64                set if PHRED+64 scores are used (@...h) default used PHRED+33

Options:

  • -k the k-mer size, this should be an integer or a list of integers e.g. -k 31 or -k 31,47,63, odd values behave better than even values
  • -q optional quality cutoff values, all k-mers with bases under the q threshold are discarded
  • -o filename where the output should be written
  • -e guarantee on the error of the estimator used, default value is 1%, lower values increase memory usage
  • -t number of threads to use
  • -s KmerStream uses random hash functions for computing the statistics, to fix the hash value for reproducibility set the seed to a fixed value, e.g. '-s 42'
  • -b Input is in BAM format
  • --binary Write output in binary format, this includes the data necessary for running KmerStreamJoin, the output filename is used as a prefix and the file containing the output is PREFIX + _Q_0_k_31
  • --tsv Write output in TSV (tab separated values) format for easier parsing
  • --online prints estimates every 100K reads, see (https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/)[https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/] for example usage
  • --q64 Quality values are enchoded in PHRED+64 format rather than the default PHRED+33, use this if your quality values are from @ to h rather than ! to I

KmerStreamJoin

KmerStreamJoin 1.1

Creates union of many stream estimates

Usage: KmerStreamJoin -o output files ...
       KmerStreamJoin merged-file

-o, --output=STRING      Filename for output
    --verbose            Print output at the end

KmerStreamJoin, when run with the -o option takes a list of KmerStream binary output files (created with --binary option to KmerStream) and creates a single binary output file that is equivalent to having run a single KmerStream run on all of the files. When the -o option is missing it outputs the KmerStream result of the binary input file.

This utility is useful when distributing the process of creating the binary files or computed incrementally.

KmerStreamEstimate.py

KmerStreamEstimate is a python script that reads a tsv file as input (generated using --tsv) and estimates the genome size (G), error rate (e), and coverage (lambda).

kmerstream's People

Contributors

pmelsted avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.