Code Monkey home page Code Monkey logo

nubeam's Introduction

Nucleotide be a matrix (Nubeam)

Nubeam is a reference-free approach to analyze short sequencing reads. It represents nucleotides by matrices, transforms a read into a product of matrices, and based on which assigns numbers to reads. A sequencing sample, which is a collection of reads, becomes a collection of numbers that form an empirical distribution. Then the genetic difference between samples is quantified by the distance between empirical distributions.

Compiling:

Dependency

zlib is required to compile. To install zlib, run the following commands:

wget https://www.zlib.net/zlib1211.zip
unzip zlib1211.zip
cd zlib-1.2.11/
./configure
make
sudo make install

Compile nubeam

Run the following commands:

wget --no-check-certificate --content-disposition https://github.com/daihang16/nubeam/archive/master.zip
unzip Nubeam-master.zip
cd Nubeam-master/
make

Usage:

./nubeam -h gives you the following messages:

./nubeam [qtf, rgc_beta, rgc_res, cad, cad2]

./nubeam qtf [-iodwSnfh]
compute quadriples for reads in fastq format.
produces prefix.quad.gz (gc content is within) and prefix.quad.log.
-i : input filename
-o : output prefix
-d : length of the reads (default d=75).
-w : sliding window size (default w=d).
-S : sliding window step (default S=w).
-n : number of missing nucleotide allowed.
-f : value, plus 33 is the PHRED quality value of fastq reads.
-h : print this help

./nubeam rgc_beta [-ioh]
perform regression on gc contents from read quantification and output regression coefficients.
produces prefix.beta.log.
-i : input file name.
-o : output prefix.
-h : print this help

./nubeam rgc_res [-ioh beta]
regress out gc contents from read quantification and output residuals.
produces prefix.nogc.gz and prefix.nogc.log.
-i : input file name.
-beta : beta file name.
-o : output prefix.
-h : print this help

./nubeam cad [-iombh bf]
compute pariwise distances of a set; the inputs are nubeam qtf outputs.
produces prefix.cad.log.
-i : specifies input file which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help

./nubeam cad2 [-ijombh bf]
compute pariwise cross distances between two sets; the inputs are nubeam qtf outputs.
produces prefix.cad2.log.
-i : specifies input file of first set, which is output of nubeam.
-j : specifies input file of second set, which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help

Examples:

  • Quantify reads

    ./nubeam qtf -i S1.fq -o S1.fq -d 75 -a 0 -n 0 -f 0

    Quantify the reads in input file S1.fq, with the read length of 75, adaptor size of 0, N not allowed in read, the output file name will be S1.fq.quad.gz. The output file has six columns of numbers: first four columns are Nubeam quadruplets for reads, the last two columns are GC counts for reads.

  • Regress out GC content

    • Obtain regression coeffients

      First combine all the output files produced by qtf together:

      cat S1.fq.quad.gz S2.fq.quad.gz S3.fq.quad.gz > all.quad.gz

      Then calculate the regression coefficients for GC count:

      ./nubeam rgc_beta -i all.quad.gz -o all.quad

      The regression coeffients are in all.quad.beta.log.

    • Obtain residuals

      For each original output files produced by qtf:

      ./nubeam rgc_res -i S1.fq.quad.gz -beta all.quad.beta.log -o S1.fq.quad

      The residuals will be written to S1.fq.quad.nogc.gz.

  • Quantify pair-wise distance

    • Calculate within-group distances

      ./nubeam cad -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz

      For n samples, the command calculate n(n-1)/2 Hellinger distances. The number of bins partitioned for R^4 space is 10^4; if the bin.txt exists, it will be used for partitioning; if not, the partitioning will be calculated and written to bin.txt. If the input files are too large, there may not be enough memory to calculate the bin partitioning file. To deal with this problem, you can down-sample input files and used them to calculate a bin partitioning file first, and then use this bin partitioning file and original input files to calculate distance matrix. The distance matrix is at the end of output.cad.log.

    • Calculate between-group distances

      ./nubeam cad2 -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz -j S4.fq.quad.gz -j S5.fq.quad.gz

      For a group of n samples (specified by -i) and a group of m samples (specified by -j), the command calculate n*m Hellinger distances. The number of bins partitioned for R^4 space is 10^4; unlike in cad, here in cad2, the argument -bf is required, with the specified file bin.txt be used for partitioning. If you don't have the bin partitioning file, you need to calculate one using cad first. The distance matrix is at the end of output.cad2.log.

nubeam's People

Contributors

daihang16 avatar

Stargazers

 avatar Liu zifeng avatar gnomix avatar slp avatar

Watchers

 avatar gnomix avatar

nubeam's Issues

Missing dependencies in readme

It appears that install instructions are incomplete. For example:

make
g++ -O2 -Wall -m64 -std=c++11  -c model.cpp  			
model.cpp:34:10: fatal error: gsl/gsl_rng.h: No such file or directory
 #include <gsl/gsl_rng.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
Makefile:34: recipe for target 'model.o' failed
make: *** [model.o] Error 1

Appears to be solved with:

$ sudo apt-get install libgsl-dev

After that,

make
g++ -O2 -Wall -m64 -std=c++11  -c model.cpp  			
g++ -static-libgcc  -O2 -Wall -m64 -std=c++11  model.o  /usr/local/lib/libgsl.a /usr/local/lib/libgslcblas.a  -lm -lz  -o nubeam	
g++: error: /usr/local/lib/libgsl.a: No such file or directory
g++: error: /usr/local/lib/libgslcblas.a: No such file or directory
Makefile:38: recipe for target 'model' failed
make: *** [model] Error 1

Which can be solved with

sudo apt-get install libgsl-dev
sudo cp /usr/lib/x86_64-linux-gnu/libgsl* /usr/local/lib/

And then Nubeam installs correctly.

It would be great if non-sudo install instructions could be found. eg. not hard-coding the location of GSL, since people may want to install that library in a different location if they don't have sudo.

A question

./nubeam qtf -i S1.fq -o S1.fq -d 75 -a 0 -n 0 -f 0 ,
For metagenomic data, what is the better value of d?
and other question is that What does the GC value in the last two columns of this file( xxxx.quad.log) actually mean?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.