Code Monkey home page Code Monkey logo

kmer-counting's Introduction

Codes in this repo has been written by Tay Hui Yi for CS4330 - Computational Methods in Bioinformatics, under Prof. Sung Wing Kin.

Introduction to kmers

kmers are substrings of length k contained within biological sequences and are generally composed of nucleotides (A,T,C,G). Generally, the term kmer refers to all of the substrings of length k in a particular DNA sequence (Figure 1). Hence, a sequence of length L will have (L-k+1) kmers. In the context of DNA sequences, the total number of possible kmers is 4k. In bioinformatics, kmers are mainly used within the context of computational genomics and sequence analysis.

Figure 1. The sequence ATGG has two 3-mers: ATG and TGG.

Description of the Program

The aim of the kmer program is to generate the counts of all kmers in a fasta file that occur at least q (user defined) times. A memory-based approach (Counting BloomFilter) is implemented to carry out kmer counting. A new output txtfile containing the kmers that occur at least q times and its respective count is generated.

Note

  • Only the canonical form of kmers are counted, where the canonical form is the alphabetically smaller kmer among the kmer itself and its reverse complement.
  • kmers containing n or N indicate a missing nucleotide and are disregarded during kmer counting.

Brief Instructions

  1. Ensure that the input fasta files containing the DNA sequence(s) are formatted in the following manner.
     >s1 
     ATCGCTG...
     >s2
     ATCGCTG...
  2. Download and unzip the kmer_counting.zip in the location with the input fasta files.
  3. Manually change all paths in the main method of kmer.java to the location containing the input fasta files.
  4. To run the program, run the following commands in your terminal.
    /*
    Description of parameters 
      f: Name of the input file containing the DNA sequence(s)
      k: length of kmers to be counted 
      q: minimum number of occurences for kmer to be counted 
      
    Example 
      java kmer 15mer.fa 15 3
    */
    
    javac *.java 
    java kmer f k q
  5. Open the output txtfile, "f_out.txt", for the generated kmer counts.

Additional Information

  • A disk-based approach (DSK algorithm) and accuracy improvements are currently undergoing implementation. The program will be updated in the near future.
  • A summary of the methods used in the program can be found under the "Summary.pdf".
  • Terminal will output the total RAM and wall-clock running time of each input fasta file.

kmer-counting's People

Contributors

tayhuiyi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.