Code Monkey home page Code Monkey logo

Comments (1)

marekkokot avatar marekkokot commented on August 24, 2024

Hi,

Not sure how much details you need.

Data structure and Sorting Algo:
All you mentioned are in fact used in our approach, but the list is probably a little longer (for example we use also Shell sort, introspective sort etc, because our sorting algorithm is highly based on a radix sort but in fact it is a hybrid of a couple of algorithms, but those may be not important details for you). On the other hand, the priority queue is used only in our sorting routine to distribute tasks to threads in a more balanced fashion, so it may be also not an important detail from your perspective.
We do also use binary heap after sorting.
The implementation details and number of optimization (including some low level optimizations) are also very important to KMC performance. In some places we use look up table, memory pooling. The other strength of KMC is the high utilization of parallelism.

Approach:
I am not sure what you mean by "Two" in this sentence. Could you explain?
You mentioned signatures, which are in fact improved minimizers and they are one of the most important algorithmic part of our algorithm, which allows us to reduce disk space usage, but also reduce the memory peak. The second approach that is important is usage of so called (k+x)-mers (you may read a paper describing KMC2 for details). Simply speaking on the second stage of KMC, where k-mers are sorted, we sort a little longer sequences, which reduces the number of records to sort by half in average. This step is mainly for reduction of memory requirements per partition, but as it turns out it also make the whole program a little faster, even though additional postprocessing stage after sorting is required (in this stage we use the binary heap to convert sorted (k+x)-mers into sorted k-mers). In KMC3 we introduced some novelties (new version of radix sort is one of them, in fact, we plan to replace it further by yet faster radix sort) like improved reading when the input is compressed, better threads utilization.
From end user perspective I think it is also important to point, that you may run KMC without disk usage (memory peak will be higher, but in some cases, there is enough RAM, and disks are really slow). On the other hand one of the input parameters of KMC is the amount of memory it may use, but in some cases, it may be not enough (especially for large input), which is clearly a disadvantage of KMC, yet there is a parameter that may set strict memory limit in which KMC will not use more memory than specified (running time may be a little longer).

The limit of k-size :
This is the default limit, as we think it is enough for most applications, but it may be easily extended, by modifiing this line and recompiling KMC, so I would say there is no limit (of course there are technical difficulties and the limit must exist, furthermore extending this constant will increase compilation time and executable size (as more templates must be instantiated in c++ code).

Supports online k-mer frequency retrieval :
Could you explain what you mean by that?
The purpose of KMC is to count k-mers, so there is an input sequence, and the output is a database of k-mers and its counters.
To process such database we supplied a multiple ways:

  • using our API, which is the most flexible option, but require writing c++ program. Using this API one may go through all k-mers in the database, of find a number of occurrences for a specific k-mer
  • use our program kmc_dump which converts our binary database to text format (one line is k-mer and its counter), this is in general not recommended as resulting text file will be in many cases big. To implement this tool we used our API, so one may check the code as an example
  • use our program kmc_tools which is also quite flexible, and do not require programming skills at all. Is supports a lot of operation that may be performed on a single k-mers databse of on multiple k-mers databases (more details in linked document).
    Is one of these functionalities equal to what you call "online k-mer frequency retrieval"? If no I would really appreciate if you could explain me this functionality because we want our software to be as useful as possible.

Supports compressed file processing: Yes :)

Thanks for investigating it deeply, I really appreciate your attitude and this sequence

As we are doing analysis so we want to be very sure about details

In case of questions don't hesitate

from kmc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.