Hi We are working on analysis of Bioinformatics tools (related to Km

Help: KMC configuration about kmc HOT 1 CLOSED

refresh-bio commented on August 24, 2024

Help: KMC configuration

from kmc.

Comments (1)

marekkokot commented on August 24, 2024

Hi,

Not sure how much details you need.

Data structure and Sorting Algo:
All you mentioned are in fact used in our approach, but the list is probably a little longer (for example we use also Shell sort, introspective sort etc, because our sorting algorithm is highly based on a radix sort but in fact it is a hybrid of a couple of algorithms, but those may be not important details for you). On the other hand, the priority queue is used only in our sorting routine to distribute tasks to threads in a more balanced fashion, so it may be also not an important detail from your perspective.
We do also use binary heap after sorting.
The implementation details and number of optimization (including some low level optimizations) are also very important to KMC performance. In some places we use look up table, memory pooling. The other strength of KMC is the high utilization of parallelism.

Approach:
I am not sure what you mean by "Two" in this sentence. Could you explain?
You mentioned signatures, which are in fact improved minimizers and they are one of the most important algorithmic part of our algorithm, which allows us to reduce disk space usage, but also reduce the memory peak. The second approach that is important is usage of so called (k+x)-mers (you may read a paper describing KMC2 for details). Simply speaking on the second stage of KMC, where k-mers are sorted, we sort a little longer sequences, which reduces the number of records to sort by half in average. This step is mainly for reduction of memory requirements per partition, but as it turns out it also make the whole program a little faster, even though additional postprocessing stage after sorting is required (in this stage we use the binary heap to convert sorted (k+x)-mers into sorted k-mers). In KMC3 we introduced some novelties (new version of radix sort is one of them, in fact, we plan to replace it further by yet faster radix sort) like improved reading when the input is compressed, better threads utilization.
From end user perspective I think it is also important to point, that you may run KMC without disk usage (memory peak will be higher, but in some cases, there is enough RAM, and disks are really slow). On the other hand one of the input parameters of KMC is the amount of memory it may use, but in some cases, it may be not enough (especially for large input), which is clearly a disadvantage of KMC, yet there is a parameter that may set strict memory limit in which KMC will not use more memory than specified (running time may be a little longer).

The limit of k-size :
This is the default limit, as we think it is enough for most applications, but it may be easily extended, by modifiing this line and recompiling KMC, so I would say there is no limit (of course there are technical difficulties and the limit must exist, furthermore extending this constant will increase compilation time and executable size (as more templates must be instantiated in c++ code).

Supports online k-mer frequency retrieval :
Could you explain what you mean by that?
The purpose of KMC is to count k-mers, so there is an input sequence, and the output is a database of k-mers and its counters.
To process such database we supplied a multiple ways:

using our API, which is the most flexible option, but require writing c++ program. Using this API one may go through all k-mers in the database, of find a number of occurrences for a specific k-mer
use our program kmc_dump which converts our binary database to text format (one line is k-mer and its counter), this is in general not recommended as resulting text file will be in many cases big. To implement this tool we used our API, so one may check the code as an example
use our program kmc_tools which is also quite flexible, and do not require programming skills at all. Is supports a lot of operation that may be performed on a single k-mers databse of on multiple k-mers databases (more details in linked document).
Is one of these functionalities equal to what you call "online k-mer frequency retrieval"? If no I would really appreciate if you could explain me this functionality because we want our software to be as useful as possible.

Supports compressed file processing: Yes :)

Thanks for investigating it deeply, I really appreciate your attitude and this sequence

As we are doing analysis so we want to be very sure about details

In case of questions don't hesitate

from kmc.

Help: KMC configuration about kmc HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent