lh3 / kmer-cnt Goto Github PK

Code examples of fast and simple k-mer counters for tutorial purposes

License: MIT License

Makefile 0.60% C 44.32% Python 0.64% C++ 54.43%

kmer-cnt's Introduction

Getting Started

git clone https://github.com/lh3/kmer-cnt
cd kmer-cnt
make  # C++11 required to compile the two C++ implementations
wget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz
./yak-count M_abscessus_HiSeq_10M.fa.gz > kc-c4.out

Introduction

K-mer counting is the foundation of many mappers, assemblers and miscellaneous tools (e.g. genotypers, metagenomics profilers, etc). It is one of the most important classes of algorithms in Bioinformatics. Here we will implement basic k-mer counting algorithms but with advanced engineering tricks. We will see how far better engineering can go.

In this repo, each {kc,yak}-*.* file implements a standalone k-mer counter. As to other files: ketopt.h is a command line option parser; khashl.h is a generic hash table library in C; kseq.h is a fasta/fastq parser; kthread.{h,c} provides two multi-threading models; robin_hood.h is a C++11 hash table library.

Results

We provide eight k-mer counters, which are detailed below the result table. All implementations count canonical k-mers, the lexicographically smaller k-mer between the k-mers on the two DNA strands.

The following table shows the timing and peak memory of different implementations for counting 31-mers from 2.5 million pairs of 100bp reads sampled from the HiSeq M. abscessus 6G-0125-R dateset in GAGE-B. They were run on a Linux server equipped with two EPYC 7301 CPUs and 512GB RAM.

Implementation	Limitation	Elapsed time (s)	CPU time (s)	Peak RAM (GB)
kc-py1 + Python3.7		499.6	499.5	8.15
kc-py1 + Pypy7.3		1220.8	1220.8	12.21
kc-cpp1		528.0	527.9	8.27
kc-cpp2		319.6	319.6	6.90
kc-c1	<=32-mer	39.3	38.3	1.52
kc-c2	<=32-mer; <1024 count	38.7	37.9	1.05
kc-c3	<=32-mer; <1024 count	34.1	38.7	1.15
kc-c4 (2+4 threads)	<=32-mer; <1024 count	7.5	35.1	1.27
yak-count (2+4; >=2 count)	<=32-mer; <1024 count	14.6	54.8	0.47
jellyfish2 (16 threads)		10.8	163.9	0.82
KMC3 (16 thr; in-mem)		9.2	36.2	5.02

Discussions

kc-py1.py is a basic Python3 implementation. It uses string translate for fast complementary. Interestingly, pypy is much slower than python3. Perhaps the official python3 comes with a better hash table implementation. Just a guess. I often recommend pypy over python. I need to be more careful about this recommendation in future.
kc-cpp1.cpp implements a basic counter in C++11 using STL's unordered_map. It is slower than python3. This is partly because STL's hash table implementation is very inefficient. C++ does not necessarily lead to a fast implementation.
kc-cpp2.cpp replaces std::unordered_map with Martin Ankerl's robin_hood hash table library, which is among the fastest hash table implementations. It is now faster than kc-py1.py, though the performance gap is small.
kc-c1.c packs k-mers no longer than 32bp into 64-bit integers. This dramatically improves speed and reduces the peak memory. Most practical k-mer counters employs bit packing. Excluding library files, this counter has less than 100 coding lines, not much more complex than the C++ or the python implementations.
kc-c2.c uses an ensemble of hash tables to save 8 bits for counter. This reduces the peak memory. The key advantage of using multiple hash tables is to implement multithreading. See below.
kc-c3.c puts file reading and parsing into a separate thread. The performance improvement is minor here, but it sets the stage for the next multi-threaded implementation.
kc-c4.c is the fastest counter in this series. Due to the use of an ensembl of hash tables in kc-c2, we can parallelize the insertion of a batch of k-mers. It is much faster than the previous versions. Notably, kc-c4 also uses less CPU time. This is probably because batching helps data locality.
yak-count.c is adapted from yak and uses the same kc-c4 algorithm. Similar to BFCounter, it optionally adds a bloom filter to filter out most singleton k-mers (k-mers occurring only once in the input). Yak needs to update the bloom filter, read the input twice and count twice. It is slower but uses less memory. Yak-count is the most complex example in this repo, but it is still short. Its code is also better organized. Command line: -b30 (bloom filter with 1 billion bits).
jellyfish2 is probably the fastest in-memory k-mer counter to date. It uses less memory and more flexible than kc-c4, but it is slower and much more complex. Command line: count -m 31 -C -s 100000000 -o /dev/null -t 16.
KMC3 is one of the fastest k-mer counters. It uses minimizers and relies on sorting. KMC3 is run in the in-memory mode here. The disk mode is as fast. KMC3 is optimized for counting much larger datasets. Although it uses more RAM here, it generally uses less RAM than jellyfish2 and other in-memory counters given high-coverage human data. Command line: -k31 -t16 -r -fa.

Conclusions

The k-mer counters here are fairly basic implementations only using generic hash tables. Nonetheless, we show better engineering can carry the basic idea a long way. If you want to implement your own k-mer counter, yak-count.c could be a good starting point. It is fast and relatively simple. By the way, if you have an efficient and simple k-mer counter (implemented in a few files), please let me know. I will be happy to add it to the table.

kmer-cnt's People

Contributors

Stargazers

Watchers

Forkers

slw287r iiseymour taoliu bioforensics arun-sub dfguan twelvesummer im-mou lds-axel philip-murzynowski ming-lian kojix2 junhaiqi

kmer-cnt's Issues

Bottleneck of Py1 code is that it relies on gunzip to feed stdin

The current kc-py1 + Python3.7 implementation spends 10 folds longer time than kc-c2 or kc-c3. However, it may only due to the fact that kc-py1 reads stdin and it may rely on command line gunzip -c to feed. Python has built-in gzip library, and if I use python-gzip to read the testing file M_abscessus_HiSeq_10M.fa.gz, the whole process will be 10x faster. See kc-py2.py. Here is the test. I used my prockreport script to track the CPU time and Max memory usage (it calls ps in 0.5s interval).

$ wget https://gist.githubusercontent.com/taoliu/2501892/raw/92a66ba0cf6026dbdc0d3f941a9f8d6be6d6c176/prockreport
$ wget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz
$ gunzip -c M_abscessus_HiSeq_10M.fa.gz | python kc-py1.py &> py1.out &
$ bash prockreport 0.5 $!
CPU time (mm:ss): 00:04:01
Max mem (KB): 8551588
$ python kc-py2.py 31  M_abscessus_HiSeq_10M.fa.gz &> py2.out &
$ bash prockreport 0.5 $!
CPU time (mm:ss): 00:00:40
Max mem (KB): 3468044
$ kc-c2 M_abscessus_HiSeq_10M.fa.gz &>c2.out &
$ bash prockreport 0.5 $!
CPU time (mm:ss): 00:00:29
Max mem (KB): 1107240
$ kc-c3 M_abscessus_HiSeq_10M.fa.gz &>c3.out &
$ bash prockreport 0.5 $!
CPU time (mm:ss): 00:00:30
Max mem (KB): 1217792

The observation is that the kc-py1.py+gunzip took 4mins (240s) and 8G memory, however kc-py2 on gzip file directly took only 40s and 3.5G memory. On the same server, kc-c2 or kc-c3 took 30s and 1.1-1.2G mem. So in fact, the difference is not that ginormous. I modified kc-py2 in Cython (not submitted to github yet) by specifying types of all variables and it can make the simple python counter to finish in 34s although the memory usage is still 3 times larger than kc-c2 or kc-c3.

The testing environment is a desktop server with Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, 64G mem and SSD.

hash64 for hash table ensemble

Hi again :)

Quick question, the use of hash64() before inserting into any of the h->h tables done for ensuring an even distribution of kmers among the ensemble?

kc-c[1-4] and yak-count not working when k=32.

for '-k32', the hist count suggest all kmers are the same string.

Using the code from #1 to dump kmer string, all kmer key is 0 or "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".

Speedup from kc-cpp2.cpp to kc-c1.c

Hi @lh3,

Thanks for this survey! It's been very useful for someone learning C++.

I have a newbie question about the 9-10x speedup you see when going from kc-cpp2.cpp to kc-c1.c. From the way I'm interpreting it, you write that the reason for the speedup is that "kc-c1.c packs k-mers no longer than 32bp into 64-bit integers".

Q1. But is this really the cause of the drastic speedup, or is it rather something more foundational within your library khashl.c? The reason I ask is that I'm only seeing about 10-20% speedup when I do bit packing (with your functionkmer_to_uint64) of k-mers and then using Martin Ankerl's robin_hood hash table library, compared to only using the robin hood library inserting the k-mers directly.

Q2 Furthermore, if khashl.c is responsible for the major speedup and not the bit packing itself, would you know (or guesstimate) if is it possible to use khashl.c at the same efficiency in C++?

Some background as to why I did some analysis on this: I want to store all positions of a k-mer on a chromosome in a vector or similar. I did some profiling on using either a robin hood hash map: std::string -> std::vector (inserting the k-mer directly), or using a robin hood hash map u_int64 -> std::vector (bit packing the k-mer first). The difference is only about 10-20% faster for u_int64 -> std::vector. In my profiling, I made sure that no other operation (such as vector::pushback) was the bottleneck. Furthermore, somewhat disappointing, my C++ program using

uint64_t bkmer = kmer_to_uint64(kmer, kmask);   
h[bkmer].push_back(kmer_pos);  # uint64 -> std::vector, Martin Ankerl's robin_hood hash table library

is almost as slow as my python program using

h[hash(k)].append(kmer_pos) # int -> list

Any feedback you can give on this is appreciated.

Problem with 32mers

Hi @lh3

I noticed a little problem when counting 32 mers.

Given the sequence:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT

    ./kc-c1 -k 32 seq.fa | head
    1       0
    2       1
    3       0
    4       0
    5       0
    6       0
    7       0
    8       0
    9       0
    10      0

There are two unique kmers in this sequence but kc-c1 is reporting only one.

After inspecting your code I noticed:

mask = (1ULL<<k*2) - 1

Which is used to extract the k*2 bits corresponding with k bases.

Testing how mask is set I got:

//mask.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(int argc, char *argv[]) {
    int k = atoi(argv[1]);
    uint64_t mask = (1ULL<<k*2) - 1;
    fprintf(stderr, "Mask: %lx\n", mask);
}

gcc -o mask mask.c
./mask 31
Mask: 3fffffffffffffff //0011111111111111111111111111111111 AOK
./mask 32
Mask: 0 // Not AOK, expected  ffffffffffffffff

When counting 32 mers, the 0 mask, hashes all kmers to the same value (0)

I understand we should be able to count up to 32mers. Is the limit 31 in reality and I misunderstood the "up to"?

I also don't get why

mask = (1ULL<<k*2)

sets mask to 1 instead of 0 when k == 32

EDIT To my last point.

It seems that shift operations with values greater than or equal to the size of the type is undefined behavior. Which explains why "<<32*2" would not work.

https://ccb.jhu.edu/gage_b/datasets/index.html

Is there another place I can get them, or and SRA number I can lookup?

Cheers,
James