Code Monkey home page Code Monkey logo

kmc's Introduction

KMC

GitHub downloads Bioconda downloads Biocontainer downloads GitHub Actions CI Join the chat at https://gitter.im/refresh-bio/KMC

KMC is a disk-based program for counting k-mers from (possibly gzipped) FASTQ/FASTA files. KMC is one of many projects developed by REFRESH Bioinformatics Group.

For accessing k-mers stored in database produced by KMC there is an API (kmc_api directory). Note that for KMC versions 0.x and 1.x dababase format differs from produced by KMC version 2.x. From version 2.2.0 API is unified for both formats and all new features/bug fixes are present only for 2.x branch (standalone API for older KMC version is not longer under development, so new version of API should be used even for databases produced by older KMC version).

Quick start

Getting the executable

The simplest way to get the KMC is to download newest release for appropriate operating system from KMC releases.

Counting the k-mers from a single fastq file

./kmc -k27 input.fastq 27mers .

The command above will count all the 27-mers occurring in input.fastq at least twice (configurable with -ci switch). The result will be stored in a KMC database, which is split into two files: 27mers.kmc_pre and 27mers.kmc_suf. KMC will create hundreds of intermediate files. In the case of the above command, those will be created in the current working directory(the . at the end of the command). It may be more convinient to use dedicated directory for KMC temporary files, for example:

mkdir kmc_tmp # create directory for kmc temporary files
./kmc -k27 input.fastq 27mers kmc_tmp

Create text dump from KMC database binary format

Having the k-mers counted it is possible to dump KMC binary database to textual form with kmc_tools.

./kmc_tools transform 27mers dump 27mers.txt

Installation details

Compile from sources
git clone --recurse-submodules https://github.com/refresh-bio/kmc.git
cd kmc
make -j32

= The following libraries come with KMC in a binary (64-bit compiled for x86 platform) form. If your system needs other binary formats, you should put the following libraries in kmc_core/libs:

  • zlib - for support for gzip-compressed input FASTQ/FASTA files

The following libraries come with KMC in a source coude form.

If needed, you can also redefine maximal length of k-mer, which is 256 in the current version.

Note: KMC is highly optimized and spends only as many bytes for k-mer (rounded up to 8) as necessary, so using large values of MAX_K does not affect the KMC performance for short k-mers.

Some parts of KMC use C++17 features, so you need a compatible C++ compiler

After that, you can run make to compile kmc and kmc_dump applications.

Additional infromation for MAC OS installation

There might be a need to change g++ path in makefile_mac. If needed we recommend install g++ with brew (http://brew.sh/).

Note that KMC creates a hundreds of temporary files, while default limit for opened files is small for under MAC OS platform. To increase this number use following command before running KMC:

ulimit -n 2048

Directory structure

  • bin - after compilation executables and libraries after compilation will be stored here
  • include - after compilation header file to use kmc core through the C++ API will be stored here
  • kmc_core - source code of kmc core library
  • kmc_CLI - source code of kmc command line interface
  • kmc_tools - source codes of kmc_tools program
  • kmc_core/libs - libraries used by KMC
  • kmc_api - C++ source codes implementing API to access KMC databases; must be used by any program that wants to process databases produced by kmc
  • kmc_dump - source codes of kmc_dump program listing k-mers in databases produced by kmc (deprecated, use kmc_tools instead)
  • py_kmc_api - python wrapper for kmc API
  • tests - tests files

Use the KMC directly from code through the API

It is possible to use the KMC directly from C++ code through. Detailed API description is available at wiki

Python wrapper for KMC API

Python wrapper for KMC API was created using pybind11. Warning: python binding is experimental. The library used to create binding as well as public interface may change in the future. Warning 2: python wrapper for C++ KMC API is much slower (much, much more than I have been expecting) than native C++ API. In fact the first attempt to create python wrapper was to use ctypes, but it turned out it was even slower than in case when pybind11 is used. The wrapper is designed and was tested only for python3. The main goal was to make it as similar to C++ API as possible. For this reason the API may be not [pythonic] (https://blog.startifact.com/posts/older/what-is-pythonic.html) enough for regular python programmer. Suggestions or pull requests to make it more robust are welcome.

Python module wrapping KMC API must be compiled.

  • for windows there is a visual studio project (note that there will be probably the need to change include directories and library directories to point python include and libs location)
  • for linux or mac one should run make py_kmc_api

As a result of pybind11 *.so file (for linux and mac os) or *.pyd (for windows) is created and may be used as a python module. *.pyd file is in fact DLL file, the only difference is its extension.

  • for windows following file is created: x64/Release/py_kmc_api.pyd
  • for linux/mac os the following file is created: bin/py_kmc_apipython3-config --extension-suffix

To be able to use this file one should make it visible for python. One way to do this is to extend PYTHONPATH environment variable. For linux/mac os one may just

source py_kmc_api/set_path.sh

while, for windows:

py_kmc_api\set_path.bat

it will export apropriate file. The example of Python wrapper for KMC API is presented in file: py_kmc_api/py_kmc_dump.py

Detailed API description is available at wiki

Binaries

After compilation you will obtain two binaries:

  • bin/kmc - the main program for counting k-mer occurrences
  • bin/kmc_dump - the program listing k-mers in a database produced by kmc
  • bin/kmc_tools - the program allowing to manipulate kmc databases (set operations, transformations, etc.)
  • bin/libkmc_core.a - compiled KMC code sources
  • py_kmc_api.cpython-39-x86_64-linux-gnu.so - compiled python wrapper for KMC API

License

In case of doubt, please consult the original documentations.

Archival source codes, binaries and documentation

Archival source codes, binaries and documentation are available at wiki.

Warranty

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Citing

Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304

Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, Agnieszka Debudaj-Grabysz, KMC 2: fast and resource-frugal k-mer counting, Bioinformatics, Volume 31, Issue 10, 15 May 2015, Pages 1569–1576, https://doi.org/10.1093/bioinformatics/btv022

Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. Disk-based k-mer counting on a PC. BMC Bioinformatics 14, 160 (2013). https://doi.org/10.1186/1471-2105-14-160

kmc's People

Contributors

gitter-badger avatar jamshed avatar jvhaarst avatar karasikov avatar maciejdlugosz avatar marcom avatar marekkokot avatar notestaff avatar satta avatar sebastiandeorowicz avatar tmaklin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kmc's Issues

test suite?

@marekkokot Is there a test suite you use to verify correctness of kmc and kmc_tools? If there is, could it be checked into github?

Ignore N in the reads.

I am looking to use KMC to filter rare k-mers pre-assembly and was wondering if there's a way to tell it to ignore Ns in the reads (which could be uncalled bases or masked low-quality bases). Maybe KMC automatically does that?

Fail quickly on missing directories

Please could you implement a check for the existence (or create) of the directories specified on the command line for <output_file_name> and <working_directory>.

As a user it is frustrating for KMC to spend many minutes or hours doing computation only for it to fail because the directory I specified for the working directory did not exist. Similarly for the parent directory I specify for the output file name.

compile issue

Hello, ran into this problem with make today:

g++ -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11  -c kmer_counter/kmer_counter.cpp -o kmer_counter/kmer_counter.o
In file included from /usr/local/include/assert.h:5:0,
                 from /usr/include/c++/5/cassert:43,
                 from kmer_counter/radix.h:13,
                 from kmer_counter/kb_collector.h:18,
                 from kmer_counter/kmc.h:26,
                 from kmer_counter/kmer_counter.cpp:18:
/usr/local/include/except.h:15:32: error: conflicting declaration ‘typedef struct Except_Frame_T* Except_Frame_T’
 typedef struct Except_Frame_T *Except_Frame_T;
                                ^
/usr/local/include/except.h:15:16: note: previous declaration as ‘struct Except_Frame_T’
 typedef struct Except_Frame_T *Except_Frame_T;
                ^
/usr/local/include/except.h:17:18: error: field ‘prev’ has incomplete type ‘Except_Frame_T’
   Except_Frame_T prev;
                  ^
/usr/local/include/except.h:16:8: note: definition of ‘struct Except_Frame_T’ is not complete until the closing brace
 struct Except_Frame_T {
        ^
makefile:79: recipe for target 'kmer_counter/kmer_counter.o' failed
make: *** [kmer_counter/kmer_counter.o] Error 1

Count K-mers read by read

Hi, I need to count k-mers for each read in a metagenomic FASTQ or FASTA file.
Is it possible to get counts for each single read of the file instead of the whole file?

Install error

Hi,
I tried to install KMC ,but there are some problems occured when I used the command "make DISABLE_ASMLIB=true" and I don't know how to solve this, could you give me some advice?
Best,
############################################################################

make DISABLE_ASMLIB=true
g++ -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 -DDISABLE_ASMLIB -mavx2 -mfma -fabi-version=0 -c kmer_counter/raduls_avx2.cpp -o kmer_counter/raduls_avx2.o
/tmp/cckJQjTJ.s: Assembler messages:
/tmp/cckJQjTJ.s:27711: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:27714: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rcx)'
/tmp/cckJQjTJ.s:27716: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:27718: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rcx)'
/tmp/cckJQjTJ.s:27823: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:27829: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)'
/tmp/cckJQjTJ.s:27831: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:27833: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)'
/tmp/cckJQjTJ.s:36827: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0' /tmp/cckJQjTJ.s:36829: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx)'
/tmp/cckJQjTJ.s:36900: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0' /tmp/cckJQjTJ.s:36902: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)'
/tmp/cckJQjTJ.s:41669: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:41670: Error: suffix or operands invalid for vpaddq'
/tmp/cckJQjTJ.s:41672: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:42925: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:42926: Error: suffix or operands invalid for vpaddd' /tmp/cckJQjTJ.s:42928: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:46028: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0' /tmp/cckJQjTJ.s:46031: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)'
/tmp/cckJQjTJ.s:46033: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0' /tmp/cckJQjTJ.s:46035: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)'
/tmp/cckJQjTJ.s:46852: Error: no such instruction: vinserti128 $0x1,%xmm2,%ymm3,%ymm2' /tmp/cckJQjTJ.s:46853: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:46855: Error: no such instruction: vextracti128 $0x1,%ymm2,16(%rax)' /tmp/cckJQjTJ.s:46857: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)'
/tmp/cckJQjTJ.s:46913: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:46919: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)'
/tmp/cckJQjTJ.s:46921: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:46923: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)'
/tmp/cckJQjTJ.s:47314: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:47315: Error: suffix or operands invalid for vpaddq'
/tmp/cckJQjTJ.s:47317: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:48587: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:48588: Error: suffix or operands invalid for vpaddd' /tmp/cckJQjTJ.s:48590: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:50550: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:52688: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:52689: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:52691: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:53960: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:53961: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:53963: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:57720: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:57721: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:57723: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:58988: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:58989: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:58991: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:62058: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:62059: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:62061: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:63323: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:63324: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:63326: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:65465: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0'
/tmp/cckJQjTJ.s:65467: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:66050: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:66052: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:66091: Error: no such instruction: vinserti128 $0x1,%xmm1,%ymm0,%ymm0'
/tmp/cckJQjTJ.s:66093: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:66451: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:66452: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:66454: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:67714: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:67715: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:67717: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:70575: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:70576: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:70578: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:71836: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:71837: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:71839: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:73822: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0'
/tmp/cckJQjTJ.s:73823: Error: suffix or operands invalid for vpaddq' /tmp/cckJQjTJ.s:73825: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)'
/tmp/cckJQjTJ.s:75091: Error: no such instruction: vinserti128 $0x1,16(%rdx,%rax),%ymm0,%ymm0' /tmp/cckJQjTJ.s:75092: Error: suffix or operands invalid for vpaddd'
/tmp/cckJQjTJ.s:75094: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx,%rax)' /tmp/cckJQjTJ.s:81696: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:81698: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:81760: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:81762: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:81832: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:81834: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rdx)' /tmp/cckJQjTJ.s:81909: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:81911: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:83185: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:83187: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:83249: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:83251: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:83323: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:83325: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rsi)' /tmp/cckJQjTJ.s:83399: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:83401: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:93800: Error: no such instruction: vinserti128 $0x1,%xmm2,%ymm3,%ymm2'
/tmp/cckJQjTJ.s:93801: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:93803: Error: no such instruction: vextracti128 $0x1,%ymm2,16(%rax)'
/tmp/cckJQjTJ.s:93805: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)' /tmp/cckJQjTJ.s:93896: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:93902: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:93904: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:93906: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)' /tmp/cckJQjTJ.s:94006: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:94009: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rcx)' /tmp/cckJQjTJ.s:94011: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:94013: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rcx)' /tmp/cckJQjTJ.s:94117: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:94123: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:94125: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:94127: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)' /tmp/cckJQjTJ.s:95555: Error: no such instruction: vinserti128 $0x1,%xmm2,%ymm3,%ymm2'
/tmp/cckJQjTJ.s:95556: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cckJQjTJ.s:95558: Error: no such instruction: vextracti128 $0x1,%ymm2,16(%rax)'
/tmp/cckJQjTJ.s:95560: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)' /tmp/cckJQjTJ.s:95650: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95656: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:95658: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95660: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rax)' /tmp/cckJQjTJ.s:95763: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95766: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rcx)' /tmp/cckJQjTJ.s:95768: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95770: Error: no such instruction: vextracti128 $0x1,%ymm0,48(%rcx)' /tmp/cckJQjTJ.s:95872: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95878: Error: no such instruction: vextracti128 $0x1,%ymm0,16(%rax)' /tmp/cckJQjTJ.s:95880: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'
/tmp/cckJQjTJ.s:95882: Error: no such instruction: `vextracti128 $0x1,%ymm0,48(%rax)'
make: *** [kmer_counter/raduls_avx2.o] Error 1

(k,x)-mers explanation

Hi, I'm not sure I understand how you split a super k-mer in (k,x)-mers (I guess x=1 is too simple and not enough explanatory) and why the subset are non-overlapping.
Could you kindly provide a pratical example with the real x value you use in the program? (it is 3?)

Best regards

makefile_mac

Hi
I am willing to install KMC as part of the required dependencies for IVA and I am struggling with the installation of KMC on my Mac.
#1. I set the new path of gcc in the makefile_mac
#2. Considering the following error message:

screen shot 2017-03-12 at 9 29 26 am

I removed the -fopenmp option from makefile_mac

#3. I reran make -f makefile_mac but I finally got this error I cannot fix (sorry for that...)
screen shot 2017-03-12 at 9 28 49 am

Any advices would be very appreciate

thanks ++
a

Using kmc_file.h from v3.0.1

I'm trying to use the C++ API but when I include kmc_file.h and compile, I get the following compilation error.

Compile command

g++ read_db.cpp -o read_db

#include <iostream>
#include "../KMC-3.0.1/kmc_api/kmc_file.h"

int main(){
  CKMCFile kmer_database;
  return 0;
}

Error,

Compilation started at Thu Apr 13 13:59:49

g++ read_kmc.cpp -o read_kmc
In file included from read_kmc.cpp:2:
In file included from ./../KMC-3.0.1/kmc_api/kmc_file.h:14:
./../KMC-3.0.1/kmc_api/kmer_defs.h:36:11: fatal error: 'ext/algorithm' file not found
        #include <ext/algorithm>
                 ^
1 error generated.

Compilation exited abnormally with code 1 at Thu Apr 13 13:59:49

gcc version 6.3.0
OSX El Capitan 10.11.6

kmer counts incorrect?

I've been testing out KMC with the following dummy sequence

>dummy
AATGGGTCCCTGTTTCGCGATAAAATGCCAATCGCTCTAAATATCGCGCTAGC

with the command kmc -ci0 -fm -k3 -cs300 dummy_genome.fa dummy_kmc kmc_temp

The result is

Stage 1: 100%
1st stage: 0.002018s
2nd stage: 0.001863s
Total    : 0.003881s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :           25
   No. of unique counted k-mers       :           25
   Total no. of k-mers                :           51
   Total no. of sequences             :            1
   Total no. of super-k-mers          :            0

while the sequence actually has 33 unique 3-mers.

I'm compiling with gcc 6.3.0. Maybe I'm doing something wrong...

Option to NOT keep .kmc_pre and .kmc_suf outputs

Running KMC produces two output files: XXX.kmc_pre and XXX.kmc_suf.

Could you please add an option to not create/write or keep these files?

For example, often we only want the summary Stats:.

tag the 2.2 release in GitHub

Could you please tag the 2.2 release in GitHub? 2.1.1 is the latest release tagged here, but the KMC web site says that the current release is 2.2. Thanks.

avx2 detection broken.

hello trying to compile KMC-3.0.1 I have some avx2 related errors.
same kind of error as mentioned in #17

rapid diggind in the code shows that avx2 is enabled on xeon proc
but some xeon does not provides avx2 support.

see for example an example of my /proc/cpuinfo
(running a docker on mac)

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping        : 4
cpu MHz         : 3699.593
cache size      : 10240 KB
physical id     : 5
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht pbe syscall nx pdpe1gb lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq dtes64 ds_cpl ssse3 cx16 xtpr pcid dca sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase erms xsaveopt arat
bugs            :
bogomips        : 7570.22
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

build erro: "modf is not a member of std"

I am building on linux with g++ 5.4.0 and as/binutils 2.2.8.
My build fails with:
In file included from kmc_dump/nc_utils.cpp:15:0:
kmc_dump/nc_utils.h: In static member function 'static int CNumericConversions::Double2PChar(double, int, uchar*)':
kmc_dump/nc_utils.h:124:22: error: 'modf' is not a member of 'std'
double fractPart = std::modf(val, &ipart);
^
kmc_dump/nc_utils.h:124:22: note: suggested alternative:
In file included from /usr/include/features.h:346:0,
from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/os_defines.h:39,
from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/c++config.h:482,
from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/string:38,
from kmc_dump/nc_utils.h:14,
from kmc_dump/nc_utils.cpp:15:
/usr/include/bits/mathcalls.h:116:1: note: 'modf'
__MATHCALL (modf,, (Mdouble __x, Mdouble *__iptr));

Any idea what is happening? Thx...

Error message: Error: Cannot open temporary file tmp/kmc_00000.bin

Dear marekkokot,

I am new to Linux, so this problem may look like sully. I download the KMC3 file and make it. In the bin fold, I could see the three files: kmc, kmc_dump and kmc_tools. But when I run the command line : /home/niu/KMC-3.0.1/bin/kmc -k20 reads.fq kmers1 tmp. I got the error like: Error: Cannot open temporary file tmp/kmc_00000.bin. I tried several times and got the same error. I also chenked my tmp, it still has 50G space. So could you help me to figure this problem? Thank you very much.

Best,
Tim

Extend the length of supported reads in fastq and fasta format

Bug reported by mail:

Is there a limit to the length of sequences in a fasta file for the 'kmc' command?

I run this command
kmc -k25 -ci1 -fa input/test.fasta output/test.res work

When the fasta file contains one sequence of 50,000 'A's, the program completes.
When the fasta file contains one sequence of 60,000 'A's, the programs halts with the message 'Error: Wrong input file!'.

So, I conclude that sequence lengths have a limit between 50,000 and 60,000 characters. Is that correct?

The limit is not strict (it depents on couple of factors, yet it is enough for short reads)

The workaround in reported case is to use -fm (multifasta format), but in general long reads should be also supported in fasta and fastq format.

python binding

This is not an issue but do you have any interesting to develop a python binding?
Thank you

Mask rare kmers, instead of filtering.

I am using kmc to check for rare k-mers before genome assembly. I was wondering if there's a way to mask rare k-mers (replace with N) instead of filtering them out or trimming them. Filtering leads to losing more data than needed. Trimming leads to reads of unequal lengths, which makes it difficult to detect positional biases in the reads, if any, after removing rare k-mers.

Error: Wrong input file!

Hi there.

I am currently trying to use KMC to count 36-mers in a bench of files I have downloaded from the SRA and for a lot of them, KMC just returns me the following error:
********Error: Wrong input file!

An example is the file SRR1047856 from the SRA. On one computer with Ubuntu 15.04, I have downloaded its corresponding SRA file and extracted the FASTA file out of it. Then, I ran the command:
/home/gholley/KMC/bin/kmc -k36 -ci3 -fa SRR1047856.fasta SRR1047856_comp .
and obtained in return:
********Error: Wrong input file!
I tried different parameters for k and ci. I tried limiting the number of threads and the RAM-only mode as well as extracting the FASTQ file from the SRA file instead of the FASTA file. Same error.

I though that my SRA file might have been corrupted during the download so I re-downloaded directly the FASTA file from the SRA on a different computer (with Ubuntu 16.10). My local KMC branch was up to date with this git repository. I tried the same command and I obtained the same error.

Any help with this would be welcome :)
Thank you!

Best, Guillaume.

kmc_dump option ci and cx are reversed

According to help:

-ci<value> - print k-mers occurring less than <value> times
-cx<value> - print k-mers occurring more of than <value> times

However, it seems like -ci<value> prints k-mers occuring greater than <value> times. Same for -cx.

how to get counts of selected kmers

What is the best way to obtain a count of specified collection of kmers. Currently, I do a 'dump' and then extract the ones I want. Is there a better way?

bug: kmc_dump -ci and -cx inclusive/exclusive

The help for kmc_dump states that -ci excluded kmers occurring less than the specified number of time and -cx excludes kmers occurring more than the specified number of times. So to get the kmers which occur exactly 10 times I should be able to specify -ci10 -cx10. However, this returns nothing.

If I specify -ci10 -cx11, as expected I get a list of kmers occurring 10 or 11 times.

KMC 3 stops during stage 2 when using BFC-corrected reads

Hello,

I'm trying to use KMC v3 with reads previously corrected with BFC. However, KMC stops during stage 2, there is no warning or error message, and the stats table shows only 0s. I ran Jellyfish v2 with the same corrected reads without a problem. Below are the commands that I'm using.

Correct reads
bash -c "bfc -s 200m -k33 -t 16 <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) | gzip -1 > bfc-corrected.fastq.gz"

Count k-mers
kmc -k21 -ci2 -m100 -t12 -v bfc-corrected.fastq.gz bfc-corrected_kmc3 ./tmp

This is an example of a read pair after BFC correction:
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:104_0_3:0_0
aTAACATATAATGTTTTTAAATAAATTTTAATTTAATTGGAATACTTATTTATTCAATAAAATTATTAACAATAATTTACCTCTATTTTGGTTTCAATTAAATAAATTTATAgAGAAATAaTAAATAAATAAAGCTTCTAACTTTATAATA
+
&???????????????????????????????????????+??????+??????++??+++???+???????????????+???+?????+????++??+++?+++??????%++++???%++?????+?+???+????+?++????++??
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:103_0_3:0_0
aTATATTTTTGTTTATTATTTTAAGTATAGGTTAATTGAAGAATTATTTAATTTATTAAAATTAGATTATTTTGTTTATTATAAAATATTTTATTTTTTTTTTATAATTATAATTTTTTATTATTTTTTATTTgATTAAAATaTATGAATA
+
&?????????????????????????????????????????????????????++????????????????????+??????????+?????????????????++?++++++?????++????????++??%+??++++?#???+++?+

I would really appreciate any help.

Option to limit number of reads/kmers processed?

I have use-cases where I have a very large FASTQ file and wish to run kmc on, but I don't want it to read the whole file, as I only need the results for some estimations.

Would you be able to add an option that stopped processing after -nr <value> reads (or -nk <value> kmers?

Option to write stdout Stats: to JSON file?

Stats:
   No. of k-mers below min. threshold :     12041315
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :     15114589
   No. of unique counted k-mers       :      3073274
   Total no. of k-mers                :    134782293
   Total no. of reads                 :      1091283
   Total no. of super-k-mers          :     15598454

It would be great if there was a -j <stats.json> option to write the above stdout table in JSON format to a specified file.

This would make it machine readable for pipelines etc.

g++ (c++14) error: 'modf' is not a member of 'std'

Any suggestions?

g++-5 -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14 -c kmc_tools/
percent_progress.cpp -o kmc_tools/percent_progress.o
In file included from kmc_dump/nc_utils.cpp:15:0:
kmc_dump/nc_utils.h: In static member function 'static int CNumericConversions::Double2PChar(double, int,
uchar*)':
kmc_dump/nc_utils.h:124:22: error: 'modf' is not a member of 'std'
   double fractPart = std::modf(val, &ipart);
                      ^
kmc_dump/nc_utils.h:124:22: note: suggested alternative:
In file included from /home/linuxbrew/.linuxbrew/include/features.h:368:0,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/x86_64-unknown-linux
-gnu/bits/os_defines.h:39,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/x86_64-unknown-linux
-gnu/bits/c++config.h:489,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/string:38,
                 from kmc_dump/nc_utils.h:14,
                 from kmc_dump/nc_utils.cpp:15:
/home/linuxbrew/.linuxbrew/include/bits/mathcalls.h:115:1: note:   'modf'
 __MATHCALL (modf,, (_Mdouble_ __x, _Mdouble_ *__iptr)) __nonnull ((2));
 ^
make: *** [kmc_dump/nc_utils.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from kmc_dump/kmc_dump.cpp:17:0:
kmc_dump/nc_utils.h: In static member function 'static int CNumericConversions::Double2PChar(double, int, uchar*)':
kmc_dump/nc_utils.h:124:22: error: 'modf' is not a member of 'std'
   double fractPart = std::modf(val, &ipart);
                      ^
kmc_dump/nc_utils.h:124:22: note: suggested alternative:
kmc_api/kmc_file.cpp: In member function 'bool CKMCFile::BinarySearch(int64, int64, const CKmerAPI&, uint64&, uint32)':
kmc_api/kmc_file.cpp:1360:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
  if (index_start >= total_kmers)
                  ^
In file included from /home/linuxbrew/.linuxbrew/include/features.h:368:0,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/x86_64-unknown-linux-gnu/bits/os_defines.h:39,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/x86_64-unknown-linux-gnu/bits/c++config.h:489,
                 from /home/linuxbrew/.linuxbrew/Cellar/gcc/5.5.0_1/include/c++/5.5.0/iostream:38,
                 from kmc_dump/kmc_dump.cpp:15:
/home/linuxbrew/.linuxbrew/include/bits/mathcalls.h:115:1: note:   'modf'
 __MATHCALL (modf,, (_Mdouble_ __x, _Mdouble_ *__iptr)) __nonnull ((2));
 ^
make: *** [kmc_dump/kmc_dump.o] Error 1

Weird kmer counting

OK, so I have a little bit of an issue with KMC 3.0.1 on a Linux system. I have multiple fasta files (let's call them F1.fasta, F2.fasta, ..., Fn.fasta) which contain multiple genes each and I ran

kmc -k15 -fm -ci1 -cs1677215 F1.fasta F1.fasta temp/
kmc_dump F1.fasta F1.fasta.15.kmrs

This counts the 15mers within each fasta file. I then ran

cat F*.fasta > all.fasta
kmc -k15 -fm -ci1 -cs1677215 all.fasta all.fasta temp/
kmc_dump all.fasta all.fasta.15.kmrs

This concatenates all the fasta files together and counts the 15mers in there. Now, there are a set of 15mers that are found in the individual fasta files, let's call one of these kmer X, that isn't found in the all.fasta file. This is kind of baffling me as it should't be possible for that to happen. How can a kmer be found in an individual fasta file, but not when we concatenate the fasta files together?

I have a total of about 5500 fasta files and X appears in them <1 time (typically).

To dig even further, I ran KMC 2.3.0 on the same all.fasta file and got different results. Those results were more inline with those of the individual KMC 3.0.1 runs (X was found in the KMC 2.3.0 run). Additionally, I should note that both KMC 2.3.0 and KMC 3.0.1 find the same number of unique 15mers, however, the 15mers that are flip flopped around (a total of 5 15mers are flip flopped) do not have the same counts. This makes me think there may be an issue with the way a kmer is getting encoded inside the database and then getting decoded in the dump. IE, if I decoded the DB to produce 15mer X, it wasn't X that was encoded there to begin with, rather it was some other 15mer Y (or stated differently encode(Y) = E, decode(E) = X). In any case, something changed between 2.3.0 and 3.0.1 (possibly 3.0.0) to produce this result.

I have the all.fasta file that was used to produce the above results. It's 550MB in size (165MB compressed). Github won't let me attach it here, so if you need it, please do ask (maybe I could email it to you?).

One final note, I did test this on the executable that you offer on your website (http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download) which is stated as 3.0. I got the odd results which made me go and compile 3.0.1 from scratch (I didn't see an available executable on your GitHub) to have the same results. Because of the bug that has the 3.0.1 kmc executable still printing 3.0.0 as its version, I'm not sure which version you have on your website. But if it is 3.0.0, I did test that as well. If not, then I did not test 3.0.0. That said, from a user perspective, please fix that little bug with the 3.0.0 on your next release (I'm sure you have already as you closed it the request for this; from a user perspective, it's really annoying not knowing what version we're actually on).

Any insight on this issue would be greatly appreciated!

Help: KMC configuration

Hi

We are working on analysis of Bioinformatics tools (related to Kmer counting) and KMC is one of them. We have gone through readme file and it is very helpful. As we are doing analysis so we want to be very sure about details. So it would be great if you help us validating below details of KMC.

Data structure and Sorting Algo:
Array, Priority queue, Radix sort, Counting sort

Approach:
Two disk based, Modified minimum sub-string partitioning (signature).

The limit of k-size : less than 257
Supports online k-mer frequency retrieval : No
Supports compressed file processing : Yes

Thanks
Tarang

asmlib/vectorclass: Licensing issues and clarification

[DISCLAIMER: I am not a lawyer and the following are only my interpretations of the licensing terms -- hence no legal advice but only well-intended suggestions/remarks.]

The README.md states that

KMC software distributed under GNU GPL 2 licence.

yet it uses asmlib (optionally) and vectorclass which are both GPL-3.0+ licensed.
Sadly GPL-2.0 and GPL-3.0 are not compatible, see
https://www.gnu.org/licenses/gpl-faq.html#AllCompatibility.
Hence, to use vectorclass, KMC would have to be made available via GPLv3, i.e., licensed under one of GPL-2.0+, GPL-3.0, or GPL-3.0+.

Furthermore I also find the following from KMC's readme problematic/misleading:

Note: asmlib is free only for non commercial purposes. If needed, you can contact the author of asmlib or compile KMC without asmlib.

Note: for commercial usage of asmlib follow the instructions in 'License conditions' (http://www.agner.org/optimize/asmlib-instructions.pdf) or compile KMC without asmlib. In case of doubt, please consult the original documentations.

vcl is under the licence GNU GPL 3 or higher Node: for commercial usage of vcl follow the instructions in 'License' section (http://www.agner.org/optimize/vectorclass.pdf)

But as asmlib/vectorclass can be used in terms of the GPL-3.0, no restrictions concerning commercial/non-commercial usage should be applicable, see
https://www.gnu.org/licenses/gpl-faq.html#NoMilitary
and
https://www.gnu.org/licenses/gpl.html#section7.
(IMO, the use of the sole term "Commercial licenses" from the asmlib and vectorclass license texts is also misleading, as they (to me) kind of suggest the interpretation GPL=free=non-commercial which is wrong. "Alternative custom/proprietary/??? license" might have been a better choice...)


Just for reference/context the license information for asmlib and vectorclass:

From http://www.agner.org/optimize/asmlib-instructions.pdf:

10 License conditions

These software libraries are free: you can redistribute the software and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the license, or any later version.

Commercial licenses are available on request to www.agner.org/contact.

This software is distributed in the hope that it will be useful, but without any warranty. See
the file license.txt or www.gnu.org/licenses for the license text.

From http://www.agner.org/optimize/vectorclass.pdf:

License

The VCL vector class library has a dual license system. You can use it for free in
open source software, or pay for using it in proprietary software.

You are free to copy, use, redistribute and modify this software under the terms of
the GNU General Public License as published by the Free Software Foundation,
version 3 or any later version. See the file license.txt.

Commercial licenses are available on request.

Union resulting in smaller database than the individual file

Hi,

I was using KMC (3.0.0) for producing kmers of the unitig files (generated by BCALM). The union of the databases produced a smaller resulting data-base.
Is it an anomaly?

Datasets:

  1. ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/004/SRR1291024/SRR1291024_1.fastq.gz
  2. ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/000/SRR1291070/SRR1291070_1.fastq.gz

The unitigs for paired-end (_1 and _2) files were generated using BCALM.

Command to produce individual kmer data-bases for the two files:
./KMC/bin/kmc -k63 -r -ci1 -fa SRR1291024.unitigs.fa SRR1291024.kmers .
./KMC/bin/kmc -k63 -r -ci1 -fa SRR1291024.unitigs.fa SRR1291070.kmers .

Command to produce union:
./KMC/bin/kmc_tools simple SRR1291024.kmers -ci1 SRR1291070.kmers -ci1 union kmers_superset -ci1

Size of the resultant individual data-bases:
SRR1291024.kmers.kmc_pre (66M)
SRR1291024.kmers.kmc_suf (40G)
SRR1291070.kmers.kmc_pre (66M)
SRR1291070.kmers.kmc_suf (40G)

Size of the resultant union data-base:
kmers_superset.kmc_pre (33M)
kmers_superset.kmc_suf (39G)

Are the results correct?

build error: 'modf' is not a member of 'std'

I am building on linux with g++ 5.4.0 and as/binutils 2.2.8.
My build fails with:

In file included from kmc_dump/nc_utils.cpp:15:0:
kmc_dump/nc_utils.h: In static member function 'static int CNumericConversions::Double2PChar(double, int, uchar*)':
kmc_dump/nc_utils.h:124:22: error: 'modf' is not a member of 'std'
   double fractPart = std::modf(val, &ipart);
                      ^
kmc_dump/nc_utils.h:124:22: note: suggested alternative:
In file included from /usr/include/features.h:346:0,
                 from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/os_defines.h:39,
                 from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/x86_64-unknown-linux-gnu/bits/c++config.h:482,
                 from /global/common/genepool/usg/languages/gcc/5.4.0/include/c++/5.4.0/string:38,
                 from kmc_dump/nc_utils.h:14,
                 from kmc_dump/nc_utils.cpp:15:
/usr/include/bits/mathcalls.h:116:1: note:   'modf'
 __MATHCALL (modf,, (_Mdouble_ __x, _Mdouble_ *__iptr));

Any idea what is happening? Thx...

Progress/warnings to STDERR

Could KMC progress/warnings/errors be sent to STDERR instead of STDOUT. This would then mean that KMC follows Linux conventions and makes pipelining KMC commands easier and more intuative.

Current Behaviour

kmc_tools sends progress to STDOUT:

$ kmc_tools transform my_kmer_db -ci4 dump /dev/stdout | head
in1: 0% AAAAAAAAAAATGATGGGCATTTTAGAAGGGCATTTCAGGTTCATTGAAAAATTATTTTAGTAACCCTAGT 10
AAAAAAAAAAATGATGGGCATTTTAGAAGGGCATTTCAGGTTCATTGAAAATTATTTTAGTAAACCCTAGT 12
AAAAAAAAAATGATGGGCATTTTAGAAGGGCATTTCAGGTTCATTGAAAAATTATTTTAGTAACCCTAGTT 10
AAAAAAAAAATGATGGGCATTTTAGAAGGGCATTTCAGGTTCATTGAAAATTATTTTAGTAAACCCTAGTT 11
AAAAAAAAACCCTAGTCATTTTATCCTAACCTAACGCAGTCGTTAGCTTCGATCCAAAATCCCCTATTGTT 15
AAAAAAAAACGTCCATGACCATTGGTCGTCTAACAGCCACACTGGTAGCTAGTCTTGTACTCCATGCAAAT 16
AAAAAAAAACTAGGAAAAAAATAGACCACAAACAGAGTGGACATCAACTTAGATGTGACATAACTATGTCA 11
AAAAAAAAACTAGGACAAAAAAATAGACCACAAACAGAGTGGACATCAACTTAGATGTGACATAACTATGT 11
AAAAAAAAACTAGGGTTTCGTAGTAGCAATCTTCGCACTCCGGAAATTCTACCGAGGCAAACAATAACTAT 12
AAAAAAAAAGAAAAGAAAAGGTTAGCTACAGACGTGTGATGAATCAAGTGCTTGAGCTAGTTAGCTTTGTT 12

This means that to put kmc_tools into a pipeline, you need to either

  • Do some funcky redirection trickery:
$ kmc_tools transform my_kmer_db -ci4 dump /dev/stderr 2>&1 > /dev/null | head
  • Ask kmc_tools to not report progress:
$ kmc_tools -hp transform my_kmer_db -ci4 dump /dev/stdout | head

Desired/Conventional Behaviour

In Linux it is convention to send progress/warnings/errors to STDERR and have results etc sent to STDOUT. This is so that the expected output of a command can be easily piped into another command (assuming no seeking is required). This is very powerful and can be used to avoid disk IO.

Therefore a change to sending progress/errors/warnings to STDERR would allow a more simplified approach to pipelining KMC commands:

kmc_tools transform my_kmer_db -ci4 dump /dev/stdout  2>progress.log | head

Issue #23 was where this was originally raised.

Max. counter value parameters not always respected

Setting Max. counter value equal to my UINT_MAX, AKA -cs4294967295 shows:

********** Used parameters: **********
[...]
Max. counter value           : 4294967295
[...]

However, the max counter remains at the default of 255:

$ cat my_reads.32mers | cut -f2 | sort -n | tail - n1
255

Setting this parameter works up to at least 1000000, which exceeds my USHRT_MAX, so it's not clear what the actual limit is.

kmc_tools complex does not accept non-alphanumeric characters in out_db_path

Hi,
Here are some examples that should work but produce errors

operations_definition_file_1 (with dot)

INPUT:
sample0 = sample0.31mers 
sample1 = sample1.31mers 
OUTPUT:
samples_union.31mers=sample0+sample1
kmc_tools complex
Error: wrong line format, line: 5

operations_definition_file_2 (with slash)

INPUT:
sample0 = sample0.31mers 
sample1 = sample1.31mers 
OUTPUT:
samples_union/samples_union=sample0+sample1
kmc_tools complex
Error: wrong line format, line: 5

New point release (3.0.2).

Hi, I was wondering if we have a point release incorporating the recent updates. The version on our uni's compute cluster won't be updated otherwise (policy) and several of us here are dependent on the new masking feature in kmc_tools.

set all counts to a specified value

@marekkokot Is there a way to set all kmer counts in a database to a specific value? If not, would that be hard to add? Use case: I have a set of samples, and a kmc kmer database for each sample. I want to make a database that, for each kmer, records which samples have it (basically a colored de Bruijn graph). If there are <=64 samples I can assign to kmers from sample i the count 2^i, then the sum of counts gives the set of samples . If 64<n_samples<=128 can represent this with two kmc kmer databases per sample, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.