Code Monkey home page Code Monkey logo

Comments (5)

marekkokot avatar marekkokot commented on August 24, 2024

Hi,

Thanks for using KMC.

Short answer:

It is possible that output databse of union operation will be smaller than sum of input sizes.

Explanation (long answer):

k-mers are stored in KMC database in compacted form, the optimal compact level is possible to determine only when it is known how many k-mers there is. The problem is that level must be known before first save to KMC database, but KMC do not know the total number of k-mers, before counting is finished, so it makes some estimation, which may or may not be optimal.
On the other hand when kmc_tools reads KMC database it knows exaclty how many k-mers is in database so it may change compaction level. The other story is that kmc_tools do not know how many k-mers there will be in the oputput database so it may still not be optimal, but in case of union operation kmc_tools has quite good chances to quess optimal compact level.
I have make some calculations based on what you sended me and it seems it is the case.

On the other hand, I would like to check if it is really the case, could you please send ma an exact command (I have never used BCALM before) that you used?

Final note:

Do you need separater KMC databases of both files (SRR1291024.kmers, SRR1291070.kmers) or do you only need the final result i.e. kmers_superset?
If you do not need the intermediate files SRR12910*.kmers, but only the final one, it may be faster to do k-mer counting for both that files, i.e. prepare file input_files.txt:

SRR1291024.unitigs.fa
SRR1291070.unitigs.fa

and then run:

./KMC/bin/kmc -k63 -r -ci1 -fa @input_files.txt kmers_superset_pattern .

Of course if you need those files for other computations your pipeline is OK.
BTW. You may check the correctness of kmc_tools union operation on your own if you want.
The first step is to run k-mer counting as I showd above, and than run command:

./KMC/bin/kmc_tools compare kmers_superset kmers_superset_pattern

The compare operation is not documented and not officially supported, but it works and we use it sometimes to check the correctness of results.
It just checks if both databases are equal (contains the same k-mers with the same counters)

But if you will send me all commands that you performed I will check it on my own and let you know (I have already downloaded files SRR1291024_1.fastq.gz SRR1291024_2.fastq.gz SRR1291070_1.fastq.gz SRR1291070_2.fastq.gz)

from kmc.

Ritu-Kundu avatar Ritu-Kundu commented on August 24, 2024

Hi,

Thanks a lot for the response and detailed explanation.
Following are the commands that I have used:


For BCALM:
ls -1 SRR1291024*.fastq.gz > SRR1291024 ; ./build/bcalm -nb-cores 12 -in SRR1291024 -kmer-size 64 -abundance-min 4

Same for SRR1291070:
ls -1 SRR1291070*.fastq.gz > SRR1291070 ; ./build/bcalm -nb-cores 12 -in SRR1291070 -kmer-size 64 -abundance-min 4

It produced a unitig file for each -- SRR1291024.unitigs.fa and SRR1291070.unitigs.fa


For KMC:
First produced individual databases (I need them)
./KMC/bin/kmc -k63 -r -ci1 -fa SRR1291024.unitigs.fa SRR1291024.kmers .
./KMC/bin/kmc -k63 -r -ci1 -fa SRR1291024.unitigs.fa SRR1291070.kmers .

Taking union of the individual databases:
./KMC/bin/kmc_tools simple SRR1291024.kmers -ci1 SRR1291070.kmers -ci1 union kmers_superset -ci1

from kmc.

marekkokot avatar marekkokot commented on August 24, 2024

Hi,
thanks but unfortunatelly command:

BCALM/bcalm/build/bcalm -nb-cores 12 -in SRR1291024 -kmer-size 64 -abundance-min 4

gives me:

BCALM 2, git commit 4bd3ce7
due to a currently known bug, bcalm with a kmer multiple of 4 is temporarily unavailable. please retry with another k-mer size

from kmc.

Ritu-Kundu avatar Ritu-Kundu commented on August 24, 2024

My bad! It is k is 63 (not 64). Sorry about that.

from kmc.

marekkokot avatar marekkokot commented on August 24, 2024

@Ritu-Kundu
No problem :)
I've reproduced your pipeline and check if the result is OK.
And it is :)
BTW. If you want check how many k-mers is in KMC database you may use:

./KMC/bin/kmc_tools info SRR1291024.kmers
./KMC/bin/kmc_tools info SRR1291070.kmers
./KMC/bin/kmc_tools info kmers_superset

There is a couple of not interesting for non KMC developer pieces of information like database format, number of bins or lut_prefix_len (this is mentioned in my first response compression level).
But some of the rest may be useful.

Anyway I am closing this issue as it seems everything works correctly.
Thanks again for using KMC. In case of any further doubts don't hesitate and ask :)

from kmc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.