Code Monkey home page Code Monkey logo

Comments (4)

marekkokot avatar marekkokot commented on August 24, 2024

Hi,
Counting k-mers in a single read is rather trivial task, so I suppose that I don't fully understand your needs.
What kind of queries would you like to be able to perform on the resulting k-mers database?
Perhaps you need some tool to index k-mers e.g.: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0133198&type=printable

from kmc.

roth12 avatar roth12 commented on August 24, 2024

Hi marekkokot,
thank you for the reply. Until now, I always used KMC2 to count k-mers overall (total occurrences of each k-mer in a dataset), using kmer_counter (on Windows) with a command like this:
kmer_counter -fm -k10 -m8 -cs500 input_file name_output_db path
and after, kmc_dump on the resulting database to dump the informations.

Now, I need to process FASTQ/FASTA file each read at a time getting in output the occurrences of k-mers for that read.
e.g.
Input FASTQ file:
read_1.
AGCTGAGATTGTCAGTCGGCGCGATCCAGGACCACG...
read_2.
CGAACGATGTCGCACAGCCCGTCTTCTCGTGCCCAG...
....
read_N.
AGTAGGAATGTGATTGCCTGACCCTACACTGCCGAC...

Suppose K=6, I want to get an output like this:
read 1:
AAAAAA #occurrences_in_read1
AAAAAG #occurrences_in_read1
...
TTTCAA #occurrences_in_read1

read 2:
AAAAAA #occurrences_in_read2
AAAAAC #occurrences_in_read2
...
TTTCAA #occurrences_in_read2

...and so on until read_N.
In this way, I can dump the db to know which k-mers are most common in a given read and use the dumped file in a script to process it.
I'm novice with this tools, probably is really trivial.

from kmc.

marekkokot avatar marekkokot commented on August 24, 2024

Hi,
Such functionality is not a part of KMC. Yet I will try to help you.
Are you sure you should use -fm switch (it is designed to multiline fasta files)? I am asking because you mentined "reads", while multifasta files usually do not contain reads but longer sequences (e.g. chromosomes), yet -fm will also work for fasta format (but -fa is recommended in that case). If you really work with multifasta files the simples option would be to split input file to N files (each file for each sequence). If you work with fasta/fastq files that contain reads, which are usually quite short that approach would work also but seems to be not the best option. Counting k-mers that are present in short sequences like reads is quite simle task nad having basik programming skills should be enought to make a program with functionalities you need (I do not know how important for you is performance, if you need I can give you some suggestions if time permits).

Maybe if you could give me some contex/applications of counting k-mers in each read separately I could be more helpful. If you have anu questions don't hesitate and ask.

from kmc.

roth12 avatar roth12 commented on August 24, 2024

Yes, I know, it's a very very simple task.
I've already created a script for k-mers counting, but orders from the top want that I use KMC2 because with largest files (whole genomic sequences about 100GB, or big datasets in general) my counting could result slow.
Anyway, I'll try to do my tests and in case of other problems with the tool, I'll ask you here.

Thank you very much for your support.

from kmc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.