Hi, I need to count k-mers for each read in a metagenomic FASTQ or FASTA file. Is

Count K-mers read by read about kmc HOT 4 CLOSED

refresh-bio commented on August 24, 2024

Count K-mers read by read

from kmc.

Comments (4)

marekkokot commented on August 24, 2024

Hi,
Counting k-mers in a single read is rather trivial task, so I suppose that I don't fully understand your needs.
What kind of queries would you like to be able to perform on the resulting k-mers database?
Perhaps you need some tool to index k-mers e.g.: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0133198&type=printable

from kmc.

roth12 commented on August 24, 2024

Hi marekkokot,
thank you for the reply. Until now, I always used KMC2 to count k-mers overall (total occurrences of each k-mer in a dataset), using kmer_counter (on Windows) with a command like this:
kmer_counter -fm -k10 -m8 -cs500 input_file name_output_db path
and after, kmc_dump on the resulting database to dump the informations.

Now, I need to process FASTQ/FASTA file each read at a time getting in output the occurrences of k-mers for that read.
e.g.
Input FASTQ file:
read_1.
AGCTGAGATTGTCAGTCGGCGCGATCCAGGACCACG...
read_2.
CGAACGATGTCGCACAGCCCGTCTTCTCGTGCCCAG...
....
read_N.
AGTAGGAATGTGATTGCCTGACCCTACACTGCCGAC...

Suppose K=6, I want to get an output like this:
read 1:
AAAAAA #occurrences_in_read1
AAAAAG #occurrences_in_read1
...
TTTCAA #occurrences_in_read1

read 2:
AAAAAA #occurrences_in_read2
AAAAAC #occurrences_in_read2
...
TTTCAA #occurrences_in_read2

...and so on until read_N.
In this way, I can dump the db to know which k-mers are most common in a given read and use the dumped file in a script to process it.
I'm novice with this tools, probably is really trivial.

from kmc.

marekkokot commented on August 24, 2024

Hi,
Such functionality is not a part of KMC. Yet I will try to help you.
Are you sure you should use -fm switch (it is designed to multiline fasta files)? I am asking because you mentined "reads", while multifasta files usually do not contain reads but longer sequences (e.g. chromosomes), yet -fm will also work for fasta format (but -fa is recommended in that case). If you really work with multifasta files the simples option would be to split input file to N files (each file for each sequence). If you work with fasta/fastq files that contain reads, which are usually quite short that approach would work also but seems to be not the best option. Counting k-mers that are present in short sequences like reads is quite simle task nad having basik programming skills should be enought to make a program with functionalities you need (I do not know how important for you is performance, if you need I can give you some suggestions if time permits).

Maybe if you could give me some contex/applications of counting k-mers in each read separately I could be more helpful. If you have anu questions don't hesitate and ask.

from kmc.

roth12 commented on August 24, 2024

Yes, I know, it's a very very simple task.
I've already created a script for k-mers counting, but orders from the top want that I use KMC2 because with largest files (whole genomic sequences about 100GB, or big datasets in general) my counting could result slow.
Anyway, I'll try to do my tests and in case of other problems with the tool, I'll ask you here.

Thank you very much for your support.

from kmc.

Count K-mers read by read about kmc HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent