simonrharris / ska Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 1.0 5.26 MB

Split Kmer Analysis

License: MIT License

C++ 94.49% C 3.50% Makefile 2.00%

bacterial-genomes epidemiology genomics

ska's People

Contributors

Stargazers

Watchers

Forkers

rohanmaddamsetti

ska's Issues

Publish the thing

This is a really nice and simple tool to use and it's a shame that it's only the two of us using it 🙂

seg fault with "-k" option in align

Ran into a segmentation fault running

ska align -p 0.9 -k -v -o SHFVkrc1_reference_free_prop_0.9 *.skf

ska_test.zip

Understanding split kmers produced from fasta files

In trying to understand split kmers produced from fasta files I produced 3 simple and very non-biologically relevant fasta files

ref.fas (a string of 160 Gs)

As expected this produced 1 split kmer as shown by the humanized summary

The 2nd fasta file (sample1) had just one polymorphism (G->A) at position 90

This file produced 31 split kmers as expected but the kmer that should have had an A as the middle base is reported with an N I think

The 3rd fasta file (sample2) had another one polymorphism (G->T) at the same position 90

Similar thing with the split kmer file

And when I calculated the distance between sample 1 and 2 using ska distance sample1.skf sample2.skf -o sample1_vs_sample2 the output showed no SNPs

Version the kmers and/or kmerge files

At some point I suspect someone will change the format of the files output by ska and so it might be a good idea to specify a version at the top of the kmers file so that you can tell if this version of the software is compatible with this version of input file.

kmers split by 2 adjacent SNPs for allele-specific primers

I am interested in using SKA to generate allele-specific LNA-modified primers for clinical outbreak investigation. I have successfully run through the tutorial with a test dataset but am trying to solve the following problem: from a primer-design perspective, it may be optimal to identify split kmers which differ by two adjacent nucleotide rather than just one. Is this something that might be possible with SKA or would be within the scope of future releases?

Many thanks,

Dustin

Consider using either kmerge or kmers as a file name, not both

We had a chat about having a consistend name for the kmer outputs from ska. I like the idea of everything being *.kmers by default. If possible I'd also prefer you left filenames up to the user so instead of prefix they specify the full name for their output.

ska distance - include distance.phy PHYLIP too?

Relaxed PHYLIP format.
Can be lower triangle.
Distances in (0,1).

ska fasq vs ska fasta

Hi,

I am getting different results (number of SNPs and SNP distances) when I run ska distance with .skf files generated using ska fastq and ska fasta, with the same genome database. Specifically, the number of SNPs is greater when I use the fasta option than when I use fastq. I wonder if this is associated with error/variation introduced during the assembly process (reads were trimmed and quality assessed), or is there any other issue to consider.

Thanks!

Remove rare kmers from merge file

Please could you add a command so that I can remove rare kmers from a merge file? I've got a couple of merge files which I'd like to merge but I run out of RAM. If possible I'd like to remove all the kmers which are in fewer than 80% of input sequences because I think that'll make the merge (and later alignment) much more efficient.

ska weed -d 0.8 my_data.kmerge

Update the wiki

It's become horribly out of date

multithreading

Hi, I am wondering if you can multithread ska weed?

One command for `fasta`, `fastq` and `merge`

We had a chat about you combining the following commands:

fasta
fastq
merge

My thought was that you could have one command:

$ ls
lots_of_samples.kmers
sample_a.kmers
sample_b.fasta
sample_c_1.fastq
sample_c_2.fastq

$ ska add -o lots_of_samples.kmers sample_a.kmers sample_b.fasta sample_c_1.fastq:sample_c_2.fastq

ska annotate --- generate a multi-sample VCF ?

Got this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
CP006620        3509    .       T       C       .       .       NS=14;NS5=0,2,0,12,0
CP006620        4634    .       A       T       .       .       NS=14;NS5=12,0,0,2,0
CP006620        4937    .       C       T       .       .       NS=14;NS5=0,12,0,2,0

Expected:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO  SAMPLE1 SAMPLE2 SAMPLE3 ....
CP006620        3509    .       T       C       .       .       NS=14;NS5=0,2,0,12,0 GT 1 0 1 0 0 1 2
etc

fastq -> READS ; fasta -> CONTIGS ?

Due to differences inc ommand line parameters, I assume fastq is actually for short sequence reads, and fasta for contigs?

What if my reads are in FASTA format? Will fastq barf ?

Improve prediction of number of missed alignments

ska type multiple samples at once?

If the user used -f <file> for the locii, could you allow the positional parameters to be multiple *.skf files so we get a table of ST calls?

Less memory for `merge` and `align`

We just has a chat about this. I was trying to merge two kmerge files of about 450-500MB each. I ran out of memory on a 32GB machine. I have a feeling you could do merging with a lot less memory if you did some memory mapping or if you held the kmers in memory but only used a reference to the bitstring for the alleles.

You might be able to do something similar to make the align use a lot less memory but I'm less sure.

SKA Type crashes when no allele is found for a locus

If a sample is missing any of the supplied loci to SKA type it says that there is a mismatch in the number between your alleles and profile. It would be really useful if SKA type could tolerate a missing locus and provide a hyphen in its place since for some schemes the locus could be genuinely missing in some lineages. Most of the time it would be the result of poor quality sequence data but it would be really useful to have the program identify the locus as missing instead of erroring out.

ska 1.0 now in homebrew

🎉 New formula ska in Brewsci/bio for Linux and macOS!
ℹ️ Split Kmer Analysis toolkit
🍺 brew install brewsci/bio/ska
🏡 https://github.com/simonrharris/SKA/releases
🔬 https://github.com/brewsci/homebrew-bio
🐧 http://linuxbrew.sh #bioinformatics

ska distances: adding isolates to databse

Is there any way to add isolates to a database of distances without having to recalculate all of the distances in the database? It would be great if this was possible and the original cluster names stayed the same too for easy modelling over a long time period. Thanks

Is there a looping command to perform all SKA task

Anyone have any looping command i.e. like python script to perform the task from raw fastq folder? I have like hundreds of isolates..

unique kmer output

Where does the unique.skf file go once we run "ska unique"? Are there any limitations to this that we should know of?
My command runs and does not give me an error. I get the following:
Output will be written to unique.skf
But then, I don't see this file...

What is `XXXXXX_ddl.fa` ?

I found a file *_ddl.fa in my folder, possibly leftover from an interrupted ska process.

If it is from ska, could you please honour $TMPDIR or use mktemp() which honours it?

It will speed up stuff for most people, especially where . is slow NFS.

SKA run out of memory on 32 GB VM

Please help! I'm running out of memory on 32GB VM when trying to merge more than 1000 isolates of Mtb. It max out ram and whole VM frozen after reading file 389 everytime. Can anyone shed some light on it??

"make PREFIX=/my/dir install" still uses /usr/local

make PREFIX=/home/linuxbrew/.linuxbrew/Cellar/ska/1.0 install
cp bin/ska /usr/local/bin/
cp: cannot create regular file <E2><80><98>/usr/local/bin/ska<E2><80><99>: Permission denied
make: *** [Makefile:34: install] Error 1

I can't figure out why this is happening.

This has the advantage that no matter how many alleles in the scheme, the ST will be consistent.

But I understand that ST is not compulsory, and that this could be used for assaying all sorts of amplicons such as AMR genes.