simonrharris / ska Goto Github PK
View Code? Open in Web Editor NEWSplit Kmer Analysis
License: MIT License
Split Kmer Analysis
License: MIT License
This is a really nice and simple tool to use and it's a shame that it's only the two of us using it ๐
Ran into a segmentation fault running
ska align -p 0.9 -k -v -o SHFVkrc1_reference_free_prop_0.9 *.skf
In trying to understand split kmers produced from fasta files I produced 3 simple and very non-biologically relevant fasta files
ref.fas (a string of 160 Gs)
As expected this produced 1 split kmer as shown by the humanized summary
The 2nd fasta file (sample1) had just one polymorphism (G->A) at position 90
This file produced 31 split kmers as expected but the kmer that should have had an A as the middle base is reported with an N I think
The 3rd fasta file (sample2) had another one polymorphism (G->T) at the same position 90
Similar thing with the split kmer file
And when I calculated the distance between sample 1 and 2 using ska distance sample1.skf sample2.skf -o sample1_vs_sample2
the output showed no SNPs
At some point I suspect someone will change the format of the files output by ska
and so it might be a good idea to specify a version at the top of the kmers file so that you can tell if this version of the software is compatible with this version of input file.
I am interested in using SKA to generate allele-specific LNA-modified primers for clinical outbreak investigation. I have successfully run through the tutorial with a test dataset but am trying to solve the following problem: from a primer-design perspective, it may be optimal to identify split kmers which differ by two adjacent nucleotide rather than just one. Is this something that might be possible with SKA or would be within the scope of future releases?
Many thanks,
Dustin
We had a chat about having a consistend name for the kmer outputs from ska. I like the idea of everything being *.kmers
by default. If possible I'd also prefer you left filenames up to the user so instead of prefix
they specify the full name for their output.
Relaxed PHYLIP format.
Can be lower triangle.
Distances in (0,1).
Hi,
I am getting different results (number of SNPs and SNP distances) when I run ska distance with .skf files generated using ska fastq and ska fasta, with the same genome database. Specifically, the number of SNPs is greater when I use the fasta option than when I use fastq. I wonder if this is associated with error/variation introduced during the assembly process (reads were trimmed and quality assessed), or is there any other issue to consider.
Thanks!
Please could you add a command so that I can remove rare kmers from a merge file? I've got a couple of merge files which I'd like to merge but I run out of RAM. If possible I'd like to remove all the kmers which are in fewer than 80% of input sequences because I think that'll make the merge (and later alignment) much more efficient.
ska weed -d 0.8 my_data.kmerge
It's become horribly out of date
Hi, I am wondering if you can multithread ska weed
?
We had a chat about you combining the following commands:
fasta
fastq
merge
My thought was that you could have one command:
$ ls
lots_of_samples.kmers
sample_a.kmers
sample_b.fasta
sample_c_1.fastq
sample_c_2.fastq
$ ska add -o lots_of_samples.kmers sample_a.kmers sample_b.fasta sample_c_1.fastq:sample_c_2.fastq
Got this:
#CHROM POS ID REF ALT QUAL FILTER INFO
CP006620 3509 . T C . . NS=14;NS5=0,2,0,12,0
CP006620 4634 . A T . . NS=14;NS5=12,0,0,2,0
CP006620 4937 . C T . . NS=14;NS5=0,12,0,2,0
Expected:
#CHROM POS ID REF ALT QUAL FILTER INFO SAMPLE1 SAMPLE2 SAMPLE3 ....
CP006620 3509 . T C . . NS=14;NS5=0,2,0,12,0 GT 1 0 1 0 0 1 2
etc
Due to differences inc ommand line parameters, I assume fastq
is actually for short sequence reads, and fasta
for contigs?
What if my reads are in FASTA format? Will fastq
barf ?
If the user used -f <file>
for the locii, could you allow the positional parameters to be multiple *.skf
files so we get a table of ST calls?
We just has a chat about this. I was trying to merge two kmerge
files of about 450-500MB each. I ran out of memory on a 32GB machine. I have a feeling you could do merging with a lot less memory if you did some memory mapping or if you held the kmers in memory but only used a reference to the bitstring for the alleles.
You might be able to do something similar to make the align use a lot less memory but I'm less sure.
If a sample is missing any of the supplied loci to SKA type it says that there is a mismatch in the number between your alleles and profile. It would be really useful if SKA type could tolerate a missing locus and provide a hyphen in its place since for some schemes the locus could be genuinely missing in some lineages. Most of the time it would be the result of poor quality sequence data but it would be really useful to have the program identify the locus as missing instead of erroring out.
๐ New formula ska in Brewsci/bio for Linux and macOS!
โน๏ธ Split Kmer Analysis toolkit
๐บ brew install brewsci/bio/ska
๐ก https://github.com/simonrharris/SKA/releases
๐ฌ https://github.com/brewsci/homebrew-bio
๐ง http://linuxbrew.sh #bioinformatics
Is there any way to add isolates to a database of distances without having to recalculate all of the distances in the database? It would be great if this was possible and the original cluster names stayed the same too for easy modelling over a long time period. Thanks
Anyone have any looping command i.e. like python script to perform the task from raw fastq folder? I have like hundreds of isolates..
Where does the unique.skf file go once we run "ska unique"? Are there any limitations to this that we should know of?
My command runs and does not give me an error. I get the following:
Output will be written to unique.skf
But then, I don't see this file...
I found a file *_ddl.fa
in my folder, possibly leftover from an interrupted ska process.
If it is from ska, could you please honour $TMPDIR
or use mktemp()
which honours it?
It will speed up stuff for most people, especially where .
is slow NFS.
Please help! I'm running out of memory on 32GB VM when trying to merge more than 1000 isolates of Mtb. It max out ram and whole VM frozen after reading file 389 everytime. Can anyone shed some light on it??
make PREFIX=/home/linuxbrew/.linuxbrew/Cellar/ska/1.0 install
cp bin/ska /usr/local/bin/
cp: cannot create regular file <E2><80><98>/usr/local/bin/ska<E2><80><99>: Permission denied
make: *** [Makefile:34: install] Error 1
I can't figure out why this is happening.
Could you tag a release (once ready) โ I'll write a conda recipe, which seems to be something I am into at the moment
Currently
Sample adk atpA ddl gdh gyd pstS purK ST
Any chance making it compatible with mlst
?
Sample ST adk atpA ddl gdh gyd pstS purK
This has the advantage that no matter how many alleles in the scheme, the ST will be consistent.
But I understand that ST is not compulsory, and that this could be used for assaying all sorts of amplicons such as AMR genes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.