Comments (2)
I made a hacky python script to remove any kmer which occurs in fewer than 80% of samples. It ran in between 1m50 and 2m20 on merge files of size 579M and 469M respectivly (around 1200 Salmonella Enterica assemblies)
import math
import sys
INPUT = sys.argv[1]
OUTPUT = sys.argv[2]
THRESHOLD = 0.8
def parse_count(ascii_field):
bits = ''.join(['{0:06b}'.format(ord(c)-33) for c in ascii_field])
return bits.count('1')
kmer_counts = {}
with open(INPUT, 'r') as f_in, open(OUTPUT, 'w') as f_out:
k = int(f_in.readline().strip())
kmer_length = 1 + ((k * 2) // 3)
samples = f_in.readline().strip().split(' ')
nSamples = len(samples)
threshold = nSamples * THRESHOLD
ascii_length = math.ceil(nSamples / 6)
n_kmers = 2
def parse_line(line):
ascii_ = line[:ascii_length]
count = parse_count(line[:ascii_length])
kmers = ((line[i], line[i+1:i+kmer_length]) for i in range(ascii_length, len(line), kmer_length))
return ascii_, count, kmers
for line in f_in:
_, count, kmers = parse_line(line.rstrip())
for _, kmer in kmers:
kmer_counts[kmer] = kmer_counts.get(kmer, 0) + count
n_kmers += 1
if (n_kmers % 1000000) == 0:
print(f"Parsed {n_kmers} kmers")
kmers_to_keep = {kmer for kmer, count in kmer_counts.items() if count > threshold}
print(f"Found {len(kmers_to_keep)} kmers to keep")
f_in.seek(0)
# HEADERS
f_out.write(f_in.readline())
f_out.write(f_in.readline())
for line in f_in:
ascii_, _, kmers = parse_line(line.rstrip())
outline = "".join([ascii_] + [ _ for snp, kmer in kmers for _ in (snp, kmer) if kmer in kmers_to_keep] + ["\n"])
if len(outline) > ascii_length + 1:
f_out.write(outline)
from ska.
This has been added to ska weed.
from ska.
Related Issues (20)
- Improve prediction of number of missed alignments
- Tag a release HOT 3
- "make PREFIX=/my/dir install" still uses /usr/local HOT 1
- ska 1.0 now in homebrew HOT 1
- fastq -> READS ; fasta -> CONTIGS ? HOT 1
- ska type multiple samples at once? HOT 1
- Order of `type` columns HOT 2
- What is `XXXXXX_ddl.fa` ? HOT 1
- ska distance - include distance.phy PHYLIP too?
- ska annotate --- generate a multi-sample VCF ? HOT 1
- Deal with case of 2 or more SNPs within a kmer length
- multithreading HOT 3
- Understanding split kmers produced from fasta files HOT 1
- unique kmer output HOT 1
- Is there a looping command to perform all SKA task HOT 1
- SKA run out of memory on 32 GB VM HOT 1
- kmers split by 2 adjacent SNPs for allele-specific primers
- ska distances: adding isolates to databse
- SKA Type crashes when no allele is found for a locus
- ska fasq vs ska fasta
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ska.