Code Monkey home page Code Monkey logo

Comments (2)

bewt85 avatar bewt85 commented on June 18, 2024

I made a hacky python script to remove any kmer which occurs in fewer than 80% of samples. It ran in between 1m50 and 2m20 on merge files of size 579M and 469M respectivly (around 1200 Salmonella Enterica assemblies)

import math
import sys

INPUT = sys.argv[1]
OUTPUT = sys.argv[2]
THRESHOLD = 0.8

def parse_count(ascii_field):
    bits = ''.join(['{0:06b}'.format(ord(c)-33) for c in ascii_field])
    return bits.count('1')

kmer_counts = {}

with open(INPUT, 'r') as f_in, open(OUTPUT, 'w') as f_out:
    k = int(f_in.readline().strip())
    kmer_length = 1 + ((k * 2) // 3)
    samples = f_in.readline().strip().split(' ')
    nSamples = len(samples)
    threshold = nSamples * THRESHOLD
    ascii_length = math.ceil(nSamples / 6)
    n_kmers = 2
    
    def parse_line(line):
        ascii_ = line[:ascii_length]
        count = parse_count(line[:ascii_length])
        kmers = ((line[i], line[i+1:i+kmer_length]) for i in range(ascii_length, len(line), kmer_length))
        return ascii_, count, kmers
    
    for line in f_in:
        _, count, kmers = parse_line(line.rstrip())
        for _, kmer in kmers:
            kmer_counts[kmer] = kmer_counts.get(kmer, 0) + count
            n_kmers += 1
            if (n_kmers % 1000000) == 0:
                print(f"Parsed {n_kmers} kmers")
            
    kmers_to_keep = {kmer for kmer, count in kmer_counts.items() if count > threshold}
    print(f"Found {len(kmers_to_keep)} kmers to keep")
    f_in.seek(0)
    # HEADERS
    f_out.write(f_in.readline())
    f_out.write(f_in.readline())
    
    for line in f_in:
        ascii_, _, kmers = parse_line(line.rstrip())
        outline = "".join([ascii_] + [ _ for snp, kmer in kmers for _ in (snp, kmer) if kmer in kmers_to_keep] + ["\n"])
        if len(outline) > ascii_length + 1:
            f_out.write(outline)

from ska.

simonrharris avatar simonrharris commented on June 18, 2024

This has been added to ska weed.

from ska.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.