Code Monkey home page Code Monkey logo

2022-sourmash-uniqify's Introduction

iterative greedy clustering of sourmash signatures

ref sourmash

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: sourmash-bio/sourmash#1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

  1. click on the 'raw' link below for sourmash-uniqify.

  2. you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

  1. perhaps better, you can clone the repository: recommended
git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
  1. and/or or you can download the repository as a zip, but again, this may not get the latest version.
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

  • --prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
  • --threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the following columns:

  • 'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
  • 'name' - signature name, as generated by sourmash sketch. may be duplicate.
  • 'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
  • 'md5sum' - ~unique hexadecimal string for each signature. should be unique.
  • 'cluster' - cluster number (integer).
  • 'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name '{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a cluster file containing 1 or more signatures, under the name '{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

  • take a directory with a bunch of genome files
  • run sourmash sketch dna * to turn them into sourmash signatures.
  • make a directory for the output: mkdir output
  • then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
  • examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.

2022-sourmash-uniqify's People

Contributors

ctb avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.