Code Monkey home page Code Monkey logo

panaani's Introduction

panaani

(WIP) Pangenome-aware dereplication of bacterial genomes into ANI clusters.

Dependencies

  • Rust >= v1.75 (stable)

Installation

git clone https://github.com/tmaklin/panaani
cd panaani
cargo build --release

License

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

panaani's People

Contributors

tmaklin avatar

Watchers

 avatar

panaani's Issues

Implement initial batching of inputs to `dereplicate`

Sometimes a good initial guess for the panANI clusters is available but these should not be trusted 100% (which is how --external-clustering works, a sequence with an external cluster can never be placed outside its cluster). This can be incorporated into panaani dereplicate by implementing a --initial-batches toggle that is used to batch the inputs so that sequences in the same guesstimate cluster are placed in the same batch for the first run.

Reduce RAM use when calculating ANIs for large inputs

Running ANI calculation for large inputs takes quite a bit of extra RAM, some ideas for reducing this:

  • All sketches are currently stored in memory before calculating the distances. Changing this to some form of sketch-on-demand or sketching on disk is the most significant change that could be made.
  • Memory-map the results so they are not held in actual RAM when not needed.
  • Stream the output when called from panaani dist (downside: user would need to handle sorting).
  • (probably minor) Hash the input file names and store the output in format hash1 hash2 dist instead of filepath1 filepath2 dist.

Panic with error "Too many open files" on some file systems

Running panaani dereplicate will crash in the graph building step that calls ggcat when ran on inputs that result in a large number of graphs to build. The crash is caused by too many open file connections and seems to affect only lustre, so this is partially caused by resource limits. Running ulimit -n with a large argument does not seem to solve the issue.

Workaround: run panaani dist and panaani clust on the input, build the graphs corresponding to panaani clust output in memory by calling the ggcat executable directly.

Error: Error message: called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

Related algbio/ggcat#25

Implement `panaani assign`

New feature to assign a query to an existing database from panaani dereplicate:

  • Calculate distances from query to db.
  • Assign to closest cluster if distance < threshold.

Questions:

  • If query could be assigned to multiple clusters based on the distance, what to do?
  • Should the db be rebuilt after update?
  • If distance > threshold for all clusters, do we create a new cluster?

Converge faster by placing closely related sequences in the same batch

Instead of just processing the inputs sequentially or randomly shuffled in each panaani dereplicate round, convergence could be made faster by doing a quick coarse clustering to inform the next round about which sequences should go in the same batch.

Coarse clustering can be done by using large subsampling rates. This should still identify very closely related sequences, so they get placed in the same batch. Parameters that seem to work:

--clip-tails --kmer-subsampling-rate 2500 --marker-compression-factor 2500

With these a rough estimate of the distances can be obtained quickly while also using less memory.

Log messages are printed to stdout

log v0.4.20 seems to print messages from warn!, err! to stdout by default. This is not good, investigate if the messages can be printed to stderr instead, or change the logger implementation.

`dereplicate` batch step affects sequences, not clusters

The batch step from dereplicate is currently applied to the original input sequences in the subsequent rounds and not to the clusters. The latter is what it should do, and the former will sometimes result in a crash because all inputs are in the same cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.