tmaklin / panaani Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 97 KB

Pangenome-aware dereplication of bacterial genomes into ANI clusters.

License: Mozilla Public License 2.0

Rust 100.00%

panaani's Introduction

panaani

(WIP) Pangenome-aware dereplication of bacterial genomes into ANI clusters.

Dependencies

Rust >= v1.75 (stable)

Installation

git clone https://github.com/tmaklin/panaani
cd panaani
cargo build --release

License

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

panaani's People

Contributors

Watchers

panaani's Issues

Implement initial batching of inputs to `dereplicate`

Sometimes a good initial guess for the panANI clusters is available but these should not be trusted 100% (which is how --external-clustering works, a sequence with an external cluster can never be placed outside its cluster). This can be incorporated into panaani dereplicate by implementing a --initial-batches toggle that is used to batch the inputs so that sequences in the same guesstimate cluster are placed in the same batch for the first run.

Reduce RAM use when calculating ANIs for large inputs

Running ANI calculation for large inputs takes quite a bit of extra RAM, some ideas for reducing this:

All sketches are currently stored in memory before calculating the distances. Changing this to some form of sketch-on-demand or sketching on disk is the most significant change that could be made.
Memory-map the results so they are not held in actual RAM when not needed.
Stream the output when called from panaani dist (downside: user would need to handle sorting).
(probably minor) Hash the input file names and store the output in format hash1 hash2 dist instead of filepath1 filepath2 dist.

Pangenome graphs have to be built on temporary disk space

ggcat has a bug that crashes the API when building many graphs within the same program (see algbio/ggcat#40). Until a fix is applied, we fallback to build all graphs entirely on disk.

Add progress bar to dereplicate, dist, and build

ANI estimation and pangenome construction should report progress when --verbose is toggled.

Panic with error "Too many open files" on some file systems

Running panaani dereplicate will crash in the graph building step that calls ggcat when ran on inputs that result in a large number of graphs to build. The crash is caused by too many open file connections and seems to affect only lustre, so this is partially caused by resource limits. Running ulimit -n with a large argument does not seem to solve the issue.

Workaround: run panaani dist and panaani clust on the input, build the graphs corresponding to panaani clust output in memory by calling the ggcat executable directly.

Error: Error message: called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

Related algbio/ggcat#25

Implement `panaani assign`

New feature to assign a query to an existing database from panaani dereplicate:

Calculate distances from query to db.
Assign to closest cluster if distance < threshold.

Questions:

If query could be assigned to multiple clusters based on the distance, what to do?
Should the db be rebuilt after update?
If distance > threshold for all clusters, do we create a new cluster?

Crates.io compatibility tracker

The current version cannot be published in crates.io because the following dependencies are not available in the registry:

skani v0.2.1 (crates.io has version 0.1.1)
ggcat-api v0.1.0 (no version published in crates.io)

Supplying output prefix `-o` to dereplicate crashes

Using an output prefix with panaani dereplicate crashes because the names aren't handled correctly.

Converge faster by placing closely related sequences in the same batch

Instead of just processing the inputs sequentially or randomly shuffled in each panaani dereplicate round, convergence could be made faster by doing a quick coarse clustering to inform the next round about which sequences should go in the same batch.

Coarse clustering can be done by using large subsampling rates. This should still identify very closely related sequences, so they get placed in the same batch. Parameters that seem to work:

--clip-tails --kmer-subsampling-rate 2500 --marker-compression-factor 2500

With these a rough estimate of the distances can be obtained quickly while also using less memory.

Log messages are printed to stdout

log v0.4.20 seems to print messages from warn!, err! to stdout by default. This is not good, investigate if the messages can be printed to stderr instead, or change the logger implementation.

`dereplicate` batch step affects sequences, not clusters

The batch step from dereplicate is currently applied to the original input sequences in the subsequent rounds and not to the clusters. The latter is what it should do, and the former will sometimes result in a crash because all inputs are in the same cluster.