Code Monkey home page Code Monkey logo

galah's Introduction

Galah logo

Galah

Anaconda-Server Badge

Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.

Galah uses a greedy clustering approach to speed up genome dereplication, relative to e.g. dRep, particularly when there are many closely related genomes (i.e. >95% ANI). Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:

  1. Each representative is <99% ANI to each other representative.
  2. All members are >=99% ANI to the representative.

If CheckM genome qualities were specified, then the clusters have an additional property:

  1. Each representative genome has a better quality score than other members of the cluster. Each genome is assigned a quality score based on the formula completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000, which is reduced from a quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8.

If instead CheckM qualities were not provided, then the following holds instead:

  1. Each representative genome was specified to galah before other members of the cluster.

The overall greedy clustering approach was largely inspired by the work of Donovan Parks, as described in Parks et. al. 2020. It operates in 3 steps. In the first step, genomes are assigned as representative if no genomes of higher quality are >99% ANI. In the second step, each non-representative genome is assigned to the representative genome it has the highest ANI with.

Installation

Install through the bioconda package

Galah can be installed through the bioconda conda channel. After initial setup of conda and the bioconda channel, it can be installed with

conda install galah

Galah can also be used indirectly through CoverM via its cluster subcommand, which is also available on bioconda.

Pre-compiled binary

Galah can be installed by downloading statically compiled binaries, available on the releases page.

Third party dependencies listed below are required for this method.

Compiling from source

Galah can also be installed from source, using the cargo build system after installing Rust.

cargo install galah

Third party dependencies listed below are required for this method.

Development

To run an unreleased version of Galah, after installing Rust:

git clone https://github.com/wwood/galah
cd galah
cargo run -- cluster ...etc...

Third party dependencies listed below are required for this method.

Dependencies

For some advanced usage of Galah, 3rd party tools are required, which must be installed separately:

Usage

For clustering a set of genomes at 99% ANI:

galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna --output-cluster-definition clusters.tsv

There are several other options for specifying genomes, ANI cutoffs, etc.

The full usage is described on the manual page, which can be accessed on the command line running galah cluster --full-help.

Precluster ANI

Similar to dRep, galah operates in two stages. In the first, a fast pre-clustering distance (dashing) is calculated between each pair of genomes. Genome pairs are only considered as potentially in the same cluster with FastANI if the prethreshold ANI is greater than the specified value. By default, the precluster ANI is set at 95% and the final ANI is set at 99%.

License

Galah is made available under GPL3+. See LICENSE.txt for details. Copyright Ben Woodcroft.

Developed by Ben Woodcroft at the Centre for Microbiome Research, Queensland University of Technology.

galah's People

Contributors

apcamargo avatar aroneys avatar rhysnewell avatar wwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

galah's Issues

Wrong genome annotation

I have a case where a genome with 99% identity 0.1 contamination and 60 contigs is assigned to a cluster where the cluster representative has only 50% completeness (50 contigs).

I don't understand how this could happen. Unfortunately I cannot share the genomes.

Compilation error

Hello, I was looking to compile galah for the use of the CoverM tool, however I encountered an error consistently that seems to come from the build calling two versions of clap. The error message is attached in the text file - are there any workarounds for this?
galah_compile_error.txt

add bindash support instead of dashing

Hello Ben,

I was investigating MinHash algorithm heavily in the past several months. In terms of simple minhash, that is to estimate jaccard in traditional manner, b-bit One Permutation MinHash with optimal densification (https://dl.acm.org/doi/abs/10.1145/1772690.1772759, https://proceedings.neurips.cc/paper/2012/file/eaa32c96f620053cf442ad32258076b9-Paper.pdf ,http://proceedings.mlr.press/v70/shrivastava17a.html) represents the most space and time efficient algorithm among all others, including hyperloglog. It was implemented in the bindash software (https://academic.oup.com/bioinformatics/article/35/4/671/5058094), since Xiaofei left academia, it was not further developed as dashing was (dashing 2 for example). However, after several experiments, e.g. all versus all distance computation for all NCBI genomes, bindash is the fastest (I use kmer 16 and sketch size 12000 to have 95% ANI level accuracy) I have ever seen, about 2 times faster than dashing. It supports only nucleotide but not amino acid as dashing and Mash do. I would suggest do not use finch because it is memory inefficient for large number of genomes. What do you think.

Thanks,

Jianshu

checkM table not working

Dears,
I'm trying to use the --checkm-tab-table which expect 13 columns, Unfortunately there is no file with that description in the new release of CheckM.
complete_genomes.tsv: 11
completeness.tsv: 15
contamination.tsv: 14
quality_summary.tsv: 14

Additionally, as a part of other bug. I tried not use checkm flag and I got:

[2023-05-22T12:04:22Z INFO  bird_tool_utils::external_command_checker] Found dashing version v1.0 
thread 'main' panicked at 'Failed to find sufficient version of dashing. You may wish to use the finch precluster method if you are having problems with dashing.: "It appears the available version of dashing is too old (found version v1.0, required is 0.4.0)"', src/external_command_checker.rs:19:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
   3: galah::external_command_checker::check_for_dashing
   4: galah::cluster_argument_parsing::generate_galah_clusterer
   5: galah::cluster_argument_parsing::run_cluster_subcommand
   6: galah::main

which of course v1.0 is not older than 0.4.0.
what should I do?

thanks in advance

dashing2 instead of dashing1

Hello Ben,

I noticed that dashing1 is based on hyperloglog and sketch size can only be the power of 2. Dashing2 (https://github.com/dnbaker/dashing2) implements both mash, dashing1, and new hash algorithms such as setsketh, prominhash et.al. However, the output format is not perfect because it is under fast development. Personally I prefer MASH because it correlates very will with blast base ANI/fastANI with sketch size 10^5 or 10^6. What do you think.

Thanks,

Jianshu

Galah not clustering MAGs with ANI greater than the threshold

Hi Ben!

I wrote a script to cluster 522 dereplicated MAGs into 329 species clusters (inspired by https://www.nature.com/articles/s41587-020-0501-8). As my algorithm is quite similar to Galah's, I also tried to use it to see if the results would be similar:

galah cluster -t 64 --quality-formula Parks2020_reduced --checkm-tab-table dereplicated_mags_CheckM.txt -f Dereplicated_mags/* --prethreshold-ani 0 --ani 95 --min-aligned-fraction 60 --output-cluster-definition clusters.txt

Unexpectedly, Galah generated 418 clusters. I noticed that there are some MAGs that have more than 97% ANI and more than 90% aligned fraction (according to fastANI) that were not clustered.

What might be going on here? I'm I using some parameter wrong?

error: unexpected argument '-d' found

Hi there,

it seems that the short option for --genome-fasta-directory in galah cluster isn't being recognised. The --full-help also shows a missing dash in the option:

       d, --genome-fasta-directory PATH
              Directory containing FASTA files of each genome.

I'm using galah v0.4.0

Cheers

Intermediate output

Hallo, would it be possible to make the intermediate output available e.g. the FastANI comparison?

--min-aligned-fraction, as fraction?

Hi Ben,

Thanks for this tool.

Just a small question. Should --min-aligned-fraction be specified as a fraction or percentage? The name suggests fraction but the default value is 50 (Min aligned fraction of two genomes for clustering. [default: 50])? Would it be possible to clarify?

Thanks,

Dieter

Regarding galah output

Hi,
Thank you for all the hard work with this great tool.

I successfully ran galah with the following parameters:

galah cluster --genome-fasta-directory binning/mags/setA
--genome-fasta-extension fasta --threads 30
--output-cluster-definition binning/mags/galah.clusters.tsv

However the --output-cluster-definition file just shows the path of each fasta file as follows:

binning/mags/setA/659905.fasta binning/mags/setA/659905.fasta
binning/mags/setA/421950.fasta binning/mags/setA/421950.fasta
binning/mags/setA/113225.fasta binning/mags/setA/113225.fasta

is this normal?

How to know which bins are similar with (a 99% ANI match)?

Dereplicate MAGs on strain level

Hi, wwood,

Recently, my colleagues and I are using galah to construct MAG cluster. Most studies only do clustering at the species level. Can galah be used at the strain level? You know, we are not only concerned about the genomes of the selected representatives, but also about the members of each cluster.

Thanks.
Jie

Option to save cluster representative

An option to save cluster representatives to an output directory would simplify most dereplication pipelines. Currently, if the user wants to get the FASTA files of the dereplicated genomes it need to first parse galah's output and then move/copy the representative genomes to the desired location.

Finch running time

Most of the time spent when clustering with finch+fastani or finch+skani is in this loop:

for (i, sketch1) in sketches.iter().enumerate() {

It looks to be single-threaded. Anyway we can parallelise it?

Question about clustering mechanism

Hi Ben,

I just stumbled on this today while looking at coverM and it looks great. dRep can't really handle clustering huge numbers of genomes (~30,000 is the reasonable limit in my experience) because it does not employ greedy clustering, as you mention, making solutions like this a necessity as genome sets increase in size. Also fun to see dRep mentioned by name in the README- much appreciated.

My question is related to how the clustering is actually done in Galah. You mention in the README that:

Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:

1) Each representative is <99% ANI to each other representative.

2) All members are >=99% ANI to the representative.

but in practice this isn't possible to satisfy in many genomes sets. In my experience it's really common to hit the scenario where, for example, genome A is >99% ANI to genome B, genome B is >99% to genome C, but genome A is <99% to genome C. The way that dRep handles this is using the scipy.cluster.hierarchy.linkage package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html), which let's the user specify how it should be handled (single, complete, average, etc.), and with the default setting of average, some averaging is done to handle these scenarios as best as possible. What is the algorithm that Galah uses for this secondary clustering?

Thanks in advance,
Matt

Failed to find sufficient version of dashing

Hi
When installing galah through conda, dashing version 1.0 is installed instead of 0.4.0.
Hence I got this error

[2023-06-07T09:03:52Z INFO  galah::cluster_argument_parsing] Read in genome qualities for 338 genomes. 59 passed quality thresholds
[2023-06-07T09:03:52Z INFO  bird_tool_utils::external_command_checker] Found dashing version v1.0
thread 'main' panicked at 'Failed to find sufficient version of dashing. You may wish to use the finch precluster method if you are having problems with dashing.: "It appears the available version of dashing is too old (found version v1.0, required is 0.4.0)"', src/external_command_checker.rs:19:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Best
Greg

Dashing failed

I tested Galah on a small and on a large dataset. In both, I get

[2022-09-14T07:35:31Z INFO  galah::dashing] Running dashing to get approximate distances ..
[2022-09-14T07:35:31Z ERROR bird_tool_utils::command] Error when running dashing process. Exitstatus was : ExitStatus(unix_wait_status(4))
[2022-09-14T07:35:31Z ERROR bird_tool_utils::command] The STDERR was: "Dashing version: v0.4.0\n"
[2022-09-14T07:35:31Z ERROR bird_tool_utils::command] The STDOUT was: ""
[2022-09-14T07:35:31Z ERROR bird_tool_utils::command] Cannot continue after dashing failed.

I can use the finch method as a work around.

I installled galah with conda
Here is my full command.

galah cluster  --genome-fasta-directory genomes/all_bins --genome-fasta-extension fasta  --genome-info genomes/quality.csv  --ani 97.5  --min-aligned-fraction 0.5    --min-completeness 50  --max-contamination 10  --quality-formula Parks2020_reduced  --threads 8  --output-representative-fasta-directory genomes/dereplicated_genomes  --output-cluster-definition genomes/clustering/allbins2genome_oldname.tsv

[Question] MAG quality score

Hi Ben

In the documentation it is said that "Each representative genome has a better quality score than other members of the cluster". How is the quality score computed? Does it use only contamination/completeness or it also incorporates contiguity metrics (as dRep does)?

Thank you!

Low quality genomes end up as seperate species.

Low-quality genomes cluster apart. I used galah on large datasets. What I noticed is that often a low complete genomes are selected as separate species, which are then annotated by GTDB-tk as the same species as other high-quality genomes.

I have the impression that if a genome has low completeness it will not pass the min coverage of FastANI and so yield no FatANI report. I don't know how you do this internally but I imagine they perturb the clustering.

Would the solution simply be to use a very low --min-aligned-fraction?

If input genomes are short, they don't cluster

Issue is that --fragment-length is 3k by default, so if genomes (e.g. single contig phage genomes) are too short then they silently don't map. Maybe scan through the genomes to determine the shortest genome / contig, and automatically change to 500bp or something when input is short.

unnecessary fastani check

pub fn generate_galah_clusterer<'a>(
    genome_fasta_paths: &'a Vec<String>,
    clap_matches: &clap::ArgMatches,
    argument_definition: &GalahClustererCommandDefinition,
) -> std::result::Result<GalahClusterer<'a>, String> {
    crate::external_command_checker::check_for_fastani();

check should be in init of the clusterer instead

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.