Code Monkey home page Code Monkey logo

metabinner's Introduction

MetaBinner

GitHub repository for the manuscript "MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities". We are glad that Metabinner achieves top performance on the CAMI II Challenge overall. Please refer to Meyer, F. et al. [1] for the results of the CAMI II Challenge.

MetaBinner consists of two modules: 1) “Component module” includes steps 1-4, developed for generating high-quality, diverse component binning results; and 2) “Ensemble module” includes step 5, developed for recovering individual genomes from the component binning results. MetaBinner is an ensemble binning method, but it does not need the outputs of other individual binners. Instead, MetaBinner generates multiple high-quality component binning results based on the proposed “partial seed” method for further integration. Please see our manuscript for details.

Getting Started

Install MetaBinner via bioconda

conda create -n metabinner_env python=3.7.6
conda activate metabinner_env
conda install -c bioconda metabinner

or Install MetaBinner via source code

Obtain codes and create an environment: After installing Anaconda (or miniconda), first obtain MetaBinner:

git clone https://github.com/ziyewang/MetaBinner.git

Then simply create an environment to run MetaBinner.

cd MetaBinner
conda env create -f metabinner_env.yaml
conda activate metabinner_env

System Requirements

OS Requirements

MetaBinner is supported and tested in Linux systems.

Preprocessing

The preprocessing steps aim to generate coverage and composition profiles as input to our program.

Several binning methods can generate these two types of information (such as CONCOCT and MetaWRAP), and we provide one way to generate the input files as follows.

Coverage Profile

The coverage profiles of the contigs for the results in the manuscript were obtained via MetaWRAP 1.2.1 script: ``binning.sh".

If users have obtained the coverage (depth) file generated for MaxBin (mb2_master_depth.txt) using MetaWRAP, they can run the following command to generate the input coverage file for MetaBinner:

cat mb2_master_depth.txt | cut -f -1,4- > coverage_profile.tsv

or remove the contigs no longer than 1000bp like this:

cat mb2_master_depth.txt | awk '{if ($2>1000) print $0 }' | cut -f -1,4- > coverage_profile_f1k.tsv

To generate coverage from sequencing reads directly, run the following script slightly modified from the "binning.sh" of MetaWRAP. The script supports different types of sequencing reads, and the default type is "paired" ([readsX_1.fastq readsX_2.fastq ...]). If MetaBinner is installed via bioconda, users can obtain path_to_MetaBinner via running this command: $(dirname $(which run_metabinner.sh))

cd path_to_MetaBinner
cd scripts

bash gen_coverage_file.sh -a contig_file \
-o output_dir_of_coveragefile \
path_to_sequencing_reads/*fastq

Options:

        -a STR          metagenomic assembly file
        -o STR          output directory (to save the coverage files)
	-b STR          directory for the bam files (optional)
        -t INT          number of threads (default=1)
        -m INT          amount of RAM available (default=4)
        -l INT          minimum contig length to bin (default=1000bp).
        --single-end    non-paired reads mode (provide *.fastq files)
        --interleaved   input read files contain interleaved paired-end reads
        -f              Forward read suffix for paired reads (default="_1.fastq")
	-r              Reverse read suffix for paired reads (default="_2.fastq")

Composition Profile

Composition profile is the vector representation of contigs, and we use kmer (k=4 in the example) to generate this information. To generate the composition profile and keep the contigs longer than contig_length_threshold, such as 1000, for binning, run the script as follows:

cd path_to_MetaBinner
cd scripts

python gen_kmer.py test_data/final.contigs_f1k.fa 1000 4 

Here we choose k=4. By default, we usually keep contigs longer than 1000; users can specify a different number. The kmer_file will be generated in the /path/to/contig_file.

And the users can run the following command to keep the contigs longer than 1000bp for binning.

cd path_to_MetaBinner
cd scripts

python Filter_tooshort.py test_data/final.contigs_f1k.fa 1000

An example to run MetaBinner:

Test data is available at https://drive.google.com/file/d/1a-IOOpklXQr_C4sgNxjsxGEkx-n-0aa4/view?usp=sharing

#path to MetaBinner
metabinner_path=/home/wzy/MetaBinner
Note: If users install MetaBinner via bioconda, they can set metabinner_path as follows: metabinner_path=$(dirname $(which run_metabinner.sh))

##test data
#path to the input files for MetaBinner and the output dir:
contig_file=/home/wzy/MetaBinner/test_data/final_contigs_f1k.fa
output_dir=/home/wzy/MetaBinner/test_data/output
coverage_profiles=/home/wzy/MetaBinner/test_data/coverage_profile_f1k.tsv
kmer_profile=/home/wzy/MetaBinner/test_data/kmer_4_f1000.csv


bash run_metabinner.sh -a ${contig_file} -o ${output_dir} -d ${coverage_profiles} -k ${kmer_profile} -p ${metabinner_path}

Options:

        -a STR          metagenomic assembly file
        -o STR          output directory
        -d STR          coverage_profile.tsv; The coverage profiles, containing a table where each row corresponds
                            to a contig, and each column correspond to a sample. All values are separated with tabs.
        -k STR          kmer_profile.csv; The composition profiles, containing a table where each row corresponds to a contig,
                            and each column correspond to the kmer composition of a particular kmer. All values are separated with comma.
        -p STR          path to MetaBinner; e.g. /home/wzy/MetaBinner
        -t INT          number of threads (default=1)
        -s STR          Dataset scale; eg. small,large,huge (default:large); Users can choose "huge" to run MetaBinner on huge datasets
                        with lower memory requirements.

#The file "metabinner_result.tsv" in the "${output_dir}/metabinner_res" is the final output.
Note: all paths in the run_metabinner.sh options should be absolute

Contacts and bug reports

Please feel free to send bug reports or questions to Ziye Wang: [email protected] and Prof. Shanfeng Zhu: [email protected]

References

[1] Meyer, F., Fritz, A., Deng, ZL. et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods (2022). https://doi.org/10.1038/s41592-022-01431-4

[2] Lu, Yang Young, et al. "COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge." Bioinformatics 33.6 (2017): 791-798.

[3] https://github.com/dparks1134/UniteM.

[4] Parks, Donovan H., et al. "CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes." Genome research 25.7 (2015): 1043-1055.

[5] Christian M. K. Sieber, Alexander J. Probst., et al. (2018). "Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy". Nature Microbiology. https://doi.org/10.1038/s41564-018-0171-1.

[6] Uritskiy, Gherman V., Jocelyne DiRuggiero, and James Taylor. "MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis." Microbiome 6.1 (2018): 1-13.

Citation

Wang, Z., Huang, P., You, R. et al. MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities. Genome Biol 24, 1 (2023). https://doi.org/10.1186/s13059-022-02832-6

metabinner's People

Contributors

sjaenick avatar ziyewang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

metabinner's Issues

component_binning.py error

Hi,
I wanted to run MetaBinner on my metagenome assembly. I have already produced coverage and composite profile files, but when I loaded them to run_metabinner.sh, the run ended with the error in component_binning.py script, see here.

Can you please take a look?

Thanks,
Kika

no pplacer available for mac os

conda cannot find a version of pplacer for use on mac os, and the pplacer homepage only has a binary for linux. Is there a work around? If not, then please note on your installation page that it cannot be installed on mac os systems.

Final bin directory?

Hello,

Forgive what is probably a stupid question, but I'm confused about the final output for metabinner. I see the metabinner_result.tsv file in the metabinner_res directory, but I don't understand where the actual fasta files for the final bins reside. I can find some files (e.g. bin_1.fna) within the directories in the ensemble_res directory but it's not clear to me which of these represents the final result I should be using for downstream analyses.

Thanks!

-Andrew

custom temporary path is needed

Dears,
Along with greeting you, I notice some of my runs are not ending due to they run out of space in the /tmp folder. Last lines of my log:

2024-02-01 13:13:01,499 - run kmeans length weight with:        X_cov_logtrans
2024-02-01 13:13:04,437 - run partial seed kmeans bacar_marker seed length weight with: 1quarter_X_cov_logtrans
2024-02-01 13:13:06,310 - run kmeans length weight with:        X_cov_logtrans
2024-02-01 13:13:06,315 - run partial seed kmeans bacar_marker seed length weight with: 2quarter_X_cov_logtrans
2024-02-01 13:13:07,918 - run kmeans length weight with:        X_cov_logtrans
2024-02-01 13:13:07,919 - run partial seed kmeans bacar_marker seed length weight with: 3quarter_X_cov_logtrans
 
real    7m0.055s
user    7m27.919s
sys     0m29.038s
    Finished processing 0 of 32 (0.00%) bins.^M    Finished processing 1 of 32 (3.12%) bins.^M    Finished processing 2 of 32 (6.25%) bins.^M    Finished processing 3 of 32 (9.38%) bins.^M    Finished processing 4 of 32 (12.50%) bins.^MFatal exception (source file p7_hmmfile.c, line 2139):
hmm write failed
system error: No space left on device
sh: line 1: 552458 Aborted                 hmmfetch -f /scratch/project_2007362/software/mambaforge/envs/metabinner/checkm_data/hmms/checkm.hmm /tmp/valensan/20291777/8a09e8c3-25ab-472c-aebe-cb5f7b0c5d3f > /tmp/valensan/20291777/018fa7d2-47e8-4240-8b04-b126fe121ed3
Fatal exception (source file esl_ssi.c, line 1134):
ssi write failed
system error: No space left on device
sh: line 1: 552423 Aborted                 hmmfetch --index /tmp/valensan/20291777/f0ee17b0-eb6e-490d-9445-738a1f8580e9 > /dev/null
 
Error: File format problem in trying to open HMM file /tmp/valensan/20291777/f0ee17b0-eb6e-490d-9445-738a1f8580e9.
a /tmp/valensan/20291777/f0ee17b0-eb6e-490d-9445-738a1f8580e9.ssi file exists (an SSI index), but its SSI format is not recogni

I can see the hmmfetch is crashing but I have no control through run_metabinner.sh.
Is there a way to specify a custom tmp folder for the entire pipeline?

thanks in advance!

A bug in gen_kmer.py

Metabinner is a nice software and performed very well in my data. However, I found a bug in the script 'gen_kmer.py'.

At line 63, you used both 'os.path.dirname(fasta_file)' and 'os.path.splitext(fasta_file)[0]', it seems redundant... 'os.path.dirname' give the path of assembly file, and os.path.splitext()[0] just remove suffix.
For example, if fasta_file is '03_connected_assembly/A1.fasta', theoutfile will be '03_connected_assembly/03_connected_assembly/A1_kmer_4_f1000.csv'... This is obviously wrong, isn't it? I think you should just use 'os.path.splitext' like this: outfile = os.path.join(os.path.splitext(fasta_file)[0] +'_kmer_'+ str(kmer_len) + '_f' + str(length_threshold) + '.csv').

image

SafetyError - Incorrect size

Dear Ziye Wang,
Thank you for your work. I had successfully installed MetaBinner via bioconda in the past. Recently I had to install it in a new conda environment. Unfortunately, I get the following error either by trying to install it through Bioconda or through the metabinner_env.yaml file:

SafetyError: The package for checkm-genome located at /home/user/anaconda3/pkgs/checkm-genome-1.1.3-py_1
appears to be corrupted. The path 'site-packages/checkm/DATA_CONFIG'
has an incorrect size.
reported size: 215 bytes
actual size: 242 bytes

When trying the installation through Bioconda I also get the message:

ClobberError: The package 'conda-forge/linux-64::sqlite-3.32.3-hcee41ef_1' cannot be installed due to a
path collision for 'include/sqlite3.h'.
This path already exists in the target prefix, and it won't be removed by
an uninstall action in this transaction. The path appears to be coming from
the package 'conda-forge/linux-64::libsqlite-3.44.2-h2797004_0', which is already installed in the prefix.

Eventually, MetaBinner is not installed properly and cannot run. I have tried to install different versions of checkm-genome by modifying its version in the yaml file (e.g., with a version above 1.1.3 such as 1.2.2), but to no avail. Do you know what is causing this and what can I do to fix this error?

Sincerely,
Georgios Filis

Something went wrong with running split_hhbins.py.

Dear Ziye Wang,
Thank you for your work. I have successfully installed MetaBinner via bioconda. I have run MetaBinner successfully in the past, but now I came across a problem which I cannot understand and I cannot overcome. I create the necessary files and run MetaBinner based on the following commands (after activating the environment of MetaBinner through conda):
1.
bash /home/user/anaconda3/envs/metabinner_env/bin/scripts/gen_coverage_file.sh -a /user_path/Contigs/Contigs_Formated.fna -o /user_path/Binning_Results/Coverages -t 6 -m 4 /user_path/Input/*fastq
2.
python /home/user/anaconda3/envs/metabinner_env/bin/scripts/gen_kmer.py /user_path/Contigs/Contigs_Formated.fna 1000 4
3.
python /home/user/anaconda3/envs/metabinner_env/bin/scripts/Filter_tooshort.py /user_path/Contigs/Contigs_Formated.fna 1000
4.
bash run_metabinner.sh -a /user_path/Contigs/Contigs_Formated.fna -o /user_path/Binning_Results/Bins -d /user_path/Binning_Results/Coverages/coverage_profile_f1k.tsv -k /user_path/Contigs/Contigs_Formated_kmer_4_f1000.csv -p /home/user/anaconda3/envs/metabinner_env/bin -t 6

The error I get is based on running "run_metabinner.sh" and it is the following:

  1. Based on the output (at the final lines of the output):
    Traceback (most recent call last):
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/split_hhbins.py", line 517, in
    bins, contigs = read_bins_from_one_dir(path)
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/split_hhbins.py", line 429, in read_bins_from_one_dir
    bin_ext, count = get_bin_extension(bin_dir)
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/metabinner_util.py", line 166, in get_bin_extension
    for f in os.listdir(bin_dir):
    FileNotFoundError: [Errno 2] No such file or directory: '/user_path/Binning_Results/Bins/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv_bins'
    Something went wrong with running split_hhbins.py. Exiting.

  2. The output also includes the following:
    Traceback (most recent call last):
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/split_hhbins.py", line 517, in
    bins, contigs = read_bins_from_one_dir(path)
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/split_hhbins.py", line 429, in read_bins_from_one_dir
    bin_ext, count = get_bin_extension(bin_dir)
    File "/home/user/anaconda3/envs/metabinner_env/bin/scripts/metabinner_util.py", line 166, in get_bin_extension
    for f in os.listdir(bin_dir):
    FileNotFoundError: [Errno 2] No such file or directory: '/user_path/Binning_Results/Bins/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_com_logtrans_result.tsv_bins'
    partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_cov_logtrans_result.tsv
    /home/user/anaconda3/envs/metabinner_env/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.

  3. The output also includes (at the beginning) the following:
    2023-10-14 13:26:57,662 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_1quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.1quarter_lencutoff_1001.seed
    ...
    2023-10-14 13:26:57,676 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_2quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /v/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.2quarter_lencutoff_1001.seed
    ...
    2023-10-14 13:26:57,692 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_3quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.3quarter_lencutoff_1001.seed

  4. The result log file includes the following:
    2023-10-14 13:26:57,662 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_1quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.1quarter_lencutoff_1001.seed
    2023-10-14 13:26:57,663 - bacar_marker_1quarter_seed_num: 0
    2023-10-14 13:26:57,663 - exec cmd: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_2quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.2quarter_lencutoff_1001.seed
    2023-10-14 13:26:57,676 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_2quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.2quarter_lencutoff_1001.seed
    2023-10-14 13:26:57,677 - bacar_marker_2quarter_seed_num: 0
    2023-10-14 13:26:57,677 - exec cmd: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_3quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.3quarter_lencutoff_1001.seed
    2023-10-14 13:26:57,692 - markerCmd failed! Not exist: /home/user/anaconda3/envs/metabinner_env/bin/scripts/../auxiliary/test_getmarker_3quarter.pl /user_path/Contigs/Contigs_Formated.fna.bacar_marker.hmmout /user_path/Contigs/Contigs_Formated.fna 1001 /user_path/Contigs/Contigs_Formated.fna.bacar_marker.3quarter_lencutoff_1001.seed

  5. The last lines of the post_result.log file are the following:
    2023-10-14 13:27:25,539 - Contig_file: /user_path/Contigs/Contigs_Formated.fna
    2023-10-14 13:27:25,539 - Coverage_profiles: /user_path/Binning_Results/Coverages/coverage_profile_f1k.tsv
    2023-10-14 13:27:25,539 - Composition_profiles: /user_path/Contigs/Contigs_Formated_kmer_4_f1000.csv
    2023-10-14 13:27:25,539 - The binning result file to be handled: /user_path/Binning_Results/Bins/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv_bins
    2023-10-14 13:27:25,539 - The number of threads: 6
    2023-10-14 13:27:25,558 - The number of contigs: 14

  6. The following files are retained in the contigs folder after the error:
    Contigs_Formated.fna.bacar_marker.hmmout
    Contigs_Formated.fna.bacar_marker.hmmout.err
    Contigs_Formated.fna.bacar_marker.hmmout.out
    Contigs_Formated.fna.bacar_marker.frag.err
    Contigs_Formated.fna.bacar_marker.frag.faa
    Contigs_Formated.fna.bacar_marker.frag.ffn
    Contigs_Formated.fna.bacar_marker.frag.gff
    Contigs_Formated.fna.bacar_marker.frag.out

A detail description of the final output

Dear Ziye:

Thank you for this excellent binning software. I am a little confused about the output file "metabinner_result.tsv", i gusses the first column shows the contig, and what the numbers in second column means. Plus, i want to confirm if the bins in the dir "kmeans_length_weight_X_t_logtrans_result.tsv_bins" are the final bins?

All the best

Two bugs to fix for preparing data

  1. in gen_coverage_file.sh, you forget to keep man selected minimum contig length, please check it.
  2. in gen_kmer.py, you forget to separate filename and dirname, please check it.

Unable to allocate XX GiB for an array with shape (XX, XX) error

Hi,
I got this error when trying to bin 900k contigs with the coverage of +45 samples

Traceback (most recent call last):
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/work/sber/Software/miniconda3/envs/metabinner_env/lib/python3.7/site-packages/sklearn/cluster/_kmeans.py", line 322, in _kmeans_single_elkan
    max_iter=max_iter, verbose=verbose)
  File "sklearn/cluster/_k_means_elkan.pyx", line 236, in sklearn.cluster._k_means_elkan.k_means_elkan
MemoryError: Unable to allocate 3.40 GiB for an array with shape (900786, 506) and data type float64

Any ideas on how to resolve that?
Best
Greg

Exit with error when run MetaBinner

I run metabinner with the following cli:

run_metabinner.sh \
-a /vol/projects/dzhiluo/outputs/results/flye/S1/assembly.fasta \
-o /vol/projects/dzhiluo/outputs/results/metabinner \
-d /vol/projects/dzhiluo/outputs//results/read_coverage/S1.read_to_contig.metabinner.txt \
-k /vol/projects/dzhiluo/outputs/results/flye/S1/kmer_4_f1000.csv \
-p /home/dzhiluo/miniconda3/envs/binner/bin

but got the error:

Traceback (most recent call last):
  File "/home/dzhiluo/miniconda3/envs/binner/bin/scripts/split_hhbins.py", line 514, in <module>
    bins, contigs = read_bins_from_one_dir(path)
  File "/home/dzhiluo/miniconda3/envs/binner/bin/scripts/split_hhbins.py", line 426, in read_bins_from_one_dir
    bin_ext, count = get_bin_extension(bin_dir)
  File "/home/dzhiluo/miniconda3/envs/binner/bin/scripts/metabinner_util.py", line 166, in get_bin_extension
    for f in os.listdir(bin_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/vol/projects/dzhiluo/outputs/results/metabinner/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_com_logtrans_result.tsv_bins'
partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_cov_logtrans_result.tsv

The output looks like:

outputs/results/metabinner/metabinner_res/
├── intermediate_result
│   ├── kmeans_length_weight_X_com_logtrans_result.tsv
│   ├── kmeans_length_weight_X_com_logtrans_result.tsv_bins
│   │   ├── 0.fa
│   │   └── 3.fa
│   ├── kmeans_length_weight_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins
│   │   ├── 0.fa
│   │   └── 3.fa
│   ├── kmeans_length_weight_X_cov_logtrans_result.tsv
│   ├── kmeans_length_weight_X_cov_logtrans_result.tsv_bins
│   │   ├── 0.fa
│   │   └── 15.fa
│   ├── kmeans_length_weight_X_cov_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins
│   │   ├── 0.fa
│   │   └── 15.fa
│   ├── kmeans_length_weight_X_t_logtrans_result.tsv
│   ├── kmeans_length_weight_X_t_logtrans_result.tsv_bins
│   │   ├── 0.fa
│   │   └── 16.fa
│   └── kmeans_length_weight_X_t_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins
│       ├── 0.fa
│       └── 16.fa
├── intermediate_result.res_namelist.tsv
├── post_result.log
├── result.log
└── unitem_profile
    ├── binning_methods
    │   └── X_t_logtrans_ori
    ├── bin_quality_summary.tsv
    ├── bins_dir.tsv
    ├── kmeans_length_weight_X_t_logtrans_result.tsv
    ├── kmeans_length_weight_X_t_logtrans_result.tsv_bins
    │   ├── 0.fa
    │   └── 16.fa
    └── X_t_logtrans_ori_quality.tsv

How should I get it to run?

Replace FragGeneScan?

Hi,

thanks for your work. FragGeneScan is pretty old, and even its successor, FragGeneScanPlus, is
no longer maintained (and has known bugs). There's a Rust-based reimplementation available at

https://github.com/unipept/FragGeneScanRs

Since it's both maintained and way faster, maybe consider this as a replacement?

Binning finishes successfully but with errors

Hi,

I'm running metabinner on a set of assemblies and it works great on all of them except one assembly. For that assembly it shows the Binning Finished! message, but I can see that above it there are some error messages. This is a snippet of the end of the output messages:

2021-12-08 14:36:04,589 - The binning result file to be handled:        /data/san/data0/users/chris/binning/data/processed/bins/random/
metabinner/spherical//metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_cov_logtrans_result.tsv_bins_post_proce
ss_mincomp_70_mincont_50_bins                                                                                                                                  
2021-12-08 14:36:04,589 - The number of threads:        10                                                                                                     
2021-12-08 14:36:04,639 - The number of contigs:        2203                                                                                                   
partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv                                                                           
/data/san/data0/users/chris/binning/.snakemake/conda/11cc0ac87d0563d6b0f17d15c2dfb3b9/lib/python3.7/site-packages/sklearn/utils/depreca
tion.py:144: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes /
 functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.              
  warnings.warn(message, FutureWarning)                                                                                                                        
2021-12-08 14:36:05,325 - Input arguments:                                                                                                                     
2021-12-08 14:36:05,325 - Contig_file:  /data/san/data0/users/chris/binning/data/processed/assemblies/random/spherical/assembly_1000.fa
2021-12-08 14:36:05,325 - Coverage_profiles:    /data/san/data0/users/chris/binning/data/processed/assemblies/random/spherical/depth1kb
.tsv                                                                                                                                                           
2021-12-08 14:36:05,325 - Composition_profiles: /data/san/data0/users/chris/binning/data/processed/assemblies/random/spherical/kmer_4_f
1000.csv                                                                                                                                                       
2021-12-08 14:36:05,325 - The binning result file to be handled:        /data/san/data0/users/chris/binning/data/processed/bins/random/
metabinner/spherical//metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv_bins_post_process
_mincomp_70_mincont_50_bins                                                                                                                                    
2021-12-08 14:36:05,325 - The number of threads:        10                                                                                                     
2021-12-08 14:36:05,375 - The number of contigs:        2203                                                                                                   
Processing 9 genomes from kmeans_length_weight_X_t_logtrans_result.tsv with extension 'fa'.                                                                    
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_t_logtrans_result.tsv with extension 'fa'.                            
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_t_logtrans_result.tsv with extension 'fa'.
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv with extension 'fa'.                            
Processing 24 genomes from kmeans_length_weight_X_cov_logtrans_result.tsv with extension 'fa'.                                                                 
Processing 19 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_cov_logtrans_result.tsv with extension 'fa'.                         
Processing 19 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_cov_logtrans_result.tsv with extension 'fa'.                         
Processing 19 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_cov_logtrans_result.tsv with extension 'fa'.                         
Processing 9 genomes from kmeans_length_weight_X_com_logtrans_result.tsv with extension 'fa'.                                                                  
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_com_logtrans_result.tsv with extension 'fa'.                          
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv with extension 'fa'.                          
Processing 9 genomes from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv with extension 'fa'.                          
bin_dir:        /data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinner_res/ensemble_res/X_t_lo
gtrans_2postprocess/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins                                                                                        
Get initial quality of bins.                                                                                                                                   
bin_dir:        /data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinner_res/ensemble_res/X_cov_
logtrans_2postprocess/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins                                                                                      
Get initial quality of bins.                                                                                                                                   
bin_dir:        /data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinner_res/ensemble_res/X_com_
logtrans_2postprocess/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins                                                                                      
Get initial quality of bins.                                                                                                                                   
Selected 2065 from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_t_logtrans_result.tsv with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).     
Selected 2065 from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).   
Selected 435 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).      
Selected 430 from kmeans_length_weight_X_com_logtrans_result.tsv with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).                                            
Selected 589 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv with quality = 97.9 (comp. = 97.9%, cont. = 0.0%).      
Selected 589 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv with quality = 97.9 (comp. = 97.9%, cont. = 0.0%).    
Selected 1893 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv with quality = 97.4 (comp. = 100.0%, cont. = 0.9%).
Selected 1782 from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_t_logtrans_result.tsv with quality = 97.3 (comp. = 97.3%, cont. = 0.0%).
Selected 1700 from kmeans_length_weight_X_com_logtrans_result.tsv with quality = 97.4 (comp. = 100.0%, cont. = 0.9%).                                          
Selected 1705 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv with quality = 96.6 (comp. = 96.6%, cont. = 0.0%).
Selected 1782 from partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv with quality = 97.3 (comp. = 97.3%, cont. = 0.0%).
Selected 455 from partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_t_logtrans_result.tsv with quality = 69.4 (comp. = 69.4%, cont. = 0.0%).
Selected 1705 from partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv with quality = 96.6 (comp. = 96.6%, cont. = 0.0%).
Selected 455 from partial_seed_kmeans_bacar_marker_seed_length_weight_1quarter_X_com_logtrans_result.tsv with quality = 69.4 (comp. = 69.4%, cont. = 0.0%).
mv: cannot stat 'Refined_ABC/Refined': No such file or directory                                                                                               
mv: cannot stat 'Refined_AB/Refined': No such file or directory                                                                                                
mv: cannot stat 'Refined_BC/Refined': No such file or directory                                                                                                
Processing 7 genomes from X_t_logtrans with extension 'fna'.                                                                                                   
No bins identified for X_cov_logtrans in /data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinne
r_res/ensemble_res/X_cov_logtrans_2postprocess/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins.                                                            
Processing 7 genomes from X_com_logtrans with extension 'fna'.                                                                                                 
Input directory does not exists: /data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinner_res/en
semble_res/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins/ensemble_3logtrans/Refined_ABC/Refined_ABC                                                      

cp: cannot stat '/data/san/data0/users/chris/binning/data/processed/bins/random/metabinner/spherical//metabinner_res/ensemble_res/greed
y_cont_weight_3_mincomp_50.0_maxcont_15.0_bins/ensemble_3logtrans/addrefined2and3comps/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins_res.tsv': No such f$
le or directory                        
Binning Finished! 

I'm just wondering if this is normal and just metabinners way of saying that it couldn't find any bins or if something else is going on?

Thanks,

Chris

Error running bin command

I am trying to run metabinner and I got this error.

/home/data/mmeg_ngs/rodrigo_taketani/condaenvs/metabinner/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
warnings.warn(message, FutureWarning)
2021-11-22 20:44:32,204 - Input arguments:
2021-11-22 20:44:32,204 - Contig_file: /home/data/mmeg_ngs_seqs_database/MAG/megahit_out/final.contigs.fa
2021-11-22 20:44:32,204 - Coverage_profiles: /home/data/mmeg_ngs_seqs_database/MAG/coverage_profile.tsv
2021-11-22 20:44:32,204 - Composition_profiles: /home/data/mmeg_ngs_seqs_database/MAG/megahit_out/kmer_4_f1000.csv
2021-11-22 20:44:32,205 - Output file path: /home/data/mmeg_ngs_seqs_database/MAG/metabinner/metabinner_res/result.tsv
2021-11-22 20:44:32,205 - Predefined Clusters: Auto
2021-11-22 20:44:32,205 - The number of threads: 10
Traceback (most recent call last):
File "component_binning.py", line 403, in
X_t, namelist, mapObj, X_cov, X_com = gen_X(com_file, cov_file)
File "component_binning.py", line 81, in gen_X
compositMat = shuffled_compositMat[covIdxArr]
IndexError: index 94834869609936 is out of bounds for axis 0 with size 1509418

real 1m7.945s
user 0m38.402s
sys 0m4.808s

The coverage file was generated from converting the maxbin depth with the command from the readme file page.

Thanks in advance.

Question with the metabinner_result.tsv

Dear MetaBinner support team,

I got my metabinner_result.tsv already (#The file "metabinner_result.tsv" in the "${output_dir}/metabinner_res" is the final output.)
metabinner_result.tsv.txt

My metabinner_result.tsv only has two columns as shown below. But, if I want to see the completeness and contamination of the assemble bins or MAGs, which file should I check? I want to check how many bins with >70% completeness and <5% contamination?

Screenshot 2022-04-23 at 23 06 52

Best,

Bing

make gen_coverage_file accept compressed files

Dears,
along with greeting you, the gen_coverage_file script to generate the coverages only accept .fastq sequences, but my files (which are more than 2000) are all compressed. in fq.gz format, How can make the script accept the compressed format?

thanks in advance!

Feature request: Use `sample name` in `gen_kmer.py` output

Hi,

I was looking into running MetaBinner on multiple samples via snakemake. And all my sample assembly files are stored in the same directory. While I could move them to individual directories, it may be efficient to either allow naming the outputs of gen_kmer.py or alternatively use the sample name in the output.

For example, in line 19 of your Filter_tooshort.py script, you have the following line, where the sample name is used as a prefix:

output_file = os.path.splitext(input_file)[0] + '_' + str(k) + '.fa'

Could you please add something similar to line 63 of the gen_kmer.py script. Example below:

 outfile = os.path.join(os.path.dirname(fasta_file), os.path.splitext(fasta_file)[0] + '_kmer_' + str(kmer_len) + '_f' + str(length_threshold) + '.csv')

Thank you,
Susheel

Unnecessary requirements in installation. Remove and publish to conda?

Taking a look at the metabinner_env.yaml, there appears to be a plethora of packages being installed that aren't actually used (appears that the yaml file was generated from a conda env export -f metabinner_env.yml or the like, and dumped the whole dev environment).

For example: freetype and harfbuzz even though text is neither being manipulated nor added to images, all the R packages even though I do not see R being used in the code-base, libtiff even though images are not being generated, etc. etc.

I imagine only installing what is absolutely necessary (click, biopython, numpy, pandas, mimetypes, scipy, sklearn, biolib, etc.) would make publication on conda/anaconda/bioconda rather straightforward. CheckM itself is on bioconda, so this would further simplify the process.

Let me know if you want any help/pointers in publishing to bioconda!

Got error: out of bounds for axis 0 while running Metabinner.py

Traceback (most recent call last):
  File "/CAMI/docker/binners/MetaBinner/Metabinner.py", line 913, in <module>
    X_t, namelist, mapObj, X_cov_sr, X_com = gen_X(com_file, cov_file)
  File "/CAMI/docker/binners/MetaBinner/Metabinner.py", line 96, in gen_X
    compositMat = shuffled_compositMat[covIdxArr]
IndexError: index 94388951158496 is out of bounds for axis 0 with size 349831

BTW, there were only two coverage files produced by run.sh even I used the both short reads and long reads for computing coverage. But the example command line has two coverage files: coverage_sr_new.tsv, coverage_pb_new.tsv. Did I get something wrong?

Run metabinner with designated BAM

Hi!

Thank you for the useful tool. I would like to suggest add an option for BAM for gen_coverage_file.sh, to make it accept designated BAM file. The help manual of gen_coverage_file.sh says "If the output already has the .bam alignments files from previous runs, the module will skip re-aligning the reads", but for me it is not clear that how such BAM files are named, and I do not want to put BAM and binning into the same directory.

Sincerely,

Cong

Differences between SolidBin and MetaBinner

MetaBinner is a good software for metagenomic binning analysis. In my work, I'm trying to find a best binning software. Therefore, I tested several binners. I found Ziye Wang is the author of both SolidBin and MetaBinner, and the workflows of these two softwares are very similar. My question is, what is the difference between these two software and which one is better?

By the way, the database of checkm is very old, is there any plan to replace checkm1 with checkm2?

Sincerely,

Qu Liping

error in component_binning.py

Hi,

I'm trying to run metabinner on coassembly data and it seems that I successfully created a coverage file and kmer file but I'm running into an error, see below. I have tried to remove the _ from "sklearn.cluster._kmeans" because I thought this may be the problem but I get the same error.
Do you maybe know what is going on?

Thank you, Heleen

Traceback (most recent call last):
File "./component_binning.py", line 23, in
from sklearn.cluster._kmeans import euclidean_distances, stable_cumsum, KMeans, check_random_state, row_norms, MiniBatchKMeans
ImportError: No module named _kmeans

real 0m0.888s
user 0m0.582s
sys 0m1.142s
Something went wrong with generating component binning results. Exiting.

markerCMD failed

Hi, thank you for the tool!

I've had it run on a MEGAHIT assembly of a metagenome with the eukaryotic host and many other unicellular eukaryotes, so I'm curious to see how it preformed on the prokaryotic and the eukaryotic fraction.

It seems the run finished successfully with some bins:

Get initial quality of bins.
Selected Refined_9 from Refined_AC with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).
Selected Refined_7 from Refined_AC with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).
Selected bin_2 from X_t_logtrans with quality = 98.3 (comp. = 98.3%, cont. = 0.0%).
Selected Refined_10 from Refined_AC with quality = 91.2 (comp. = 91.2%, cont. = 0.0%).
Selected Refined_11 from Refined_AC with quality = 86.2 (comp. = 96.6%, cont. = 3.4%).
Selected bin_4 from X_com_logtrans with quality = 82.9 (comp. = 82.9%, cont. = 0.0%).
Selected Refined_14 from Refined_AC with quality = 76.6 (comp. = 82.8%, cont. = 2.0%).
Selected bin_6 from X_t_logtrans with quality = 76.4 (comp. = 91.9%, cont. = 5.2%).
Selected Refined_13 from Refined_AC with quality = 71.0 (comp. = 82.3%, cont. = 3.8%).
Selected Refined_1 from Refined_AC with quality = 65.3 (comp. = 70.5%, cont. = 1.7%).
Selected Refined_15 from Refined_AC with quality = 63.8 (comp. = 69.0%, cont. = 1.7%).
Selected Refined_1 from Refined_AB with quality = 63.8 (comp. = 63.8%, cont. = 0.0%).
Selected bin_11 from X_com_logtrans with quality = 59.3 (comp. = 70.1%, cont. = 3.6%).
Selected bin_13 from X_t_logtrans with quality = 50.0 (comp. = 83.6%, cont. = 11.2%).
Selected bin_14 from X_com_logtrans with quality = 44.1 (comp. = 60.2%, cont. = 5.3%).
Selected bin_3 from X_cov_logtrans with quality = 41.3 (comp. = 53.5%, cont. = 4.1%).
Selected bin_15 from X_t_logtrans with quality = 33.0 (comp. = 57.3%, cont. = 8.1%).
Selected bin_17 from X_com_logtrans with quality = 26.3 (comp. = 68.2%, cont. = 14.0%).
Binning Finished!

but apart from some sklearn warnings:

partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_cov_logtrans_result.tsv
/.../envs/metabinner-1.4.4_env/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
  warnings.warn(message, FutureWarning)

there was an error with test_getmarker_3quarter.pl on several spots:

2024-05-07 01:24:45,264 - markerCmd failed! Not exist: /.../envs/metabinner-1.4.4_env/bin/auxiliary/test_getmarker_3quarter.pl /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/37742_reclustered_0.fa.bacar_marker.hmmout /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/37742_reclustered_0.fa 1001 /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_2quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/37742_reclustered_0.fa.bacar_marker.3quarter.seed

2024-05-07 01:25:14,796 - markerCmd failed! Not exist: /.../envs/metabinner-1.4.4_env/bin/auxiliary/test_getmarker_3quarter.pl /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/35559_reclustered_0.fa.bacar_marker.hmmout /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/35559_reclustered_0.fa 1001 /.../metab_out/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/35559_reclustered_0.fa.bacar_marker.3quarter.seed

there are a few more of those errors but they are all similar. Could you make sense of it? Thanks.

Another question:

  • When it comes to (multi sample) cross-assemblies, what is the best strategy to pass the reads to MetaBinner?
    • Give them all as combined_reads_R1/R2.fastq
    • Give all as individual samples sample1_R1/R2.fastq sample2_R1/R2.fastq etc
    • Only use the reads from the sample/assembly we are binning sample1_R1/R2.fastq?
  • Any advantage of subsampling reads before binning or would you not recommend it?

Thank you!

Input file does not exists

Hi,
I am running the pipeline successfully with some samples. But, in others the pipeline doesn't end. I tried to find the error but the logs looks "normal" making the tracking difficult. In this case the missing file is greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins_res.tsv, and I cannot find the script that generate it to check what's happing. what could I do?

last lines of the log:

2024-02-08 15:46:24,892 - Processing file:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/1399_reclustered_0.fa
2024-02-08 15:46:24,901 - Reading Map:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins/1399_reclustered_0.fa.reclustered.tsv
2024-02-08 15:46:24,902 - Writing bins:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_com_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins_post_process_mincomp_70_mincont_50_bins/
/scratch/project_2007362/software/mambaforge/envs/metabinner/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
  warnings.warn(message, FutureWarning)
2024-02-08 15:46:26,310 - Input arguments:
2024-02-08 15:46:26,310 - Contig_file:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/bams_HeP-1057_1_month/HeP-1057_1_month_500.fa
2024-02-08 15:46:26,310 - Coverage_profiles:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/coverage_profile_f0.5k.tsv
2024-02-08 15:46:26,310 - Composition_profiles:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/bams_HeP-1057_1_month/HeP-1057_1_month_kmer_4_f500.csv
2024-02-08 15:46:26,310 - The binning result file to be handled:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_cov_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins
2024-02-08 15:46:26,311 - The number of threads:	8
2024-02-08 15:46:26,432 - The number of contigs:	3712
/scratch/project_2007362/software/mambaforge/envs/metabinner/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
  warnings.warn(message, FutureWarning)
2024-02-08 15:46:27,996 - Input arguments:
2024-02-08 15:46:27,996 - Contig_file:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/bams_HeP-1057_1_month/HeP-1057_1_month_500.fa
2024-02-08 15:46:27,997 - Coverage_profiles:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/coverage_profile_f0.5k.tsv
2024-02-08 15:46:27,997 - Composition_profiles:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/bams_HeP-1057_1_month/HeP-1057_1_month_kmer_4_f500.csv
2024-02-08 15:46:27,997 - The binning result file to be handled:	/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/intermediate_result/partial_seed_kmeans_bacar_marker_seed_length_weight_3quarter_X_t_logtrans_result.tsv_bins_post_process_mincomp_70_mincont_50_bins
2024-02-08 15:46:27,997 - The number of threads:	8
2024-02-08 15:46:28,117 - The number of contigs:	3712
Input file does not exists: /scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/ensemble_res/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins/ensemble_3logtrans/addrefined2and3comps_bins_dir.tsv

cp: cannot stat '/scratch/project_2007362/sandro/HeP_samples/2_assembly/HeP-1057_1_month_metabinner/s2_output/metabinner_res/ensemble_res/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins/ensemble_3logtrans/addrefined2and3comps/greedy_cont_weight_3_mincomp_50.0_maxcont_15.0_bins_res.tsv': No such file or directory

thanks in advance!

Update ETA

Hi there,
Congrats for the tool - I'm really interested in using this tool for a large metagenomics dataset that I pulled from NCBI. First question would be when are you planning to release the update? Second one would be - would you recommend running this tool on a dataset that includes unrelated samples?
Thanks in advance and all the best!

error in subscript of component_binning.py

主要问题:“File "./component_binning.py", line 406, in”参考如下截图。用的是您团队提供的测试数据和命令行:run_metabinner.sh -a final_contigs_f1k.fa -d coverage_profile_f1k.tsv -t 40 -k kmer_4_f1000.csv -p /home/chenkai/miniconda3/envs/metabinner_env/bin/ -o metabinner.test
问题截图如下:
image
谢谢

Fastq naming issue

Hi,

I'm using the gen_coverage_file.sh script with the following command:

bash gen_coverage_file.sh -a GL91_DN_999.fa -o $(dirname GL91_DN/coverage_profile_f1k.tsv) GL100_DN_mg.r1.preprocessed.fq GL100_DN_mg.r2.preprocessed.fq GL100_UP_mg.r1.preprocessed.fq GL100_UP_mg.r2.preprocessed.fq 
GL91_DN_mg.r1.preprocessed.fq GL91_DN_mg.r2.preprocessed.fq -t 12 -m 200 -l 999

However, I get the following error, i.e. it looks like the script is looking for the read files to be named in a very specific format, i.e. _1.fastq and _2.fastq, whereas my files are called sample.{r1, r2}.preprocessed.fq:

Tue Apr 19 09:33:35 CEST 2022

------------------------------------------------------------------------------------------------------------------------
-----                                           Entered read type: paired                                          -----
------------------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------------------
-----                  Unable to find proper fastq read pair in the format *_1.fastq and *_2.fastq                 -----
------------------------------------------------------------------------------------------------------------------------


Usage: bash gen_coverage_file.sh [options] -a assembly.fa -o output_dir readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]
Note1: Make sure to provide all your separately replicate read files, not the joined file.
Note2: You may provide single end or interleaved reads as well with the use of the correct option
Note3: If the output already has the .bam alignments files from previous runs, the module will skip re-aligning the reads

Would it be possible to make it more generic?

Thank you,
Susheel

something wrong in the script - Filter_tooshort.py

Hi Ziye,

I ran the Filter_tooshort.py on my megabit assembled final.contigs.fa. I checked the generated final.contigs_1000.fa, no ">" before each sequence name there, so the gen_kmer.py generated kmer_4_f1000.csv is also empty. Maybe you need to have a check on the script - Filter_tooshort.py.

Screenshot of the final.contigs_1000.fa
Screenshot 2022-07-28 at 13 56 35

Screenshot of the kmer_4_f1000.csv
Screenshot 2022-07-28 at 13 57 47

Best,

Bing

run_FragGeneScan.pl error - Hmmsearch failed

Dear Ziye Wang,
I am facing an error and I need your help and insight. I have installed MetaBinner via source code (based on the metabinner_env.yaml file). The processes related to the coverage profile and the composition profile run without a problem. When running "run_metabinner.sh" via the following command:

/home/user/work/rte/PSP_06_03_2024/ps_tools/metabinner/MetaBinner/run_metabinner.sh -a /home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa -o /home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins -d /home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/coverages/coverage_profile_f200.tsv -k /home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_kmer_4_f200.csv -p /home/user/work/rte/PSP_06_03_2024/ps_tools/metabinner/MetaBinner -t 8

I get the following error message:

2024-04-01 05:53:38,578 - Input arguments:
2024-04-01 05:53:38,578 - Contig_file:	/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa
2024-04-01 05:53:38,578 - Coverage_profiles:	/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/coverages/coverage_profile_f200.tsv
2024-04-01 05:53:38,578 - Composition_profiles:	/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_kmer_4_f200.csv
2024-04-01 05:53:38,579 - Output file path:	/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/result.tsv
2024-04-01 05:53:38,579 - Predefined Clusters:	Auto
2024-04-01 05:53:38,579 - The number of threads:	8
2024-04-01 05:53:39,027 - The number of contigs:	8831
2024-04-01 05:53:39,027 - gen bacar marker seed
2024-04-01 05:53:39,028 - exec cmd: run_FragGeneScan.pl -genome=/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa -out=/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.frag -complete=0 -train=complete -thread=8 1>/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.frag.out 2>/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.frag.err
2024-04-01 05:54:03,153 - exec cmd: hmmsearch --domtblout /home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.bacar_marker.hmmout --cut_tc --cpu 8 /home/rte/PSP_06_03_2024/ps_tools/metabinner/MetaBinner/scripts/../auxiliary/bacar_marker.hmm /home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.frag.faa 1>/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.bacar_marker.hmmout.out 2>/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.bacar_marker.hmmout.err
2024-04-01 05:54:03,185 - Hmmsearch failed! Not exist: /home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.bacar_marker.hmmout

real	0m26.021s
user	0m38.726s
sys	0m23.824s
cp: cannot stat '/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/intermediate_result/kmeans_length_weight_X_t_logtrans_result.tsv': No such file or directory
Processing file:	/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa
Reading Map:	/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/unitem_profile/kmeans_length_weight_X_t_logtrans_result.tsv
Traceback (most recent call last):
  File "/home/user/work/rte/PSP_06_03_2024/ps_tools/metabinner/MetaBinner/scripts/gen_bins_from_tsv.py", line 74, in <module>
    main(args.f, args.r, args.o)
  File "/home/user/work/rte/PSP_06_03_2024/ps_tools/metabinner/MetaBinner/scripts/gen_bins_from_tsv.py", line 40, in main
    with open(resultfile, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/unitem_profile/kmeans_length_weight_X_t_logtrans_result.tsv'
Input directory does not exists: /home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/unitem_profile/kmeans_length_weight_X_t_logtrans_result.tsv_bins

['X_t_logtrans_ori', '/home/rte/PSP_06_03_2024/results_sample_19/metabinner_results/bins/metabinner_res/unitem_profile/kmeans_length_weight_X_t_logtrans_result.tsv_bins']
Something went wrong with running unitem_profile.py. Please check CheckM installation. Exiting.

I have traced the error back to when "run_FragGeneScan.pl" is run which outputs an empty "contigs_formated_200.fa.frag.faa" file, which in turn results in no "contigs_formated_200.fa.bacar_marker.hmmout" file being created (thus Hmmsearch fails). FragGeneScan also outputs the "contigs_formated_200.fa.frag.err" file which contains the following:

awk: cmd. line:1: fatal: cannot open file `/home/rte/PSP_06_03_2024/results_sample_19/contigs/contigs_formated_200.fa.frag.out' for reading (No such file or directory)
  1. Do you have an idea of what is going on?
  2. In addition, can MetaBinner be installed and run properly in the conda enviroment with other versions of Python except of 3.7.6?
  3. Would it be ok if I installed MetaBinner first in the environment and then python 3.7.6?

Sincerely,
Georgios Filis

Input contains NaN, infinity or a value too large for dtype('float64') in Component Binning Step

Hi Ziye,

I'm unable to figure out where the NaN values are being generated. The coverage and kmer frequency files look fine with no fully empty rows or columns. Fails with the following error

2024-02-12 17:39:30,960 - start estimate_bin_number
Traceback (most recent call last):
  File "./component_binning.py", line 476, in <module>
    bin_number = estimate_bin_number(X_t, candK, dataset_scale=dataset_scale, len_weight=length_weight)
  File "./component_binning.py", line 162, in estimate_bin_number
    kmeans.fit(X_mat, sample_weight=len_weight)
  File "3_miniconda3/envs/metabinner/lib/python3.7/site-packages/sklearn/cluster/_kmeans.py", line 859, in fit
    order=order, copy=self.copy_x)
  File "3_miniconda3/envs/metabinner/lib/python3.7/site-packages/sklearn/utils/validation.py", line 578, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "3_miniconda3/envs/metabinner/lib/python3.7/site-packages/sklearn/utils/validation.py", line 60, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

missing minimap2, samtools, bedtools in yaml

I tried to install MetaBinner but it did not work because of some missing dependencies including minimap2, samtools, bedtools. Besides, the paths used in score_reads are hard coded which causes issues while running the script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.