cami-challenge / amber Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 7.0 13.82 MB

AMBER: Assessment of Metagenome BinnERs

Home Page: https://cami-challenge.github.io/AMBER/

License: GNU General Public License v3.0

Python 97.81% Gherkin 1.93% HTML 0.16% Dockerfile 0.10%

benchmarking binning biobox metagenomics

amber's People

Contributors

Stargazers

Watchers

Forkers

graingert liupfskygre ash1one maat-pharma-legacy pythseq maat-pharma santamccloud

amber's Issues

v2.0.17-beta TWO bugs

hello, @fernandomeyer
I have met two bugs when using AMBER v2.0.17-beta.

index.html

the pictures of worst, medium, best have failed to load.
purity_completeness_seq.png

purity_completeness_seq.png is blank.

Error in creating html page

Hello,

I am encountering an error (pasted below) while running AMBER for taxonomic binning on the CAMI medium complexity dataset.

2020-09-08 19:10:25,561 INFO Loading NCBI files
2020-09-08 19:10:43,554 INFO Loading Gold standard
2020-09-08 19:10:43,599 INFO Loading predictions_10
2020-09-08 19:10:43,616 INFO Creating output directories
2020-09-08 19:10:43,696 INFO Evaluating Gold standard (sample gs, taxonomic binning)
2020-09-08 19:11:05,368 INFO Evaluating predictions_10 (sample gs, taxonomic binning)
/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/binning_classes.py:306: RuntimeWarning: invalid value encountered in double_scalars
  (utils_labels.F1_SCORE_BP, [2 * self.__precision_avg_bp * self.__recall_avg_bp / (self.__precision_avg_bp + self.__recall_avg_bp)]),
/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/binning_classes.py:313: RuntimeWarning: invalid value encountered in double_scalars
  (utils_labels.F1_SCORE_SEQ, [2 * self.__precision_avg_seq * self.__recall_avg_seq / (self.__precision_avg_seq + self.__recall_avg_seq)]),
/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/binning_classes.py:319: RuntimeWarning: invalid value encountered in double_scalars
  (utils_labels.F1_SCORE_PER_BP, [2 * self.__precision_weighted_bp * self.__recall_weighted_bp / (self.__precision_weighted_bp + self.__recall_weighted_bp)]),
/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/binning_classes.py:320: RuntimeWarning: invalid value encountered in double_scalars
  (utils_labels.F1_SCORE_PER_SEQ, [2 * self.__precision_weighted_seq * self.__recall_weighted_seq / (self.__precision_weighted_seq + self.__recall_weighted_seq)]),
2020-09-08 19:11:22,665 INFO Saving computed metrics
2020-09-08 19:11:22,872 INFO Creating taxonomic binning plots
/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/plots.py:343: UserWarning: FixedFormatter should only be used together with FixedLocator
  axs.set_xticklabels(['{:3.0f}'.format(x * 100) for x in vals], fontsize=11)
...
*(Error above repeated a bunch of times)*

2020-09-08 19:11:46,422 INFO Creating HTML page
Traceback (most recent call last):
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/bin/amber.py", line 302, in <module>
    main()
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/bin/amber.py", line 297, in main
    args.desc)
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/amber_html.py", line 848, in create_html
    metrics_row_t = create_taxonomic_binning_html(df_summary, pd_bins[pd_bins['rank'] != 'NA'], labels, sample_ids_list, options)
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/amber_html.py", line 777, in create_taxonomic_binning_html
    rank_to_sample_to_html[rank].append(create_table_html(pd_mean_rank.T, is_taxonomic=True))
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/src/amber_html.py", line 450, in create_table_html
    html += df_metrics.style.apply(get_heatmap_colors, df_metrics=df_metrics, axis=1).set_precision(3).set_table_styles(this_style).render()
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/pandas/io/formats/style.py", line 540, in render
    self._compute()
...
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/pandas/core/frame.py", line 467, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 283, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 78, in arrays_to_mgr
    index = extract_index(arrays)
  File "/cbio/donnees/rmenegaux/miniconda3/envs/amber/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 397, in extract_index
    raise ValueError("arrays must all be same length")
ValueError: arrays must all be same length

The command that gets this error is the following:

amber.py predictions_10.binning --gold_standard_file ground_truth_10.binning --ncbi_nodes_file nodes.dmp --ncbi_names_file names.dmp --ncbi_merged_file merged.dmp --filter 1 --output_dir output_filter_1

The nodes files are freshly downloaded off NCBI, and the ground truth and predictions toy files are:

cat predictions_10.binning 
@Version:0.9.1
@SampleID:gs

@@SEQUENCEID	TAXID
RM2|S1|R0	222805
RM2|S1|R1	187303
RM2|S1|R2	1525
RM2|S1|R3	146919
RM2|S1|R4	1488
RM2|S1|R5	305

cat ground_truth_10.binning 
@Version:0.9.1
@SampleID:gs

@@SEQUENCEID	BINID	TAXID	_READID	_LENGTH
RM2|S1|R0	1030896	1123266	scaffold00002_27-953956	100
RM2|S1|R1	1220_BD	169973	scaffold9.1_8-4249	100
RM2|S1|R2	1036704	1123003	scaffold00002_48-138142	100
RM2|S1|R3	1285_CK	460257	scaffold2.1_10-583737	100
RM2|S1|R4	evo_1035921.028	745369	contig_5_4-113862	100
RM2|S1|R5	1139_Y	169973	scaffold15.1_21-8412	100

PS: This error does not come systematically, and I managed to make it work for some prediction files.

Issue when running add_length_column.py

I am trying to call the help for the script with python3 add_length_column.py -h. And I am getting the following output:

Traceback (most recent call last):
  File "/home/users/pnovikova/binning-refinement/scripts/add_length_column.py", line 26, in <module>
    import argparse_parents
ModuleNotFoundError: No module named 'argparse_parents'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/users/pnovikova/binning-refinement/scripts/add_length_column.py", line 30, in <module>
    import argparse_parents
ModuleNotFoundError: No module named 'argparse_parents'

The argparse module is installed. And as far as I see from google, there's no such module as argparse_parents. What am I doing wrong?

Thanks!

AMBER should be registered in Dockerhub and also the readme should be updated. By doing this a user does not have to build the image on his system but directly use it. The correct build should also be tested on circleci.

Updates for AMBER requirements

Dear developer,
I found while setting up AMBER in the system via pip that some more packages with specific versions should be installed.

scipy=1.8.0 (to be compatible with NumPy 1.18.3)
jinja2=2.10.1
MarkupSafe=2.0.1
Flask=1.0.3

It would be useful if it is included in the prerequisite file for future users.

Thank you.

Best,
Yazhini

Remove warning when building html plots

When building html plots the following warning is thrown.

Example:

"/home/fmeyer/.local/lib/python3.5/site-packages/bokeh/util/deprecation.py:34:
BokehDeprecationWarning:
Supplying a user-defined data source AND iterable values to glyph
methods is deprecated.

See https://github.com/bokeh/bokeh/issues/2056 for more information.

   warn(message)"

Question of the output

Hi
I have tried to use the AMBER, but I am confused about the output that there are Average completeness (bp) and CAMI 1 average completeness (bp). What is the difference between them?Thanks for your suggestions!

gold standard mapping column BINID vs binning assignment file

Hi,

I am trying to evaluate bins quality with AMBER. Unfortunately I am bit confused about the naming of the columns of the gold standard mapping file and the binning assignment file.
My gold standard mapping file looks like this:

Whereas the binning assignment looks like this:

I am a bit confused about how BINID contains different objects in each file. I also tried using the gold standard assignment mapping file with the TAXID column:

When the AMBER is executed with the following command:
amber.py -g gold_standard_file.tsv binning_file.tsv -o output_dir
BINID and genome_id columns present the same value:

Thank you so much for your help,
Pau

colorized html table

I think it would be useful to highlight the good and bad values of a metric. Each row in a table would have a color between dark red and light blue. See metaquast as an example http://quast.bioinf.spbau.ru/static/metaquast/cami/report.html

Error creating genome binning plots

Hi,
I get the following error when running amber on some binnings of the medium complexity toy data set of cami:

2020-02-07 10:00:58,977 INFO done
2020-02-07 10:00:58,979 INFO Computing metrics for Gold standard - genome binning, CAMI_toy_medium...
2020-02-07 10:01:01,114 INFO done
2020-02-07 10:01:01,118 INFO Computing metrics for CONCOCT - genome binning, CAMI_toy_medium...
2020-02-07 10:01:01,628 INFO done
2020-02-07 10:01:01,628 INFO Computing metrics for MaxBin2 - genome binning, CAMI_toy_medium...
2020-02-07 10:01:02,127 INFO done
2020-02-07 10:01:02,128 INFO Computing metrics for MetaBAT2 - genome binning, CAMI_toy_medium...
2020-02-07 10:01:02,619 INFO done
2020-02-07 10:01:02,619 INFO Computing metrics for MetaWrap - genome binning, CAMI_toy_medium...
2020-02-07 10:01:02,950 INFO done
2020-02-07 10:01:02,950 INFO Computing metrics for MetaWrap_ra - genome binning, CAMI_toy_medium...
2020-02-07 10:01:03,245 INFO done
2020-02-07 10:01:03,245 INFO Computing metrics for MetaWrap_qc - genome binning, CAMI_toy_medium...
2020-02-07 10:01:03,575 INFO done
2020-02-07 10:01:03,575 INFO Computing metrics for DAS_tool - genome binning, CAMI_toy_medium...
2020-02-07 10:01:03,900 INFO done
2020-02-07 10:01:03,902 INFO Saving computed metrics...
2020-02-07 10:01:03,980 INFO done
2020-02-07 10:01:03,981 INFO Creating genome binning plots...
Traceback (most recent call last):
  File "../../tools/AMBER/amber.py", line 412, in <module>
    main()
  File "../../tools/AMBER/amber.py", line 398, in main
    plot_genome_binning(sample_id_to_queries_list, df_summary, pd_bins, args.plot_heatmaps, output_dir)
  File "../../tools/AMBER/amber.py", line 269, in plot_genome_binning
    plots.plot_avg_precision_recall(df_summary_g, output_dir)
  File "/mnt/lscratch/users/ohickl/binning/tools/AMBER/src/plots.py", line 262, in plot_avg_precision_recall
    'Average completeness per genome [%]')
  File "/mnt/lscratch/users/ohickl/binning/tools/AMBER/src/plots.py", line 240, in plot_summary
    plt.tight_layout()
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/pyplot.py", line 1352, in tight_layout
    fig.tight_layout(pad=pad, h_pad=h_pad, w_pad=w_pad, rect=rect)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/figure.py", line 2307, in tight_layout
    pad=pad, h_pad=h_pad, w_pad=w_pad, rect=rect)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/tight_layout.py", line 349, in get_tight_layout_figure
    pad=pad, h_pad=h_pad, w_pad=w_pad)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/tight_layout.py", line 114, in auto_adjust_subplotpars
    tight_bbox_raw = union([ax.get_tightbbox(renderer) for ax in subplots
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/tight_layout.py", line 115, in <listcomp>
    if ax.get_visible()])
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axes/_base.py", line 4198, in get_tightbbox
    bb_xaxis = self.xaxis.get_tightbbox(renderer)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 1145, in get_tightbbox
    ticks_to_draw = self._update_ticks(renderer)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 1028, in _update_ticks
    tick_tups = list(self.iter_ticks())  # iter_ticks calls the locator
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 978, in iter_ticks
    minorTicks = self.get_minor_ticks(len(minorLocs))
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 1415, in get_minor_ticks
    tick = self._get_tick(major=False)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 1792, in _get_tick
    return XTick(self.axes, 0, '', major=major, **tick_kw)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 178, in __init__
    self.gridline = self._get_gridline()
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/axis.py", line 503, in _get_gridline
    **self._grid_kw)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/lines.py", line 391, in __init__
    self.set_linestyle(linestyle)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/lines.py", line 1125, in set_linestyle
    self._us_dashOffset, self._us_dashSeq, self._linewidth)
  File "/home/users/ohickl/anaconda3/envs/amber/lib/python3.7/site-packages/matplotlib/lines.py", line 68, in _scale_dashes
    scaled_offset = offset * lw
TypeError: can't multiply sequence by non-int of type 'float'

Could that be some python2 > python3 problem?

Best

Oskar

Error creating html - character encoding

Hi Amber runs until the HTML phase and creates the expected outputs. In the html phase I get the following output:

2020-11-04 19:31:20,810 INFO Creating HTML page
Traceback (most recent call last):
  File "cami-env/bin/amber.py", line 302, in <module>
    main()
  File "cami-env/bin/amber.py", line 297, in main
    args.desc)
  File "/<path>/cami-env/lib/python3.7/site-packages/src/amber_html.py", line 872, in create_html
    f.write(html)
UnicodeEncodeError: 'latin-1' codec can't encode character '\ufffd' in position 775969: ordinal not in range(256)

I am running in a Python3.7 virtual environment and installed Amber using pip3 per the instructions on github.

Do you know what causes this error?

Create GitHub Page

It would be nice to have Github Page https://pages.github.com/ that just shows the current html that AMBER can produce. This would be just another branch with in this repo.

Allow users to install AMBER with pip

We should provide a way and document how to install AMBER with pip.

Bug which make the option genome_coverage not usable

Hello,

After the version update, the option genome_coverage it is not usable since the new code doesn't work with it. I did a PR #56 with a possible fix. I hope this was okay to do. Could it also be tested from your side, since it didn't see a difference in the output compare with the output from version 2.0.4. After the review, it is also possible to release the new version?

Thank you!

Filter contigs by length

First of all, great work with AMBER!

I ran AMBER on the mouse gut toy dataset, which contains many very small contigs in the GSA.
Some bins exclusively contain small contigs and are not recoverable by common genome binners.

It is already possible to manually exclude a set of genomes; I propose a complementary feature:
Filter contigs by size (threshold set by user, default e.g. 2.5kb) and exclude these from analyses.
This will also remove some gold standard bins completely (if they don't contain longer contigs).

a question about AMBER 2.0.21-beta

I installed AMBER 2.0.21-beta and my team member installed another version we found that the values of the Completeness (bp) and Purity (bp) in AMBER 2.0.21-beta are same as avg_completeness_per_bp and avg_purity_per_bp in the other version. Has the names of the metrics changed from that used in the paper? What's the difference between Average purity (bp) in AMBER 2.0.21-beta and avg_purity_per_bp in older version? Thanks.

mixed dtype for BINID can cause bins with same IDs to be split into separate bins

if BINID is a mixture of strings/ints I've noticed that individual int values can be imported as both strings and ints essentially splitting a single bin into two.

I believe it's an issue with how pandas imports the dataframes and may only happen with large files. the issue appears to be resolved by using either int or string values for BINID

a data-related problem

Dear,
In the article "Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit", the CAMI II mouse gut dataset used is the short reads of 64 samples? If so, where can I download this dataset? I only found pacbio scaffolds related data in https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMISIM_MOUSEGUT, but no short reads.

Looking forward to your reply, thanks!

amber only considers bins with completeness > 0 ?

Hi,
I have a sample that I want to bin only a fraction of - for example I have 10 bins in the gold standard and the sample has a bunch of other sequences that I don't want to put into bins. A binning tool creates 20 bins, of which 15 match to 5 gold standard bins. IE there are 5 bins created by the binner which don't match to any gold standard bin. Is there any way for amber to handle a situation like this?

(I can send you more details and example data by email if you wish - mine is dpellow post.tau.ac.il)

Old files archive are not correct

Hello,

sorry for writing the issue here, since I didn't find an email to contact any of the CAMI staff. I'm currently working on my bachelor thesis which including building a workflow on the web server https://usegalaxy.eu/ which serve a lot of different tools in the bioinformatic fields. Since amber is up there now, I need some benchmarks to test the workflow and I did discover that you are providing the old archive like cami low or mouse gut toy etc. I did work with the cami low and the mouse gut toy low archives, but I also want to test the high or medium archive as well, and now there is the problem. I did download both tarballs [from http://gigadb.org/dataset/100344] and unzip them, but only to get the samples without any other file while there should be also the gsa and binning which are not there in both tarballs. Is it possible to fix this, or is there any other source which contain the correct tarball as download?

This would be a great help and thank you in advance and again I'm sorry if this topic is wrong here!

ImportError: cannot import name 'Markup' from 'jinja2'

Hi there, I ran AMBER some days prior, and since then I updated some packages to run CAMISIM since some of them were incompatible and I cant get it to run anymore.

I get the following error message:

Traceback (most recent call last):
  File "/Users/eparisis/miniconda3/envs/amber/bin/amber.py", line 26, in <module>
    from src import amber_html
  File "/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/src/amber_html.py", line 41, in <module>
    from bokeh.plotting import figure
  File "/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/bokeh/plotting/__init__.py", line 2, in <module>
    from ..document import Document; Document
  File "/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/bokeh/document/__init__.py", line 7, in <module>
    from .document import Document ; Document
  File "/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/bokeh/document/document.py", line 35, in <module>
    from ..core.templates import FILE
  File "/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/bokeh/core/templates.py", line 20, in <module>
    from jinja2 import Environment, Markup, FileSystemLoader, PackageLoader
ImportError: cannot import name 'Markup' from 'jinja2' (/Users/eparisis/miniconda3/envs/amber/lib/python3.7/site-packages/jinja2/__init__.py)

I get this same error message running it on Linux and on my local Mac. Installing it from scratch in a new conda env also produces this error.

I've looked up some fixes for jinja2 but nothing worked.

Trying to build the docker image as in the instruction also didn't work:

$ docker build -t amber:latest .
Sending build context to Docker daemon   74.1MB
Step 1/10 : FROM python:3.7-slim
3.7-slim: Pulling from library/python
3f4ca61aafcd: Pull complete 
3f487a3359db: Pull complete 
e87858cc8912: Pull complete 
471900aadde7: Pull complete 
37bdaa58825f: Pull complete 
Digest: sha256:62209b7fcd75e157220c682de6c81e737a3d36a06ce05f449757c7b9ef271f99
Status: Downloaded newer image for python:3.7-slim
 ---> 74e5f3c48333
Step 2/10 : ADD image /usr/local
 ---> efb47449e5cb
Step 3/10 : ADD *.py /usr/local/bin/
 ---> ce875502bb87
Step 4/10 : ADD src /usr/local/bin/src
 ---> c663e9a0e4ef
Step 5/10 : ADD src/utils /usr/local/bin/src/utils
 ---> 2b80cb51a18f
Step 6/10 : ADD requirements /requirements
failed to export image: failed to create image: failed to get layer sha256:3da0f9e1caa5774c47974ea1948ca723ac5a3fad7bebeaeb513002f3ca3cabc4: layer does not exist

The packages versions in the requirements.txt are all installed so I don't know whats wrong.

HTML Table

Add a second html table similar to the table in the pdf attached to this issue.
summary.pdf

Evaluating Gold standard encountered RuntimeWarning: overflow

Hello,
AMBER run on my dataset showed a numerical overflow.
evaluating Gold standard (sample marine, genome binning) ~/.local/lib/python3.8/site-packages/src/binning_classes.py:306: RuntimeWarning: overflow encountered in long_scalars return (n * (n - 1)) / 2.0
This comes from the function compute_rand_index in binning_classes.py. What would be the reason for this error?

Thanks.

Best,
Yazhini

Output metrics bp vs seq.

Hi,
could you clarify the difference between the AMBER 2.x seq and bp based metrics? As you explained in #36, Average purity (bp) stems from equation 6 in the original paper. This would then be Average purity (bp) of Quality of bins: all bins have the same weight in the index.html output? Does the seq based metric then mean, the value is based on the number of contigs from the most abundant gold standard genome in each bin? Does that value deviate from the bp metric, because of different contig lenghts and thus a large contig of the "correct" genome will not increase e.g. the seq purity value but the bp one? If so why would we care about that? Wouldn't the bp based measure always be more informative in showing how close bins are to the gold standard?

Best
Oskar

Improve error message(s)

I ran AMBER using a corrupt gold standard file, where _LENGTH was in the header, but the data rows only contained 2 fields. Example:

@Version:0.9.1
@SampleID:gsa

@@SEQUENCEID    BINID    _LENGTH
A    42
B    13

I got a plain IndexError: list index out of range message and therefore suggest that you check that the data rows indeed contain either 2 or 3 fields and print a sensible error message.

For bonus points, I suggest that you check for more and similar edge cases, where better reporting will aid the user in finding the mistake and ultimately a better user experience.

P.S. Eventually, I figured out my mistake (faulty regex removed the BINID field) and AMBER worked like a charm! 👍

CHANGELOG

We should maintain a CHANGELOG.md that always shows the changes made on the code. A user can by reading this file always understand which bugs are solved and which features are added.

amber package name is already taken on PyPI. Please choose a different name

VERSION

AMBER should be versioned following the semantic versioning conventions: http://semver.org/
We should create a setup.cfg which states the current release/version.
This information could also be parsed by various scripts and included in a pdf/html/png, ... . The result of AMBER should always state by which release it was produced.
The config is also necessary for issue #5 .

"nan" values in Purity (bp) and Purity (seq)

Hi,

I'm using AMBER to evaluate a set of bins I obtained from a metagenome assembled from the CAMI Toy Mouse Gut Dataset reads. I've noticed that some bins have nan values in the Purity (bp) and Purity (seq) columns. What might be causing that?

To build the gold standard I aligned the reassembled contigs to the original genomes using BLAST, as described in Vamb's paper:

We removed any hits shorter than 500 bp or with lower nucleotide identity than 95%. If a query (reassembled) contig was aligned to multiple reference (original) contigs, we accepted the reference with the longest alignment, if the alignment was more than twice as long of the next longest. If that was not the case for any reference, we accepted the reference with highest nucleotide identity, if the reference was longer than 10 kbp, had an alignment length of at least 90% of the longest-aligning reference, and had at least 0.05% higher nucleotide identity than the second-highest identity reference. If no reference fit those criteria, they were ignored in the benchmarking.

Bin ID	Most abundant genome	Purity (bp)	Completeness (bp)	Bin size (bp)	True positives (bp)	True size of most abundant genome (bp)	Purity (seq)	Completeness (seq)	Bin size (seq)	True positives (seq)	True size of most abundant genome (seq)
mouse_gut_5.vamb.83	denovo8255.1	0.659	0.983	741631	488937	497562	0.655	0.980	226	148	151
mouse_gut_5.vamb.9	269125.1	0.998	0.977	1809513	1806528	1848507	0.994	0.977	172	171	175
mouse_gut_5.vamb.81	179513.0	1.000	0.963	807379	807379	838686	1.000	0.962	280	280	291
mouse_gut_5.vamb.126	228785.0	1.000	0.960	308660	308660	321520	1.000	0.954	103	103	108
mouse_gut_5.vamb.72	661259.1	1.000	0.940	241230	241230	256496	1.000	0.944	85	85	90
mouse_gut_5.vamb.182	259993.0	1.000	0.922	647949	647949	702765	1.000	0.917	211	211	230
mouse_gut_5.vamb.454	denovo11208.0	0.525	0.919	1065913	559173	608144	0.531	0.895	305	162	181
mouse_gut_5.vamb.11	133719.0	0.760	0.915	534069	405959	443893	0.748	0.888	127	95	107
mouse_gut_5.vamb.111	denovo11993.0	1.000	0.875	636280	636280	727174	1.000	0.882	194	194	220
mouse_gut_5.vamb.793	4471135.0	0.992	0.863	1913155	1898009	2200037	0.990	0.845	583	577	683
mouse_gut_5.vamb.333	denovo12532.0	nan	0.857	202574	199887	233197	nan	0.852	76	75	88
mouse_gut_5.vamb.51	denovo2465.0	nan	0.848	218613	218613	257907	nan	0.863	82	82	95
mouse_gut_5.vamb.115	denovo1032.0	nan	0.816	44891	24716	30297	nan	0.750	11	6	8
mouse_gut_5.vamb.71	denovo10679.0	0.333	0.787	654629	218217	277372	0.341	0.622	82	28	45
mouse_gut_5.vamb.451	denovo11206.0	0.893	0.761	1096925	979172	1287345	0.872	0.718	298	260	362
mouse_gut_5.vamb.428	denovo2609.0	0.998	0.733	1077877	1075782	1468385	0.997	0.717	325	324	452
mouse_gut_5.vamb.1036	263992.0	nan	0.231	66518	41954	181693	nan	0.242	23	15	62

Tabs in html file should be in horizontal order

When comparing many tools >5 then the tabs are not in horizontal order anymore.

How does mapping work?

Dear developers,
I am trying to understand the evaluation method deeper. How do you get (i) the fraction of base pairs of the genome covered in bins and (ii) overlapping in base pairs between bin and genome without mapping through sequence alignment? The input for gold-standard and final bins has only contig names and lengths but not the coordinates. Your explanation is much appreciated.

Thanks.

Best,
Yazhini

Increase performance of the produced HTML

The html gets quite big and slow.
The size can be reduced by recomputing the plots on demand. This way we don't have to provide all plots when the html is build.

cami-challenge / amber Goto Github PK

amber's People

Contributors

Stargazers

Watchers

Forkers

amber's Issues

Recommend Projects

Recommend Topics

Recommend Org