biocat's Introduction

BioCAT: Biosynthesis Cluster Analysis Tool

BioCAT is a tool which is designed for searching potential producers of a given non-ribosomal peptide. The tool performs an alignment of the given NRP chemical structure against all possible biosytnthesis gene clusters (BGCs) found in the given genome and returns the final alignment score distributed from 0 to 1. Alignments with score higher than 0.5 can be considered as successful matches.

Identification of monomers is performed with the rBAN tool(https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0335-x). Annotation of nonribosomal peptide synthetase gene clusters is preformed with the antiSMASH 6 software(https://academic.oup.com/nar/article/49/W1/W29/6274535).

Dependencies:

antiSMASH 6
sklearn 0.24.2
pandas
biopython
numpy
rdkit
hmmer

rBAN is included into the BioCAT package.

Installation:

conda create -n biocat_env -c bioconda -c rdkit antismash=6 hmmer rdkit
conda activate biocat_env
pip install biocat

BioCAT usage:

There is no test data in the BioCAT package distributed via Pip. Several examples are available in this repository in the example folder.

Usage example:

For a minimal run, an NRP structure in SMILES format and a genome in FASTA format must be specified with the -smiles and -genome parameters, respectively. Optionally, you can specify the output directory using the -out parameter and the name of the given NRP using -name.

biocat -smiles '[H][C@@]1(CCC(O)=O)NC(=O)CC(CCCCCCCCCCC)OC(=O)[C@]([H])(CC(C)C)NC(=O)[C@@]([H])(CC(C)C)NC(=O)[C@]([H])(CC(O)=O)NC(=O)[C@@]([H])(NC(=O)[C@@]([H])(CC(C)C)NC(=O)[C@]([H])(CC(C)C)NC1=O)C(C)C' -name 'surfactin' -genome example/Surfactin/GCF_000015785.2_ASM1578v2_genomic.fna -out surfactin_results

In this case, the given chemical structure of surfactin is processed by rBAN and the given genome is processed by antiSMASH 6. Next, the resulting files are used for the alignment process. The main feature of BioCAT is a PSSM-based alignment algorithm, which includes an artificial shuffling of PSSMs to calculate the final score, so, the alignment process might be time-consuming in some cases (usually less than 1 minute).

The output directory surfactin_results contains the rBAN and the antiSMASH resulting files, PSSM matrices generated for BGCs which were aligned during the analysis, and the resulting file Results.tsv. This file contains a detailed information about each possible NRP to BGC alignment, but in this minimal example we are interested only in the last two columns Relative score and Binary. Rows with relative score more than 0.5 are interpreted as successful alignments, thus, the given organism can be considered as a potential producer of surfactin.

Parameters

usage: biocat [-h] [-antismash ANTISMASH] [-genome GENOME] [-name NAME]
              [-smiles SMILES] [-file_smiles FILE_SMILES] [-rBAN RBAN]
              [-NRPS_type NRPS_TYPE] [-iterations ITERATIONS] [-delta DELTA]
              [-threads THREADS] [-out OUT] [-skip SKIP] [--disable_pushing_type_B]
              [--disable_dif_strand] [--disable_exploration]

BioCAT is a tool, which estimates the likelihood that a given organism is
capable of producing of a given NRP

optional arguments:
  -h, --help            show this help message and exit

Genome arguments:
  -antismash ANTISMASH  antiSMASH *.json output file (either -antismash or
                        -genome parameter should be specified)
  -genome GENOME        Path to the fasta file with nucleotide sequence
                        (either -antismash or -genome parameter should be
                        specified)

Chemical arguments:
  -name NAME            Name of the given molecule (optional)
  -smiles SMILES        NRP chemical structure in the SMILES format (either
                        -smiles or -file_smiles parameter should be specified)
  -file_smiles FILE_SMILES
                        .smi file with one or more NRPs. Each row should
                        contain two columns: name of the NRP and SMILES
                        string. Columns should be separated by tabulation.
                        (Either -smiles or -file_smiles parameter should be
                        specified.)
  -rBAN RBAN            rBAN peptideGraph.json output file
  -NRPS_type NRPS_TYPE  Expected NRPS type (default A+B)

Technical arguments:
  -iterations ITERATIONS
                        Count of shuffling iterations (default 500)
  -delta DELTA          The maximum number of gaps in the molecule (default
                        3). Gaps are assigned as "nan".
  -threads THREADS      Number of threads (default 8)
  -out OUT              Output directory (default ./BioCAT_output)

Advanced arguments:
  -skip SKIP            Count of modules which can be skipped (default 0). Not
                        recommended to use unless the user is sure about
                        module skipping.
  --disable_pushing_type_B
                        By defult, the algorithm tries to truncate peptide
                        fragments cutting edge monomers to find all possible
                        identical peptide fragments in the structure. If
                        disabled, only the identity of full peptide fragments
                        is considered.
  --disable_dif_strand  By default, the protoclusters predicted by antiSMASH
                        are subdivided according to the assumption that each
                        cluster should contain only genes located on the same
                        strand of the genome. If disabled, protoclusters
                        annotated by antiSMASH are used as minimal clusters.
  --disable_exploration
                        By default, the algorithm tries to find the optimal
                        alignment combining alignment options in all possible
                        ways. If disabled, alignment is performed in according
                        to the given options strictly.

Reference

Konanov, D. N., Krivonos, D. V., Ilina, E. N., & Babenko, V. V. (2022). BioCAT: search for biosynthetic gene clusters producing nonribosomal peptides with known structure. Computational and Structural Biotechnology Journal.[https://doi.org/10.1016/j.csbj.2022.02.013]

biocat's People

Contributors

Stargazers

Watchers

biocat's Issues

Interpreting results: Inspecting NRPS A-domain specificities.

Hi,

Thanks for this amazing tool. I think it's pretty useful in searching hypothesized non-ribosomal peptides.

I have some confusion in the result interpretation. In the output, it has a column for 'putative linearized NRP sequence' which corresponds to monomeric linearized version of query peptide derived from rBAN. Since the genome predicted NRP may not be an exact match to this linearized query, it might be useful to include a column for what was used as a genome predicted NRP for alignment. Below is an example output for your reference:

Chromosome ID Coordinates of cluster Strand Substance BGC ID Putative linearized NRP sequence Biosynthesis profile Sln score Mln score Sdn score Mdn score Sdt score Mdt score Slt score Mlt score Relative score Binary
tig00000001 [1456645:1572935] - NPA022110 BGC_cand_3_1 orn--ser--ser--orn--nan Type A 1 0.94 1 0.98 1 0.9 1 0.97 0.967 1

In the above example, antiSMASH/NRPSPredictor2 predicted monomers in captioned BGC (BGC_cand_3_1) were (ctg1_1495: X, ctg1_1499: X, ctg1_1500: ser, ctg1_1504: ser, ctg1_1507: X ). The week predictions for ctg1_1495, ctg1_1499 and ctg1_1507 were (asp, asn, glu, gln, aad), (hydrophobic-aliphatic) and (dhpg, hpg) respectively. I see that based on two consecutive Ser in a peptide of 5 monomers it is a decent match.

BioCAT PSSM output for the BGC_cand_3_1 is below:

Module name tyr ala dhpg val glu orn gly hpg asp phe gln ser ile thr bht cys leu asn pro
nrpspksdomains_ctg1_1507_AMP-binding.1 0.537259615 0.58808933 0.485008818 0.434500649 0.394242068 0.415716857 0.548672566 0.442642643 0.311602871 0.619939577 0.397076023 0.37653127 0.368231047 0.386792453 0.6039953 0.337223587 0.513598988 0.296454768 0.543891958
nrpspksdomains_ctg1_1504_AMP-binding.1 0.037860577 0.035359801 0.032333921 0.031128405 0.055816686 0.047990402 0.041087231 0.03963964 0.042464115 0.04592145 0.161988304 0.754996776 0.029482551 0.041778976 0.034077556 0.028255528 0.044908286 0.034841076 0.042971148
nrpspksdomains_ctg1_1500_AMP-binding.1 0.358173077 0.360421836 0.248677249 0.18612192 0.66039953 0.49430114 0.478508217 0.282282282 0.324760766 0.324471299 0.838011696 0.994197292 0.163056558 0.276280323 0.337250294 0.274570025 0.322580645 0.240831296 0.310006139
nrpspksdomains_ctg1_1499_AMP-binding.1 0.022235577 0.022952854 0.037624927 0.022697795 0.022914219 0.024595081 0.024020228 0.025825826 0.020933014 0.023564955 0.022807018 0.026434558 0.022864019 0.024932615 0.02173913 0.022727273 0.024035421 0.023227384 0.023327195
nrpspksdomains_ctg1_1495_AMP-binding.1 0.010817308 0.00248139 0.016460905 0.004539559 0.017626322 0.00239952 0.022123894 0.009009009 0.005382775 0.012084592 0.016374269 0.012250161 0.019253911 0.005390836 0.019388954 0.018427518 0.006957622 0.01405868 0.005524862

In the PSSM too, ctg1_1504 and ctg1_1500 has a high score for Ser, consistent with linearized peptide and antiSMASH. Both antiSMAH and biocat score suggest Dhpg for ctg1_1499, however, one might expect Orn based on the query. All A-domains show very low scores for Orn.

In this example, the high relative score (0.967) is likely driven by the positions of the two serine and total number of monomers in the peptide under-appreciating that only 2 out of 5 monomers match? Any other way that you would suggest to inspect results?

Thanks,
Rahim.

Had an issue importing rdkit

Hi Danil,

There is a small spelling mistake in the example code:
biocat -smiles '[H][C@@]1(CCC(O)=O)NC(=O)CC(CCCCCCCCCCC)OC(=O)C@(CC(C)C)NC(=O)C@@(CC(C)C)NC(=O)C@(CC(O)=O)NC(=O)C@@(NC(=O)C@@(CC(C)C)NC(=O)C@(CC(C)C)NC1=O)C(C)C' -name 'surfactin' -genome example/Surfactine/GCF_000015785.2_ASM1578v2_genomic.fna -out surfactin_results

The path to the GCF_000015785.2_ASM1578v2_genomic.fna should be with Surfactin instead of Surfactine in the path name to make it work.
Another thing I ran into was that the rdkit module wasn't found when using biocat, which was solved when using python3 and then the path to the biocat itself.

In my case it looked like this:
python3 /home/daan/.local/bin/biocat -smiles '[H][C@@]1(CCC(O)=O)NC(=O)CC(CCCCCCCCCCC)OC(=O)C@(CC(C)C)NC(=O)C@@(CC(C)C)NC(=O)C@(CC(O)=O)NC(=O)C@@(NC(=O)C@@(CC(C)C)NC(=O)C@(CC(C)C)NC1=O)C(C)C' -name 'surfactin' -genome example/Surfactine/GCF_000015785.2_ASM1578v2_genomic.fna -out surfactin_results

I do not know what caused it, but it wasn't able to find rdkit eventhough I was able to import it manually within python and python3 in the command line.

With kind regards,
Daan

Recommend Projects

danilkrivonos / biocat Goto Github PK

biocat's Introduction

BioCAT: Biosynthesis Cluster Analysis Tool

Dependencies:

Installation:

BioCAT usage:

Usage example:

Parameters

Reference

biocat's People

Contributors

Stargazers

Watchers

Forkers

biocat's Issues

Interpreting results: Inspecting NRPS A-domain specificities.

Had an issue importing rdkit

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent