medema-group / bigslice Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 38.0 4.43 MB

A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data

License: GNU Affero General Public License v3.0

Python 71.00% CSS 0.05% Shell 0.04% JavaScript 2.01% Jinja 21.10% Jupyter Notebook 5.71% Dockerfile 0.09%

bigslice's People

Contributors

Stargazers

Watchers

bigslice's Issues

improvement: tidy but informative print outputs

redo the way we print out (and log!) system messages.

method: use hmm-scanned/searched alignments to RP15 database

..rather than downloading stockholm files from pfam.
an initial test of running Pfam-A.biosynthetic.hmm (±1500 hmms)
using 20 cores on the RP15 database takes 8 hours wall time.

Although, this means we will use hmm-aligned fragments instead of the whole gene as training sets (which will lead to quicker queries?).

This also means that we can easily construct sub_pfams for non-Pfam HMMs (such as antiSMASH ones)

BiG-SLiCE output visualization issues: jsonify'd DataTables & antiSMASH6

Visualizing the locally stored bigslice reports after querying a genome's worth of gbk files returns errors on the vast majority of pages.
There were two main errors I observed, seen below:
TL;DR:

DataTable needed to be jsonify'd before being visualized.
The SQL database that calls and looks for enum_bgc_type table, and selects the column description, then filters specifically for the description based on the column code, doesn't account for antismash6, as it appears to only contain information regarding antiSMASH4, antiSMASH5, and MIBIG.

I am using:
Windows Subsystem for Linux, Ubuntu 20.04.3 LTS, anaconda3, antismash-6.1.0, docker (wsl2 backend).

For reference, I ran bigslice using the command:
bigslice --query ~/antismashOutputs/genomeC241/ --n_ranks 5 ~/bigSLICEqueryResults/full_run_result/ --run_id 6 --query_name c241_thresh900_query

Once the run completed, I visualized the output using:
bash ~/bigSLICEqueryResults/full_run_result/start_server.sh 8000

The first issue I ran into was an Error whenever it attempted to load any data:
DataTables warning: table_id=table_reports - Ajax error

In Firefox, I used the web developer tools to track down the specific errors. The issue appeared to be that any data table visualized through the controllers (main.py, etc), had to have an jsonify'd output.
So, for every instance of:
return results, I replaced it with return jsonify(results), after importing jsonify using from Flask import jsonify.

This solved every issue with the exception the other issue in this issue write-up, where the report for an individual region in a genome that I queried wouldn't visualize the Overview correctly: (example issue page: 127.0.0.1:8000/reports/view/query/21/bgc/5)

For that specific error, the specific issue was found in the function detail_get_overview():
The specific code commented fetch type desc searches in the SQL database for the table enum_bgc_type, selects for the column of values containing description, and filters the values based on a sanitized input for code, where code is one of three options, antismash4, antismash5, and MIBIG. If, you run antismash6, the entire thing stops working and none of the adjacent information (run name, threshold, query created, etc) visualizes at all.

I believe that modifying the SQL database to include in the enum_bgc_type some references to antismash6 that at a bare minimum allow the other elements of the overview to visualize would be sufficient.

Please let me know if you have a better way of doing this.

raise FileNotFoundError() while pharsing hmmtext

When I tried to run bigslice with my deepBGC predicted output, an error occurred.

Here's the log information (I tried to figure out the reason by modifying the original scripts and printing the intermediate data objects, so you may see some odd message in logs).

(BiGSLICE-py3.6) wolfgang@DESKTOP-647U8AG:/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script$
bigslice -i `pwd`/test_wd `pwd`/test_oud

pid 14503's current affinity list: 0-11
pid 14503's new affinity list: 11
pid 14504's current affinity list: 0-11
pid 14504's new affinity list: 10
pid 14505's current affinity list: 0-11
pid 14505's new affinity list: 9
pid 14506's current affinity list: 0-11
pid 14506's new affinity list: 8
pid 14507's current affinity list: 0-11
pid 14507's new affinity list: 7
pid 14508's current affinity list: 0-11
pid 14508's new affinity list: 6
pid 14509's current affinity list: 0-11
pid 14509's new affinity list: 5
pid 14510's current affinity list: 0-11
pid 14510's new affinity list: 4
pid 14511's current affinity list: 0-11
pid 14511's new affinity list: 3
pid 14512's current affinity list: 0-11
pid 14512's new affinity list: 2
pid 14513's current affinity list: 0-11
pid 14513's new affinity list: 1
pid 14514's current affinity list: 0-11
pid 14514's new affinity list: 0
pid 14479's current affinity list: 0-11
pid 14479's new affinity list: 0-11
Folder /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud exists! continue running program (Y/[N])? Y

output_folder /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud

File /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db.bak exists! it will get overwritten, continue (Y/[N])?Y
Loading database into memory (this can take a while)...

data_db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db

[0.06678962707519531s] loading sqlite3 database
Using HMM database version 'bigslice-models-R01' (built using antiSMASH version 5.1.1)
Loading HMM databases...
[1.1011223793029785s] loading hmm databases

metadata_file /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets.tsv SRR10037259_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037259_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037259_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037259_1

SRR10037265_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037265_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037265_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037265_1

SRR10037270_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037270_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037270_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037270_1

SRR10338929_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338929_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338929_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338929_1

SRR10338933_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338933_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338933_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338933_1

SRR10338934_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338934_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338934_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338934_1

SRR10338936_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338936_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338936_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338936_1

SRR10583077_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10583077_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10583077_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10583077_1

SRR10613871_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10613871_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10613871_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd

dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10613871_1


eligible_regexes [re.compile('^BGC[0-9]{7}$'), re.compile('^.+\\.cluster[0-9]+$'), re.compile('^.+\\.region[0-9]+$')]

Found 8 BGCs in the database.
[0.006556510925292969s] processing dataset: SRR10037259_1
Found 0 BGCs in the database.
[0.00015473365783691406s] processing dataset: SRR10037265_1
Found 0 BGCs in the database.
[0.0001461505889892578s] processing dataset: SRR10037270_1
Found 0 BGCs in the database.
[0.0001442432403564453s] processing dataset: SRR10338929_1
Found 0 BGCs in the database.
[0.00014829635620117188s] processing dataset: SRR10338933_1
Found 0 BGCs in the database.
[0.00015497207641601562s] processing dataset: SRR10338934_1
Found 0 BGCs in the database.
[0.0001513957977294922s] processing dataset: SRR10338936_1
Found 0 BGCs in the database.
[0.00014734268188476562s] processing dataset: SRR10583077_1
Found 0 BGCs in the database.
[0.0001461505889892578s] processing dataset: SRR10613871_1
dataset_name, dataset_bgc_ids SRR10613871_1 []
Found 8 BGC(s) from 9 dataset(s)

self <bigslice.modules.data.database.Database object at 0x7f19acc25cc0>

Dumping in-memory database content into /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db...
self._db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db

0.0887s
Checking run status of 8 BGCs...
[0.00022101402282714844s] checking run status
Doing biosyn_pfam scan on 8 BGCs...
2 BGCs are already scanned in previous run
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...

self <bigslice.modules.data.database.Database object at 0x7f19acc25cc0>

Dumping in-memory database content into /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db...
self._db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db

0.1545s
Traceback (most recent call last):
  File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/bin/bigslice", line 1607, in <module>
    return_code = main()
  File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/bin/bigslice", line 1226, in main
    out_result_path, hmm_ids):
  **File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/lib/python3.6/site-packages/bigslice/modules/data/hsp.py", line 75, in parse_hmmtext**
    raise FileNotFoundError()
FileNotFoundError

(BiGSLICE-py3.6) wolfgang@DESKTOP-647U8AG:/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script$

The hsp.py and bigslice script seems a bit complicated to me. Would you please take a look into it and tell me how to solve it.

I created the input folder with tree structure as below in reference to the input_folder_template. All gbks were formatted by generate_antismash_gbk.py.


test_wd
├── datasets
│   ├── SRR10037259_1
│   │   ├── SRR10037259_1.region001.gbk
│   │   ├── SRR10037259_1.region002.gbk
│   │   ├── SRR10037259_1.region003.gbk
│   │   ├── SRR10037259_1.region004.gbk
│   │   ├── SRR10037259_1.region005.gbk
│   │   ├── SRR10037259_1.region006.gbk
│   │   ├── SRR10037259_1.region007.gbk
│   │   └── SRR10037259_1.region008.gbk
│   ├── SRR10037265_1
│   │   ├── SRR10037265_1.region001.gbk
│   │   ├── SRR10037265_1.region002.gbk
│   │   ├── SRR10037265_1.region003.gbk
│   │   ├── SRR10037265_1.region004.gbk
│   │   └── SRR10037265_1.region005.gbk
│   ├── SRR10037270_1

...

│   │   ├── SRR10583077_1.region004.gbk
│   │   ├── SRR10583077_1.region005.gbk
│   │   └── SRR10583077_1.region006.gbk
│   └── SRR10613871_1
│       ├── SRR10613871_1.region001.gbk
│       ├── SRR10613871_1.region002.gbk
│       ├── SRR10613871_1.region003.gbk
│       ├── SRR10613871_1.region004.gbk
│       ├── SRR10613871_1.region005.gbk
│       ├── SRR10613871_1.region006.gbk
│       ├── SRR10613871_1.region007.gbk
│       ├── SRR10613871_1.region008.gbk
│       ├── SRR10613871_1.region009.gbk
│       ├── SRR10613871_1.region010.gbk
│       ├── SRR10613871_1.region011.gbk
│       ├── SRR10613871_1.region012.gbk
│       ├── SRR10613871_1.region013.gbk
│       ├── SRR10613871_1.region014.gbk
│       ├── SRR10613871_1.region015.gbk
│       ├── SRR10613871_1.region016.gbk
│       ├── SRR10613871_1.region017.gbk
│       └── SRR10613871_1.region018.gbk
├── datasets.tsv
└── taxonomy
    ├── SRR10037259_1.taxonomy.tsv
    ├── SRR10037265_1.taxonomy.tsv
    ├── SRR10037270_1.taxonomy.tsv
    ├── SRR10338929_1.taxonomy.tsv
    ├── SRR10338933_1.taxonomy.tsv
    ├── SRR10338934_1.taxonomy.tsv
    ├── SRR10338936_1.taxonomy.tsv
    ├── SRR10583077_1.taxonomy.tsv
    └── SRR10613871_1.taxonomy.tsv

11 directories, 96 files

Link to taxonomy assignment script broken?

Thanks for the great package

It looks like the links to the taxonomy assignment scripts here are broken at 2 places (italicizied and bolded below):

BiG-SLiCE provides some python scripts that can be used to assign taxonomy based on the original input genomes (not clustergbks) using the GTDB-toolkit (only for fairly complete archaeal and bacterial genomes, download the script here). Additionally, if the genomes were already assigned an NCBI taxonomy (e.g. it went through NCBI submission process), you can use this script to extract the taxonomy metadata files straight from the genome gbks.

Is it possible to share the scripts?

Thanks in advance

Link to preprocessed NCBI database broken

This package looks great. I'd love to try it out with some clusters annotated by antismash using the database of preprocessed BCG from the NCBI, but the link appears to be broken. Is that preprocessed database available or will it be available soon? Thanks!

typo in readme

Command in readme should be changed from user@local:~$ download_bigslice_hmmdb.py to user@local:~$ download_bigslice_hmmdb

incorporate nucleotide sequences of BGCs

will be useful for downstream applications; e.g. to design primer sequences.
use bytes blob for encoding nt so that it will take as little space as possible

Can't find preprocessed data from NCBI BGCs

Hi Satria,

First of all, I'm excited to try your tool! I was just trying to download the pre-processed dataset for the query mode, but the link leads to this page: https://github.com/medema-group/bigslice/blob/master , which is a 404. Would you mind pointing me to the pre-computed dataset?

Thanks,
Valentin

Suggestion: remove interactive prompts

I'd suggest adding a flag to automatically answer 'y' to [Y/N] prompts. To use the full_run_result, at my institution it is necessary to submit a job via Slurm to access a node with sufficient memory, and so the job will fail waiting for that interactive input.

start_server.sh fails on macOS

Running start_server.sh on macOS Big Sur 11.1 returns the following error:

readlink: illegal option -- f
usage: readlink [-n] [file ...]

Possible fix is outlined in this stackoverflow thread.

Another solution would be to have the GNU version of readlink on macOS which I personally didn't want to do.

I just commented out the line DIR="$(dirname "$(readlink -f "$0")")" and added DIR="$(dirname "$0")" instead so it "works" for me.

Add BiG-SLiCE to Bioconda

I think BiG-SLiCE can be added as a conda package by using the conda skeleton command, since it is already available in PyPi: https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs-skeleton.html

[!] algorithm improvement (note to all users!)

some challenges with taking spheric, feature-based, linear clustering: (i.e. what is sacrificed from BiG-SCAPE or all-to-all pairwise comparison approach?)

Shorter BGCs and/or lack of features occupancy (e.g. bacteriocin, RiPPs, etc.) resulting in higher perceived similarity (thus will be grouped together more often than bigger BGCs i.e. t1pks and nrps under the same threshold parameter)

Possible solution: increase pHMM coverage for those BGC classes (i.e. include more domains), or dynamically adjust the number of extracted sub-pfam signatures (i.e. for classes like RiPPs where core enzyme sequence specificity are much more important than domain composition within the region) to 'balance' the feature occupancy level throughout the classes

max() based domains->genes->bgc features aggregation (due to the limitation of keeping features range in 0-255) falls short when trying to compare multimodular classes like NRPS and t1pks that relies on copy numbers

Possible solution: use sum() based (remove the range limitation) to capture those copy numbers.. must be careful not to make it too sensitive/lean towards the copy numbers though..

idea: BGC (local) alignments

based on the construction of domain strings (where sub_pfam signatures combination will form unique identities for a domain). should be achievable with the SQL query

antiSMASH 6

bump up support for antiSMASH 6.0

must using antismash parameter --pfam2go?

is it necessary for result.gbk from antismash including PFAM result? (must using antismash parameter --pfam2go?)

feature: enable whole system profiling

not only to troubleshoot and reduce bottlenecks but is also an important piece for the paper

Unpack error

I am running bigslice for the first time. I cant get past this errory:
~/bin/BigSlice.dir ~/bin/BigSlice.dir/bigslice/bigslice/bigslice -i /home/kpenn/bin/BigSlice.dir/input_folder --n_ranks 5 ~/bin/BigSlice.dir/Test.output.bigslice.dir
pid 19975's current affinity list: 0-15
pid 19975's new affinity list: 15
pid 19976's current affinity list: 0-15
pid 19976's new affinity list: 14
pid 19977's current affinity list: 0-15
pid 19977's new affinity list: 13
pid 19978's current affinity list: 0-15
pid 19978's new affinity list: 12
pid 19979's current affinity list: 0-15
pid 19979's new affinity list: 11
pid 19980's current affinity list: 0-15
pid 19980's new affinity list: 10
pid 19981's current affinity list: 0-15
pid 19981's new affinity list: 9
pid 19982's current affinity list: 0-15
pid 19982's new affinity list: 8
pid 19983's current affinity list: 0-15
pid 19983's new affinity list: 7
pid 19984's current affinity list: 0-15
pid 19984's new affinity list: 6
pid 19985's current affinity list: 0-15
pid 19985's new affinity list: 5
pid 19986's current affinity list: 0-15
pid 19986's new affinity list: 4
pid 19987's current affinity list: 0-15
pid 19987's new affinity list: 3
pid 19988's current affinity list: 0-15
pid 19988's new affinity list: 2
pid 19989's current affinity list: 0-15
pid 19989's new affinity list: 1
pid 19990's current affinity list: 0-15
pid 19990's new affinity list: 0
pid 19943's current affinity list: 0-15
pid 19943's new affinity list: 0-15
creating output folder...
Loading database into memory (this can take a while)...
[0.008744478225708008s] loading sqlite3 database
Using HMM database version 'bigslice-models-R01' (built using antiSMASH version 5.1.1)
Loading HMM databases...
[2.787898540496826s] loading hmm databases
Dumping in-memory database content into /home/kpenn/bin/BigSlice.dir/Test.output.bigslice.dir/result/data.db... 0.0908s
Traceback (most recent call last):
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 1596, in
return_code = main()
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 1074, in main
args.input_folder, output_db, pool).items():
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 224, in process_input_folder
ds_name, ds_path, ds_taxonomy_path, ds_desc = line.split("\t")
ValueError: too many values to unpack (expected 4)

where the Can't find HMM model libraries can be put ?

error: Can't find HMM model libraries!

milestone: implements K-means clustering of features

semi-supervision is done by putting mibig dataset as initial points?
K value needs to be not specified by users (e.g. by using x-means, accelerated k-means)
end result should be fuzzy, e.g. by recalculating distances of each data point to centers

Missing script for taxonomy assignment

Hi,

I was trying to use the script mentioned in the documentation, but the provided link is broken (404).

In order to help users automate this process, BiG-SLiCE provides some python scripts that can be used to assign taxonomy based on the original input genomes (not clustergbks) using the GTDB-toolkit (only for fairly complete archaeal and bacterial genomes, download the script here).

On the other hand, using this script:

Alternatively, if the genomes were coming from NCBI RefSeq/GenBank (i.e., having GCF_* or GCA_* accessions), you can use this script to extract the taxonomy from the GTDB-API.

...I was unable to generate result table (output empty, just headers) when the first record in the list was not found (No entry)
e.g., for this query

GCF_009710845.1
GCF_000160015.1
GCF_001553315.1
GCF_004166985.1
GCF_009710805.1
GCF_009710825.1

I get:

pid 6314's current affinity list: 0
pid 6314's new affinity list: 0
pid 6308's current affinity list: 0
pid 6308's new affinity list: 0
Querying GTDB-API...
No entry: GCF_009710845.1
Fetched: GCF_000160015.1
Fetched: GCF_001553315.1
Fetched: GCF_004166985.1
No entry: GCF_009710805.1
No entry: GCF_009710825.1
saving output...
Traceback (most recent call last):
  File "fetch_taxonomy_from_api.py", line 134, in <module>
    main()
  File "fetch_taxonomy_from_api.py", line 121, in main
    taxa["domain"],
TypeError: 'NoneType' object is not subscriptable

...and the output file is an empty table.
Thanks for looking into that :)

[Question] Pre-processed antiSMASH results from BiG-SLiCE paper

As a follow up to this question - #40

If I have to use the --query mode to query against the pre-processed GCFs of BiG-SLICE, I assume I would need to have the relevant antiSMASH results folders from the pre-processed dataset?

If so, could you provide a data dump of the same in a zipped file format?

Using the --query mode, you can perform a blazing-fast query
of a putative BGC against the pre-processed set of 
Gene Cluster Family (GCF) models that BiG-SLiCE outputs 
(for example, you can use our pre-processed result on 
~1.2M microbial BGCs from the NCBI database -- a 17GB zipped file download)

bigslice --query <antismash_output_folder> --n_ranks <int> <output_folder>

Thanks in advance

Changing the default temporary directory location

I ran into these problem when running a large bigslice query:

...
Fatal exception (source file p7_alidisplay.c, line 1215):
alignment display write failed
system error: No space left on device
Failed to open output file /tmp/tmpsdr80cnq/bio_848a3a02f3366616b195e4c91b6c35cf.hmmtxt for writing
...
parsing & inserting 41594 GBKs...
Inserted 41594 BGCs!
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...
Traceback (most recent call last):
  File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 1596, in <module>
    return_code = main()
  File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 931, in main
    return query_mode(args.query_name, args.query, input_run_id,
  File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 587, in query_mode
    for hsp_object in HSP.parse_hmmtext(
  File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/lib/python3.10/site-packages/bigslice/modules/data/hsp.py", line 75, in parse_hmmtext
    raise FileNotFoundError()
FileNotFoundError
...

I'm guessing that I ran out of disk space in my /tmp directory.

Is it possible to add a command line argument to specify where the temporary directory should be created?

milestone: implements feature extraction

taking HSP bitscores to generate a feature matrix for each BGC

Error: download_bigslice_hmmdb

Hi,

the download_bigslice_hmmdb command raises an error:

pid 7020's current affinity list: 0-29
pid 7020's new affinity list: 29
pid 7021's current affinity list: 0-29
pid 7021's new affinity list: 28
pid 7022's current affinity list: 0-29
pid 7022's new affinity list: 27
pid 7023's current affinity list: 0-29
pid 7023's new affinity list: 26
pid 7024's current affinity list: 0-29
pid 7024's new affinity list: 25
pid 7025's current affinity list: 0-29
pid 7025's new affinity list: 24
pid 7026's current affinity list: 0-29
pid 7026's new affinity list: 23
pid 7027's current affinity list: 0-29
pid 7027's new affinity list: 22
pid 7028's current affinity list: 0-29
pid 7028's new affinity list: 21
pid 7029's current affinity list: 0-29
pid 7029's new affinity list: 20
pid 7030's current affinity list: 0-29
pid 7030's new affinity list: 19
pid 7031's current affinity list: 0-29
pid 7031's new affinity list: 18
pid 7032's current affinity list: 0-29
pid 7032's new affinity list: 17
pid 7033's current affinity list: 0-29
pid 7033's new affinity list: 16
pid 7034's current affinity list: 0-29
pid 7034's new affinity list: 15
pid 7035's current affinity list: 0-29
pid 7035's new affinity list: 14
pid 7036's current affinity list: 0-29
pid 7036's new affinity list: 13
pid 7037's current affinity list: 0-29
pid 7037's new affinity list: 12
pid 7038's current affinity list: 0-29
pid 7038's new affinity list: 11
pid 7039's current affinity list: 0-29
pid 7039's new affinity list: 10
pid 7040's current affinity list: 0-29
pid 7040's new affinity list: 9
pid 7041's current affinity list: 0-29
pid 7041's new affinity list: 8
pid 7042's current affinity list: 0-29
pid 7042's new affinity list: 7
pid 7043's current affinity list: 0-29
pid 7043's new affinity list: 6
pid 7044's current affinity list: 0-29
pid 7044's new affinity list: 5
pid 7045's current affinity list: 0-29
pid 7045's new affinity list: 4
pid 7046's current affinity list: 0-29
pid 7046's new affinity list: 3
pid 7047's current affinity list: 0-29
pid 7047's new affinity list: 2
pid 7048's current affinity list: 0-29
pid 7048's new affinity list: 1
pid 7049's current affinity list: 0-29
pid 7049's new affinity list: 0
pid 6960's current affinity list: 0-29
pid 6960's new affinity list: 0-29
Folder /data2/barak_cytryn/Lachish/bigScape/download_bigslice_hmmdb exists! continue running program (Y/[N])? Y
Loading database into memory (this can take a while)...
Traceback (most recent call last):
File "/opt/miniconda2/envs/bigslice-v1.0.0/bin/bigslice", line 1560, in
return_code = main()
File "/opt/miniconda2/envs/bigslice-v1.0.0/bin/bigslice", line 958, in main
with Database(data_db_path, use_memory) as output_db:
File "/opt/miniconda2/envs/bigslice-v1.0.0/lib/python3.8/site-packages/bigslice/modules/data/database.py", line 60, i n init
self.schema_ver = re.search(
AttributeError: 'NoneType' object has no attribute 'group'

FYI.

question about input folder (clustering mode)

Hello first of all thanks for this useful tool!,

I want to clsuter some predicted BCGs from MAGs. With antismash I generated the output folders that contain gbk files:

Lets say that one of my MAGs is called genome_1.fa

antismash will make their predictions of putative BGCs contig per contg, so the resulting "gbks" in my case are relative to the contig names and not to the genome name, in that way when preparing bigslice's input folder, there will be a genome_1 folder which contain a antismash result's subfolder in which exists a gbk file called genome_1.gbk and other gbk files specifying the regions in which a certain BCG was infered ( lets say k141_120083.region001.gbk , where k141_120083 is a contig header name of the genome_1.fa MAG)

so in this way I understand that bigslice needs to know that the genome "bigslice_input_folder/genome_1/" contains gbk files on the way of "genome_1.region_x.gbk"

so in this way I have to change the name of all my k141_number.region001.gbk files to genome_1.region_x.gbk ?

milestone: implements an interactive (local)web output

should be ran e.g. using Python SimpleHTTPServer
at first, would simply provide basic information extracted from the sqlite3 database:

PCA-based visualization of BGC families centers
BGCs and their GCF memberships
Distribution (heatmap?) of families and taxonomy
Prototype BGC query function
Free-form SQL command inputter

style: refactor main script

please clean up the ugly monolithic code.

Failure to run example input folder

Excuse me if this is a naive mistake but, I ran:

bigslice -i ./bigslice/misc/input_folder_template bigslice_test_run

and got:

creating output folder...
Loading database into memory (this can take a while)...
[0.017683744430541992s] loading sqlite3 database
Loading HMM databases...
[4.839829921722412s] loading hmm databases
processing dataset: dataset_1...
Found 0 BGCs from 0 GBKs, another 2 to be parsed.
Parsing and inserting 2 GBKs...
Inserted 2 new BGCs.
Parsing and inserting taxonomy information...
Added taxonomy info for 0 BGCs...
[0.033670902252197266s] processing dataset: dataset_1
Found 2 BGC(s) from 1 dataset(s)
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0854s
Checking run status of 2 BGCs...
[2.6941299438476562e-05s] checking run status
Doing biosyn_pfam scan on 2 BGCs...
0 BGCs are already scanned in previous run
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0808s
[4.027007102966309s] biosyn_pfam scan
run_status is now BIOSYN_SCANNED
Doing sub_pfam scan on 2 BGCs...
0 BGCs are already scanned in previous run
Preparing fasta files for subpfam_scans...
Running subpfam_scans in parallel...
Parsing subpfam_scans results...
[1.0811920166015625s] sub_pfam scan
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0744s
run_status is now SUBPFAM_SCANNED
Extracting features from 2 BGCs...
0 BGCs are already extracted in previous run
Extracting features...
[0.07161593437194824s] features extraction
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0782s
run_status is now FEATURES_EXTRACTED
Building GCF models...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0865s
[0.16102123260498047s] clustering
run_status is now CLUSTERING_FINISHED
Assigning GCF membership...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0922s
Traceback (most recent call last):
  File "/Users/schanana/Downloads/Fan_193/.venv/bin/bigslice", line 1571, in <module>
    return_code = main()
  File "/Users/schanana/Downloads/Fan_193/.venv/bin/bigslice", line 1528, in main
    for membership in Membership.assign(
  File "/Users/schanana/Downloads/Fan_193/.venv/lib/python3.8/site-packages/bigslice/modules/clustering/membership.py", line 179, in assign
    dists, centroids_idx = nn.kneighbors(X=bgc_features.values,
  File "/Users/schanana/Downloads/Fan_193/.venv/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 616, in kneighbors
    raise ValueError(
ValueError: Expected n_neighbors <= n_samples,  but n_samples = 2, n_neighbors = 5

My .venv/bin folder contains both bigslice and hmm* executables. I ran the command stated above from /Users/schanana/Downloads/Fan_193/

Missing link in README

check out our [demo page here](https://github.com/404))
here:
https://github.com/medema-group/bigslice/blame/8197176c5c39a13ac0c8f2645cbead429bb2d0ce/README.md#L64

idea: constructing CORASON-like phylogeny

e.g.:

take the closest matching BGC->centroid
use the closest to lowest matching cds to align the other BGCs (i.e. traverse until found)

Trouble querying antiSMASH BGCs

I installed BIGSlice from source and tested it using the input folder and everything works great!
I attempted to run some antiSMASH 5.1.1 results through Big Slice. My output file contains the main region.gbk files:

GCF_902388275.1_UHGG_MGYG-HGUT-02545/NZ_LR699017.1.region001.gbk
GCF_902388275.1_UHGG_MGYG-HGUT-02545//NZ_LR699017.1.region002.gbk

So I ran the following query:
bigslice --query GCF_902388275.1_UHGG_MGYG-HGUT-02545/ --n_ranks 1 out_bigslice_GCF_902388275.1
I get the following error:

Output folder didn't exists!
BiG-SLiCE run failed.

Not sure what is going on. Any tips would be helpful.

Explore MMseq2 for subpfam scan

It is up to 14000x faster?

overlapping antiSMASH domains

One thing we missed out during the construction of the original biosnythetic-Pfams is that many (especially PKS/NRP core domains) --see the list here-- of antiSMASH are meant to be overlapping, i.e. acting like a sub-Pfam.. This have a consequence in that features from those domains will be artificially enriched (especially since each of them will have sub-Pfams on their own)..

For the next set of reference models, please consider this issue

core: store (subpfam) hsp -> (biosyn_pfam) hsp relationship

which will enable showing subpfam annotations per aligned biosyn_pfam hits

bug: parsed folder paths lose its first letter

sqlite> select * from bgc
...> ;
1|1|**C_003888.3/**NC_003888.3.region002|as5|0|24764|C_003888.3|NC_003888.3.region002.gbk
2|1|C_003888.3/NC_003888.3.region001|as5|0|53018|C_003888.3|NC_003888.3.region001.gbk

method: include (pl)antiSMASH HMMs as features

in addition to the Pfam-filtered biosynthetic pfams..
antiSMASH contain not only Pfam-derived HMMs but also a lot of manually curated ones, such as bacteriocin family.

Possible to generate a data frame output from Interactive output

Hi! I am trying to work with the output from big slice and was wondering if the report info can be generated as a data frame for further analysis. Thanks!

Error in generation of antismash regions using "generate_antismash_gbk.py"

Hi,

I apologise in advance if this is a rather simple question, I am a newbie in this area and my learning process is slow.

I am trying to generate antismash regions from my DeepBGC output. Using the proposed script I manage to convert my first 5 regions. However, for the rest I get this output:

Parsing region coordinates...
CDS ['JJQ97_RS05155'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05160'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05165'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05170'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05175'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05180'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05185'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05185'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
Traceback (most recent call last):
  File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/Seq.py", line 1232, in translate
    table_id = int(table)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate_antismash_gbk.py", line 131, in <module>
    main()
  File "generate_antismash_gbk.py", line 111, in main
    feature.translate(region_gbk).seq]
  File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/SeqFeature.py", line 462, in translate
    return feat_seq.translate(
  File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/SeqRecord.py", line 1320, in translate
    self.seq.translate(
  File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/Seq.py", line 1251, in translate
    raise ValueError("Bad table argument")
ValueError: Bad table argument

My first thought was that there was an error in my coordinates table, however, I am not able to find it as its format seems to be the same as example.csv ( I attached a sample of my table below; starting from the row marked in bold it gives the error). I have tried to modify the script to solve this problem but I haven't been able to do it yet.

#record_name,start_loc,end_loc
NZ_CP068034.2,431183,432377
NZ_CP068034.2,477321,504096
NZ_CP068034.2,505213,506620
NZ_CP068034.2,520667,529185
NZ_CP068034.2,603185,636668
NZ_CP068034.2,1171217,1207238
NZ_CP068034.2,1261449,1269122
NZ_CP068034.2,1277741,1294683
NZ_CP068034.2,1329947,1331198
NZ_CP068034.2,1348002,1372261
NZ_CP068034.2,1631019,1638565

Thank you in advance

Error when generating antismash-like reions

Hi,

When runing the exact example for generate_antismash_gbk.py I get:

Parsing region coordinates...
CDS ['1756_1756_2'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_3'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_4'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_5'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_6'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_7'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_8'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_9'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_10'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_11'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_12'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
Traceback (most recent call last):
File "generate_antismash_gbk.py", line 135, in
main()
File "generate_antismash_gbk.py", line 121, in main
SeqIO.write(
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 530, in write
count = writer_class(handle).write_file(sequences)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 244, in write_file
count = self.write_records(records, maxcount)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 218, in write_records
self.write_record(record)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 981, in write_record
self._write_the_first_line(record)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations

..and only one gbk in the output folder.

BioPython: 1.78

using with gecco

Is there anyway to use bigslice with gecco? https://github.com/zellerlab/GECCO/

method: implements redundancy filtering prior to k-means

e.g. by using MinHash-based approach

request: please enable separating input BGCs into datasets

then we can select which datasets to perform analysis with, etc...

bigslice output

Hi,
I am running BiG-SLICE to investigate the novelty of BGCs that recovered from my own metagenomic datasets, by comparing them with the 1.2M preprocessed BGCs, but why the "class" column was "unknown:unknown"? The "class" column of the example in big-fam website could be assigned to specific BGCs class, such as NRPS, RIPP, PKS and so on.

my script:

the results of big-fam example:

Cases when the taxonomy information of the interested species is only partially available

I am preparing the <taxonomy_X.tsv> files according to the example input folder:
https://github.com/medema-group/bigslice/blob/master/misc/input_folder_template/taxonomy/dataset_1_taxonomy.tsv

I used GTDB-tk 1.5 as my taxonomy assignment tool, I encountered some cases which GTDB-tk could not assign genus and species.
GCA_010156995.1_ASM1015699v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__;f__;g__;s__
GCA_010672345.1_ASM1067234v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Elainellales;f__Elainellaceae;g__;s__
GCA_010672835.1_ASM1067283v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__;f__;g__;s__

I wonder whether you have some suggestion for preparing the <taxonomy_X.tsv> files for these case, thank you.

os.sched_getaffinity not available on MacOS

Something I noticed while attempting to run bligslice on my Mac: os.sched_getaffinity import failure due to lack of sched_getaffinity in Mac python's os package. This would be a problem on Windows, as well, I expect. Not a big deal for me, personally, as I normally run elsewhere, but may want to address if you want more compatibility.

$ bigslice --help
Traceback (most recent call last):
  File "/Users/DWUdwary/anaconda3/envs/bigslice/bin/bigslice", line 11, in <module>
    from os import getpid, path, makedirs, remove, sched_getaffinity
ImportError: cannot import name 'sched_getaffinity' from 'os' (/Users/DWUdwary/anaconda3/envs/bigslice/lib/python3.10/os.py)

Broken link for jupyter notebook scripts?

Hi @satriaphd, thanks for creating this amazing tool!
I would like to learn and explore the SQL query data, and I thought maybe you already have some pointers in your Jupyter notebook scripts?

To access BiG-SLiCE's preprocessed data, (advanced) users need to be able to run SQL(ite) queries. Although the learning curve might be steeper compared to the conventional tabular-formatted output files, once familiarized, the SQL database can provide an easy-to-use yet very powerful data wrangling experience. Please refer to our publication manuscript to get an idea of what kind of things are able to be done with the output data. Additionally, you can also download and reuse some jupyter notebook scripts that we wrote to perform all analyses and generate figures for the manuscript.

Unfortunately, the link at https://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/scripts/ seems to be broken. Would you kindly fix the link and share the notebooks?

can't use pip install ./bigslice/ commend

Hello, I got some problems when I download bigslice using pip install ./bigslice/

DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: Directory './bigslice/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.

This error came...

I try pip install --upgrade pip and try again but same error is came.

and I try pip3 install ./bigslice/

that time,
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)
ERROR: Directory './bigslice/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)

this error is came.

So, I download manually python setup.py install , python setup.py build

It's ok?
Please let me know how can I do

Thank you