medema-group / bigslice Goto Github PK
View Code? Open in Web Editor NEWA highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data
License: GNU Affero General Public License v3.0
A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data
License: GNU Affero General Public License v3.0
redo the way we print out (and log!) system messages.
..rather than downloading stockholm files from pfam.
an initial test of running Pfam-A.biosynthetic.hmm (±1500 hmms)
using 20 cores on the RP15 database takes 8 hours wall time.
Although, this means we will use hmm-aligned fragments instead of the whole gene as training sets (which will lead to quicker queries?).
This also means that we can easily construct sub_pfams for non-Pfam HMMs (such as antiSMASH ones)
Visualizing the locally stored bigslice reports after querying a genome's worth of gbk files returns errors on the vast majority of pages.
There were two main errors I observed, seen below:
TL;DR:
enum_bgc_type
table, and selects the column description
, then filters specifically for the description
based on the column code
, doesn't account for antismash6, as it appears to only contain information regarding antiSMASH4
, antiSMASH5
, and MIBIG
.I am using:
Windows Subsystem for Linux, Ubuntu 20.04.3 LTS, anaconda3, antismash-6.1.0, docker (wsl2 backend).
For reference, I ran bigslice using the command:
bigslice --query ~/antismashOutputs/genomeC241/ --n_ranks 5 ~/bigSLICEqueryResults/full_run_result/ --run_id 6 --query_name c241_thresh900_query
Once the run completed, I visualized the output using:
bash ~/bigSLICEqueryResults/full_run_result/start_server.sh 8000
The first issue I ran into was an Error whenever it attempted to load any data:
DataTables warning: table_id=table_reports - Ajax error
In Firefox, I used the web developer tools to track down the specific errors. The issue appeared to be that any data table visualized through the controllers (main.py, etc), had to have an jsonify'd output.
So, for every instance of:
return results
, I replaced it with return jsonify(results)
, after importing jsonify using from Flask import jsonify
.
This solved every issue with the exception the other issue in this issue write-up, where the report for an individual region in a genome that I queried wouldn't visualize the Overview correctly: (example issue page: 127.0.0.1:8000/reports/view/query/21/bgc/5)
For that specific error, the specific issue was found in the function detail_get_overview()
:
The specific code commented fetch type desc
searches in the SQL database for the table enum_bgc_type
, selects for the column of values containing description
, and filters the values based on a sanitized input for code
, where code is one of three options, antismash4
, antismash5
, and MIBIG
. If, you run antismash6
, the entire thing stops working and none of the adjacent information (run name, threshold, query created, etc) visualizes at all.
I believe that modifying the SQL database to include in the enum_bgc_type
some references to antismash6 that at a bare minimum allow the other elements of the overview to visualize would be sufficient.
Please let me know if you have a better way of doing this.
When I tried to run bigslice with my deepBGC predicted output, an error occurred.
Here's the log information (I tried to figure out the reason by modifying the original scripts and printing the intermediate data objects, so you may see some odd message in logs).
(BiGSLICE-py3.6) wolfgang@DESKTOP-647U8AG:/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script$
bigslice -i `pwd`/test_wd `pwd`/test_oud
pid 14503's current affinity list: 0-11
pid 14503's new affinity list: 11
pid 14504's current affinity list: 0-11
pid 14504's new affinity list: 10
pid 14505's current affinity list: 0-11
pid 14505's new affinity list: 9
pid 14506's current affinity list: 0-11
pid 14506's new affinity list: 8
pid 14507's current affinity list: 0-11
pid 14507's new affinity list: 7
pid 14508's current affinity list: 0-11
pid 14508's new affinity list: 6
pid 14509's current affinity list: 0-11
pid 14509's new affinity list: 5
pid 14510's current affinity list: 0-11
pid 14510's new affinity list: 4
pid 14511's current affinity list: 0-11
pid 14511's new affinity list: 3
pid 14512's current affinity list: 0-11
pid 14512's new affinity list: 2
pid 14513's current affinity list: 0-11
pid 14513's new affinity list: 1
pid 14514's current affinity list: 0-11
pid 14514's new affinity list: 0
pid 14479's current affinity list: 0-11
pid 14479's new affinity list: 0-11
Folder /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud exists! continue running program (Y/[N])? Y
output_folder /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud
File /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db.bak exists! it will get overwritten, continue (Y/[N])?Y
Loading database into memory (this can take a while)...
data_db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db
[0.06678962707519531s] loading sqlite3 database
Using HMM database version 'bigslice-models-R01' (built using antiSMASH version 5.1.1)
Loading HMM databases...
[1.1011223793029785s] loading hmm databases
metadata_file /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets.tsv SRR10037259_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037259_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037259_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037259_1
SRR10037265_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037265_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037265_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037265_1
SRR10037270_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037270_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10037270_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10037270_1
SRR10338929_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338929_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338929_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338929_1
SRR10338933_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338933_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338933_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338933_1
SRR10338934_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338934_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338934_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338934_1
SRR10338936_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338936_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10338936_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10338936_1
SRR10583077_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10583077_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10583077_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10583077_1
SRR10613871_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10613871_1
/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/taxonomy/SRR10613871_1.taxonomy.tsv
NA
folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd
dataset_folder_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_wd/datasets/SRR10613871_1
eligible_regexes [re.compile('^BGC[0-9]{7}$'), re.compile('^.+\\.cluster[0-9]+$'), re.compile('^.+\\.region[0-9]+$')]
Found 8 BGCs in the database.
[0.006556510925292969s] processing dataset: SRR10037259_1
Found 0 BGCs in the database.
[0.00015473365783691406s] processing dataset: SRR10037265_1
Found 0 BGCs in the database.
[0.0001461505889892578s] processing dataset: SRR10037270_1
Found 0 BGCs in the database.
[0.0001442432403564453s] processing dataset: SRR10338929_1
Found 0 BGCs in the database.
[0.00014829635620117188s] processing dataset: SRR10338933_1
Found 0 BGCs in the database.
[0.00015497207641601562s] processing dataset: SRR10338934_1
Found 0 BGCs in the database.
[0.0001513957977294922s] processing dataset: SRR10338936_1
Found 0 BGCs in the database.
[0.00014734268188476562s] processing dataset: SRR10583077_1
Found 0 BGCs in the database.
[0.0001461505889892578s] processing dataset: SRR10613871_1
dataset_name, dataset_bgc_ids SRR10613871_1 []
Found 8 BGC(s) from 9 dataset(s)
self <bigslice.modules.data.database.Database object at 0x7f19acc25cc0>
Dumping in-memory database content into /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db...
self._db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db
0.0887s
Checking run status of 8 BGCs...
[0.00022101402282714844s] checking run status
Doing biosyn_pfam scan on 8 BGCs...
2 BGCs are already scanned in previous run
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...
self <bigslice.modules.data.database.Database object at 0x7f19acc25cc0>
Dumping in-memory database content into /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db...
self._db_path /mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script/test_oud/result/data.db
0.1545s
Traceback (most recent call last):
File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/bin/bigslice", line 1607, in <module>
return_code = main()
File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/bin/bigslice", line 1226, in main
out_result_path, hmm_ids):
**File "/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/lib/python3.6/site-packages/bigslice/modules/data/hsp.py", line 75, in parse_hmmtext**
raise FileNotFoundError()
FileNotFoundError
(BiGSLICE-py3.6) wolfgang@DESKTOP-647U8AG:/mnt/d/Programming/miniconda3/envs/BiGSLICE-py3.6/wd/script$
The hsp.py and bigslice script seems a bit complicated to me. Would you please take a look into it and tell me how to solve it.
I created the input folder with tree structure as below in reference to the input_folder_template. All gbks were formatted by generate_antismash_gbk.py.
test_wd
├── datasets
│ ├── SRR10037259_1
│ │ ├── SRR10037259_1.region001.gbk
│ │ ├── SRR10037259_1.region002.gbk
│ │ ├── SRR10037259_1.region003.gbk
│ │ ├── SRR10037259_1.region004.gbk
│ │ ├── SRR10037259_1.region005.gbk
│ │ ├── SRR10037259_1.region006.gbk
│ │ ├── SRR10037259_1.region007.gbk
│ │ └── SRR10037259_1.region008.gbk
│ ├── SRR10037265_1
│ │ ├── SRR10037265_1.region001.gbk
│ │ ├── SRR10037265_1.region002.gbk
│ │ ├── SRR10037265_1.region003.gbk
│ │ ├── SRR10037265_1.region004.gbk
│ │ └── SRR10037265_1.region005.gbk
│ ├── SRR10037270_1
...
│ │ ├── SRR10583077_1.region004.gbk
│ │ ├── SRR10583077_1.region005.gbk
│ │ └── SRR10583077_1.region006.gbk
│ └── SRR10613871_1
│ ├── SRR10613871_1.region001.gbk
│ ├── SRR10613871_1.region002.gbk
│ ├── SRR10613871_1.region003.gbk
│ ├── SRR10613871_1.region004.gbk
│ ├── SRR10613871_1.region005.gbk
│ ├── SRR10613871_1.region006.gbk
│ ├── SRR10613871_1.region007.gbk
│ ├── SRR10613871_1.region008.gbk
│ ├── SRR10613871_1.region009.gbk
│ ├── SRR10613871_1.region010.gbk
│ ├── SRR10613871_1.region011.gbk
│ ├── SRR10613871_1.region012.gbk
│ ├── SRR10613871_1.region013.gbk
│ ├── SRR10613871_1.region014.gbk
│ ├── SRR10613871_1.region015.gbk
│ ├── SRR10613871_1.region016.gbk
│ ├── SRR10613871_1.region017.gbk
│ └── SRR10613871_1.region018.gbk
├── datasets.tsv
└── taxonomy
├── SRR10037259_1.taxonomy.tsv
├── SRR10037265_1.taxonomy.tsv
├── SRR10037270_1.taxonomy.tsv
├── SRR10338929_1.taxonomy.tsv
├── SRR10338933_1.taxonomy.tsv
├── SRR10338934_1.taxonomy.tsv
├── SRR10338936_1.taxonomy.tsv
├── SRR10583077_1.taxonomy.tsv
└── SRR10613871_1.taxonomy.tsv
11 directories, 96 files
Hi
Thanks for the great package
It looks like the links to the taxonomy assignment scripts here are broken at 2 places (italicizied and bolded below):
Is it possible to share the scripts?
Thanks in advance
This package looks great. I'd love to try it out with some clusters annotated by antismash using the database of preprocessed BCG from the NCBI, but the link appears to be broken. Is that preprocessed database available or will it be available soon? Thanks!
Command in readme should be changed from user@local:~$ download_bigslice_hmmdb.py
to user@local:~$ download_bigslice_hmmdb
will be useful for downstream applications; e.g. to design primer sequences.
use bytes blob for encoding nt so that it will take as little space as possible
Hi Satria,
First of all, I'm excited to try your tool! I was just trying to download the pre-processed dataset for the query mode, but the link leads to this page: https://github.com/medema-group/bigslice/blob/master , which is a 404. Would you mind pointing me to the pre-computed dataset?
Thanks,
Valentin
I'd suggest adding a flag to automatically answer 'y' to [Y/N] prompts. To use the full_run_result, at my institution it is necessary to submit a job via Slurm to access a node with sufficient memory, and so the job will fail waiting for that interactive input.
Running start_server.sh
on macOS Big Sur 11.1 returns the following error:
readlink: illegal option -- f
usage: readlink [-n] [file ...]
Possible fix is outlined in this stackoverflow thread.
Another solution would be to have the GNU version of readlink
on macOS which I personally didn't want to do.
I just commented out the line DIR="$(dirname "$(readlink -f "$0")")"
and added DIR="$(dirname "$0")"
instead so it "works" for me.
I think BiG-SLiCE can be added as a conda package by using the conda skeleton
command, since it is already available in PyPi: https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs-skeleton.html
some challenges with taking spheric, feature-based, linear clustering: (i.e. what is sacrificed from BiG-SCAPE or all-to-all pairwise comparison approach?)
based on the construction of domain strings (where sub_pfam signatures combination will form unique identities for a domain). should be achievable with the SQL query
bump up support for antiSMASH 6.0
is it necessary for result.gbk from antismash including PFAM result? (must using antismash parameter --pfam2go?)
not only to troubleshoot and reduce bottlenecks but is also an important piece for the paper
I am running bigslice for the first time. I cant get past this errory:
~/bin/BigSlice.dir ~/bin/BigSlice.dir/bigslice/bigslice/bigslice -i /home/kpenn/bin/BigSlice.dir/input_folder --n_ranks 5 ~/bin/BigSlice.dir/Test.output.bigslice.dir
pid 19975's current affinity list: 0-15
pid 19975's new affinity list: 15
pid 19976's current affinity list: 0-15
pid 19976's new affinity list: 14
pid 19977's current affinity list: 0-15
pid 19977's new affinity list: 13
pid 19978's current affinity list: 0-15
pid 19978's new affinity list: 12
pid 19979's current affinity list: 0-15
pid 19979's new affinity list: 11
pid 19980's current affinity list: 0-15
pid 19980's new affinity list: 10
pid 19981's current affinity list: 0-15
pid 19981's new affinity list: 9
pid 19982's current affinity list: 0-15
pid 19982's new affinity list: 8
pid 19983's current affinity list: 0-15
pid 19983's new affinity list: 7
pid 19984's current affinity list: 0-15
pid 19984's new affinity list: 6
pid 19985's current affinity list: 0-15
pid 19985's new affinity list: 5
pid 19986's current affinity list: 0-15
pid 19986's new affinity list: 4
pid 19987's current affinity list: 0-15
pid 19987's new affinity list: 3
pid 19988's current affinity list: 0-15
pid 19988's new affinity list: 2
pid 19989's current affinity list: 0-15
pid 19989's new affinity list: 1
pid 19990's current affinity list: 0-15
pid 19990's new affinity list: 0
pid 19943's current affinity list: 0-15
pid 19943's new affinity list: 0-15
creating output folder...
Loading database into memory (this can take a while)...
[0.008744478225708008s] loading sqlite3 database
Using HMM database version 'bigslice-models-R01' (built using antiSMASH version 5.1.1)
Loading HMM databases...
[2.787898540496826s] loading hmm databases
Dumping in-memory database content into /home/kpenn/bin/BigSlice.dir/Test.output.bigslice.dir/result/data.db... 0.0908s
Traceback (most recent call last):
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 1596, in
return_code = main()
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 1074, in main
args.input_folder, output_db, pool).items():
File "/home/kpenn/bin/BigSlice.dir/bigslice/bigslice/bigslice", line 224, in process_input_folder
ds_name, ds_path, ds_taxonomy_path, ds_desc = line.split("\t")
ValueError: too many values to unpack (expected 4)
where the Can't find HMM model libraries can be put ?
error: Can't find HMM model libraries!
semi-supervision is done by putting mibig dataset as initial points?
K value needs to be not specified by users (e.g. by using x-means, accelerated k-means)
end result should be fuzzy, e.g. by recalculating distances of each data point to centers
Hi,
I was trying to use the script mentioned in the documentation, but the provided link is broken (404).
In order to help users automate this process, BiG-SLiCE provides some python scripts that can be used to assign taxonomy based on the original input genomes (not clustergbks) using the GTDB-toolkit (only for fairly complete archaeal and bacterial genomes, download the script here).
On the other hand, using this script:
Alternatively, if the genomes were coming from NCBI RefSeq/GenBank (i.e., having GCF_* or GCA_* accessions), you can use this script to extract the taxonomy from the GTDB-API.
...I was unable to generate result table (output empty, just headers) when the first record in the list was not found (No entry
)
e.g., for this query
GCF_009710845.1
GCF_000160015.1
GCF_001553315.1
GCF_004166985.1
GCF_009710805.1
GCF_009710825.1
I get:
pid 6314's current affinity list: 0
pid 6314's new affinity list: 0
pid 6308's current affinity list: 0
pid 6308's new affinity list: 0
Querying GTDB-API...
No entry: GCF_009710845.1
Fetched: GCF_000160015.1
Fetched: GCF_001553315.1
Fetched: GCF_004166985.1
No entry: GCF_009710805.1
No entry: GCF_009710825.1
saving output...
Traceback (most recent call last):
File "fetch_taxonomy_from_api.py", line 134, in <module>
main()
File "fetch_taxonomy_from_api.py", line 121, in main
taxa["domain"],
TypeError: 'NoneType' object is not subscriptable
...and the output file is an empty table.
Thanks for looking into that :)
Hi
As a follow up to this question - #40
If I have to use the --query mode
to query against the pre-processed GCFs of BiG-SLICE, I assume I would need to have the relevant antiSMASH results folders from the pre-processed dataset?
If so, could you provide a data dump of the same in a zipped file format?
Using the --query mode, you can perform a blazing-fast query
of a putative BGC against the pre-processed set of
Gene Cluster Family (GCF) models that BiG-SLiCE outputs
(for example, you can use our pre-processed result on
~1.2M microbial BGCs from the NCBI database -- a 17GB zipped file download)
bigslice --query <antismash_output_folder> --n_ranks <int> <output_folder>
Thanks in advance
I ran into these problem when running a large bigslice query:
...
Fatal exception (source file p7_alidisplay.c, line 1215):
alignment display write failed
system error: No space left on device
Failed to open output file /tmp/tmpsdr80cnq/bio_848a3a02f3366616b195e4c91b6c35cf.hmmtxt for writing
...
parsing & inserting 41594 GBKs...
Inserted 41594 BGCs!
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...
Traceback (most recent call last):
File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 1596, in <module>
return_code = main()
File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 931, in main
return query_mode(args.query_name, args.query, input_run_id,
File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/bin/bigslice", line 587, in query_mode
for hsp_object in HSP.parse_hmmtext(
File "/datadrive/data/bgcflow/.snakemake/conda/d6f4af77174369dbe9ceb7588b4d0d4c/lib/python3.10/site-packages/bigslice/modules/data/hsp.py", line 75, in parse_hmmtext
raise FileNotFoundError()
FileNotFoundError
...
I'm guessing that I ran out of disk space in my /tmp
directory.
Is it possible to add a command line argument to specify where the temporary directory should be created?
taking HSP bitscores to generate a feature matrix for each BGC
Hi,
the download_bigslice_hmmdb command raises an error:
pid 7020's current affinity list: 0-29
pid 7020's new affinity list: 29
pid 7021's current affinity list: 0-29
pid 7021's new affinity list: 28
pid 7022's current affinity list: 0-29
pid 7022's new affinity list: 27
pid 7023's current affinity list: 0-29
pid 7023's new affinity list: 26
pid 7024's current affinity list: 0-29
pid 7024's new affinity list: 25
pid 7025's current affinity list: 0-29
pid 7025's new affinity list: 24
pid 7026's current affinity list: 0-29
pid 7026's new affinity list: 23
pid 7027's current affinity list: 0-29
pid 7027's new affinity list: 22
pid 7028's current affinity list: 0-29
pid 7028's new affinity list: 21
pid 7029's current affinity list: 0-29
pid 7029's new affinity list: 20
pid 7030's current affinity list: 0-29
pid 7030's new affinity list: 19
pid 7031's current affinity list: 0-29
pid 7031's new affinity list: 18
pid 7032's current affinity list: 0-29
pid 7032's new affinity list: 17
pid 7033's current affinity list: 0-29
pid 7033's new affinity list: 16
pid 7034's current affinity list: 0-29
pid 7034's new affinity list: 15
pid 7035's current affinity list: 0-29
pid 7035's new affinity list: 14
pid 7036's current affinity list: 0-29
pid 7036's new affinity list: 13
pid 7037's current affinity list: 0-29
pid 7037's new affinity list: 12
pid 7038's current affinity list: 0-29
pid 7038's new affinity list: 11
pid 7039's current affinity list: 0-29
pid 7039's new affinity list: 10
pid 7040's current affinity list: 0-29
pid 7040's new affinity list: 9
pid 7041's current affinity list: 0-29
pid 7041's new affinity list: 8
pid 7042's current affinity list: 0-29
pid 7042's new affinity list: 7
pid 7043's current affinity list: 0-29
pid 7043's new affinity list: 6
pid 7044's current affinity list: 0-29
pid 7044's new affinity list: 5
pid 7045's current affinity list: 0-29
pid 7045's new affinity list: 4
pid 7046's current affinity list: 0-29
pid 7046's new affinity list: 3
pid 7047's current affinity list: 0-29
pid 7047's new affinity list: 2
pid 7048's current affinity list: 0-29
pid 7048's new affinity list: 1
pid 7049's current affinity list: 0-29
pid 7049's new affinity list: 0
pid 6960's current affinity list: 0-29
pid 6960's new affinity list: 0-29
Folder /data2/barak_cytryn/Lachish/bigScape/download_bigslice_hmmdb exists! continue running program (Y/[N])? Y
Loading database into memory (this can take a while)...
Traceback (most recent call last):
File "/opt/miniconda2/envs/bigslice-v1.0.0/bin/bigslice", line 1560, in
return_code = main()
File "/opt/miniconda2/envs/bigslice-v1.0.0/bin/bigslice", line 958, in main
with Database(data_db_path, use_memory) as output_db:
File "/opt/miniconda2/envs/bigslice-v1.0.0/lib/python3.8/site-packages/bigslice/modules/data/database.py", line 60, i n init
self.schema_ver = re.search(
AttributeError: 'NoneType' object has no attribute 'group'
FYI.
Hello first of all thanks for this useful tool!,
I want to clsuter some predicted BCGs from MAGs. With antismash I generated the output folders that contain gbk files:
Lets say that one of my MAGs is called genome_1.fa
so in this way I understand that bigslice needs to know that the genome "bigslice_input_folder/genome_1/" contains gbk files on the way of "genome_1.region_x.gbk"
so in this way I have to change the name of all my k141_number.region001.gbk files to genome_1.region_x.gbk ?
should be ran e.g. using Python SimpleHTTPServer
at first, would simply provide basic information extracted from the sqlite3 database:
please clean up the ugly monolithic code.
Excuse me if this is a naive mistake but, I ran:
bigslice -i ./bigslice/misc/input_folder_template bigslice_test_run
and got:
creating output folder...
Loading database into memory (this can take a while)...
[0.017683744430541992s] loading sqlite3 database
Loading HMM databases...
[4.839829921722412s] loading hmm databases
processing dataset: dataset_1...
Found 0 BGCs from 0 GBKs, another 2 to be parsed.
Parsing and inserting 2 GBKs...
Inserted 2 new BGCs.
Parsing and inserting taxonomy information...
Added taxonomy info for 0 BGCs...
[0.033670902252197266s] processing dataset: dataset_1
Found 2 BGC(s) from 1 dataset(s)
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0854s
Checking run status of 2 BGCs...
[2.6941299438476562e-05s] checking run status
Doing biosyn_pfam scan on 2 BGCs...
0 BGCs are already scanned in previous run
Preparing fasta files for hmmscans...
Running hmmscans in parallel...
Parsing hmmscans results...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0808s
[4.027007102966309s] biosyn_pfam scan
run_status is now BIOSYN_SCANNED
Doing sub_pfam scan on 2 BGCs...
0 BGCs are already scanned in previous run
Preparing fasta files for subpfam_scans...
Running subpfam_scans in parallel...
Parsing subpfam_scans results...
[1.0811920166015625s] sub_pfam scan
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0744s
run_status is now SUBPFAM_SCANNED
Extracting features from 2 BGCs...
0 BGCs are already extracted in previous run
Extracting features...
[0.07161593437194824s] features extraction
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0782s
run_status is now FEATURES_EXTRACTED
Building GCF models...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0865s
[0.16102123260498047s] clustering
run_status is now CLUSTERING_FINISHED
Assigning GCF membership...
Dumping in-memory database content into /Users/schanana/Downloads/Fan_193/bigslice_test_run/result/data.db... 0.0922s
Traceback (most recent call last):
File "/Users/schanana/Downloads/Fan_193/.venv/bin/bigslice", line 1571, in <module>
return_code = main()
File "/Users/schanana/Downloads/Fan_193/.venv/bin/bigslice", line 1528, in main
for membership in Membership.assign(
File "/Users/schanana/Downloads/Fan_193/.venv/lib/python3.8/site-packages/bigslice/modules/clustering/membership.py", line 179, in assign
dists, centroids_idx = nn.kneighbors(X=bgc_features.values,
File "/Users/schanana/Downloads/Fan_193/.venv/lib/python3.8/site-packages/sklearn/neighbors/_base.py", line 616, in kneighbors
raise ValueError(
ValueError: Expected n_neighbors <= n_samples, but n_samples = 2, n_neighbors = 5
My .venv/bin
folder contains both bigslice
and hmm*
executables. I ran the command stated above from /Users/schanana/Downloads/Fan_193/
check out our [demo page here](https://github.com/404))
here:
https://github.com/medema-group/bigslice/blame/8197176c5c39a13ac0c8f2645cbead429bb2d0ce/README.md#L64
e.g.:
I installed BIGSlice from source and tested it using the input folder and everything works great!
I attempted to run some antiSMASH 5.1.1 results through Big Slice. My output file contains the main region.gbk files:
GCF_902388275.1_UHGG_MGYG-HGUT-02545/NZ_LR699017.1.region001.gbk
GCF_902388275.1_UHGG_MGYG-HGUT-02545//NZ_LR699017.1.region002.gbk
So I ran the following query:
bigslice --query GCF_902388275.1_UHGG_MGYG-HGUT-02545/ --n_ranks 1 out_bigslice_GCF_902388275.1
I get the following error:
Output folder didn't exists!
BiG-SLiCE run failed.
Not sure what is going on. Any tips would be helpful.
It is up to 14000x faster?
One thing we missed out during the construction of the original biosnythetic-Pfams is that many (especially PKS/NRP core domains) --see the list here-- of antiSMASH are meant to be overlapping, i.e. acting like a sub-Pfam.. This have a consequence in that features from those domains will be artificially enriched (especially since each of them will have sub-Pfams on their own)..
For the next set of reference models, please consider this issue
which will enable showing subpfam annotations per aligned biosyn_pfam hits
sqlite> select * from bgc
...> ;
1|1|**C_003888.3/**NC_003888.3.region002|as5|0|24764|C_003888.3|NC_003888.3.region002.gbk
2|1|C_003888.3/NC_003888.3.region001|as5|0|53018|C_003888.3|NC_003888.3.region001.gbk
in addition to the Pfam-filtered biosynthetic pfams..
antiSMASH contain not only Pfam-derived HMMs but also a lot of manually curated ones, such as bacteriocin family.
Hi! I am trying to work with the output from big slice and was wondering if the report info can be generated as a data frame for further analysis. Thanks!
Hi,
I apologise in advance if this is a rather simple question, I am a newbie in this area and my learning process is slow.
I am trying to generate antismash regions from my DeepBGC output. Using the proposed script I manage to convert my first 5 regions. However, for the rest I get this output:
Parsing region coordinates...
CDS ['JJQ97_RS05155'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05160'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05165'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05170'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05175'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05180'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05185'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['JJQ97_RS05185'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
Traceback (most recent call last):
File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/Seq.py", line 1232, in translate
table_id = int(table)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "generate_antismash_gbk.py", line 131, in <module>
main()
File "generate_antismash_gbk.py", line 111, in main
feature.translate(region_gbk).seq]
File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/SeqFeature.py", line 462, in translate
return feat_seq.translate(
File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/SeqRecord.py", line 1320, in translate
self.seq.translate(
File "/home/mfrand/.local/lib/python3.8/site-packages/Bio/Seq.py", line 1251, in translate
raise ValueError("Bad table argument")
ValueError: Bad table argument
My first thought was that there was an error in my coordinates table, however, I am not able to find it as its format seems to be the same as example.csv ( I attached a sample of my table below; starting from the row marked in bold it gives the error). I have tried to modify the script to solve this problem but I haven't been able to do it yet.
#record_name,start_loc,end_loc
NZ_CP068034.2,431183,432377
NZ_CP068034.2,477321,504096
NZ_CP068034.2,505213,506620
NZ_CP068034.2,520667,529185
NZ_CP068034.2,603185,636668
NZ_CP068034.2,1171217,1207238
NZ_CP068034.2,1261449,1269122
NZ_CP068034.2,1277741,1294683
NZ_CP068034.2,1329947,1331198
NZ_CP068034.2,1348002,1372261
NZ_CP068034.2,1631019,1638565
Thank you in advance
Hi,
When runing the exact example for generate_antismash_gbk.py
I get:
Parsing region coordinates...
CDS ['1756_1756_2'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_3'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_4'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_5'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_6'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_7'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_8'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_9'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_10'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_11'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
CDS ['1756_1756_12'] is not translated (don't have 'translation' qualifier)! translating with BioPython (default transl_table=11 if not filled)..
Traceback (most recent call last):
File "generate_antismash_gbk.py", line 135, in
main()
File "generate_antismash_gbk.py", line 121, in main
SeqIO.write(
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 530, in write
count = writer_class(handle).write_file(sequences)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 244, in write_file
count = self.write_records(records, maxcount)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 218, in write_records
self.write_record(record)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 981, in write_record
self._write_the_first_line(record)
File "/home/art/.local/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations
..and only one gbk in the output folder.
BioPython: 1.78
Is there anyway to use bigslice with gecco? https://github.com/zellerlab/GECCO/
e.g. by using MinHash-based approach
then we can select which datasets to perform analysis with, etc...
Hi,
I am running BiG-SLICE to investigate the novelty of BGCs that recovered from my own metagenomic datasets, by comparing them with the 1.2M preprocessed BGCs, but why the "class" column was "unknown:unknown"? The "class" column of the example in big-fam website could be assigned to specific BGCs class, such as NRPS, RIPP, PKS and so on.
I am preparing the <taxonomy_X.tsv> files according to the example input folder:
https://github.com/medema-group/bigslice/blob/master/misc/input_folder_template/taxonomy/dataset_1_taxonomy.tsv
I used GTDB-tk 1.5 as my taxonomy assignment tool, I encountered some cases which GTDB-tk could not assign genus and species.
GCA_010156995.1_ASM1015699v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__;f__;g__;s__
GCA_010672345.1_ASM1067234v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Elainellales;f__Elainellaceae;g__;s__
GCA_010672835.1_ASM1067283v1_genomic d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__;f__;g__;s__
I wonder whether you have some suggestion for preparing the <taxonomy_X.tsv> files for these case, thank you.
Something I noticed while attempting to run bligslice on my Mac: os.sched_getaffinity import failure due to lack of sched_getaffinity in Mac python's os package. This would be a problem on Windows, as well, I expect. Not a big deal for me, personally, as I normally run elsewhere, but may want to address if you want more compatibility.
$ bigslice --help
Traceback (most recent call last):
File "/Users/DWUdwary/anaconda3/envs/bigslice/bin/bigslice", line 11, in <module>
from os import getpid, path, makedirs, remove, sched_getaffinity
ImportError: cannot import name 'sched_getaffinity' from 'os' (/Users/DWUdwary/anaconda3/envs/bigslice/lib/python3.10/os.py)
Hi @satriaphd, thanks for creating this amazing tool!
I would like to learn and explore the SQL query data, and I thought maybe you already have some pointers in your Jupyter notebook scripts?
To access BiG-SLiCE's preprocessed data, (advanced) users need to be able to run SQL(ite) queries. Although the learning curve might be steeper compared to the conventional tabular-formatted output files, once familiarized, the SQL database can provide an easy-to-use yet very powerful data wrangling experience. Please refer to our publication manuscript to get an idea of what kind of things are able to be done with the output data. Additionally, you can also download and reuse some jupyter notebook scripts that we wrote to perform all analyses and generate figures for the manuscript.
Unfortunately, the link at https://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/scripts/ seems to be broken. Would you kindly fix the link and share the notebooks?
Hello, I got some problems when I download bigslice using pip install ./bigslice/
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: Directory './bigslice/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
This error came...
I try pip install --upgrade pip
and try again but same error is came.
and I try pip3 install ./bigslice/
that time,
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)
ERROR: Directory './bigslice/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)
WARNING: Ignoring invalid distribution -iopython (/usr/local/lib/python3.6/dist-packages)
this error is came.
So, I download manually python setup.py install
, python setup.py build
It's ok?
Please let me know how can I do
Thank you
Hi BiG-SLiCE team
I have a question pertaining to the pre-processed results https://github.com/medema-group/bigslice#querying-antismash-bgcs
Question - isolate_fungal dataset
states BGCs from contig/scaffold-level Bacterial refseqs taken at 2020-02-12 03:19 UTC+1
Shouldn't description state fungal refseqs
it is faster for many-to-many queries
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.