medema-group / big-map Goto Github PK

License: Other

Python 100.00%

big-map's Introduction

The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). Metabolic Gene Clusters (MGCs) are responsible of the synthesis of small molecules, that are known to have a major impact on the host. To evaluate their contribution to a given host phenotype, it is important to assess the MGC abundance and expression profile in a given dataset. Thus, we have built BiG-MAP, a bioinformatic tool to profile abundance and expression levels of gene clusters across metagenomic and metatranscriptomic data and evaluate their differential abundance and expression between different conditions. BiG-MAP is composed of 4 different modules:

BiG-MAP.download.py
BiG-MAP.family.py
BiG-MAP.map.py
BiG-MAP.analyse.py

For information on how to install and run BiG-MAP, check this tutorial. Please, take into account that this tool has been only tested on Ubuntu.

Installation

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/medema-group/BiG-MAP.git

Then install all the dependencies from the BiG-MAP.yml file with:

# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process
~$ conda activate BiG-MAP_process

# For BiG-MAP.analyse.py
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse
~$ conda activate BiG-MAP_analyse

This step is optional but to make use of the second redundancy filtering step in the BiG-MAP.family module, download BiG-SCAPE using:

~$ git clone https://github.com/medema-group/BiG-SCAPE/

Further below you will find more information on how to install BiG-SCAPE and its dependencies, but you can also check the wiki page for more information: https://github.com/medema-group/BiG-SCAPE/wiki.

Overview and example run

The typical workflow for BiG-MAP consists of the following steps:

Download WGS data using BiG-MAP.download.py (Optional)
Group gene clusters onto gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
Assess gene cluster (and HGF representatives) abundance and expression profiles using BiG-MAP.map.py
Performs differential abundance/expression analysis and visualizes the output using BiG-MAP.analyse.py

The four modules are described more in detail below where you will also find the commands to run them.

1) BiG-MAP.download.py

This script is created to easily download the metagenomic and/or metatranscriptomic samples available in the SRA repository (https://www.ncbi.nlm.nih.gov/sra). First, the samples are downloaded in .SRA format, and then they are converted into .fastq using fastq-dump.

conda activate BiG-MAP_process
python3 BiG-MAP.download.py -h
python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

To download the samples, go to the SRA run selector and use the BioProject record of interest. Next, from the resulting page, download the Accession list. This is a file that contains a list of sample accessions and it is used as input for the BiG-MAP.download module. In this tutorial, the IBD-cohort of Schirmer et al. (2018) is used, thus the Accession list of the BioProject PRJNA389280 was downloaded. With the aim of simplifying the tutorial and speeding up the analysis, 8 metagenomic samples were chosen: 4 samples from patients suffering Crohn Disease (CD) and 4 from Ulcerative Colitis (UC). Create a file with the following 8 samples’ accessions and run BiG-MAP.download command to get the fastq files:

Acc_list.txt:
SRR5947837
SRR5947861
SRR5947824
SRR5947881
SRR5947836
SRR5947862
SRR5947855
SRR5947841

python3 BiG-MAP.download.py -A Acc_list.txt -O /usr001/fastq/schirmer/

2) BiG-MAP.family.py

The main purpose of this module is to group gene clusters into GCFs using sequence similarity. The first redundancy filtering step is performed by MASH, that by default uses 0.8 sequence similarity cut-off but can be changed as desired. Additionally, a second round of redundancy filtering can be performed by using BiG-SCAPE. We strongly recommend using BiG-SCAPE for a more accurate redundancy filtering. For that, look at the BiG-SCAPE wiki on how to install it: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/installation. To run BiG-SCAPE, you will also need to have the latest (processed) Pfam database Pfam-A.hmm.gz available from the Pfam FTP website (https://pfam.xfam.org/). Once the Pfam-A.hmm.gz file is downloaded, uncompress it and process it using the hmmpress command from the HMMER suit (http://hmmer.org/).

BiG-MAP.family takes as input the output directories of any anti- or gutSMASH run. Given a set of genomes, gutSMASH/antiSMASH can predict multiple gene clusters, thus the output folders containing the predicted gene clusters for each genome are the ones used as input for this module. Please, take into account that it needs to be run beforehand. To follow this tutorial without previously running anti- or gutSMASH, you can find 10 exemplary gutSMASH output folders in here: example data folder.

To make use of the tutorial gutSMASH folders, first extract the files using:

tar -xf BiG-MAP_tutorial_genomes.tar.gz

The general usage of BiG-MAP.family is:

conda activate BiG-MAP_process
python3 BiG-MAP.family.py -h
python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

Check the command below to see how to run this module with the tutorial samples and BiG-SCAPE:

python3 BiG-MAP.family.py -D /usr001/BiG-MAP_tutorial_genomes/ -b /usr001/BiG-SCAPE_location/ -pf /usr001/pfam_files_location/ -O /usr001/results_family/

This yields:
BiG-MAP.GCF.bed = Bedfile to extract core regions in BiG-MAP.map.py
BiG-MAP.GCF.fna = Reference file to map the WGS reads to
BiG-MAP.GCs.json = Dictionary that contains the GCFs
BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs

In general, the anti- or gutSMASH-output folder should contain the results of at least several runs. Optional flags to run this module include:

-tg: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the gene clusters can be considered similar. If the threshold is set to zero, all gene clusters will form their own gene cluster families, whereas a threshold of one will result in one large family containing all gene clusters. Default = 0.8.

-th: Fraction between 0 and 1; the similarity threshold that determines when the protein sequences of the housekeeping genes can be considered similar. Default = 0.1

-f: Specify here the number of genes that are flanking the core genes of the gene cluster. 0 –> only the core, n –> n genes included that flank the core. Default = 0

-g: Output whole genome fasta files for the MASH filtered gene clusters as well. This uses more disk space in the output directory. ‘True’ | ‘False’. Default = False

-p: Number of used parallel threads in the BiG-SCAPE filtering step. Default = 6

NOTE: the number of predicted MGCs may exceed the maximum number of sequences that MASH (“sketch” and “dist” functions) is able to compare, leading to an error. In this scenario, the family module can be run in batches or the code can be slightly modified to manually run the MASH analysis and MASH “paste” function (more information is available in their documentation at https://mash.readthedocs.io/en/latest/) and pick up the analysis again from that step onwards.

3) BiG-MAP.map.py

This module is designed to align the WGS (paired or unpaired) reads to the reference representatives of each GCF and HGF using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.map.py -h
python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-F [family] | -P [pickled file]} -O [outdir] -b [metadata] [Options*]

To map the 8 samples from Schirmer et al. (2018) to the GCF reference representatives, and correct for the BiG-SCAPE GCF size, run:

NOTE: It is important for downstream analysis to also use the -b flag. Also, if it is prefered to use the averaged number of reads mapped per GCF (instead of summed), the flag -a or –average needs to be included in the command below

python3 BiG-MAP.map.py -b /usr001/results/schirmer_metadata.txt -I1 /usr001/fastq/schirmer/*pass_1* -I2 /usr001/fastq/schirmer/*pass_2* -O /usr001/results_mapping/ -F /usr001/results_family/


the schirmer_metadata.txt is set up as follows (tab-delimited):
#run.ID	host.ID	SampleType	DiseaseStatus
SRR5947837	M2026C2_MGX	METAGENOMIC	UC
SRR5947861	M2026C3_MGX	METAGENOMIC	UC
SRR5947824	M2026C4_MGX	METAGENOMIC	UC
SRR5947881	M2026C7_MGX	METAGENOMIC	UC
SRR5947836	M2027C1_MGX	METAGENOMIC	CD
SRR5947862	M2027C2_MGX	METAGENOMIC	CD
SRR5947855	M2027C3_MGX	METAGENOMIC	CD
SRR5947841	M2027C5_MGX	METAGENOMIC	CD

note the '#' to denote the header row!!!

4) BiG-MAP.analyse.py

This module performs a statistical analysis on the metagenomic/metatranscriptomic samples. First, the script normalizes and filters the data. Whereafter, the best covered gene clusters can be observed using the –explore flag. Next, the Kruskal Wallis and fitZIG model will be used to compute differentially abundant/expressed gene clusters and Benjamini-Hochberg FDR compensates for multiple hypothesis testing. The output of the script are several heatmaps in pdf format.

To run the script, the BiG-MAP_analyse conda environment should be activated. The general usage is:

conda activate BiG-MAP_analyse
python3 BiG-MAP.analyse.py -h
python3 BiG-MAP.analyse.py --explore --compare -B [biom_file] -T [metagenomic/metatranscriptomic] -M [metagroup] -O [outdir] [Options*]

Example command for the explore heatmap:
python3 BiG-MAP.analyse.py --explore -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -O /usr001/results_analysis

Example command for the compare heatmap:
python3 BiG-MAP.analyse.py --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Example command including both the explore and the compare heatmap:
python3 BiG-MAP.analyse.py --explore --compare -B /usr001/results_mapping/biom-results/BiG-MAP.map.metacore.dec.biom -T metagenomic -M DiseaseStatus -g UC CD -O /usr001/results_analysis

Note: You can either choose between the BiG-MAP.map.metacore.dec.biom or the BiG-MAP.mapcore.metacore.dec.biom as -B flag input file, depending if you are interested on plotting the results for the whole gene clusters or only the core genomic region of the gene clusters respectively.

Output: 
explore_heatmap.pdf & explore_heatmap.eps -> contains the top 20 best covered gene clusters
UCvsCD_fz.pdf & UCvsCD.eps -> comparison between UC and CD using the fitZIG model
UCvsCD_kw.pdf & UCvsCD_kw.eps -> comparison between UC and CD using the Kruskal Wallis model
tsv-results -> directory containing tsv files with the raw data

Snakemake workflow

This Snakemake workflow allows for a more automated and streamlined running of the separated BiG-MAP modules. For more information on how to install and run the Snakemake version of BiG-MAP, check the instructions below.

Installation and run overview

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/medema-group/BiG-MAP.git

Install Snakemake with the following command:

~$ conda create -n snakemake -c conda-forge -c bioconda snakemake=6.0.2
~$ conda activate snakemake

Next, copy the BiG-MAP_snakemake folder to the preferred output location

~$ cp -r BiG-MAP/BiG-MAP_snakemake/ /path/to/output/location/

Navigate to the BiG-MAP_snakemake folder and adjust the config.yaml file. In this file, the locations of files and folders should be included based on the wanted BiG-MAP run settings. After adjusting the config file, use the following command to start the BiG-MAP run:

~$ snakemake --use-conda --cores 10

NOTE: It’s recommended to use a conda-prefix location to a folder in which the BiG-MAP conda environments can be installed (–conda-prefix path/to/snakemake/conda/envs). Additionally, the number of cores used by BiG-MAP can be adjusted with the –cores flag.

Requirements

Input data:

antiSMASH v5.0 or higher
gutSMASH

Software:

Python 3+
R statistics
fastq-dump
Mash
HMMer
Bowtie2
Samtools
Bedtools
biom
BiG-SCAPE=20191011

Packages:

Python

BioPython
pandas

R

metagenomeSeq
biomformat
ComplexHeatmap=2.0.0
viridisLite
RColorBrewer
tidyverse

big-map's People

Contributors

Stargazers

Watchers

Forkers

ohmeta animesh lzh93 raufs cgwyx hildaha chen318liang

big-map's Issues

Readme: unclear steps for BiG-MAP.family.py

In the readme tutorial, at the BiG-MAP.family.py - in the example a pfam folder is provided but no mention as to 1) where this came from and 2) what contents there should be there. Perhaps just a short sentence could clarify on what this folder is and does.

Possible to make a release for bioconda recipe?

Hello,

currently I'm working to create a wrapper for BiG-MAP which then should be added to Galaxy (https://usegalaxy.org), an open source tool collection.

For creating a wrapper I need to also write a bioconda recipe, but for the recipe I need to have a release from this GitHub.
Is it possible to make a release which I can use for the recipe?

Thank you in advance!

SameFileError while running BiG-MAP.family.py

HI,
Ran into the following error while running the latest commit:

home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py -p 30 -D 3.Function/antismash/ -b /home/mcs/soft/BiG-SCAPE/bigscape.py -pf /home/mcs/soft/BiG-SCAPE/ -O 3.Function/antismash/big_map_antismash -g True
Extracting fasta files
Preparing BiG-SCAPE input_
Traceback (most recent call last):
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 1236, in
main()
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 1117, in main
movegbk(args.outdir, gbk_file, list_gbkfiles)
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 828, in movegbk
shutil.copy(gbk_file, os.path.join(path, genome))
File "/home/mcs/miniconda3/envs/BiG-MAP_process/lib/python3.6/shutil.py", line 241, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/mcs/miniconda3/envs/BiG-MAP_process/lib/python3.6/shutil.py", line 104, in copyfile
raise SameFileError("{!r} and {!r} are the same file".format(src, dst))
shutil.SameFileError: '3.Function/antismash/big_map_antismash/gbk_files/c_000000000001.region001.gbk' and '3.Function/antismash/big_map_antismash/gbk_files/c_000000000001.region001.gbk' are the same file

Seems like problem with shutil copy function

ERROR: could not open "01.identity_0.8/mash_sketch.msh" for reading.

Hi,

Thanks for the fancy tool. I would like to use BiG-MAP to reduplicate my BGCs with the command below

nohup python3 /data3/zhangdw/01.Toolkit/BiG-MAP/src/BiG-MAP.family.py -D ../03.All_antismash_output/ -b /usr/local/bin/bigscape -p 20 -tg 0.8 -O 01.identity_0.8 &

But I got an error:

ERROR: could not open "01.identity_0.8/mash_sketch.msh" for reading.
___________Extracting fasta files__________
________Preparing BiG-SCAPE input__________
__________Running BiG-SCAPE________________

I put several antismash outputs under the folder "../03.All_antismash_output/".

$ ll ../03.All_antismash_output/
total 0
drwxrwxr-x. 3 dwzhang dwzhang 129 Oct 23 16:38 01.RefSeq_isolate
drwxrwxr-x. 4 dwzhang dwzhang  80 Oct 23 19:18 02.PATRIC_isolate
drwxrwxr-x. 4 dwzhang dwzhang  79 Oct 23 19:22 03.IMG_M_isolate
drwxrwxr-x. 4 dwzhang dwzhang 175 Oct 23 19:22 04.Human_gut_isolate
drwxrwxr-x. 4 dwzhang dwzhang  79 Oct 23 19:23 05.IMG_M_MGAs
drwxrwxr-x. 4 dwzhang dwzhang 104 Oct 23 19:23 06.Human_gut_MGAs
drwxrwxr-x. 4 dwzhang dwzhang  79 Oct 23 19:24 07.Food_MGAs

It works well when I run BiG-MAP individually, but got the aforementioned error when running overall. Any suggestion would be greatly appreciated.

Run BiG-MAP.map.py to see if the TPM results can be generated directly from this step

Hi,
is it possible to run BiG-MAP.map.py in this step to generate TPM comparison results directly and no longer generate RPKM comparison results?
Thank you.

Error in BiG-MAP.family.py with --metatranscriptomes option

Hello, I've come across an error when using the --metatranscriptomes option in BiG-MAP.family.py.

Code run:

python3 ../software/BiG-MAP/src/BiG-MAP.family.py -D results/metagenomics/09_BGC/ANTISMASH/prodigal_bins \
-O results/metagenomics/09_BGC/ANTISMASH/big-map/results_family \
-b ../software/BiG-SCAPE \
-pf ../software/BiG-SCAPE \
-p 16 --metatranscriptomes

Where the input -D is the path to a collection of AntiSMASH output directories.

The error message is below:

Running BiG-MAP.family.py
_Extracting fasta files
Adding housekeeping genes
Traceback (most recent call last):
File "../software/BiG-MAP/src/BiG-MAP.family.py", line 1237, in
main()
File "../software/BiG-MAP/src/BiG-MAP.family.py", line 1178, in main
organism_name = organism[organism.index(".") + 1:]
ValueError: substring not found

Could be similar to issue #5?

Thank you for your help!

2 errors with using BiG-MAP snakemake

Hi, congrats on developing this tool. It is going to make an impact on the microbiome field.

I installed the Snakemake pipeline successfully. As the pipeline ran for the first time, there was no error when the conda environments were set up.
However, the next step with the 'downloader' job produced an error:

rule downloader:
    input: /home/users/astar/bmsi/simck1/scratch/bigmap/acc_list.txt
    output: output/download
    jobid: 0

Activating conda environment: /scratch/users/astar/bmsi/simck1/bigmap/BiG-MAP_snakemake/.snakemake/conda/c477bb20
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.

The 'family' job also produced an error:

rule family:
    input: /home/users/astar/bmsi/simck1/scratch/bigmap/BiG-MAP_tutorial_genomes
    output: output/family/BiG-MAP.pickle
    jobid: 0
    threads: 24

Activating conda environment: /scratch/users/astar/bmsi/simck1/bigmap/BiG-MAP_snakemake/.snakemake/conda/c477bb20
Traceback (most recent call last):
  File "/home/users/astar/bmsi/simck1/scratch/bigmap/src/BiG-MAP.family.py", line 27, in <module>
    from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'

Thanks for any help!

BiG-MAP.download.py can't download test-set: ftp site is decommissioned

Upon trying to download the test project as provided in the readme, an error occurs claiming that the site cannot be accessed (No such directory ‘sra/sra-instant/reads/ByRun/sra/SRR/SRR013/SRR013549’.) - see below for full error code.
Trying manually with wget gives the same error.
Found online an explanation (https://ncbiinsights.ncbi.nlm.nih.gov/2019/10/17/users-of-the-sra-ftp-site-try-the-sra-toolkit/) - the ncbi ftp site is decommissioned since December 2019.

Solve:
Try incorporating the SRA-tool in BiG-MAP.download.py

Good luck.

Error:
Namespace(acclist='../SRR_Acc_List.txt', outdir='../../scratch/')
--2020-10-19 12:24:37-- ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR013/SRR013549/SRR013549.sra
=> ‘../../scratch/SRR013549.sra’
Resolving ftp-trace.ncbi.nih.gov (ftp-trace.ncbi.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10, 2607:f220:41e:250::13, ...
Connecting to ftp-trace.ncbi.nih.gov (ftp-trace.ncbi.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR013/SRR013549 ...
No such directory ‘sra/sra-instant/reads/ByRun/sra/SRR/SRR013/SRR013549’.

Traceback (most recent call last):
File "BiG-MAP.download.py", line 194, in
main()
File "BiG-MAP.download.py", line 188, in main
downloadSRA(acc, args.outdir)
File "BiG-MAP.download.py", line 127, in downloadSRA
res_download = subprocess.check_output(cmd_download, shell=True)
File "/opt/miniconda3/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/opt/miniconda3/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR013/SRR013549/SRR013549.sra -P ../../scratch/' returned non-zero exit status 8.

Hi BiG-MAP team,

I am having a problem with running the BiG-MAP analysis module. The problem seems to be related the metadata 'sample type'
Kindly assist.

#my command
python3 BiG-MAP.analyse.py --explore --compare -B /home/hildaha/BiG-MAP/Analysis/results_Mapping/biom-results/BiG-MAP.map.meta.dec.biom -T metagenomic -M Amendment -g CF NCF -O /home/hildaha/BiG-MAP/Analysis/results_analysis/

Attached also is the metadata file I used in mapping module

metadata.txt

#Here is the error I get.
Loading biom file_____________
Traceback (most recent call last):
File "BiG-MAP.analyse.py", line 1229, in
main()
File "BiG-MAP.analyse.py", line 1061, in main
biom_dict, args.metagroup)
File "BiG-MAP.analyse.py", line 153, in get_sample_type
elif sample_type == (sample["metadata"]["SampleType"]).upper():
KeyError: 'SampleType'

BiG-MAP.map.py type error

I'm trying to run BiG-MAP on a set of metatranscriptome samples. The family module is prepared from antiSMASH input of isolate genomes, which I am trying to map my RNA reads to. I had to rename my fastq files based on a previous issue with underscores, so they are named 1_R1.fastq, 1_R2.fastq etc.
The mapping stage begins and bowtie does start mapping, but quickly aborts. The error reads below:

Loading big-map/230322
Loading requirement: python/3.9.4 R/4.0.5 gsl/2.6 capnproto-c++/0.9.1 mash/2.3
fasttree/2.1.11 hmmer/3.4.0 scikit-learn/1.1.2-py39 big-scape/1.1.5
Building a SMALL index
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.3.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.3.bt2
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.4.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.4.bt2
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.1.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.1.bt2
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.2.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.2.bt2
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.rev.1.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.rev.1.bt2
Renaming /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.rev.2.bt2.tmp to /scratch3/dow17n/RNA_pilot/big-map/big-map_map_output_sorangium/BiG-MAP.GCF_HGF.rev.2.bt2
Fastq-files_______________________
5_R1.fastq
6_R1.fastq
7_R1.fastq
8_R1.fastq
9_R1.fastq

5_R2.fastq
6_R2.fastq
7_R2.fastq
8_R2.fastq
9_R2.fastq
Mapping reads using bowtie
Dealing with sample 5
Traceback (most recent call last):
File "/apps/big-map/230322/bin/BiG-MAP.map.py", line 1129, in
main()
File "/apps/big-map/230322/bin/BiG-MAP.map.py", line 1027, in main
core_RPKM_avg = familycorrect(core_RPKM_avg, BGCF)
File "/apps/big-map/230322/bin/BiG-MAP.map.py", line 731, in familycorrect
for HGF_member in family[GC]:
TypeError: string indices must be integers

The bowtie2 log is as follows:
#5
71632055 reads; of these:
71632055 (100.00%) were paired; of these:
71602250 (99.96%) aligned concordantly 0 times
29011 (0.04%) aligned concordantly exactly 1 time
794 (0.00%) aligned concordantly >1 times
----
71602250 pairs aligned concordantly 0 times; of these:
173 (0.00%) aligned discordantly 1 time
----
71602077 pairs aligned 0 times concordantly or discordantly; of these:
143204154 mates make up the pairs; of these:
143201511 (100.00%) aligned 0 times
2565 (0.00%) aligned exactly 1 time
78 (0.00%) aligned >1 times
0.04% overall alignment rate

Anyone got any bright ideas?
Lachlan

Running python3 BiG-MAP.family.py reports an error BiopythonParserWarning: Attempting to parse malformed locus line:

Hi,
I am running antismash 7.0, big-map analysis on metagenomic data and the program comes out with the following result when I run python3 BiG-MAP.family.py.
Is this the right data to get?
/home/mcs/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1401: BiopythonParserWarning: Attempting to parse malformed locus line:
'LOCUS NODE_37978_length_1104_cov_1.148414 1104 bp DNA linear UNK 01-JAN-1980\n'
Found locus 'NODE_37978_length_1104_cov_1.148414' size '1104' residue_type 'DNA'
Some fields may be wrong.
BiopythonParserWarning)
/home/mcs/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/Bio/GenBank/Scanner.py:1401: BiopythonParserWarning: Attempting to parse malformed locus line:
'LOCUS NODE_38054_length_1101_cov_0.890144 1101 bp DNA linear UNK 01-JAN-1980\n'
Found locus 'NODE_38054_length_1101_cov_0.890144' size '1101' residue_type 'DNA'
Some fields may be wrong.
BiopythonParserWarning)

thanks

Problem with reading results into pd.dataframe during BiG-MAP.map

Hello,

I'm running BiG-MAP.map on a collection of ~140 metagenome samples. During the section where it prints out "Dealing with sample 32712-5#8", there were three times when it said "Unable to run bowtie" after the "Dealing with..." message.

Then, I got this error:

Traceback (most recent call last):
  File "/home/phil/programs/BiG-MAP/src/BiG-MAP.map.py", line 1129, in <module>
    main()
  File "/home/phil/programs/BiG-MAP/src/BiG-MAP.map.py", line 1043, in main
    df = pd.DataFrame(results)
  File "/home/phil/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/home/phil/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/pandas/core/frame.py", line 461, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/home/phil/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/pandas/core/frame.py", line 6163, in _arrays_to_mgr
    index = extract_index(arrays)
  File "/home/phil/miniconda3/envs/BiG-MAP_process/lib/python3.6/site-packages/pandas/core/frame.py", line 6211, in extract_index
    raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length

When I removed the three samples that seemed to spark the "Unable to run bowtie" message, the program executed successfully.

Interpreting big-map outputs

Hello,

I've run BiG-MAP to identify the gene clusters in some microbiome samples comparing health and disease.

There seem to be interesting patterns in my dataset of particular gene cluster types being associated with health or disease:

Do you have any suggestions on approaches to test whether these patterns are statistically significant?

Each gene cluster example is statistically significantly associated with e.g. health. For example, gb.KB291615.1.region001.GC_DNA..Entryname.acetate2butyrate..OS.Clostridium_celatum_DSM_1785_genomic_scaffold..SMASHregion.region001..NR.1 was associated with health.

But then, what about at the gene cluster level? Could just look at whether the counts in the screenshot above are significant, but that seems to be discarding a lot of information.

I was also thinking about whether the RPKMs could be "summed" at the "gene cluster type" level (e.g. acetate2butyrate), and compared between health and disease.

Any thoughts welcome!

Thanks,

Phil

I have two countries in my study, so I've narrowed down the list of hits by filtering for only pathways that are consistently associated with health/disease in both coun

normalization to control group for log2 fold change

Hi, thanks for the wonderful tool.
In the differential abundance analysis in BiG-MAP.analyse.py, it may be nice to be able to specify which is the control group so that the log2 fold change is correctly normalized to the control group.

BiG-MAP.analyse- ERROR reading BIOM file

Hi,

Thanks for modifying the BiG-MAP.family script. I created the GCF files and mapped it via the BiG-MAP.map.py script. When I run the BiG-MAP.analyse.py for plotting, it shows error while loading the biom file regarding the SampleType column.

Here is the shell output:

python3 /home/mcs/soft/BiG-MAP/src/BiG-MAP.analyse.py --explore --compare -B 3.Function/antismash/big_map_antismash/mapping/biom-results/BiG-MAP.mapcore.metacore.dec.biom -T Metagenomic -M Status -O 3.Function/antismash/big_map_antismash/results/core_genes
Loading biom file_____________
Traceback (most recent call last):
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.analyse.py", line 1173, in
main()
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.analyse.py", line 1037, in main
biom_dict, args.metagroup)
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.analyse.py", line 144, in get_sample_type
elif sample_type == (sample["metadata"]["SampleType"]).upper():
TypeError: 'NoneType' object is not subscriptable

Here is the output biom file and the metadata file used to create the biom file
metadata.txt
BiG-MAP.mapcore.metacore.dec.biom.txt

metagenome_contigs

Hi there,

Thank you for the awesome tool. I am analysing my shotgun metagenome dataset and have a couple of questions?

As I constructed metagenome assembled genomes from the dataset, do you have a recommendation of which MAGs that I should use for BiG-MAP regarding to quality and completeness (HG MAG - 90% with less than 5% contaminants or medium quality is acceptable?)
Can we use metagenome contigs instead of genome as an input for big-map?

Best wishes

Wisnu

environment dependencies (yaml-files) are not downloadable

The yaml files are a specific dump of dependencies which renders some packages not to be found.

Error:
ResolvePackageNotFound:
- openssl==1.1.1=h7b6447c_0

Solve:
Provide top-level dependencies.

Make output directory if it doesn't already exist

Hi,

Just a very small thing - would be good if bigmap.map made the output directory if it doesn't already exist.

Thanks,

Phil

ERROR: could not open "mash_sketch.msh" for reading.

Hi, I am running the command:

python3 BiG-MAP/src/BiG-MAP.family.py \
	 -D gutsmash_out \
	 -O BiGMAP_family \
	 -p $CPUS \
	 -b BiG-MAP/BiG-SCAPE \
	 -pf BiG-MAP/BiG-SCAPE \
	 --metatranscriptomes

___________Extracting fasta files__________
ERROR: could not open "BiGMAP_family/mash_sketch.msh" for reading.
_________Adding housekeeping genes_________
[]
________Preparing BiG-SCAPE input__________
__________Running BiG-SCAPE________________

any ideas what the problem is with mash_sketch.msh?
Thanks!

Add one in, have to re-run BiG-MAP.map.py?

Hello,

I have an existing big-map analysis of ~150 metagenomes, and I want to add 10 more metagenomes into the analysis

Do I have to run BiG-MAP.map.py with all 160 metagenomes in order to create a unified biom file for analysis with big-map.analyse?

Or is there a way to combine the bioms from different runs into a single file?

Thanks,

Phil

stringent protein similarity results in less clusters

Hi,

I would like to set a stringent protein similarity to deduplicate my BGCs. I set the parameter -tg from 0.5 to 0.9, which should be expected to generate less similar clusters and more representative cluster sequences. However, the result is the opposite. I check the code of script BiG-MAP.family.py and found the overlap was defined as overlap = 1-(float(no1)/float(no2)), making me quite confused. The last column from MASH distance file refers to "Matching-hashes", I think "(float(no1)/float(no2)" should be the similarity between two clusters? Or do I misunderstand anything?

$ grep -c ">" 0.5/BiG-MAP.GCF.fna
40
$ grep -c ">" 0.9/BiG-MAP.GCF.fna
32

Using BiG-MAP.map.py on metatranscriptome data

Hi BiG-MAP team,

Thanks for this great tool! I'm trying to map metatranscriptome assemblies onto previously generated BGCs with BiG-MAP.map.py. I have two questions:

My metatranscriptome assemblies are composed of contigs within a fasta file. Is there a way to display in the output which contig maps to which GCF instead of only seeing the total reads mapped onto the GCF by the entire metatranscriptome sample?
I would like to use a fasta file input instead of a fastq file input. Is this possible within your package?

Thanks a lot!
Zinka

make_sketch() missing required arguments: 'kmer' and 'sketch'

Hello - I've just downloaded the tool and am getting started. Tried a test run but ran into the error below. Adding -k 16 -s 5000 did not solve the issue:

python BiG-MAP/src/BiG-MAP.family.py -p 8 -D bgcs -O test

_Extracting fasta files
Traceback (most recent call last):
File "BiG-MAP/src/BiG-MAP.family.py", line 1219, in
main()
File "BiG-MAP/src/BiG-MAP.family.py", line 1061, in main
make_sketch(args.outdir + os.sep)
TypeError: make_sketch() missing 2 required positional arguments: 'kmer' and 'sketch'

BiG-MAP.map.py Biom error

Hi BiG-MAP developer

Thanks for developing the tool for the community.

I ended up with an error when running BiG-MAP.map for both metagenome and metatranscriptome data. One error example is listed down below. Any idea how to fix the error?

python3 $HOME/BiG-MAP/src/BiG-MAP.map.py -b $HOME/bigmap_input/mtx_S.txt -U /scratch1/users/MTX_filtered/*fastq.gz -O /scratch1/users/bigmap_output/results_mapping_MTX_hmg_ref_cf02_S/ -F $HOME/bigmap_output/results_family_genomes_cf02_mtx/ -th 64 -a True

Adding metadeta to biom and converting files into json format
Traceback (most recent call last):
File "/usr/users/BiG-MAP/src/BiG-MAP.map.py", line 1124, in
main()
File "/usr/users/BiG-MAP/src/BiG-MAP.map.py", line 1088, in main
biom_out1 = decoratebiom(biomfile, args.outdir, args.biom_output)
File "/usr/users/BiG-MAP/src/BiG-MAP.map.py", line 780, in decoratebiom
res_add = subprocess.check_output(cmd_sample, shell=True)
File "/usr/users/.conda/envs/BiG-MAP_process/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/users/.conda/envs/BiG-MAP_process/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'biom add-metadata -i /scratch1/users/bigmap_output/results_mapping_MTX_hmg_ref_cf02_S/BiG-MAP.map.biom -o /scratch1/users/bigmap_output/results_mapping_MTX_hmg_ref_cf02_S/BiG-MAP.map.meta.biom -m /usr/users/bigmap_input/mtx_S.txt --output-as-json' returned non-zero exit status 1

fastq naming pattern hard coded?

Hello,

Thanks for developing this tool for the community. I'm running BiG-MAP.map on some data and I had a problem getting the fastq sample names from the sample sheet to match up with the fastq paths given on the command line.

I eventually solved this problem by looking at the source code and seeing that you were getting the name from the fastq file name by splitting on underscores and taking the zero-eth item. Unfortunately my fastqs have underscores in the name, like this 32580_8#24_R2.fastq.gz. I could get BiG-MAP.map to run by re-naming my fastqs to be like 32580-8#24_R2.fastq.gz, but perhaps some other users might not dig in this far?

I know it's a nightmare dealing with all the different ways that people write the sample name in their fastq file names, perhaps you could include the path to the fastqs as an entry in the sample sheet?

Best,

Phil

error while running BiG-MAP.family.py:ValueError: substring not found

Hi,

After successful installation of BiG-MAP and BIG-SCAPE I tried to run it on my metagenomic contig's ANTISMASH output but ran into following error:

python3 /home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py -p 30 -D 3.Function/antismash/ -b /home/mcs/soft/BiG-SCAPE/bigscape.py -pf /home/mcs/soft/BiG-SCAPE/ -O 3.Function/antismash/big_map_antismash
_Extracting fasta files
Traceback (most recent call last):
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 1223, in
main()
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 1034, in main
prot_file, orgID, fasta_header = writefasta(AAseq, "GC_PROT", GC, organism, f, args.outdir)
File "/home/mcs/soft/BiG-MAP/src/BiG-MAP.family.py", line 252, in writefasta
organism = organism[organism.index('_') + 1:]
ValueError: substring not found

for ANTISMASH run I used --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go --genefinding-tool none --genefinding-gff3 dir/containing/gff

medema-group / big-map Goto Github PK

big-map's Introduction

The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

Installation

Overview and example run

1) BiG-MAP.download.py

2) BiG-MAP.family.py

3) BiG-MAP.map.py

4) BiG-MAP.analyse.py

Snakemake workflow

Installation and run overview

Requirements

Input data:

Software:

Packages:

Python

R

big-map's People

Contributors

Stargazers

Watchers

Forkers

big-map's Issues

Recommend Projects

Recommend Topics

Recommend Org