terrimporter / metaworks Goto Github PK

MetaWorks is a flexible multi-marker metabarcode pipeline for processing paired-end Illumina reads from raw fastq.gz files to taxonomic assignments.

Home Page: https://terrimporter.github.io/MetaWorksSite/

License: GNU General Public License v3.0

metaworks's Introduction

MetaWorks: A Multi-Marker Metabarcode Pipeline

MetaWorks generates exact sequence variants and/or operational taxonomic units and taxonomically assigns them. Supports a number of popular metabarcoding markers: COI, rbcL, ITS, SSU rRNA, and 12S SSU mtDNA. See the MetaWorks website for quickstart guides, additional pipeline details, FAQs, and a step-by-step tutorial that includes installation.

Installation

MetaWorks runs at the command-line on linux x86-64 in a conda environment (provided).

Instructions for installing conda (if not already installed).

Instructions for installing ORFfinder if pseudogene-filtering will be run (optional).

Instructions for installing MetaWorks and activating the MetaWorks conda environment.

Instructions on where to find custom-trained classifiers that can be used with MetaWorks.

Documentation

A quickstart guide to various workflows.

A detailed explanation of MetaWorks workflows.

A tutorial provides step-by-step instructions on how to prepare your environment and get started quickly using the provided test data.

NEW We added answers to some frequently asked questions (FAQs) about MetaWorks and data analysis to the MetaWorks website.

How to cite

If you use this dataflow or any of the provided scripts, please cite the MetaWorks paper:
Porter, T. M., & Hajibabaei, M. (2022). MetaWorks: A flexible, scalable bioinformatic pipeline for high-throughput multi-marker biodiversity assessments. PLOS ONE, 17(9), e0274260. doi: 10.1371/journal.pone.0274260

You can also site this repository: Teresita M. Porter. (2020, June 25). MetaWorks: A Multi-Marker Metabarcode Pipeline (Version v1.10.0). Zenodo. http://doi.org/10.5281/zenodo.4741407

If you use this dataflow for making COI taxonomic assignments, please cite the COI classifier publication:
Porter, T. M., & Hajibabaei, M. (2018). Automated high throughput animal CO1 metabarcode classification. Scientific Reports, 8, 4226.

If you use the pseudogene filtering methods, please cite the pseudogene publication: Porter, T.M., & Hajibabaei, M. (2021). Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets. BMC Bioinformatics, 22: 256.

If you use the RDP classifier, please cite the publication:
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267. doi:10.1128/AEM.00062-07

Last updated: September 30, 2022

metaworks's People

Contributors

Stargazers

Watchers

Forkers

jaimeortiz-david aafc-lethbridge-bioinformatics-support hvanphucs emagallong

metaworks's Issues

Stuck during MetaWorks_v1.12.0 Tutorial

Hi! I’m also experiencing difficulty with the MetaWorks_v1.12.0 tutorial similar to the following two posts:

I tried following the solutions posted in those two threads but I haven't been able to figure it out and the tutorial isn’t working. I’m new to Linux and MetaWorks. I’m using Ubuntu with the WSL2. My laptop has 10 cores and 16gb ram installed.

A. When I typed echo $SHELL during the tutorial I got

/bin/bash

so I ran conda init bash to initialize conda for that step.

B. When I go to the rRNAClassifier.properties file (while in the Linux command terminal) and enter pwd I get:

/home/ocean/MetaWorks1.12.0/mydata

So I updated my path in config_testing_COI_data.yaml to be:

t: "/home/ocean/MetaWorks1.12.0/mydata/rRNAClassifier.properties"

C. After following the rest of the steps in the tutorial I tried to run snakemake --jobs 2 --snakefile snakefile_ESV --configfile config_testing_COI_data.yaml but I got the same "java heap space" memory error as this post (1.#5) so I also tried changing -Xmx8g to -Xmx16g in config_testing_COI_data.yaml.

D. Now when I run snakemake --jobs 2 --snakefile snakefile_ESV --configfile config_testing_COI_data.yaml with -Xmx16g I get the following message:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads

add_good_orf_sequences_to_taxonomy 1 1 1
all 1 1 1
consolidate_orfs 1 1 1
filter_ESV_table 1 1 1
filter_rdp 1 1 1
get_orfs_aa 1 1 1
get_orfs_nt 1 1 1
get_results 1 1 1
hmmscan 1 1 1
subset_ESV_sequences_by_taxon 1 1 1
subset_taxonomy_by_taxon1 1 1 1
taxonomic_assignment 1 1 1
total 12 1 1

Select jobs to execute...

[Fri Aug 25 18:49:48 2023]
rule taxonomic_assignment:
input: COI/cat.denoised.nonchimeras
output: COI/rdp.out.tmp
jobid: 37
reason: Missing output files: COI/rdp.out.tmp
resources: tmpdir=/tmp

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Select jobs to execute...
/home/ocean/miniconda3/envs/MetaWorks_v1.12.0/bin/rdp_classifier: line 57: 84 Killed /home/ocean/miniconda3/envs/MetaWorks_v1.12.0/bin/java -Xmx16g -jar /home/ocean/miniconda3/envs/MetaWorks_v1.12.0/share/rdp_classifier-2.13-1/classifier.jar classify -t /home/ocean/MetaWorks1.12.0/mydata/rRNAClassifier.properties -o COI/rdp.out.tmp COI/cat.denoised.nonchimeras
[Fri Aug 25 18:50:15 2023]
Error in rule taxonomic_assignment:
jobid: 0
input: COI/cat.denoised.nonchimeras
output: COI/rdp.out.tmp

RuleException:
CalledProcessError in file /home/ocean/MetaWorks1.12.0/snakefile_ESV, line 343:
Command 'set -euo pipefail; rdp_classifier -Xmx16g classify -t /home/ocean/MetaWorks1.12.0/mydata/rRNAClassifier.properties -o COI/rdp.out.tmp COI/cat.denoised.nonchimeras' returned non-zero exit status 137.
File "/home/ocean/MetaWorks1.12.0/snakefile_ESV", line 343, in __rule_taxonomic_assignment
File "/home/ocean/miniconda3/envs/MetaWorks_v1.12.0/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job taxonomic_assignment since they might be corrupted:
COI/rdp.out.tmp
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-08-25T184946.566586.snakemake.log

Can MetaWorks Start with Contigs?

Hi,
Can MetaWorks Start with Contigs? I know BOLDigger can start with contigs, but just with 50 sequences at one time. Locally running with metaworks can be worked within a pipeline.

Can MetaWorks accommodate merged Illumina reads?

Hi there,

I have reads that are demultiplexed but the R1 and R2 reads have already been merged. Our script merges them before demultiplexing because the primers are dual indexed and both barcodes are required to assign the sequence to the correct sample.

Is it okay to run merged demultiplexed reads through the single read ESV workflow? Or is there a better way to process this type of data? Will the presence of both the forward and reverse primers on the sequences mess up any of the downstream steps?

Thank you kindly for your help.

Difficuty with MetaWorks 12.0 Tutorial

          Hi there!

I am also experiencing difficulty with the Tutorial. I am new to coding and MetaWorks. I have edited the path to the classifier in the config_testing_COI_data.yaml file, but am still getting the following error:

Error in rule taxonomic_assignment:
jobid: 0
input: COI/cat.denoised.nonchimeras
output: COI/rdp.out.tmp

RuleException:
CalledProcessError in file /home/innovation-admin/MetaWorks1.12.0/snakefile_ESV, line 343:
Command 'set -euo pipefail; rdp_classifier -Xmx8g classify -t /home/innovation-admin/MetaWorks1.12.0/mydata/rRNAClassifier.properties -o COI/rdp.out.tmp COI/cat.denoised.nonchimeras' returned non-zero exit status 1.
File "/home/innovation-admin/MetaWorks1.12.0/snakefile_ESV", line 343, in __rule_taxonomic_assignment
File "/home/innovation-admin/miniconda3/envs/MetaWorks_v1.12.0/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job taxonomic_assignment since they might be corrupted:
COI/rdp.out.tmp

Any suggestions on what might be going wrong?
Thank you for your help.

Originally posted by @Jess-Schultz in #5 (comment)

Issue with OTU making

Hi There
NOt really sure why i got this error when running the clustering process. Can somebody please shed a light on this.

Thanks
Aji

RuleException in line 449 of /data/home/awahyu/MetaWorks1.9.4/snakefile_OTU:
NameError: The name 'SED' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
File "/data/home/awahyu/miniconda3_2018/envs/MetaWorks_v1.9.4/lib/python3.8/site-packages/snakemake/executors/init.py", line 136, in run_jobs
File "/data/home/awahyu/miniconda3_2018/envs/MetaWorks_v1.9.4/lib/python3.8/site-packages/snakemake/executors/init.py", line 441, in run
File "/data/home/awahyu/miniconda3_2018/envs/MetaWorks_v1.9.4/lib/python3.8/site-packages/snakemake/executors/init.py", line 230, in _run
File "/data/home/awahyu/miniconda3_2018/envs/MetaWorks_v1.9.4/lib/python3.8/site-packages/snakemake/executors/init.py", line 156, in _run
File "/data/home/awahyu/miniconda3_2018/envs/MetaWorks_v1.9.4/lib/python3.8/site-packages/snakemake/executors/init.py", line 162, in printjob

Problem with column "SampleName" in results.csv

Hello! I’m having a problem with the column “SampleName” not working in the results.csv file after running MetaWorks.

When I ran the pipeline (with the Default ESV Workflow) the results.csv “SampleName” column would contain sample names in the format “S1_PrimerName”, “S2_PrimerName”… corresponding to the {sample} wildcards from the input *.fastq.gz files plus the primer name from the adapters_anchored.fasta file.
But now when I run it all samples are named “results_PrimerName” or “MetaWorks1_PrimerName” and there’s no way for me to see or organize the data from the different samples.
I’ve noticed this happening in results.csv while using the Default ESV Workflow with the testing data, multiple classifiers including 16S_vertebrates, COI and 12S_vertebrates and with MetaWorks1.12.0 and with MetaWorks1.13.0 . Attached is an example of when “SampleNames” was not working.

I'm having difficulty troubleshooting this and can't figure out if something is wrong with my file names or somewhere else. Any help or advice would be appreciated!

Is there a version of this pipeline we can use on Windows ?

Hi !
I really want to try this pipeline with my sequences, but I only have a laptop so I can't install Linux on it. I was then wondering if there is any version of this pipeline that we can use on windows ? I know we can install Python, Conda & Miniconda on windows.
Thanks a lot for your response !

vsearch versioning

The version of vsearch included in the MetaWorks_v1.11.0 conda environment ( v2.15.2) does not implement --fastx_uniques in the dereplicate rule in the snakefile_ESV file. This was not implemented in vsearch until version 2.20.0 (see torognes/vsearch#348).

No rRNAclassifier.properties files for 12SvertebrateClassifier and 16SvertebrateClassifier?

Hello! When I downloaded new classifiers for MetaWorks like the 18SClassifier (https://github.com/terrimporter/18SClassifier) from the list of classifiers (https://terrimporter.github.io/MetaWorksSite/#classifier_table) I was able to successfully follow the quickstart guidelines and locate its rRNAClassifier.properties file in order to run the new classifier with my own practice data.

However, when I tried to download both the 12SvertebrateClassifier (https://github.com/terrimporter/12SvertebrateClassifier) and the 16SvertebrateClassifier (https://github.com/terrimporter/16SvertebrateClassifier) there was no quick start guide on the GitHub page. When I downloaded and unzipped the .zip source code file, .tar.gz source code file, and the mydata_training.tar.gz files I was unable to locate the rRNAClassifier.properties file.

Additional guidance on how to use the 12SvertebrateClassifier and 16SvertebrateClassifier with MetaWorks and the RDP classifier would be appreciated! I'm new to MetaWorks and Linux and I'm missing why setting up these two classifiers is different from setting up the other MetaWorks classifiers I've tried so far. Thank you!

snakemake version

Hi,
does the snakemake version matter for running MetaWorks (e.g. v1.11.2); i.e. can I use snakemake v6.7.0. Later versions of snakemake cause issues while running through docker container.

Cheers,
Sten

Issues in MetaWorks 1.10.0 tutorial

Dear all,

I am a beginner user of MetaWorks. I was working with tutorial provided in the webpage (https://terrimporter.github.io/MetaWorksSite/tutorial/), nevertheless I found some issues that I could not fix by my own. I hope you could help me.

When I write in my Linux console the step:
"t: "/path/to/CO1Classifier/v4/mydata_trained/rRNAClassifier.properties""
I found "t:: order not found". I do not know how to fix this issue (and its source)
I tried to run the following step, because I am working with training data and I guess it could work without specifying a special path for the classifier:
"snakemake --jobs 2 --snakefile snakefile --configfile config_testing_COI_data.yaml"
The console said "Error: Snakefile "snakefile" not found". So, the program is not working. Which directory is required to be in for running this command?
The path for downloading the most recent version of the program is not working, so I worked with manually downloaded files from: https://github.com/terrimporter/MetaWorks/releases/tag/v1.10.0 (I downloaded "MetaWorks1.10.0.zip" and everything seemed to be working fine. Is it correct?

I hope you could help me to understand the usage of this interesting program.

Regards.

Create conda environment with environment.yml problem: ResolvePackageNotFound

The problem might be naive but I can't seem to find solution online. So I've tried to conda env create both 1.9.6 and 1.10.0 version of MetaWorks and ResolvePackageNotFound error always occurs.

Solving environment: failed

ResolvePackageNotFound: 
  - giflib==5.2.1=h36c2ea0_2
  - biopython==1.78=py38h7b6447c_0
  - isa-l==2.30.0=h7f98852_1
.
.

Since it looks like the issue of build numbers, I manually edited package names by removing everything after = in environment.yml files, and the ResolvePackageNotFound errors has reduced to these particular packages:

ResolvePackageNotFound:
- libstdcxx-ng
- alsa-lib
- _openmp_mutex
- libgomp
- libgcc-ng
- seqprep
- glibc214

I then moving these names under - pip: line that I just added (in environment.yml files):

- dependencies:
  - OTHER_PACKAGES...
  - pip:
    - alsa-lib
    - _openmp_mutex
    - libgomp
    - seqprep
    - glibc214
    - libstdcxx-ng
    - libgcc-ng

conda environment was successfully created but these packages were never installed in the environment

CondaEnvException: Pip failed

which caused errors during the implementation.

In order to install these packages, after anaconda search -t conda seqprep and anaconda show bioconda/seqprep, I've tried either conda install --channel https://conda.anaconda.org/bioconda seqprep or conda install -c bioconda seqprep in the environment, but the error remained.

PackagesNotFoundError: The following packages are not available from current channels:

  - libgcc-ng
  - _openmp_mutex
  - glibc214
  - libgomp
  - seqprep
  - alsa-lib
  - libstdcxx-ng

Current channels:

  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

But seqprep seems to be available on channel.

Same installation problems were appeared on other 6 ResolvePackageNotFound packages

My conda version is 4.13.0 and I am using macOS monterey 12.4.
Maybe it's because the packages are only available for linux x64 in that channel?
Please let me know how to fix this problem. Thank you!!

Issues in MetaWorks 12.0 Tutorial

Hello, everyone!

I am am having trouble with the Tutorial, apperently some classifier-related issue. I have edited the path to the classifier in the config_testing_COI_data.yaml file, I also tried to increase the memory allocated to the rdp classifier in the config file, but it is still not working. I am getting the following messages:

/home/labecmar/miniconda3/envs/MetaWorks_v1.12.0/bin/rdp_classifier: line 57: 8489 Killed /home/labecmar/miniconda3/envs/MetaWorks_v1.12.0/bin/java -Xmx8g -jar /home/labecmar/miniconda3/envs/MetaWorks_v1.12.0/share/rdp_classifier-2.13-1/classifier.jar classify -t mydata/rRNAClassifier.properties -o COI/rdp.out.tmp COI/cat.denoised.nonchimeras

[Thu Aug 17 12:03:34 2023]
Error in rule taxonomic_assignment:
jobid: 0
input: COI/cat.denoised.nonchimeras
output: COI/rdp.out.tmp

RuleException:
CalledProcessError in file /home/labecmar/MetaWorks1.12.0/snakefile_ESV, line 343:
Command 'set -euo pipefail; rdp_classifier -Xmx8g classify -t mydata/rRNAClassifier.properties -o COI/rdp.out.tmp COI/cat.denoised.nonchimeras' returned non-zero exit status 137.
File "/home/labecmar/MetaWorks1.12.0/snakefile_ESV", line 343, in __rule_taxonomic_assignment
File "/home/labecmar/miniconda3/envs/MetaWorks_v1.12.0/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job taxonomic_assignment since they might be corrupted:
COI/rdp.out.tmp
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-08-17T120233.625360.snakemake.log

Could you please help me solve this issue?
Thank you very much!