aomlomics / tourmaline Goto Github PK
View Code? Open in Web Editor NEWAmplicon sequence processing workflow using QIIME 2 and Snakemake
License: BSD 3-Clause "New" or "Revised" License
Amplicon sequence processing workflow using QIIME 2 and Snakemake
License: BSD 3-Clause "New" or "Revised" License
From @nvpatin:
There are just two commands I normally run for DEICODE, and the second one is the regular Emperor biplot command. I use an output folder called "02-beta" which I make manually, but the outputs could easily go into an existing folder from the Tourmaline workflow.
For the DEICODE rpca command, the parameter --p-n-components
can be added and changed from the default of four. I've done this when trying to see more spread in my plots, but it risks overfitting the data. Alternatively, rpca can be auto-rpca, which runs a preliminary step to optimize the number of components. I never saw any difference between plots generated with auto-rpca vs rpca with default parameters, so I don't think it matters too much. I would probably just use the deicode defaults and then users can decide if they want to play around with anything subsequently
# Initial DEICODE command which will generate a distance matrix and an ordination plot (qza file).
qiime deicode rpca --i-table 00-table-repseqs/table-filt.qza --p-min-feature-count 10 --o-biplot 02-beta/deicode-ordination-filt.qza --o-distance-matrix 02-beta/deicode-distance-filt.qza
# Make an Emperor biplot from the DEICODE output
qiime emperor biplot --i-biplot deicode-ordination-filt.qza --m-sample-metadata-file ../../00-data/metadata.tsv --m-feature-metadata-file ../01-taxonomy/taxonomy.qza --o-visualization deicode-filt-biplot.qzv --p-number-of-features 4
Install Clustal-Omega:
conda install -c bioconda clustalo
Install MUSCLE:
conda install -c bioconda muscle
Tried running the command to generate DAGs, but Graphviz is not installed by default with Docker or the conda environment.
snakemake dada2_pe_report_unfiltered --dag | dot -Tpdf -Grankdir=LR -Gnodesep=0.1 -Granksep=0.1 > dag.pdf
bash: dot: command not found
Need to add documentation to wiki to install Graphviz if folks want to generate DAGs.
It may be useful to filter out samples that did not sequence well, have contamination, or are missing certain metadata during the filtered
steps before running the diversity
and report
steps. This could be implemented in a similar way as feature filtering is implemented, by providing a list of samples to exclude or filtering based on some stat (e.g., number of repseqs).
The bioconductor-odseq package can cause problems with other dependencies and may need to be re-installed in the conda environment.
To facilitate uploading data to OBIS/GBIF, add a function to combine ASV file (table.biom), sample data file (metadata.tsv), and taxa table (taxonomy.qza) in a way that can be uploaded to DwC. This can also include converting taxa to WORMS (if marine).
First pass at this doesn't have to check if you have the correct DwC headers, it is more for putting your files in the right shape.
Pipeline in development that does this a bit using R:
https://github.com/iobis/PacMAN-pipeline
Jon Sanders has database/solution for testing.
Also:
00-data/test_539_R1.fastq.gz
Dear Luke,
I am trying to install the tourmaline pipeline native on linux. I have downloaded the .yml
file from https://data.qiime2.org/distro/core/qiime2-2023.2-py38-linux-conda.yml
, but it appears the qiime
installation was not successful.
I activate the environment, I get:
QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Illegal instruction (core dumped)
and when I try to invoke qiime
I get the same error.
Can it be that this issue is related to the dependencies/ versions specified on the .yml
file that I used to build the environment?
Just out of curiosity, I checked on conda and there is already a new version of qiime2: https://anaconda.org/qiime2/qiime2. I tried downloading the env from this url, but still got the same error
Below is the complete information on my OS
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Thanks in advance
Need a new rule.
For example, Sarah Hu's Snakemake workflow for QIIME 2: https://forum.qiime2.org/t/qiime2-snakemake-workflow-tutorial-18s-16s-tag-sequencing/11334
Add parameters for:
Need a new rule.
From Luke: Let's cross reference with the config.yaml file in Tourmaline to make sure all of the information we request there is reflected here. Then we could add a note to the config file or Tourmaline docs to point users this this file (if they have it) to find that info.
Fragment insertion (SEPP) in addition to de novo tree.
Also sometimes happens when parameters are changed that shouldn't affect denoising (DADA2) output. This shouldn't be happening if workflow is working properly, and it adds a lot of time to the workflow. Check Snakefile and DAG.
@ksilnoaa I found that the snakefile was missing a param definition at line 375:
classifymethod=config["classify_method"]
This was resulting in an error on my run a follows:
RuleException in rule check_inputs_params_se in file /Users/rachaelkarns/Desktop/sedtrap_tourmaline_16S/Snakefile, line 371:
AttributeError: 'Params' object has no attribute 'classifymethod', when formatting the following:
if [ -r '01-imported/fastq_se.qza' ]; then echo 'OK: FASTQ archive 01-imported/fastq_se.qza found; FASTQ manifest file 00-data/manifest_se.csv not required.'; elif [ -r '00-data/manifest_se.csv' ]; then echo 'OK: FASTQ manifest file 00-data/manifest_se.csv found; it will be used to create FASTQ archive 01-imported/fastq_se.qza.'; else echo 'Error: FASTQ sequence data not found; either 00-data/manifest_se.csv or 01-imported/fastq_se.qza is required.' && exit 1; fi; if [ {params.classifymethod} = naive-bayes ]; then if [ -r '01-imported/classifier.qza' ]; then echo 'OK: Reference sequences classifier 01-imported/classifier.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; elif [ -r '01-imported/refseqs.qza' ]; then echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; elif [ -r '00-data/refseqs.fna' ]; then echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; else echo 'Error: Reference sequences not found; either 01-imported/classifier.qza or 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; fi; elif [ -r '01-imported/refseqs.qza' ]; then echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; elif [ -r '00-data/refseqs.fna' ]; then echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; else echo 'Error: Reference sequences not found; either 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; fi; if [ {params.classifymethod} != naive-bayes ]; then if [ -r '01-imported/reftax.qza' ]; then echo 'OK: Reference taxonomy archive 01-imported/reftax.qza found; reference taxonomy file 00-data/reftax.tsv not required.'; elif [ -r '00-data/reftax.tsv' ]; then echo 'OK: Reference taxonomy file 00-data/reftax.tsv found; it will be used to create reference taxonomy archive 01-imported/reftax.qza.'; else echo 'Error: Reference taxonomy not found; either 00-data/reftax.tsv or 01-imported/reftax.qza is required.' && exit 1; fi; fi; if grep -q ^{params.column}$ {input}; then echo 'OK: Metadata contains the column "{params.column}" that is specified as beta_group_column in config.yaml.'; else echo 'Error: Metadata does not contain the column "{params.column}" that is specified as beta_group_column in config.yaml.' && exit 1; fi
Currently supports:
Tetramer Amplicon locus Forward primer
TACG 16S rRNA prok 515f (EMP)
GTAG ITS rRNA euk ITS1f (EMP)
GCT[AC] 18S rRNA euk Euk1391f (EMP)
GTCG 12S rRNA mt MiFish-U-F
While running snakemake dada2_pe_denoise
, I ran into an error running DADA2. This occurred during the filtering step and was traced back to the version of Matrix in the conda environment (1.3_2) being different than the R version (1.3.3). To solve this, I reloaded Matrix in the R environment for version 1.3.2. First I had to install devtools in the Conda environment, which I've also included in the code below:
$ brew install libgit2
$ pip3 install pygit2
$ R
remove.packages(Matrix)
install.packages('devtools')
devtools::install_version("Matrix", version = "1.3.2", repos = "http://cran.us.r-project.org")
quit()`
$ snakemake dada2_pe_denoise [should work!]
Error received during initial denoise run:
Export conda environments (Mac, Linux) and share on GitHub. M1 and Intel both needed?
Suggestion from @Yixiangzhang1996: Improve documentation of the input data type and the reference sequence of the recommended database comments.
dada2pe_trunc_len_f/r
for paired-end 2x250 data?dada2pe_trim_left_f/r
parameters be set according to the length of primers? And what if some reads have barcodes?Basic workflow:
Code:
cd tourmaline/02-denoised/dada2-pe
# metablast and store as ASN.1 (convertible to any format)
blastn \
-query representative_sequences.fasta \
-db $EXTDRIVE/databases/blast/nt/nt \
-negative_gilist $DB/blast/nt/env_metag_unclassified.gi \
-task megablast \
-max_target_seqs 5 \
-max_hsps 1 \
-num_threads 24 \
-outfmt 11 \
-out representative_sequences_vs_nt.asn
# convert ASN.1 to XML
blast_formatter \
-archive representative_sequences_vs_nt.asn \
-outfmt 5 \
-out representative_sequences_vs_nt.xml
# parse XML to TSV (requires BioPython)
parse_blast_xml.py \
representative_sequences_vs_nt.xml 1 > \
representative_sequences_vs_nt.tsv
# map accessions to taxids (from A to Z),
# lookup lineages, and
# format as qiime2 taxonomy
# RUN: blastn_tsv_to_taxonomy_tsv.ipynb
# OUTPUT: representative_sequences_vs_nt_tax.tsv, representative_sequences_taxonomy.tsv
# import as qiime2 taxonomy artifact
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat \
--input-path representative_sequences_taxonomy.tsv \
--output-path representative_sequences_taxonomy.qza
I'm following the instructions for Apple Silicon and not quite sure how to proceed. Which parts do those specific commands replace?
For all installations, do I activate the snakemake-minimal
environment before proceeding with the other installations? When I want to run the workflow, do I activate both snakemake-minimal
and qiime2-2023.5
?
For systems that don't have Docker due to cost or security reasons
Rule not working properly:
# mac: md5 / linux: md5sum
rule check_repseq_md5:
input:
"02-output-{method}/00-table-repseqs/representative_sequences.fasta"
output:
"02-output-{method}/00-table-repseqs/representative_sequences_md5_status.txt"
shell:
"seq1id=`head -n 1 {input} | sed 's/^>//'`; "
"seq1md5=`head -n 2 {input} | tail -n 1 | md5 -q`; "
"echo -n First representative sequence ID..... > {output}; "
"echo $seq1id >> {output}; "
"echo -n First representative sequence MD5.... >> {output}; "
"echo $seq1md5 >> {output}; "
"if [ '$seq1id' == '$seq1md5' ]; "
"then echo 'First representative sequence has ID matching MD5 :)'; "
"else echo 'First representative sequence has ID *not* matching MD5 :('; "
"fi >> {output}; "
"seq2id=`tail -n 2 {input} | head -n 1 | sed 's/^>//'`; "
"seq2md5=`tail -n 1 {input} | md5 -q`; "
"echo -n Last representative sequence ID..... >> {output}; "
"echo $seq2id >> {output}; "
"echo -n Last representative sequence MD5.... >> {output}; "
"echo $seq2md5 >> {output}; "
"if [ '$seq2id' == '$seq2md5' ]; "
"then echo 'Last representative sequence has ID matching MD5 :)'; "
"else echo 'Last representative sequence has ID *not* matching MD5 :('; "
"fi >> {output}"
Hi,
I've installed the tourmaline git directory to see tourmaline would run smoothly with the test data on my HPC (Linux), but somehow errors are showing up when I run snakemake dada2_pe_denoise
. Please see below.
R version 4.0.2 (2020-06-22)
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.10 / RcppParallel: 5.1.6
1) Filtering Error in validObject(.Object) :
invalid class “SRFilterResult” object: superclass "Mnumeric" not defined in the environment of the object's class
Execution halted
Traceback (most recent call last):
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 264, in denoise_paired
run_commands([cmd])
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
subprocess.run(cmd, check=True)
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmpn1wfffzw/forward', '/tmp/tmpn1wfffzw/reverse', '/tmp/tmpn1wfffzw/output.tsv.biom', '/tmp/tmpn1wfffzw/track.tsv', '/tmp/tmpn1wfffzw/filt_f', '/tmp/tmpn1wfffzw/filt_r', '240', '190', '0', '0', '2.0', '2.0', '2', 'independent', 'consensus', '1.0', '1', '1000000']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
results = action(**arguments)
File "<decorator-gen-522>", line 2, in denoise_paired
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
output_views = self._callable(**view_args)
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 279, in denoise_paired
" and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.
Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.
See above for debug info.
Error in job denoise_dada2_pe while creating output files 02-output-dada2-pe-unfiltered/00-table-repseqs/table.qza, 02-output-dada2-pe-unfiltered/00-table-repseqs/repseqs.qza, 02-output-dada2-pe-unfiltered/00-table-repseqs/dada2_stats.qza.
RuleException:
CalledProcessError in line 549 of /tourmaline/Snakefile:
Command 'qiime dada2 denoise-paired --i-demultiplexed-seqs 01-imported/fastq_pe.qza --p-trunc-len-f 240 --p-trunc-len-r 190 --p-trim-left-f 0 --p-trim-left-r 0 --p-max-ee-f 2 --p-max-ee-r 2 --p-trunc-q 2 --p-pooling-method independent --p-chimera-method consensus --p-min-fold-parent-over-abundance 1 --p-n-reads-learn 1000000 --p-n-threads 1 --p-hashed-feature-ids --o-table 02-output-dada2-pe-unfiltered/00-table-repseqs/table.qza --o-representative-sequences 02-output-dada2-pe-unfiltered/00-table-repseqs/repseqs.qza --o-denoising-stats 02-output-dada2-pe-unfiltered/00-table-repseqs/dada2_stats.qza --verbose' returned non-zero exit status 1.
File "/tourmaline/Snakefile", line 549, in __rule_denoise_dada2_pe
File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Can you please advise?
Thanks
Command:
qiime fragment-insertion sepp
This would be helpful for downstream data analysis and uploading to OBIS. Challenge is that every reference database will have a different number and types of taxonomic ranks. How to automate?
Either add back a field where the database file (name and version) is specified, or request users to put the database name/version in the metadata file.
Hi,
I tried to install using the current qiime2 version but it failed due to icu version conflict.
Any chance you will produce an install scenario for python38 and the current qiime2 package?
Thanks in advance
Grant said: I ran into an error on the sequence alignment step, and the culprit was the muscle package. After a little digging, it turns out that a couple of arguments have changed in muscle since it moved to v5 (https://drive5.com/muscle5/manual/commands.html). After downgrading to Muscle v3.8.31 manually, everything worked fine, but the Anaconda package manager downloads v5.1 by default.
"-in" has changed to "-align", "-out" has changed to "-output", "-maxiters" has changed to "-consiters" and there doesn't appear to be a diagonal optimization option in the most recent version (So, no "-diag"). When I made these changes to the snakefile, it worked fine with muscle v5.1. Chances are, you may already know about this, but I thought I should let you know regardless!
I have been using the --cores __ argument in snakemake for dada2_pe_denoise and consistently had non-zero exit statuses for unzip rules which invoke commands like unzip, mv, and rm. After my own debugging I have found that this is because of a race condition on the temp folder. Multiple processes are trying to write/unzip/delete the folder. Creating a separate folder for each rule seemed to fix this (temp0, temp1, etc.). Please consider modifying the Snakefile to improve parallel performance and stability.
shell:
"mkdir temp0; "
"unzip -o {input} -d temp0; "
"mv -v temp0/*/data/feature-table.biom {output}; "
"rm -rf temp0; "
Getting errors like invalid escape sequence '\/'
, invalid escape sequence '\['
, and invalid escape sequence '\|'
at the start of a snakemake run.
May be related to new Python version 3.12: https://docs.python.org/dev/whatsnew/3.12.html#other-language-changes
After installing the qiime2-2023.2 version from a custom yaml file, the pipeline works but a warning message is printed after running jobs:
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.
This does not happen when installing with regular conda instructions.
rule summarize_fastq_demux_pe:
input: 01-imported/fastq_pe.qza, 01-imported/check_inputs_params_pe.done
output: 01-imported/fastq_summary.qzv
jobid: 4
reason: Missing output files: 01-imported/fastq_summary.qzv; Input files updated by another job: 01-imported/check_inputs_params_pe.done, 01-imported/fastq_pe.qza
threads: 4
resources: tmpdir=/tmp
Saved Visualization to: 01-imported/fastq_summary.qzv
[Mon May 8 17:36:14 2023]
Finished job 4.
6 of 18 steps (33%) done
Select jobs to execute...
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.