Code Monkey home page Code Monkey logo

tourmaline's People

Contributors

7yl4r avatar cuttlefishh avatar ksilnoaa avatar lukenoaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tourmaline's Issues

Add support for DEICODE

From @nvpatin:

There are just two commands I normally run for DEICODE, and the second one is the regular Emperor biplot command. I use an output folder called "02-beta" which I make manually, but the outputs could easily go into an existing folder from the Tourmaline workflow.

For the DEICODE rpca command, the parameter --p-n-components can be added and changed from the default of four. I've done this when trying to see more spread in my plots, but it risks overfitting the data. Alternatively, rpca can be auto-rpca, which runs a preliminary step to optimize the number of components. I never saw any difference between plots generated with auto-rpca vs rpca with default parameters, so I don't think it matters too much. I would probably just use the deicode defaults and then users can decide if they want to play around with anything subsequently

# Initial DEICODE command which will generate a distance matrix and an ordination plot (qza file).
qiime deicode rpca --i-table 00-table-repseqs/table-filt.qza --p-min-feature-count 10 --o-biplot 02-beta/deicode-ordination-filt.qza --o-distance-matrix 02-beta/deicode-distance-filt.qza

# Make an Emperor biplot from the DEICODE output
qiime emperor biplot --i-biplot deicode-ordination-filt.qza --m-sample-metadata-file ../../00-data/metadata.tsv --m-feature-metadata-file ../01-taxonomy/taxonomy.qza --o-visualization deicode-filt-biplot.qzv --p-number-of-features 4

Graphviz not installed, so DAG command won't work

Tried running the command to generate DAGs, but Graphviz is not installed by default with Docker or the conda environment.

snakemake dada2_pe_report_unfiltered --dag | dot -Tpdf -Grankdir=LR -Gnodesep=0.1 -Granksep=0.1 > dag.pdf

bash: dot: command not found

Need to add documentation to wiki to install Graphviz if folks want to generate DAGs.

Add snakemake rule to filter out samples

It may be useful to filter out samples that did not sequence well, have contamination, or are missing certain metadata during the filtered steps before running the diversity and report steps. This could be implemented in a similar way as feature filtering is implemented, by providing a list of samples to exclude or filtering based on some stat (e.g., number of repseqs).

Add rule to prepare output format for uploading to DwC

To facilitate uploading data to OBIS/GBIF, add a function to combine ASV file (table.biom), sample data file (metadata.tsv), and taxa table (taxonomy.qza) in a way that can be uploaded to DwC. This can also include converting taxa to WORMS (if marine).

First pass at this doesn't have to check if you have the correct DwC headers, it is more for putting your files in the right shape.

Pipeline in development that does this a bit using R:
https://github.com/iobis/PacMAN-pipeline

hejej

00-data/test_539_R1.fastq.gz

qiime not successfully installed (Linux)

Dear Luke,

I am trying to install the tourmaline pipeline native on linux. I have downloaded the .yml file from https://data.qiime2.org/distro/core/qiime2-2023.2-py38-linux-conda.yml, but it appears the qiime installation was not successful.

I activate the environment, I get:

QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Illegal instruction (core dumped)

and when I try to invoke qiime I get the same error.

Can it be that this issue is related to the dependencies/ versions specified on the .yml file that I used to build the environment?

Just out of curiosity, I checked on conda and there is already a new version of qiime2: https://anaconda.org/qiime2/qiime2. I tried downloading the env from this url, but still got the same error

Below is the complete information on my OS

NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Thanks in advance

Relate config file with AOML-MIMARKS metadata terms

From Luke: Let's cross reference with the config.yaml file in Tourmaline to make sure all of the information we request there is reflected here. Then we could add a note to the config file or Tourmaline docs to point users this this file (if they have it) to find that info.

rule check_inputs_params_se: missing param

@ksilnoaa I found that the snakefile was missing a param definition at line 375:

classifymethod=config["classify_method"]

This was resulting in an error on my run a follows:

RuleException in rule check_inputs_params_se in file /Users/rachaelkarns/Desktop/sedtrap_tourmaline_16S/Snakefile, line 371:
AttributeError: 'Params' object has no attribute 'classifymethod', when formatting the following:
if [ -r '01-imported/fastq_se.qza' ]; then     echo 'OK: FASTQ archive 01-imported/fastq_se.qza found; FASTQ manifest file 00-data/manifest_se.csv not required.'; elif [ -r '00-data/manifest_se.csv' ]; then     echo 'OK: FASTQ manifest file 00-data/manifest_se.csv found; it will be used to create FASTQ archive 01-imported/fastq_se.qza.'; else echo 'Error: FASTQ sequence data not found; either 00-data/manifest_se.csv or 01-imported/fastq_se.qza is required.' && exit 1; fi; if [ {params.classifymethod} = naive-bayes ]; then     if [ -r '01-imported/classifier.qza' ]; then         echo 'OK: Reference sequences classifier 01-imported/classifier.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.';     elif [ -r '01-imported/refseqs.qza' ]; then         echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.';     elif [ -r '00-data/refseqs.fna' ]; then         echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.';     else echo 'Error: Reference sequences not found; either 01-imported/classifier.qza or 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1;     fi; elif [ -r '01-imported/refseqs.qza' ]; then     echo 'OK: Reference sequences archive 01-imported/refseqs.qza found; reference sequences FASTA file 00-data/refseqs.fna not required.'; elif [ -r '00-data/refseqs.fna' ]; then     echo 'OK: Reference sequences FASTA file 00-data/refseqs.fna found; it will be used to create reference sequences archive 01-imported/refseqs.qza.'; else echo 'Error: Reference sequences not found; either 00-data/refseqs.fna or 01-imported/refseqs.qza is required.' && exit 1; fi; if [ {params.classifymethod} != naive-bayes ]; then     if [ -r '01-imported/reftax.qza' ]; then         echo 'OK: Reference taxonomy archive 01-imported/reftax.qza found; reference taxonomy file 00-data/reftax.tsv not required.';     elif [ -r '00-data/reftax.tsv' ]; then         echo 'OK: Reference taxonomy file 00-data/reftax.tsv found; it will be used to create reference taxonomy archive 01-imported/reftax.qza.';     else echo 'Error: Reference taxonomy not found; either 00-data/reftax.tsv or 01-imported/reftax.qza is required.' && exit 1;     fi; fi; if grep -q ^{params.column}$ {input}; then     echo 'OK: Metadata contains the column "{params.column}" that is specified as beta_group_column in config.yaml.'; else echo 'Error: Metadata does not contain the column "{params.column}" that is specified as beta_group_column in config.yaml.' && exit 1; fi

DADA2 Filtering Error

While running snakemake dada2_pe_denoise, I ran into an error running DADA2. This occurred during the filtering step and was traced back to the version of Matrix in the conda environment (1.3_2) being different than the R version (1.3.3). To solve this, I reloaded Matrix in the R environment for version 1.3.2. First I had to install devtools in the Conda environment, which I've also included in the code below:

$ brew install libgit2
$ pip3 install pygit2
$ R

remove.packages(Matrix)
install.packages('devtools')
devtools::install_version("Matrix", version = "1.3.2", repos = "http://cran.us.r-project.org")
quit()`

$ snakemake dada2_pe_denoise [should work!]

Error received during initial denoise run:

  1. Filtering Error in h(simpleError(msg, call)) :
    error in evaluating the argument 'i' in selecting a method for function '[': invalid class “SRFilterResult” object: superclass "Mnumeric" not defined in the environment of the object's class
    Execution halted
    Traceback (most recent call last):
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 181, in _denoise_single
    run_commands([cmd])
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
    subprocess.run(cmd, check=True)
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command '['run_dada_single.R', '/var/folders/_c/wsbmy73951n1kf7l4vj4b63r0000gp/T/qiime2-archive-6quibl8x/9a10ad70-8aee-4daf-82b4-a075b02bc750/data', '/var/folders/_c/wsbmy73951n1kf7l4vj4b63r0000gp/T/tmpl3a9qnml/output.tsv.biom', '/var/folders/_c/wsbmy73951n1kf7l4vj4b63r0000gp/T/tmpl3a9qnml/track.tsv', '/var/folders/_c/wsbmy73951n1kf7l4vj4b63r0000gp/T/tmpl3a9qnml', '107', '0', '2.0', '2', 'Inf', 'independent', 'consensus', '1.0', '1', '1000000', 'NULL', '16']' returned non-zero exit status 1.
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
    results = action(**arguments)
    File "", line 2, in denoise_single
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
    output_views = self._callable(**view_args)
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 218, in denoise_single
    band_size='16')
    File "/Users/xxxx/opt/miniconda3/envs/qiime2-2021.2b/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 192, in _denoise_single
    " and stderr to learn more." % e.returncode)
    Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Improve documentation for trim and trunc recommendations and taxonomic databases

Suggestion from @Yixiangzhang1996: Improve documentation of the input data type and the reference sequence of the recommended database comments.

  • What is the recommended value of dada2pe_trunc_len_f/r for paired-end 2x250 data?
  • If there are primers in the sequences, should dada2pe_trim_left_f/r parameters be set according to the length of primers? And what if some reads have barcodes?
  • For the reference sequence selection in the taxonomy step, I suggest adding some reference databases corresponding to the variable area in the README, which can improve the flexibility of the tool.

Add generic taxonomy assignment using GenBank nt

Basic workflow:

  • BLASTN with --outputfmt 11 (ASN)
  • blast_formatter to get desired format
  • Ignore "uncultured" hits (by keyword or taxid)
  • Report NCBI taxomony

Code:

cd tourmaline/02-denoised/dada2-pe

# metablast and store as ASN.1 (convertible to any format)
blastn \
-query representative_sequences.fasta \
-db $EXTDRIVE/databases/blast/nt/nt \
-negative_gilist $DB/blast/nt/env_metag_unclassified.gi \
-task megablast \
-max_target_seqs 5 \
-max_hsps 1 \
-num_threads 24 \
-outfmt 11 \
-out representative_sequences_vs_nt.asn

# convert ASN.1 to XML
blast_formatter \
-archive representative_sequences_vs_nt.asn \
-outfmt 5 \
-out representative_sequences_vs_nt.xml

# parse XML to TSV (requires BioPython)
parse_blast_xml.py \
representative_sequences_vs_nt.xml 1 > \
representative_sequences_vs_nt.tsv

# map accessions to taxids (from A to Z),
# lookup lineages, and
# format as qiime2 taxonomy
# RUN: blastn_tsv_to_taxonomy_tsv.ipynb
# OUTPUT: representative_sequences_vs_nt_tax.tsv, representative_sequences_taxonomy.tsv

# import as qiime2 taxonomy artifact
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat \
--input-path representative_sequences_taxonomy.tsv \
--output-path representative_sequences_taxonomy.qza

Documentation updates

I'm following the instructions for Apple Silicon and not quite sure how to proceed. Which parts do those specific commands replace?

For all installations, do I activate the snakemake-minimal environment before proceeding with the other installations? When I want to run the workflow, do I activate both snakemake-minimal and qiime2-2023.5?

Calculate MD5 checksums of sequences and compare to DADA2/Deblur seqids

Rule not working properly:

# mac: md5 / linux: md5sum
rule check_repseq_md5:
    input:
        "02-output-{method}/00-table-repseqs/representative_sequences.fasta"
    output:
        "02-output-{method}/00-table-repseqs/representative_sequences_md5_status.txt"
    shell:
        "seq1id=`head -n 1 {input} | sed 's/^>//'`; "
        "seq1md5=`head -n 2 {input} | tail -n 1 | md5 -q`; "
        "echo -n First representative sequence ID..... > {output}; "
        "echo $seq1id >> {output}; "
        "echo -n First representative sequence MD5.... >> {output}; "
        "echo $seq1md5 >> {output}; "
        "if [ '$seq1id' == '$seq1md5' ]; "
        "then echo 'First representative sequence has ID matching MD5 :)'; "
        "else echo 'First representative sequence has ID *not* matching MD5 :('; "
        "fi >> {output}; "
        "seq2id=`tail -n 2 {input} | head -n 1 | sed 's/^>//'`; "
        "seq2md5=`tail -n 1 {input} | md5 -q`; "
        "echo -n Last representative sequence ID..... >> {output}; "
        "echo $seq2id >> {output}; "
        "echo -n Last representative sequence MD5.... >> {output}; "
        "echo $seq2md5 >> {output}; "
        "if [ '$seq2id' == '$seq2md5' ]; "
        "then echo 'Last representative sequence has ID matching MD5 :)'; "
        "else echo 'Last representative sequence has ID *not* matching MD5 :('; "
        "fi >> {output}"

issues running with test dataset

Hi,

I've installed the tourmaline git directory to see tourmaline would run smoothly with the test data on my HPC (Linux), but somehow errors are showing up when I run snakemake dada2_pe_denoise. Please see below.

R version 4.0.2 (2020-06-22) 
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.10 / RcppParallel: 5.1.6 
1) Filtering Error in validObject(.Object) : 
  invalid class “SRFilterResult” object: superclass "Mnumeric" not defined in the environment of the object's class
Execution halted
Traceback (most recent call last):
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 264, in denoise_paired
    run_commands([cmd])
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
    subprocess.run(cmd, check=True)
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmpn1wfffzw/forward', '/tmp/tmpn1wfffzw/reverse', '/tmp/tmpn1wfffzw/output.tsv.biom', '/tmp/tmpn1wfffzw/track.tsv', '/tmp/tmpn1wfffzw/filt_f', '/tmp/tmpn1wfffzw/filt_r', '240', '190', '0', '0', '2.0', '2.0', '2', 'independent', 'consensus', '1.0', '1', '1000000']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-522>", line 2, in denoise_paired
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_dada2/_denoise.py", line 279, in denoise_paired
    " and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Plugin error from dada2:

  An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.
Error in job denoise_dada2_pe while creating output files 02-output-dada2-pe-unfiltered/00-table-repseqs/table.qza, 02-output-dada2-pe-unfiltered/00-table-repseqs/repseqs.qza, 02-output-dada2-pe-unfiltered/00-table-repseqs/dada2_stats.qza.
RuleException:
CalledProcessError in line 549 of /tourmaline/Snakefile:
Command 'qiime dada2 denoise-paired --i-demultiplexed-seqs 01-imported/fastq_pe.qza --p-trunc-len-f 240 --p-trunc-len-r 190 --p-trim-left-f 0 --p-trim-left-r 0 --p-max-ee-f 2 --p-max-ee-r 2 --p-trunc-q 2 --p-pooling-method independent --p-chimera-method consensus --p-min-fold-parent-over-abundance 1 --p-n-reads-learn 1000000 --p-n-threads 1 --p-hashed-feature-ids --o-table 02-output-dada2-pe-unfiltered/00-table-repseqs/table.qza --o-representative-sequences 02-output-dada2-pe-unfiltered/00-table-repseqs/repseqs.qza --o-denoising-stats 02-output-dada2-pe-unfiltered/00-table-repseqs/dada2_stats.qza --verbose' returned non-zero exit status 1.
  File "/tourmaline/Snakefile", line 549, in __rule_denoise_dada2_pe
  File "/miniconda/yes/envs/qiime2-2021.2/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Can you please advise?

Thanks

Report: record database version

Either add back a field where the database file (name and version) is specified, or request users to put the database name/version in the metadata file.

New Muscle version is not backward compatible

Grant said: I ran into an error on the sequence alignment step, and the culprit was the muscle package. After a little digging, it turns out that a couple of arguments have changed in muscle since it moved to v5 (https://drive5.com/muscle5/manual/commands.html). After downgrading to Muscle v3.8.31 manually, everything worked fine, but the Anaconda package manager downloads v5.1 by default.

"-in" has changed to "-align", "-out" has changed to "-output", "-maxiters" has changed to "-consiters" and there doesn't appear to be a diagonal optimization option in the most recent version (So, no "-diag"). When I made these changes to the snakefile, it worked fine with muscle v5.1. Chances are, you may already know about this, but I thought I should let you know regardless!

Unzip of table.biom and representative_sequences.fasta causes breaks process when parallelizing

I have been using the --cores __ argument in snakemake for dada2_pe_denoise and consistently had non-zero exit statuses for unzip rules which invoke commands like unzip, mv, and rm. After my own debugging I have found that this is because of a race condition on the temp folder. Multiple processes are trying to write/unzip/delete the folder. Creating a separate folder for each rule seemed to fix this (temp0, temp1, etc.). Please consider modifying the Snakefile to improve parallel performance and stability.

    shell:
        "mkdir temp0; "
        "unzip -o {input} -d temp0; "
        "mv -v temp0/*/data/feature-table.biom {output}; "
        "rm -rf temp0; "

ILP solver warning on Qiime2-2023.2 branch when installed from yaml

After installing the qiime2-2023.2 version from a custom yaml file, the pipeline works but a warning message is printed after running jobs:

Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.

This does not happen when installing with regular conda instructions.

rule summarize_fastq_demux_pe:
    input: 01-imported/fastq_pe.qza, 01-imported/check_inputs_params_pe.done
    output: 01-imported/fastq_summary.qzv
    jobid: 4
    reason: Missing output files: 01-imported/fastq_summary.qzv; Input files updated by another job: 01-imported/check_inputs_params_pe.done, 01-imported/fastq_pe.qza
    threads: 4
    resources: tmpdir=/tmp

Saved Visualization to: 01-imported/fastq_summary.qzv
[Mon May  8 17:36:14 2023]
Finished job 4.
6 of 18 steps (33%) done
Select jobs to execute...
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver. Run Snakemake with --verbose to see the full solver output for debugging the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.