Code Monkey home page Code Monkey logo

covid-19-signal's Introduction

SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL)

This is a complete standardized workflow the assembly and subsequent analysis for short-read viral sequencing. This core workflow is compatible with the illumina artic nf pipeline and produces consensus and variant calls using iVar (1.3) (Grubaugh, 2019) and Freebayes (Garrison, 2012) . However, it performs far more extensive quality control and visualisation of results including an interactive HTML summary of run results.

Briefly, raw reads undergo qc using fastqc (Andrews) before removal of host-related reads by competitive mapping against a composite host and human reference with BWA-MEM (0.7.5) (Li, 2013), samtools (1.9) (Li, 2009), and a custom script. This is to ensure raw as data as possible can be deposited in central databases. After this, reads undergo adapter trimming and further qc with trim-galore (0.6.5) (Martin). Residual truseq sequencing adapters are then removed through another custom script. Reads are then mapped to the viral reference with BWA-MEM, and amplicon primer sequences trimmed using ivar (1.3) (Grubaugh, 2019). Fastqc is then used to perform a QC check on the reads that map to the viral reference. After this, iVar is used to generate a consensus genome and variants are called using both ivar variants and breseq (0.35) (Deatherage, 2014). Additionally, Freebayes may be run in addition to iVar which will generate a second set of consensus genome(s) and variant call(s) with comparisons made between iVar and FreeBayes to highlight differences in mutation calls. Coverage statistics are calculated using bedtools before a final QC via quast and a kraken2 taxonomic classification of mapped reads. Finally, data from all samples are collated via a post-processing script into an interactive summary for exploration of results and quality control. Optionally, users can run ncov-tools to generate additional quality control and summary plots and statistics.

If you use this software please cite:

Nasir, Jalees A., Robert A. Kozak, Patryk Aftanas, Amogelang R. Raphenya, Kendrick M. Smith, Finlay Maguire, Hassaan Maan et al. "A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture." Viruses 12, no. 8 (2020): 895.
https://doi.org/10.3390/v12080895

Contents:

Setup:

0. Clone the git repository (--recursive only needed to run ncov-tools postprocessing)

    git clone --recursive https://github.com/jaleezyy/covid-19-signal

1. Install conda and snakemake (version >5) e.g.

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh # follow instructions
    source $(conda info --base)/etc/profile.d/conda.sh
    conda create -n signal -c conda-forge -c bioconda -c defaults snakemake pandas conda mamba
    conda activate signal

There are some issues with conda failing to install newer versions of snakemake so alternatively install mamba and use that (snakemake has beta support for it within the workflow)

    conda install -c conda-forge mamba
    mamba create -c conda-forge -c bioconda -n signal snakemake pandas conda mamba
    conda activate signal
    # mamba activate signal is equivalent

Additional software dependencies are managed directly by snakemake using conda environment files:

  • trim-galore 0.6.5 (docs)
  • kraken2 2.1.1 (docs)
  • quast 5.0.2 (docs)
  • bwa 0.7.17 (docs)
  • samtools 1.7/1.9 (docs)
  • bedtools 2.26.0 (docs)
  • breseq 0.35.0 (docs)
  • ivar 1.3 (docs)
  • freebayes 1.3.2 (docs)
  • pangolin (latest; version can be specified by user) (docs)
  • pangolin-data (latest; version can be specified by user; required for Pangolin v4+) (docs)
  • pangolearn (latest; version can be specified by user) (docs)
  • constellations (latest; version can be specified by user) (docs)
  • scorpio (latest; version can be specified by user) (docs)
  • pango-designation (latest; version can be specified by user) (docs)
  • nextclade (v1.11.0) (docs)
  • ncov-tools postprocessing scripts require additional dependencies (see file).

SIGNAL Help Screen:

Using the provided signalexe.py script, the majority of SIGNAL functions can be accessed easily.

To display the help screen:

python signalexe.py -h

usage: signalexe.py [-h] [-c CONFIGFILE] [-d DIRECTORY] [--cores CORES] [--config-only] [--remove-freebayes] [--add-breseq] [-neg NEG_PREFIX] [--dependencies] [--data DATA] [-ri] [-ii] [--unlock]
                    [-F] [-n] [--quiet] [--verbose] [-v]
                    [all ...] [postprocess ...] [ncov_tools ...] [install ...]

SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) aims to take Illumina short-read sequences and perform consensus assembly + variant calling for ongoing surveillance and research efforts towards
the emergent coronavirus: Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).

positional arguments:
  all                   Run SIGNAL with all associated assembly rules. Does not include postprocessing '--configfile' or '--directory' required. The latter will automatically generate a
                        configuration file and sample table. If both provided, then '--configfile' will take priority
  postprocess           Run SIGNAL postprocessing on completed SIGNAL run. '--configfile' is required but will be generated if '--directory' is provided
  ncov_tools            Generate configuration file and filesystem setup required and then execute ncov-tools quality control assessment. Requires 'ncov-tools' submodule! '--configfile' is required
                        but will be generated if '--directory' is provided
  install               Install individual rule environments and ensure SIGNAL is functional. The only parameters operable will be '--data' and '--unlock'. Will override other operations!

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIGFILE, --configfile CONFIGFILE
                        Configuration file (i.e., config.yaml) for SIGNAL analysis
  -d DIRECTORY, --directory DIRECTORY
                        Path to directory containing reads. Will be used to generate sample table and configuration file
  --cores CORES         Number of cores. Default = 1
  --config-only         Generate sample table and configuration file (i.e., config.yaml) and exit. '--directory' required
  --remove-freebayes    Configuration file generator parameter. Set flag to DISABLE freebayes variant calling (improves overall speed)
  --add-breseq          Configuration file generator parameter. Set flag to ENABLE optional breseq step (will take more time for analysis to complete)
  -neg NEG_PREFIX, --neg-prefix NEG_PREFIX
                        Configuration file generator parameter. Comma-separated list of negative control sample name(s) or prefix(es). For example, 'Blank' will cover Blank1, Blank2, etc.
                        Recommended if running ncov-tools. Will be left empty, if not provided
  --dependencies        Download data dependencies (under a created 'data' directory) required for SIGNAL analysis and exit. Note: Will override other parameters! (~10 GB storage required)
  --data DATA           SIGNAL install and data dependencies parameter. Set location for data dependancies. If '--dependancies' is run, a folder will be created in the specified directory. If '--
                        config-only' or '--directory' is used, the value will be applied to the configuration file. (Upcoming feature): When used with 'SIGNAL install', any tests run will use the
                        dependencies located at this directory. Default = 'data'
  -ri, --rerun-incomplete
                        Snakemake parameter. Re-run any incomplete samples from a previously failed run
  -ii, --ignore-incomplete
                        Snakemake parameter. Do not check for incomplete output files
  --unlock              Snakemake parameter. Remove a lock on the working directory after a failed run
  -F, --forceall        Snakemake parameter. Force the re-run of all rules regardless of prior output
  -n, --dry-run         Snakemake parameter. Do not execute anything and only display what would be done
  --quiet               Snakemake parameter. Do not output any progress or rule information. If used with '--dry-run`, it will just display a summary of the DAG of jobs
  --verbose             Snakemake parameter. Display snakemake debugging output
  -v, --version         Display version number

Summary:

signalexe.py simplies the execution of all functions of SIGNAL. At its simplest, SIGNAL can be run with one line, provided only the directory of sequencing reads.

# Download dependances (only needs to be run once; ~10GB of storage required)
# --data flag allows you to rename and relocate dependencies directory
python signalexe.py --data data --dependencies

# Generate configuration file and sample table
# --neg_prefix can be used to note negative controls
# --data can be used to specify location of data dependencies
python signalexe.py --config-only --directory /path/to/reads

# Execute pipeline (step-by-step; --cores defaults to 1 if not provided)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all
python signalexe.py --configfile config.yaml --cores NCORES postprocess
python signalexe.py --configfile config.yaml --cores NCORES ncov_tools

# ALTERNATIVE
# Execute pipeline (one line)
# --data can be used to specify location of data dependencies
python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools

# ALTERNATIVE
# Execute pipeline (one line; no prior configuration file or sample table steps)
# --directory can be used in place of --configfile to automatically generate a configuration file
# --data can be used to specify location of data dependencies
python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools

Each of the steps in SIGNAL can be run manually by accessing the individual scripts or running snakemake.

# Download dependances (only needs to be run once; ~10GB of storage required)
bash scripts/get_data_dependencies.sh -d data -a MN908947.3

# Generate sample table
# Modify existing 'example_config.yaml' for your configuration file
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv

# Execute pipeline (step-by-step)
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda all
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess
snakemake -kp --configfile config.yaml --cores NCORES --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools

Detailed setup and execution:

1. Download necessary database files:

The pipeline requires:

  • Amplicon primer scheme sequences

  • SARS-CoV2 reference fasta

  • SARS-CoV2 reference gbk

  • SARS-CoV2 reference gff3

  • kraken2 viral database

  • Human GRCh38 reference fasta (for composite human-viral BWA index)

     python signalexe.py --dependencies
     # defaults to a directory called `data` in repository root
     # --data can be used to rename and relocate the resultant directory
    
     OR
    
     bash scripts/get_data_dependencies.sh -d data -a MN908947.3
     # allows you to rename and relocate the resultant directory
    

Note: Downloading the database files requires ~10GB of storage, with up to ~35GB required for all temporary downloads!

1.5. Prepare per-rule conda environments (optional, but recommended):

SIGNAL uses controlled conda environments for individual steps in the workflow. These are generally produced upon first execution of SIGNAL with input data; however, an option to install the per-rule environments is available through the signalexe.py script.

   python signalexe.py install

   # Will install per-rule environments
   # Later versions of SIGNAL will include a testing module with curated data to ensure function

2. Generate configuration file:

You can use the --config-only flag to generate both config.yaml and sample_table.csv. The directory provided will be used to auto-generate a name for the run.

python signalexe.py --config-only --directory /path/to/reads

# Outputs: 'reads_config.yaml' and 'reads_sample_table.csv'
# --data can be used to specify the location of data dependancies

You can also create the configuration file through modifying the example_config.yaml to suit your system.

Note: Regardless of method, double-check your configuraation file to ensure the information is correct!

3. Specify your samples in CSV format:

See the example table example_sample_table.csv for an idea of how to organise this table.

Using the --config-only flag, both configuration file and sample table will be generated (see above in step 2) from a given directory path to reads.

Alternatively, you can attempt to use generate_sample_table.sh to circumvent manual creation of the table.

bash scripts/generate_sample_table.sh

Output:
You must specify a data directory containing fastq(.gz) reads.

ASSUMES FASTQ FILES ARE NAMED AS <sample_name>_L00#_R{1,2}*.fastq(.gz)

Flags:
    -d  :  Path to directory containing sample fastq(.gz) files (Absolute paths preferred for consistency, but can use relative paths)
    -n  :  Name or file path for final sample table (with extension) (default: 'sample_table.csv') - will overwrite if file exists
    -e  :  Name or file path for an existing sample table - will append to the end of the provided table

Select one of '-n' (new sample table) or '-e' (existing sample table).
If neither provided, a new sample table called 'sample_table.csv' will be created (or overwritten) by default.

General usage:

# Create new sample table 'sample_table.csv' given path to reads directory
bash scripts/generate_sample_table.sh -d /path/to/reads -n sample_table.csv

# Append to existing sample table 'sample_table.csv' given path to a directory with additional reads
bash scripts/generate_sample_table.sh -d /path/to/more/reads -e sample_table.csv

4. Execute pipeline:

For the main signalexe.py script, positional arguments inform the rules of the pipeline to execute with flags supplementing input parameters.

The main rules of the pipeline are as followed:

  • all = Sequencing pipeline. i.e., take a set of paired reads, perform reference-based assembly to generate a consensus, run lineage assignment, etc.
  • postprocess = Summarize the key results including pangolin lineage, specific mutations, etc, after running all
  • ncov_tools = Create the required conda environment, generate the necessary configuration file, and link needed result files within the ncov-tools directory. ncov-tools will then be executed with output found within the SIGNAL directory.

The generated configuration file from the above steps can be used as input. To run the general pipeline:

python signalexe.py --configfile config.yaml --cores 4 all

is equivalent to running

snakemake -kp --configfile config.yaml --cores 4 --use-conda --conda-prefix=$PWD/.snakemake/conda all

You can run the snakemake command as written above, but note that if the --conda-prefix is not set as this (i.e., $PWD/.snakemake/conda), then all envs will be reinstalled for each time you change the results_dir in the config.yaml.

Alternatively, you can skip the above configuration and sample table generation steps by simply providing the directory of reads to the main script (see step 2):

python signalexe.py --directory /path/to/reads --cores 4 all

A configuartion file and sample table will automatically be generated prior to running SIGNAL all.

FreeBayes variant calling and BreSeq mutational analysis are technically optional tools within the workflow. Using the --directory flag, by default, FreeBayes will run and BreSeq will not. These can be changed by using the --remove-freebayes and --add-breseq flags, respectively.

5. Postprocessing analyses:

As with the general pipeline, the generated configuration file from the above steps can be used as input. To run postprocess which summarizes the SIGNAL results:

python signalexe.py --configfile config.yaml --cores 1 postprocess

is equivalent to running

snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda postprocess

After postprocessing finishes, you'll see the following summary files:

  - summary.html                top-level summary, with links to per-sample summaries
  - {sample_name}/sample.html   per-sample summaries, with links for more detailed info
  - {sample_name}/sample.txt    per-sample summaries, in text-file format instead of HTML
  - summary.zip                 zip archive containing all of the above summary files.

Note that the pipeline postprocessing (snakemake postprocess) is separated from the rest of the pipeline (snakemake all). This is because in a multi-sample run, it's likely that at least one pipeline stage will fail. The postprocessing script should handle failed pipeline stages gracefully, by substituting placeholder values when expected pipeline output files are absent. However, this confuses snakemake's dependency tracking, so there seems to be no good alternative to separating piepline processing and postprocessing into all and postprocess targets.

Related: because pipeline stages can fail, we run (and recommend running if using the snakemake command to run SIGNAL) snakemake all with the -k flag ("Go on with independent jobs if a job fails").

Additionally, SIGNAL can prepare output and execute @jts' ncov-tools to generate phylogenies and alternative summaries.

python signalexe.py --configfile config.yaml --cores 1 ncov_tools

is equivalent to running

snakemake -kp --configfile config.yaml --cores 1 --use-conda --conda-prefix=$PWD/.snakemake/conda ncov_tools

SIGNAL manages installing the dependencies (within the conda_prefix) and will generate the necessary hard links to required input files from SIGNAL for ncov-tools if it has been cloned as a sub-module (if not found, the script will attempt to pull the submodule) and a fasta containing sequences to include in the tree has been specified using phylo_include_seqs: in the main SIGNAL config.yaml. If run_freebayes is set to True, then SIGNAL will attempt to link the FreeBayes consensus FASTA and variant files, if found. Otherwise, the corresponding iVar files will be used instead.

SIGNAL will then execute ncov-tools and the output will be found within the SIGNAL results directory, specified in SIGNAL's configuration file, under ncov-tools-results. Refer to the ncov-tools documentation for information regarding specific output.

Multiple operations:

Using signalexe.py positional arguments, you can specify SIGNAL to perform multiple rules in succession.

python signalexe.py --configfile config.yaml --cores NCORES all postprocess ncov_tools

In the above command, SIGNAL all, postprocess, and ncov_tools will run using the provided configuration file as input, which links to a sample table.

Note: Regardless of order for positional arguments, or placement of other parameter flags, SIGNAL will always run in the set order priority: all > postprocess > ncov_tools!

Note: If install is provided as input, it will override all other positional arguments!

If no configuration file or sample table was generated for a run, you can provide --directory with the path to sequencing reads and SIGNAL will auto-generate both required inputs prior to running any rules.

python signalexe.py --directory /path/to/reads --cores NCORES all postprocess ncov_tools

Overall, this simplifies executing SIGNAL to one line!

Docker:

Alternatively, the pipeline can be deployed using Docker (see resources/Dockerfile_pipeline for specification). To pull from dockerhub:

    docker pull finlaymaguire/signal

Download data dependencies into a data directory that already contains your reads (data is this example but whatever name you wish to use):

    mkdir -p data && docker run -v $PWD/data:/data finlaymaguire/signal:latest bash scripts/get_data_dependencies.sh -d /data

Generate your config.yaml and sample_table.csv (with paths to the readsets underneath /data) and place them into the data directory:

    cp config.yaml sample_table.csv $PWD/data

WARNING result_dir in config.yaml must be within /data e.g. /data/results to automatically be copied to your host system. Otherwise they will be automatically deleted when the container finishes running (unless docker is run interactively).

Then execute the pipeline:

    docker run -v $PWD/data:/data finlaymaguire/signal conda run -n snakemake snakemake --configfile /data/config.yaml --use-conda --conda-prefix /covid-19-signal/.snakemake/conda --cores 8 all

Data Summaries:

Convenient extraction script:

SIGNAL produces several output files and directories on its own alongside the output for ncov-tools. Select files from the output can be copied or transferred for easier parsing using a provided convenience bash script:

bash scripts/get_signal_results.sh

Usage:
bash get_signal_results.sh -s <SIGNAL_results_dir> -d <destination_dir> [-m] [-c]

This scripts aims to copy (rsync by default, or cp) or move (mv) select output from SIGNAL 'all', 'postprocess', and 'ncov_tools'.

The following files will be transferred over to the specified destination directory (if found):
SIGNAL 'all' & 'postprocess':
-> signal-results/<sample>/<sample>_sample.txt
-> signal-results/<sample>/core/<sample>.consensus.fa
-> signal-results/<sample>/core/<sample>_ivar_variants.tsv
-> signal-results/<sample>/freebayes/<sample>.consensus.fasta
-> signal-results/<sample>/freebayes/<sample>.variants.norm.vcf

SIGNAL 'ncov_tools':
-> ncov_tools-results/qc_annotation/<sample>.ann.vcf
-> ncov-tools-results/qc_reports/<run_name>_ambiguous_position_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_mixture_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_ncov_watch_variants.tsv
-> ncov-tools-results/qc_reports/<run_name>_negative_control_report.tsv
-> ncov-tools-results/qc_reports/<run_name>_summary_qc.tsv

Flags:
        -s  :  SIGNAL results directory
        -d  :  Directory where summary will be outputted
        -m  :  Invoke 'mv' move command instead of 'rsync' copying of results. Optional
        -c  :  Invoke 'cp' copy command instead of 'rsync' copying of results. Optional

The script uses rsync to provide accurate copies of select output files organized into signal-results and ncov-tools-results within a provided destination directory (that must exist). If the -c is provided, cp will be used instead of rsync to produce copies. Similarly, if -m is provided, mv will be used instead (WARNING: Any interruption during mv could result in data loss.)

Pipeline details:

For a step-by-step walkthrough of the pipeline, see pipeline/README.md.

A diagram of the workflow is shown below (update pending).

Workflow Version 8

covid-19-signal's People

Contributors

fmaguire avatar hkeward avatar jaleezyy avatar kmsmith137 avatar nknox avatar nodrogluap avatar pvanheus avatar raphenya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid-19-signal's Issues

Version 2 for long-reads

Bigger milestone to start as separate repo but version of pipeline to support minion and pacbio with test data for both

Move to standarised "core"

From discussion in teams, consider moving towards a standardised set of core tools (with our additional QC and analyses):

trim_galore -> bwa -> ivar trim -> ivar variant -> ivar consensus

Uploader code

Write accessory scripts to upload clean reads as biosamples and final assembly to nexstrain/GISAID

missing dependencies in c19_postprocess

Maybe add to dependencies install script:

Traceback (most recent call last):
  File "scripts/c19_postprocess.py", line 13, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'

RuntimeWarning for generated figures

scripts/c19_postprocess.py:1075: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning).

location/BED file based primer trimming

Using a primer schema:

        ivarCmd = "ivar trim -e"
    } else {
        ivarCmd = "ivar trim"
    }
        """
        samtools view -F4 -o ${sampleName}.mapped.bam ${bam}
        samtools index ${sampleName}.mapped.bam
        ${ivarCmd} -i ${sampleName}.mapped.bam -b ${bedfile} -m ${params.illuminaKeepLen} -q ${params.illuminaQualThreshold} -p ivar.out
        samtools sort -o ${sampleName}.mapped.primertrimmed.sorted.bam ivar.out.bam```

Add primer QC check

Do a QC for primers after trimming to make sure there aren't any remaining

Running pipeline via docker

  • Running pull command on both Mac OS and Linux i'm getting the following error:
docker pull finlaymaguire/pipeline
Using default tag: latest
Error response from daemon: pull access denied for finlaymaguire/pipeline, repository does not exist or may require 'docker login'
  • Running build on Mac OS using docker build -f Dockerfile_pipeline . gave error:
Solving environment: ...working... The command '/bin/sh -c conda create --name snakemake --channel conda-forge --channel bioconda snakemake=5.11.2' returned a non-zero code: 137

Error running get_data_dependencies.sh

I ran bash pipeline/scripts/get_data_dependencies.sh -d data -a MN908947.3 and got the following error:

Downloading nucleotide gb accession to taxon map...rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.12): Connection timed out (110)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::11): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(127) [Receiver=3.1.3]

Looking through the script, I narrowed it down to line 45: kraken2-build --download-taxonomy --db Kraken2/db --threads 10 When I run that at the commandline, I get the same error as above. Which led me here DerrickWood/kraken2#38. Adding --use-ftp solved my issue.

It would also be helpful to add a comment on the estimated disc space needed for these databases & dependencies.

simplify env files

Ensure full conda env ymls (i.e. from conda env export) work long-term has been a nightmare for me recently.

I'd suggest we simplify the conda env yaml files down to the core tools we actually want to install and let the installation of all the dependencies be automatic.

Slightly less reproducible but a lot more maintainable.

c19_postprocess script needs updating

Need to swap cutadapt parsing for trim galore and ivar trim. Also need to change the directory where Fastqc files are found post-trimming. May want to run fastqc after ivar trim instead.

coverage plot

consider creating an a read coverage plot alternative to that provided by BreSeq, based on the step 13 HiSAT2 results, with y-axis in log scale?

IonTorrent support

Implement support for IonTorrent data in the pipeline, should largely be the same as Illumina with some quirks of quality score issues

Snakemake/Conda install issues (temporary solution?)

In attempting to create the conda environment with snakemake:

$ conda create -c conda-forge -c bioconda -n snakemake snakemake pandas
$ conda activate snakemake
(snakemake) $ snakemake --version
5.3.0
(snakemake) $

So firstly, the channels need to be specified or else it fails to find snakemake.
Secondly, the "latest" version of snakemake found is 5.3.0 (5.11.2 not even found, hence why the version is not specified in the above command...which may or may not be a typo...).
Apparently this is a known problem.

The following can be seen at the installation instructions in the Snakemake documentation:

conda install -c conda-forge mamba # install workaround mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake # create conda environment using mamba, installing snakemake
conda activate snakemake 

Temporary rewording of instuctions to install snakemake through this alternate method? ...at least until conda catches up?

pandas dependency error

The installation script needs to install pandas module. I had to install separately using:
conda install -c anaconda pandas

Update documentation

Since the big PR, the documentation needs updated including the figure.

@agmcarthur can the SVG for the overview figure be added to the repo?

new summary visualization - %SARS versus completeness

I've been using this pipeline and making this summary plot based on the output statistics. The label for the sample is "Sample Name, Average Fold Coverage".

Screen Shot 2020-05-13 at 3 19 42 PM

It would be great if this summary figure was generated automatically.

Use LMAT conda env instead of LMAT docker

We're currently calling LMAT through Fin's docker container, via the wrapper script lmat_wrapper.py which addresses some nuisance issues such as file permissions. Fin has also made an LMAT conda recipe (conda create -n lmat -c fmaguire lmat) which should remove these nuisance issues and be more convenient.

Breseq failing for some samples

Breseq failing for some samples even after re-runs, see error log:

grep "FATAL ERROR" */breseq/breseq.log -A 20
S278/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S278/breseq/breseq.log-Attempt to translate codon without three bases.
S278/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S278/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S278/breseq/breseq.log-Backtrace with 9 stack frames.
S278/breseq/breseq.log-breseq(+0x1b176) [0x56276e113176]
S278/breseq/breseq.log-breseq(+0x1844d2) [0x56276e27c4d2]
S278/breseq/breseq.log-breseq(+0x1868c5) [0x56276e27e8c5]
S278/breseq/breseq.log-breseq(+0x189860) [0x56276e281860]
S278/breseq/breseq.log-breseq(+0x18c222) [0x56276e284222]
S278/breseq/breseq.log-breseq(+0x4c2f1) [0x56276e1442f1]
S278/breseq/breseq.log-breseq(+0xc652) [0x56276e104652]
S278/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fc60065bb97]
S278/breseq/breseq.log-breseq(+0x1a629) [0x56276e112629]
S278/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--
S384/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S384/breseq/breseq.log-Attempt to translate codon without three bases.
S384/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S384/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S384/breseq/breseq.log-Backtrace with 9 stack frames.
S384/breseq/breseq.log-breseq(+0x1b176) [0x55c171d16176]
S384/breseq/breseq.log-breseq(+0x1844d2) [0x55c171e7f4d2]
S384/breseq/breseq.log-breseq(+0x1868c5) [0x55c171e818c5]
S384/breseq/breseq.log-breseq(+0x189860) [0x55c171e84860]
S384/breseq/breseq.log-breseq(+0x18c222) [0x55c171e87222]
S384/breseq/breseq.log-breseq(+0x4c2f1) [0x55c171d472f1]
S384/breseq/breseq.log-breseq(+0xc652) [0x55c171d07652]
S384/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fd978c3db97]
S384/breseq/breseq.log-breseq(+0x1a629) [0x55c171d15629]
S384/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--
S56/breseq/breseq.log:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S56/breseq/breseq.log-Attempt to translate codon without three bases.
S56/breseq/breseq.log-FILE: reference_sequence.cpp   LINE: 2383
S56/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
S56/breseq/breseq.log-Backtrace with 9 stack frames.
S56/breseq/breseq.log-breseq(+0x1b176) [0x560f40897176]
S56/breseq/breseq.log-breseq(+0x1844d2) [0x560f40a004d2]
S56/breseq/breseq.log-breseq(+0x1868c5) [0x560f40a028c5]
S56/breseq/breseq.log-breseq(+0x189860) [0x560f40a05860]
S56/breseq/breseq.log-breseq(+0x18c222) [0x560f40a08222]
S56/breseq/breseq.log-breseq(+0x4c2f1) [0x560f408c82f1]
S56/breseq/breseq.log-breseq(+0xc652) [0x560f40888652]
S56/breseq/breseq.log-/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f31a3734b97]
S56/breseq/breseq.log-breseq(+0x1a629) [0x560f40896629]
S56/breseq/breseq.log-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm suspecting the fastqs didn't end properly.

adjust Trimmomatic options

There are short sequences in fastqs after Trimmomatic step we should update to specify lengths:

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING: Cut bases off the start of a read, if below a threshold quality
TRAILING: Cut bases off the end of a read, if below a threshold quality
CROP: Cut the read to a specified length
HEADCROP: Cut the specified number of bases from the start of the read
MINLEN: Drop the read if it is below a specified length
TOPHRED33: Convert quality scores to Phred-33
TOPHRED64: Convert quality scores to Phred-64

run_bed_primer_trim rule fails

Pipeline consistently fails at this step. I can run the cmd manually (samtools view -F / ivar trim) and it works - might be a rel path issue?

Error: It can't generate and find reference.mapped.primertrimmed.sorted.bam

ncov-tools failing due to same headers in the consensus.fasta

Error message:

grep error ncovtools.log -B 20
Finished job 6.
1 of 7 steps (14%) done

[Wed May 27 14:33:06 2020]
rule make_msa:
    input: qc_analysis/default_consensus.fasta
    output: qc_analysis/default_aligned.fasta
    jobid: 5
    wildcards: prefix=default

Detected duplicate input strains "Consensus_virus.consensus_threshold_0.75_quality_20" but the sequences are different.
[Wed May 27 14:33:08 2020]
Error in rule make_msa:
    jobid: 5
    output: qc_analysis/default_aligned.fasta
    shell:
        augur align --sequences qc_analysis/default_consensus.fasta --reference-sequence /workspace/raphenar/sars-cov-2/data/MN908947.3.fasta --output qc_analysis/default_aligned.fasta --fill-gaps
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

need to rename the headers i.e:

$ head -1 S99.consensus.fasta
>Consensus_virus.consensus_threshold_0.75_quality_20
$ head -1 SB2.consensus.fasta
>Consensus_virus.consensus_threshold_0.75_quality_20

Output details on README

Generate a verbose README on the output files from individual assemblies, summaries among assemblies, ncov-tools, etc. Consider making a corresponding YouTube video.

KeyError phylo_include_seqs for ncov-tools integration

Activating conda environment: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb
Writing config for ncov to ncov-tools/config.yaml
Traceback (most recent call last):
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py", line 59, in
set_up()
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py", line 24, in set_up
'tree_include_consensus': f"'{os.path.abspath(snakemake.config['phylo_include_seqs'])}'",
KeyError: 'phylo_include_seqs'
[Tue May 26 22:37:32 2020]
Error in rule ncov_tools:
jobid: 0
conda-env: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb

RuleException:
CalledProcessError in line 120 of /workspace/raphenar/sars-cov-2/covid-19-sequencing/Snakefile:
Command 'source /home/raphenar/miniconda3/bin/activate '/workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/conda/afb830cb'; set -euo pipefail; python /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/scripts/tmptora1qx_.ncov-tools.py' returned non-zero exit status 1.
File "/workspace/raphenar/sars-cov-2/covid-19-sequencing/Snakefile", line 120, in __rule_ncov_tools
File "/home/raphenar/miniconda3/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /workspace/raphenar/sars-cov-2/covid-19-sequencing/.snakemake/log/2020-05-26T223716.884897.snakemake.log

ivar failing with ARTIC bed

ivar trim -e -i Blank1/core/reference.mapped.bam -b /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/nCoV-2019_v3.bed -m 20 -q 20 -p Blank1/core/reference.mapped.primertrimmed 2> Blank1/core/ivar_trim.log
iVar uses the standard 6 column BED format as defined here - https://genome.ucsc.edu/FAQ/FAQformat.html#format1.
It requires the following columns delimited by a tab: chrom, chromStart, chromEnd, name, score, strand
$ head /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/nCoV-2019_v3.bed
MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	385	410	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	320	342	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	704	726	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	642	664	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	1004	1028	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	943	965	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	1312	1337	nCoV-2019_4_RIGHT	nCoV-2019_2	-
MN908947.3	1242	1264	nCoV-2019_5_LEFT	nCoV-2019_1	+
MN908947.3	1623	1651	nCoV-2019_5_RIGHT	nCoV-2019_1	-

$ head /workspace/raphenar/sars-cov-2/covid-19-sequencing/resources/primer_schemes/Wuhan_liverpool_primers_28.01.20.bed 
MN908947.3	248	269	Wuhan_1_LEFT	1	+
MN908947.3	1184	1203	Wuhan_1_RIGHT	1	-
MN908947.3	944	963	Wuhan_2_LEFT	2	+
MN908947.3	2137	2156	Wuhan_2_RIGHT	2	-
MN908947.3	1912	1931	Wuhan_3_LEFT	1	+
MN908947.3	3146	3165	Wuhan_3_RIGHT	1	-
MN908947.3	2936	2957	Wuhan_4_LEFT	2	+
MN908947.3	4180	4199	Wuhan_4_RIGHT	2	-
MN908947.3	4052	4071	Wuhan_5_LEFT	1	+
MN908947.3	5324	5347	Wuhan_5_RIGHT	1	-

The 5th column needs to be a score

Snakemake running error

/usr/local/lib/python3.6/dist-packages/snakemake/workflow.py:18: FutureWarning: read_table is deprecated, use read_csv instead.
  
NameError in line 159 of /data0/fwhelan/mcarthur/covid-19-sequencing/pipeline/Snakefile.master:
name 'multiext' is not defined
  File "/data0/fwhelan/mcarthur/covid-19-sequencing/pipeline/Snakefile.master", line 159, in <module>

Output postprocessig

Writing python script to postprocess a pipeline run and populate a .csv file with high-level summary info from Andrew's spreadsheet

typo in install instructions

Hi all- Andrew asked if I could do an independent test of your workflow (install and usage). I'll drop any issues I find here.

I think the install instructions should read bash pipeline/scripts/get_data_dependencies.sh -d data -a MN908947.3 instead of bash scripts/get_data_dependencies.sh -d data -a MN908947.3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.