Code Monkey home page Code Monkey logo

ww_nf_minimal's Introduction

Wastewater analysis

Introduction

ww_minimal is a bioinformatics analysis pipeline used to perform the initial quality control and variant analysis on wastewater sequencing samples. This pipeline supports Illumina short-reads prepared using the Nimagen primer scheme on various platforms (NovaSeq, NextSeq, MiSeq).

Pipeline summary

  1. Merge sequencing FASTQ files (pigz)
  2. Adapter trimming (fastp)
  3. Variant calling
    1. Read alignment (bwa mem)
    2. Sort and index alignments (Samtools)
    3. Primer sequence removal (BAMClipper)
    4. Genome-wide and amplicon coverage (mosdepth, Samtools ampliconstats)
    5. Variant calling (freyja variants/demix; samples may fail on this step due to low coverage, these are omitted from further analysis in the pipeline, they are not excluded overall)
    6. Extract WHO and pango lineages (collate_results.py, collate_lineages.py)
    7. Aggregate all sample outputs (xsv)

Quickstart

This pipeline uses conda for environment and package management (recommended to use miniconda).

Initialise environment

With [mini]conda installed:

git clone https://github.com/LooseLab/ww_nf_minimal
cd ww_nf_minimal
conda env create -f environment.yml

Run test profile

conda activate ww_minimal
nextflow run main.nf -profile test

Running an actual run

After successfully running the test subset you can attempt to run on other samples. Read the input section for how to setup the FASTQ directory and sample sheet. Once these are done the pipeline can be run like so:

nextflow run /path/to/main.nf --readsdir <FASTQ INPUT DIRECTORY> --sample_sheet <SAMPLE SHEET CSV> -with-report report.html

If nextflow crashes while running, you can add the flag -resume to the previous command to check for cached jobs so the entire pipeline does not need to be re-run.

Input and Output

Input

There are two required user supplied inputs, the sample sheet and the FASTQ reads directory. These can be supplied by editing the nextflow.config file adding the sample_sheet and readsdir attributes to the params or supplied on the command line using --sample_sheet and --readsdir. In addition there are three static inputs that are provided with the workflow (these may change in the future as the primer scheme changes). These are the reference genome, paired-end primer file, and amplicon primer file.

FASTQ

This pipeline expects FASTQ files to be structured inside an input directory with subfolders for each sequencing lab and then further subfolders for each run ID. For most labs the share directory can be used directly, however samples from Exeter require symlinking. An example input directory structure can be seen below:

input
├── <LAB1>
│  ├── <RUN1>
│  │  ├── SAMPLE_R1_L002_001.fastq.gz
│  │  └── SAMPLE_R2_L002_001.fastq.gz
│  └── <RUN2>
│     ├── SAMPLE_R1_L002_001.fastq.gz
│     └── SAMPLE_R2_L002_001.fastq.gz
└── <LAB2>
   └── <RUN1>
      ├── SAMPLE_L001_R1_001.fastq.gz
      ├── SAMPLE_L001_R2_001.fastq.gz
      ├── ...
      ├── SAMPLE_L004_R1_001.fastq.gz
      └── SAMPLE_L004_R2_001.fastq.gz

Sample sheet CSV

The CSV sample sheet is required as this informs the pipeline which samples should be analysed. It currently requires 6 fields:

  1. sample_id
  2. sample_site_code
  3. timestamp_sample_collected
  4. sequencing_lab_code
  5. sequencing_sample_id
  6. sequencing_run_id

These are used to find the input FASTQ files in the readsdir. All fields are passed through to the aggregation steps at the end of the pipeline.

Output

Output files are written, by default to a results directory where the pipeline is called from. This folder is organised for each step that emits files and results like so:

results
├── aggregated.csv
├── all_lineages.csv
├── <LAB1>
│  ├── <RUN1>
│  │  ├── alignments
│  │  ├── ampliconstats
│  │  ├── bamclipper
│  │  ├── freyja
│  │  ├── mosdepth
│  │  ├── stats_csv
│  │  └── trimmed
│  └── <RUN2>
│     ├── alignments
│     ├── ampliconstats
│     ├── bamclipper
│     ├── freyja
│     ├── mosdepth
│     ├── stats_csv
│     └── trimmed
└── <LAB2>
   └── <RUN1>
      ├── alignments
      ├── ampliconstats
      ├── bamclipper
      ├── freyja
      ├── mosdepth
      ├── stats_csv
      └── trimmed

Outputs that are organised in directories under a <RUN ID> are the raw outputs from the steps in pipeline summary. The aggregated outputs are placed at the top level as these combine data from all of the sequencing labs and runs.

aggregated.csv

This CSV file aggregates the WHO lineages, their frequencies, and sequencing depths for all the samples that are able to complete analysis. As multiple lineages maybe present multiple rows can be returned for a single sample.

Column Description
amplicon_mean Mean coverage over all amplicons including zeros
non_zero_amplicon_mean Mean coverage over amplicons excluding zeros
amplicon_median Median coverage over all amplicons including zeros
non_zero_amplicon_median Median coverage over amplicons excluding zeros
count_gte_20 Count of amplicons with at least (≥) 20× coverage
count_lt_20 Count of amplicons with less than (<) 20× coverage
stdev Standard deviation of coverage over all amplicons
non_zero_stdev Standard deviation of coverage over all amplicons excluding zeros
lineage WHO lineage assigned by Freyja
abundance Abundance of this WHO lineage
mean_genome_coverage Mean coverage over whole genome from mosdepth
sample_id Sample ID used in the pipeline
sample_site_code Sample site location code
timestamp_sample_collected Timestamp sample collected
sequencing_lab_code Sequencing lab
original_sample_id Original metadata sample id
sequencing_sample_id Sample ID used in the pipeline
sequencing_run_id Run ID for the sample
amplicon_001_mean_depth Coverage over this individual amplicon
... ...
amplicon_154_mean_depth repeated for all amplicons

all_lineages.csv

This CSV file aggregates Pango lineages that are assigned by Freyja. It is a more fine-grained breakdown of the sample composition than the WHO lineages.

Column Description
lineage Pango lineage assigned by Freyja
abundance Abundance of this lineage
sample_id Sample ID used in the pipeline
sample_site_code Sample site location code
timestamp_sample_collected Timestamp sample collected
sequencing_lab_code Sequencing lab
original_sample_id Original metadata sample id
sequencing_sample_id Sample ID used in the pipeline
sequencing_run_id Run ID for the sample

ww_nf_minimal's People

Contributors

alexomics avatar

Watchers

 avatar Rory Munro avatar  avatar

ww_nf_minimal's Issues

Some problems with finding files

Hi Loose Lab! I'm based with the team in Edinburgh doing the SARS-CoV-2 and other waste water sequencing, and thought I'd give this pipeline a go at the suggestion of Anish.

I've noticed a bit of a bug (for us at least, presumably you have things upstream that mean this isn't an issue), when setting up the input data structure it looks something like input/EDIN_UNI/RUN001/fastqs as suggested in the docs. The sample sheet then asks for the sequencing_lab_code and sequencing_run_id, in this case EDIN_UNI and RUN001. So far so good.

However in the metadata handling it "lowercases" the sequencing_lab_code, so when the path is built with getInputFilePattern it outputs input/edin_uni/RUN001/fastqs and then no files are found when the pipeline looks. Not finding a sample doesn't raise any warnings and the pipeline goes on to complete "successfully".

I realised something was up because none of the samples had anything in them after completing, but what could be more risky is that if there just happened to be one sample in a run that didn't produce any output because of incorrect name entry in the sample sheet I don't think you would necessarily know. Apart from that though, all seems to be working well!

All the best,
Danny

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.