Wastewater analysis

Introduction

ww_minimal is a bioinformatics analysis pipeline used to perform the initial quality control and variant analysis on wastewater sequencing samples. This pipeline supports Illumina short-reads prepared using the Nimagen primer scheme on various platforms (NovaSeq, NextSeq, MiSeq).

Pipeline summary

Merge sequencing FASTQ files (pigz)
Adapter trimming (fastp)
Variant calling
1. Read alignment (bwa mem)
2. Sort and index alignments (Samtools)
3. Primer sequence removal (BAMClipper)
4. Genome-wide and amplicon coverage (mosdepth, Samtools ampliconstats)
5. Variant calling (freyja variants/demix; samples may fail on this step due to low coverage, these are omitted from further analysis in the pipeline, they are not excluded overall)
6. Extract WHO and pango lineages (collate_results.py, collate_lineages.py)
7. Aggregate all sample outputs (xsv)

Quickstart

This pipeline uses conda for environment and package management (recommended to use miniconda).

Initialise environment

With [mini]conda installed:

git clone https://github.com/LooseLab/ww_nf_minimal
cd ww_nf_minimal
conda env create -f environment.yml

Run test profile

conda activate ww_minimal
nextflow run main.nf -profile test

Running an actual run

After successfully running the test subset you can attempt to run on other samples. Read the input section for how to setup the FASTQ directory and sample sheet. Once these are done the pipeline can be run like so:

nextflow run /path/to/main.nf --readsdir <FASTQ INPUT DIRECTORY> --sample_sheet <SAMPLE SHEET CSV> -with-report report.html

If nextflow crashes while running, you can add the flag -resume to the previous command to check for cached jobs so the entire pipeline does not need to be re-run.

Input and Output

Input

There are two required user supplied inputs, the sample sheet and the FASTQ reads directory. These can be supplied by editing the nextflow.config file adding the sample_sheet and readsdir attributes to the params or supplied on the command line using --sample_sheet and --readsdir. In addition there are three static inputs that are provided with the workflow (these may change in the future as the primer scheme changes). These are the reference genome, paired-end primer file, and amplicon primer file.

FASTQ

This pipeline expects FASTQ files to be structured inside an input directory with subfolders for each sequencing lab and then further subfolders for each run ID. For most labs the share directory can be used directly, however samples from Exeter require symlinking. An example input directory structure can be seen below:

input
├── <LAB1>
│  ├── <RUN1>
│  │  ├── SAMPLE_R1_L002_001.fastq.gz
│  │  └── SAMPLE_R2_L002_001.fastq.gz
│  └── <RUN2>
│     ├── SAMPLE_R1_L002_001.fastq.gz
│     └── SAMPLE_R2_L002_001.fastq.gz
└── <LAB2>
   └── <RUN1>
      ├── SAMPLE_L001_R1_001.fastq.gz
      ├── SAMPLE_L001_R2_001.fastq.gz
      ├── ...
      ├── SAMPLE_L004_R1_001.fastq.gz
      └── SAMPLE_L004_R2_001.fastq.gz

Sample sheet CSV

The CSV sample sheet is required as this informs the pipeline which samples should be analysed. It currently requires 6 fields:

sample_id
sample_site_code
timestamp_sample_collected
sequencing_lab_code
sequencing_sample_id
sequencing_run_id

These are used to find the input FASTQ files in the readsdir. All fields are passed through to the aggregation steps at the end of the pipeline.

Output

Output files are written, by default to a results directory where the pipeline is called from. This folder is organised for each step that emits files and results like so:

results
├── aggregated.csv
├── all_lineages.csv
├── <LAB1>
│  ├── <RUN1>
│  │  ├── alignments
│  │  ├── ampliconstats
│  │  ├── bamclipper
│  │  ├── freyja
│  │  ├── mosdepth
│  │  ├── stats_csv
│  │  └── trimmed
│  └── <RUN2>
│     ├── alignments
│     ├── ampliconstats
│     ├── bamclipper
│     ├── freyja
│     ├── mosdepth
│     ├── stats_csv
│     └── trimmed
└── <LAB2>
   └── <RUN1>
      ├── alignments
      ├── ampliconstats
      ├── bamclipper
      ├── freyja
      ├── mosdepth
      ├── stats_csv
      └── trimmed

Outputs that are organised in directories under a <RUN ID> are the raw outputs from the steps in pipeline summary. The aggregated outputs are placed at the top level as these combine data from all of the sequencing labs and runs.

`aggregated.csv`

This CSV file aggregates the WHO lineages, their frequencies, and sequencing depths for all the samples that are able to complete analysis. As multiple lineages maybe present multiple rows can be returned for a single sample.

Column	Description
amplicon_mean	Mean coverage over all amplicons including zeros
non_zero_amplicon_mean	Mean coverage over amplicons excluding zeros
amplicon_median	Median coverage over all amplicons including zeros
non_zero_amplicon_median	Median coverage over amplicons excluding zeros
count_gte_20	Count of amplicons with at least (≥) 20× coverage
count_lt_20	Count of amplicons with less than (<) 20× coverage
stdev	Standard deviation of coverage over all amplicons
non_zero_stdev	Standard deviation of coverage over all amplicons excluding zeros
lineage	WHO lineage assigned by Freyja
abundance	Abundance of this WHO lineage
mean_genome_coverage	Mean coverage over whole genome from mosdepth
sample_id	Sample ID used in the pipeline
sample_site_code	Sample site location code
timestamp_sample_collected	Timestamp sample collected
sequencing_lab_code	Sequencing lab
original_sample_id	Original metadata sample id
sequencing_sample_id	Sample ID used in the pipeline
sequencing_run_id	Run ID for the sample
amplicon_001_mean_depth	Coverage over this individual amplicon
...	...
amplicon_154_mean_depth	repeated for all amplicons

`all_lineages.csv`

This CSV file aggregates Pango lineages that are assigned by Freyja. It is a more fine-grained breakdown of the sample composition than the WHO lineages.

Column	Description
lineage	Pango lineage assigned by Freyja
abundance	Abundance of this lineage
sample_id	Sample ID used in the pipeline
sample_site_code	Sample site location code
timestamp_sample_collected	Timestamp sample collected
sequencing_lab_code	Sequencing lab
original_sample_id	Original metadata sample id
sequencing_sample_id	Sample ID used in the pipeline
sequencing_run_id	Run ID for the sample

Some problems with finding files

Hi Loose Lab! I'm based with the team in Edinburgh doing the SARS-CoV-2 and other waste water sequencing, and thought I'd give this pipeline a go at the suggestion of Anish.

I've noticed a bit of a bug (for us at least, presumably you have things upstream that mean this isn't an issue), when setting up the input data structure it looks something like input/EDIN_UNI/RUN001/fastqs as suggested in the docs. The sample sheet then asks for the sequencing_lab_code and sequencing_run_id, in this case EDIN_UNI and RUN001. So far so good.

However in the metadata handling it "lowercases" the sequencing_lab_code, so when the path is built with getInputFilePattern it outputs input/edin_uni/RUN001/fastqs and then no files are found when the pipeline looks. Not finding a sample doesn't raise any warnings and the pipeline goes on to complete "successfully".

I realised something was up because none of the samples had anything in them after completing, but what could be more risky is that if there just happened to be one sample in a run that didn't produce any output because of incorrect name entry in the sample sheet I don't think you would necessarily know. Apart from that though, all seems to be working well!

All the best,
Danny

looselab / ww_nf_minimal Goto Github PK

ww_nf_minimal's Introduction

Wastewater analysis

Introduction

Pipeline summary

Quickstart

Initialise environment

Run test profile

Running an actual run

Input and Output

Input

FASTQ

Sample sheet CSV

Output

aggregated.csv

all_lineages.csv

ww_nf_minimal's People

Contributors

Watchers

ww_nf_minimal's Issues

Recommend Projects

Recommend Topics

Recommend Org

`aggregated.csv`

`all_lineages.csv`