Metagenomics standard operating procedure - binning

Created by Jakob Nybo Nissen, Technical University of Denmark, 2018-01-09. Based on Simon Rasmussen, TUoD's similar pipeline. Please contact [email protected] for bug fixes and feature requests.

It is part of the mgsop pipeline, along with module assembly

This script bundles together various command line tools semi-intelligently using Snakemake. The input is a bunch of FASTQ-files from metagenomic sequencing (assumed to be Illumina, although it probably doesn't matter). The output is a directory with the outputs and log files from the assembly and intermediate steps.

Module binning

1. Run module assembly unless the contigs are already present

2. Concatenate all contigs to one file (the contig catalogue)

3. Run contigAssembler to deduplicate and merge the contig catalogue

4. BWA index the contig catalogue to enable mapping

5. BWA mem the reads from each experiment to the contig catalogue

6. Various samtools steps to remove PCR duplicated and sort the BAM files

7. For all samples with more than one experiment, merge the sorted BAM files to sample level

8. Calculate the mapping depths across samples with METABAT2

9. Bin the contigs using METABAT2

Quickstart

See the Workflow in detail section for definition of sample, experiment and run.

In all cases, make sure you are running Python v 3.5 or newer and the command snakemake is in your PATH. (On Computerome, you can load the module anaconda3/4.0.0)

Create a configuration file in YAML-format with the output directory and data specified with additional optional configuration. Use the dummy.yaml file as template. It should look something like this:

assembly_workdir: /path/to/output/assembly
binning_workdir: /path/to/output/binning

data:
    - run1  exp1    sample1 /path/to/data/sample1.forward.fq.gz /path/to/data/sample1.reverse.fq.gz
    - run2  exp2    sample2 /path/to/data/sample2.forward.fq.gz /path/to/data/sample2.reverse.fq.gz

Run the bin.py script WITHOUT the --commit flag. Look through all the planned commands. Does it look okay?

python /path/to/script/bin.py my_config.yaml my_groupname --chain > planned_commands.txt
If it looks good, run the same command with the --commit flag. The pipeline will now execute.

python /path/to/script/bin.py my_config.yaml my_groupname --chain --commit 2> status.txt

Workflow in detail

To understand the workflow, you need to understand the terminology borrowed from European Nucleotide Archive regarding samples, experiments and runs:

In an analysis, different samples refer to data generated from different biological material, or data which for other reasons must be considered separate "sources" of data throughout the workflow.

A sample contains one or more experiments. Different files from the same experiment has been generated by the exact same experimental procedure, at the same time, from the same sample, et cetera. In contrast, if there are any relevant differences between two sets of data such as extraction protocol, sequencing machine or anything else, they are different experiments. Hence, all data within the same experiment is consideres homogenous.

Lastly, different runs is considered simply different files representing parts of the same data. They can be merged or split at will. One run simply refers to either one FastQ file (for single end) or a pair of FastQ files (for paired end).

Installation

Download the directory. If you also want to run the other modules (like assembly), put the binning directory and the assembly directory in one common directory. This ensures that one module can access the others.

In the config.yaml file, make sure that the number of CPU cores ("cores") is the correct maximum per job for the machine/cluster you are using, and all the paths to the executables are absolute and indeed refer to a working executable.

Currently, this script assumes access to a cluster where jobs can be submitted with qsub. If this is not the case, the "template" string in bin.py must be changed to be the desired Snakemake command.

Configuration options

This script uses sane defaults, so running without configuration will usually gives decent results. However, in some circumstances the results can be improved by passing options to the script.

This is done by modifying the config file, or alternatively, adding key=value pairs after the -c argument like such:

python /path/to/script/bin.py my_config.yaml my_groupname --chain -c cores=18 assembler=megahit

Find below a complete list of options that can be passed:

cores=[28]

Number of CPU cores per cluster on the system you are running on.

adapterremoval_mm=[3]

Maximal number of mismatches between an observed sequence and an adapter for it to be recognized as such. Should generally not be changed.

assembler=[spades] (Can be spades or megahit)

SPAdes gives better results than MEGAHIT on almost all data. MEGAHIT is preferable if you are interested in micro-diversity, as it preserves this better than SPAdes. MEGAHIT is also much quicker than SPAdes and uses less memory, so you can use it if you hit the hard RAM limit on your machine with SPAdes.

assemble_contigs=[true]

Assembles contigs from different samples further by BLASTING them against each other, then merging contigs with a high identity at the termini (98% identity over 100 bp)

spades_kmers=[21,33,55,77,99] kmers for SPAdes De Bruijn graph iterations.

Too low K-mers gives complicated graphs from which contigs cannot easily be extracted leading to small contigs. Any overlap between reads smaller than K will not be assembled, so too large values are not suitable either. Longer reads and higher sequencing depth favors higher K-mers. The default is suitable for ~150 bp sequences and metagenomic data (which implies low coverage of most sequences)

megahit_kmers=[21,33,55,77,99] kmers for Megahit De Bruijn graph iterations.

See spades_kmers

megahit_mink_1pass=[false]

Change to "true" for 30% usage memory reduction when assembling with MEGAHIT. This is about 10% slower and a few percent worse in terms of assembly quality. This is only recommended to be true when memory is a problem even with MEGAHIT.

min_contig_length=[2500]

Minimum length for contig filtering and Metabat

A higher threshold stabilizes tetranucleotide frequency and contig depth and improves binning quality. However, more data is excluded from the binning. There is no best setting here. Metabat recommends an absolute minimum of 1500. A higher number will increase the quality of the binning, but may result in throwing a lot of the data away. For best results, try different settings from 1500 to 2500.

remove_duplicates=[true]

It's common to PCR amplify the fragmented DNA before sequencing if there is not a lot of material. This will cause some sequences to be copied and sequenced multiple times, biasing the result. With the default setting of true, if multiple reads map to the same position at the 5' end, only the best scoring pair is kept.

More settings can fairly easily be added - if you find yourself needing some, please send me a mail and I might incorporate it to the snakefiles!

Advanced usage

Adding new samples/experiments

Simply update your YAML config file with the new data and rerun the pipeline.

Partial runs

Invoke bin.py with -t to stop the pipeline early. The pipeline will then only run until creation of the target file.

Manually changing some of the output files.

Let's say you want to do some custom filtering of the contig catalogue. First, run the pipeline up to and including the creation of the contig catalogue by passing -t contigs/contigs.fna. Then do the manual changes you wish. The rerun the pipeline without the -t flag.

Setting non-settable parameters

You need to change stuff in the snakefile itselfs. If it's just replacing one number with another, it's not too bad.

Requirements

The script requires Snakemake, which is found in Anaconda3, or can be found in PyPI and installed with pip. Snakemake works with Python 3.

It also requires the following programs to be installed:

adapterremoval
bwa
samtools
SPAdes OR MEGAHIT
metabat2
blastn (currently the shell command blastn must work)

As well as basic UNIX tools such as cat, paste etc.

Good to know

For safety, you can't execute multiple snakemake runs in the same directory.
To save disk space, you can delete intermediate files after each run.
A .snakemake directory is created when you run Snakemake. This can grow big after many runs and might need to be regularly deleted.
99 GB RAM is given for assembly, the max a thin node can support. If this is not enough, you must manually assemble on a fat node.

Troubleshooting

The pipeline won't start and instead raises an obscure exception

I've tried to write helpful error messages for each error. Make sure to read the entire error output carefully beginning from the top. Often, there will be a printed, white message before the red error-text.

My Snakemake run won't start, and there's no printed message.

It's likely a previously run exited while having locked your output directory. Check the stderr to see if that's the case. Unlock by running snakemake with the --unlock argument, then rerun the pipeline.

The pipeline crashed somewhere in the middle

Go through the stderr of the pipeline and find the command that failed. Then find the corresponding error message and find out what went wrong.

The assembly failed.

Check if it ran out of memory by looking at the output/log/assembly files Megahit will raise Error code -9 near the end, SPAdes will specifically state it ran out of memory.

How can I reduce assembly memory consumption?

Try the following steps:

Run Megahit instead of SPAdes (check the results on one sample first)
Choose option megahit_mink_1pass=true with Megahit.
Split your experiments and runs (splits assembly into more parts).

Snakemake raises ProtectedOutputException

You don't have write access to the relevant output. This prevents Snakemake from re-creating or updating the output. Make sure you have write access.

Samtools sort crashed with error code 137 and no helpful error message

This is likely due to an OOM error caused by a bug in samtools sort where it uses more memory than it's allowed to. It should be fixed with samtools v 1.5. If the problem is encountered when using a newer version, mail me.

jakobnissen / mgsop-binning Goto Github PK

mgsop-binning's Introduction