torch-consortium / magma Goto Github PK

A pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research

Home Page: https://doi.org/10.1371/journal.pcbi.1011648

License: GNU General Public License v3.0

Nextflow 70.57% Makefile 0.25% Shell 1.22% Dockerfile 0.72% Python 27.23%

genomics pipeline mycobacterium-tuberculosis tuberculosis nextflow-pipeline nextflow

magma's Introduction

MAGMA

MAGMA (Maximum Accessible Genome for Mtb Analysis) is a pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research.

Salient features of the implementation

Fine-grained control over resource allocation (CPU/Memory/Storage)
Reliance of bioconda for installing packages for reproducibility
Ease of use on a range of infrastructure (cloud/on-prem HPC clusters/ servers (or local machines))
Resumability for failed processes
Centralized locations for specifying analysis parameters and hardware requirements
- MAGMA parameters (default_parameters.config which can overridden using a params.yaml file)
- Hardware requirements (conf/server.config or conf/pbs.config or conf/low_memory.config)
- Execution (software) requirements (conf/docker.config or conf/conda_local.config or conf/podman.config)

Prerequisites

Nextflow

git : The version control in the pipeline.
Java-11 or Java-17 LTS release (preferred)

⚠️ Check java version!: The java version should NOT be an internal jdk release! You can check the release via java --version

$ java -version
openjdk version "17.0.7" 2023-04-18 LTS
OpenJDK Runtime Environment (build 17.0.7+7-LTS)
OpenJDK 64-Bit Server VM (build 17.0.7+7-LTS, mixed mode, sharing)

Download Nextflow

$ curl -s https://get.nextflow.io | bash

Make Nextflow executable

$ chmod +x nextflow

Add nextflow to your path (for example /usr/local/bin/)

$ mv nextflow /usr/local/bin

Sanity check for nextflow installation

$ nextflow info

  Version: 23.04.1 build 5866
  Created: 15-04-2023 06:51 UTC (08:51 SAST)
  System: Mac OS X 12.6.5
  Runtime: Groovy 3.0.16 on OpenJDK 64-Bit Server VM 17.0.7+7-LTS
  Encoding: UTF-8 (UTF-8)

✔️ With this you're all set with Nextflow. Next stop, conda or docker - pick one!:

Samplesheet

A dummy samplesheet is provided here

The samplesheet structure should have the following fields.

Study,Sample,Library,Attempt,R1,R2,Flowcell,Lane,Index Sequence
Study_Name,S0001,1,1,full_path_to_directory_of_fastq_files/S0001_01_R1.fastq.gz,full_path_to_directory_of_fastq_files/S0001_01_R1.fastq.gz,1,1,1
Study_Name,S0002,1,1,full_path_to_directory_of_fastq_files/S0002_01_R1.fastq.gz,full_path_to_directory_of_fastq_files/S0002_01_R2.fastq.gz,1,1,1
Study_Name,S0003,1,1,full_path_to_directory_of_fastq_files/S0003_01_R1.fastq.gz,full_path_to_directory_of_fastq_files/S0003_01_R2.fastq.gz,1,1,1
Study_Name,S0004,1,1,full_path_to_directory_of_fastq_files/S0004_01_R1.fastq.gz,full_path_to_directory_of_fastq_files/S0004_01_R2.fastq.gz,1,1,1

Here's a formatted version of the CSV above

Study	Sample	Library	Attempt	R1	R2	Flowcell	Lane	Index Sequence
Study_Name	S0001	1	1	full_path_to_directory_of_fastq_files/S0001_01_R1.fastq.gz	full_path_to_directory_of_fastq_files/S0001_01_R1.fastq.gz	1	1	1
Study_Name	S0002	1	1	full_path_to_directory_of_fastq_files/S0002_01_R1.fastq.gz	full_path_to_directory_of_fastq_files/S0002_01_R2.fastq.gz	1	1	1
Study_Name	S0003	1	1	full_path_to_directory_of_fastq_files/S0003_01_R1.fastq.gz	full_path_to_directory_of_fastq_files/S0003_01_R2.fastq.gz	1	1	1
Study_Name	S0004	1	1	full_path_to_directory_of_fastq_files/S0004_01_R1.fastq.gz	full_path_to_directory_of_fastq_files/S0004_01_R2.fastq.gz	1	1	1

Customization

Note We are currently working on the transition to nf-core standard (see #188), which would add standardized configurations and pipeline structure to benefit from the nf-core nf-core/modules and nf-core/configs projects.

The pipeline parameters are distinct from Nextflow parameters, and therefore it is recommended that they are provided using a yml file as shown below

# Sample contents of my_parameters_1.yml file

input_samplesheet: /path/to/your_samplesheet.csv
only_validate_fastqs: true
conda_envs_location: /path/to/folder/with/conda_envs

When running the pipeline, use profiles to ensure smooth execution on your computing system. The two types of profiles employed by the pipeline are: execution environment + memory/computing requirements

Execution environment profiles:

conda_local
docker
podman

Memory/computing profiles:

pbs (good for high performance computing clusters)
server (good for local servers)
low_memory (this can be run on a laptop, even limited to 8 cores and 8 GB of RAM)

Advanced Users The MAGMA pipeline has default parameters related to minimum QC thresholds that must be reached for samples to be included in the cohort analysis. These default parameters are listed in default_params.config. Users wishing to adjust these parameters should specify these adjustments in the params.yml file supplied when launching the pipeline. An example of adjusted parameters is shown below:

# Sample contents of my_parameters_1.yml file

input_samplesheet: /path/to/your_samplesheet.csv
only_validate_fastqs: true
conda_envs_location: /path/to/folder/with/conda_envs
median_coverage_cutoff: 5
breadth_of_coverage_cutoff: 0.95
rel_abundance_cutoff: 0.65
ntm_fraction_cutoff: 0.40
site_representation_cutoff: 0.80

Note The -profile mechanism is used to enable infrastructure specific settings of the pipeline. The example below, assumes you are using conda based setup.

Which could be provided to the pipeline using -params-file parameter as shown below

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' \
		 -profile conda_local, server \ 
		 -r v1.1.1 \
		 -params-file  my_parameters_1.yml

Analysis

Running MAGMA using Nextflow Tower

You can also use Seqera Platform (aka Nextflow Tower) to run the pipeline on any of the supported cloud platforms and monitoring the pipeline execution.

Please refer the Tower docs for further information.

Running MAGMA using conda

⚠️⚠️⚠️ We discourage running MAGMA via conda, it is prone to challenging-to-reproduce errors

You can run the pipeline using Conda, Mamba or Micromamba package managers to install all the prerequisite softwares from popular repositories such as bioconda and conda-forge.

ℹ️ Conda environments and cheatsheet:
You can find out the location of conda environments using conda env list. Here's a useful cheatsheet for conda operations.

You can use the conda based setup for the pipeline for running MAGMA

On a local linux machine(e.g. your laptop or a university server)
On an HPC cluster (e.g. SLURM, PBS) in case you don't have access to container systems like Singularity, Podman or Docker

All the requisite softwares have been provided as a conda recipe (i.e. yml files)

These files can be downloaded using the following commands

wget https://raw.githubusercontent.com/TORCH-Consortium/MAGMA/master/conda_envs/magma-env-2.yml
wget https://raw.githubusercontent.com/TORCH-Consortium/MAGMA/master/conda_envs/magma-env-1.yml

The conda environments are expected by the conda_local profile of the pipeline, it is recommended that it should be created prior to the use of the pipeline, using the following commands. Note that if you have mamba (or micromamba) available you can rely upon that instead of conda.

$ conda env create -n magma-env-1 --file magma-env-1.yml

$ conda env create -n magma-env-2 --file magma-env-2.yml

Once the environments are created, you can make use of the pipeline parameter conda_envs_location to inform the pipeline of the names and location of the conda envs.

Next, you need to load the WHO Resistance Catalog within tb-profiler; basically the instructions, which are used to build the necessary containers.

Download magma_resistance_db_who_v1.zip and unzip it

wget https://github.com/TORCH-Consortium/MAGMA/files/14559680/resistance_db_who_v1.zip

unzip resistance_db_who

Activate magma-env-1, which has tb-profiler

conda activate magma-env-1

Move inside that folder and use tb-profiler load_library functionality to load the database

cd resistance_db_who

tb-profiler load_library ./resistance_db_who

Success, would look like this

Running MAGMA using docker

✔️✔️✔️This is the recommended execution strategy

We provide two docker containers with the pipeline so that you could just download and run the pipeline with them. There is NO need to create any docker containers, just download and enable the docker profile.

🚧 Container build script: The script used to build these containers is provided here.

Although, you don't need to pull the containers manually, but should you need to, you could use the following commands to pull the pre-built and provided containers

docker pull ghcr.io/torch-consortium/magma/magma-container-1:1.1.1

docker pull ghcr.io/torch-consortium/magma/magma-container-2:1.1.1

📝 Have singularity or podman instead?:
If you do have access to Singularity or Podman, then owing to their compatibility with Docker, you can still use the provided docker containers.

Here's the command which should be used

nextflow run 'https://github.com/torch-consortium/magma' \
		 -params-file my_parameters_2.yml \
		 -profile docker,pbs \
		 -r v1.1.1

💡 Hint:
You could use -r option of Nextflow for working with any specific version/branch of the pipeline.

Running MAGMA on HPC and cloud executors

HPC based execution for MAGMA, please refer this doc.
Cloud batch (AWS/Google/Azure) based execution for MAGMA, please refer this doc

MAGMA samplesheets

In order to run the MAGMA pipeline, you must provide a samplesheet as input. The structure of the samplesheet should be that located in samplesheet

⚠️ Make sure to use full paths!!!:

Library

Certain samples may have had multiple libraries prepared.
This row allows the pipeline to distinguish between
different libraries of the same sample.

Attempt

Certain libraries may need to be sequenced multiple times.
This row allows the pipeline to distinguish between
different attempts of the same library.

Flowcell/Lane/Index Sequence

Providing this information may allow the VQSR filtering step
to better distinguish between true variants and sequencing
errors. Including these is optional, if unknown or irrelevant,
just fill in with a '1' as shown in example_MAGMA_samplesheet.csv)

(Optional) GVCF datasets

We also provide some reference GVCF files which you could use for specific use-cases.

For small datasets (20 samples or less), we recommend that you download the EXIT_RIF GVCF files from https://zenodo.org/record/8054182 containing GVCF reference dataset for ~600 samples is provided for augmenting smaller datasets
For including Mtb lineages and outgroup (M. canettii) in the phylogenetic tree, you can download the LineagesAndOutgroup files from https://zenodo.org/record/8233518

use_ref_exit_rif_gvcf = false
ref_exit_rif_gvcf =  "/path/to/FILE.g.vcf.gz" 
ref_exit_rif_gvcf_tbi =  "/path/to/FILE.g.vcf.gz.tbi"

💡 Custom GVCF dataset:
For creating a custom GVCF dataset, you can refer the discussion here.

Tutorials and Presentations

Tim Huepink and Lennert Verboven created an in-depth tutorial of the features of the variant calling in MAGMA:

We have also included a presentation (in PDF format) of the logic and workflow of the MAGMA pipeline as well as posters that have been presented at conferences. Please refer the docs folder.

Interpretation

The results directory produced by MAGMA is as follows:

/path/to/results_dir/
├── QC_statistics
├── analyses
├── cohort
├── libraries
└── samples

QC Statistics Directory

In this directory you will find files related to the quality control carried out by the MAGMA pipeline. The structure is as follows:

/path/to/results_dir/QC_statistics
├── cohort
├── coverage
└── mapping

Cohort

Here you will find the joint.merged_cohort_stats.tsv which contains the QC statistics for all samples in the sample sheet and allows users to determine why certain samples failed to be incorporated in the cohort analysis steps

Coverage

Contains the GATK WGSMetrics outputs for each of the samples in the samplesheet

Mapping

Contains the FlagStat and samtools stats for each of the samples in the samplesheet

Analysis Directory

/path/to/results_dir/analysis
├── cluster_analysis
├── drug_resistance
├── phylogeny
└── snp_distances

Cluster analysis

Contains files related to clustering based on 5SNP and 12SNP cutoffs .figtree files: These can be imported directly into Figtree for visualisation

Drug Resistance

Organised based on the different types of variants as well as combined results:

/path/to/results_dir/analysis/drug_resistance
├── combined_results
├── major_variants
├── minor_variants
└── structural_variants

Each of the directories containing results related to the different variants (major | minor | structural) have text files that can be used to annotate the .treefiles produced by MAGMA in iToL (https://itol.embl.de)

The combined resistance results file contains a per-sample drug resistance summary based on the WHO Catalogue of Mtb mutations (https://www.who.int/publications/i/item/9789240082410)

MAGMA also notes the presence of all variants in in tier 1 and tier 2 drug resistance genes.

Phylogeny

Contains the outputs of the IQTree phylogenetic tree construction.

📝 By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome

SNP distances

Contains the SNP distance tables.

📝 By default we recommend that you use the ExDRIncComplex files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the Mtb genome

Cohort Directory

/path/to/results_dir/cohort
├── combined_variant_files
├── minor_variants
├── multiple_alignment_files
├── raw_variant_files
├── snp_variant_files
└── structural_variants

Combined variant files

Contains the cohort gvcfs based on major variants detected by the MAGMA pipeline

Minor variants

Merged vcfs of all samples, generated by LoFreq

Multiple alignment files

FASTA files for the generation of phylogenetic trees by IQTree

Raw variant files

Unfiltered indel and SNPs detected by the MAGMA pipeline

SNP variant files

Filtered SNPs detected by the MAGMA pipeline

Structural variant files

Unfiltered structural variants detected by the MAGMA pipeline

Libraries Directory

Contains files related to FASTQ validation and FASTQC analysis

Samples Directory

Contains vcf files for major|minor|structural variants for each individual samples

Citations

The MAGMA paper has been published here: https://doi.org/10.1371/journal.pcbi.1011648

The XBS variant calling core was published here: https://doi.org/10.1099%2Fmgen.0.000689

Contributions and interactions with us

Contributions are warmly accepted! We encourage you to interact with us using Discussions and Issues feature of Github.

License

Please refer the GPL 3.0 LICENSE file.

Here's a quick TLDR for the license terms.

magma's People

Contributors

Stargazers

Watchers

Forkers

mycobactopia-org

magma's Issues

Implement ANNOTATE_WF

This workflow stems from the discussion with @sandergoossens , the PPT was shared on the email and further details were shared here #32

Feedback for PBS testing

Hey Tim,

I'm updating the modules for the tool paths on the cluster - as soon as I'm done, I'll ping you here so that you can start testing.

setup_conda_envs.sh odd rm command

@abhi18av @LennertVerboven
please see https://github.com/TORCH-Consortium/xbs-nf/blob/master/conda_envs/setup_conda_envs.sh line 22
rm -rf resistance_db_who ./
note that as this point in the script you are located in xbs-nf/conda_envs/resistance_db_who, this command specifies the removal of two things: 1. the current directory where you are located ./ which is odd as this is not possible (you would have to go a dir up) and 2. what is presumably a directory resistance_db_who, which would mean a dir exists in this location with exactly the same name as the parent dir, guess this is not what was meant.

Implement check for fastq integrity.

I am just running some other dataset atm and encountering quite a few issues with corrupted fastq files.
It would be great if we could check the file integrity of each fastq.gz with gzip -t [file] somewhere before fastqc.

Allow users to use pre-downloaded conda environments

Since conda is often a cause of problems on a cluster - we should update the pipeline to have

a script to download and setup the necessary conda environments within the specified folder
a conda_local profile which relies on this conda environment

Thread for possible enhancements in future

On this thread we can keep track of the enhancements which would be nice to have.

Update tb-profiler version to v4

Relevant discussions

#50
#64

@LennertVerboven , could you please confirm (in isolation) that the latest tb-profiler (v4) is working as expected?

Retry the GATK_VARIANT_RECALIBRATOR with reduced gaussians on failure

3 annotations (DP/...) work with 4 gaussians

GATK_VARIANT_RECALIBRATOR
--max-gaussians

Always reduce the value from 4

Refactor the usage of tbprofiler vcf_profile

MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__LOFREQ crashes because, when run in parallel, multiple processes are reading and writing the same files in /conda/xbs-nf-env-1-d99876e5fea88a1c4bd18887d111ae27/share/tbprofiler/

temporary solution is to keep on restarting the run, each time completing more samples. Decreasing queueSize and increasing errorStrategy retries is likely to be a good temporary solution as well.

Feedback for MERGE_WF (+ the alpha version of the pipeline) on PBS

I've created this issue to track the progress and feedback for the MERGE_WF. Will ping with further/updates on this thread.

Rename the pipeline to magma

Change the relevant references in code
Change the name of the repo

GATK to handle rare variant types

This is primarily a reminder for myself to include code so that GATK will correctly handle MIXED and MNP variants, we've encountered this rare issue elsewhere.

Consider providing a GVCF file for EXIT-RIF dataset

Simplify the version of softwares via a single conda.yml env file

Explore custom log and warning messages (post-release)

Accomodate all indexes and test with the shared references

This task builds upon the data shared in #20

Add initial stubs for the individual tools used

This task about identifying all the individual tools which have been used and to add the NF stubs to them.

So far, I've identified these many.

(base) bash-5.0$ tree -L 2
.
├── bcftools
│   └── view
├── bwa
│   └── mem
├── clusterpicker
├── delly
│   └── call
├── fastqc
├── gatk
│   ├── apply_vqsr
│   ├── base_recalibrator
│   ├── collect_wgs_metrics
│   ├── combine_gvcfs
│   ├── flag_stat
│   ├── genotype_gvcfs
│   ├── haplotype_caller
│   ├── index_feature_file
│   ├── mark_duplicates
│   ├── merge_vcfs
│   ├── select_variants
│   ├── variant_recalibrator
│   └── variants_to_table
├── iqtree
├── lofreq
│   ├── call
│   └── indelqual
├── samtools
│   ├── index
│   ├── merge
│   └── stats
├── snpdists
├── snpeff
├── snpsites
├── tbprofiler
│   ├── collate
│   └── vcf_profile
└── utils
    └── gunzip

Update readme to redirect users to the website

Publish a condensed MultiQC report with the relevant statistics

The inspiration for this task is the way nf-core uses the MultiQC reports to provide a single-file summary stats about various steps.

Update the quanttb report to accommodate refname

Following up from #80 (comment)

the refname entry is not transferred from the quanttb output to this file hence the columns/ data are unaligned

Define the publish dirs for all processes

Hi guys,

This task is related to defining the default biologically-informed publish directories as opposed to the current tool-specific directories.

This is an independent task and can be accomplished by customizing the results_dir section for any specific process within your pbs.config file.

I've added an example for this here https://github.com/abhi18av/xbs-nf/blob/master/conf/pbs.config#L60

Publish all logs for each process

This would be a useful information for retreacability of analysis and results.

process TEST {
    output:
    ...
    path(".*command.{sh,out,err,log}")

    ...
}

And in the nextflow (>= v21.10.0 ) we can specify this in the config file as shown below

params {
    outdir = "${projectDir}/results/"
}




process {
    withName: "TEST" {
        publishDir = [[path: "${params.outdir}/test/", enabled: 'true', pattern: "temp.txt"],
                      [path: "${params.outdir}/test/logs/", enabled: 'true', pattern: "*command.{sh,out,err,log}"]]
    }
}

GATK VQSR optimal variant annotations

testing testing.

One major issue we have to tackle before release is the unstable VQSR, which can crash or works sub-optimal given variant annotations wit insufficient variation.
Solution:

Run VQSR with all annotations.
Extract informativeness of each annotation.
Rerun several times each time excluding the weakest annotation.
Pick the best model based on 99.9 sensitivity tranche VQSLOD score.

Finalize sub-workflows and workflows

Once #2 is done, it's time to think of the right sub-workflows and the overall workflow.

Proposal for replacing QuantTB with TBprofiler

Hello @everyone (@abhi18av @vrennie @LennertVerboven ).

At the moment we identify multiple infections with QuantTB so that such cases can be excluded from subsequent analyses. Unfortunately we see that this program is not working as well as we would like it to, resulting in false positive multiple infection calls and therefore perfectly good samples excluded from the rest of the analyses.

After some discussion here at Antwerp it has become clear that QuantTB needs to be replaced by a better process, rather than fixing QuantTB which seems a can of worms. The good thing is that we already produce a better output to infer multiple infection, TBprofiler when run on the LoFreq output produces a json with frequencies of the identified (sub)lineages.

Lennert has already written a script to extract the necessary information from this json: https://github.com/TORCH-Consortium/xbs-nf/blob/master/bin/multiple_infection_filter.py

Note that in this particular scenario the multiple infection information is reported by TBprofiler, which in turn is run on the LoFreq output, which in turn is run on the mapped (bam) files.

Additional advantages of this proposal:

QuantTB is not compatible with many OS's, excluding it would make our pipeline easier to set-up on different systems.
By the time the pipeline has inferred the multiple infection with TBprofiler all the other stats for selection criteria have also been inferred, it is then possible to report all selection criteria in one summary file (rather than the 4 we have now)
This in turn makes the whole workflow of the pipeline much cleaner as all samples go through the same flow up to the selection criteria assessment, there is no need for a rejected workflow.
The output from TBprofiler is much cleaner and easier to interpret: there is clear lineage identification (rather than cryptic QuantTB codes) and associated percentages.
If this proposal combined with the TBprofiler fix (#104) than both flows can be incorporated at once and solve all major issues for release.

If all stats for selection criteria can be collated and assessed at one point then the best would be to do this before haplotype caller.

@abhi18av could you please have a look at this proposal so that we can discuss?

Tim

Make use of labels to simplify resource specification

Labels feature in Nextflow can be used to simplify the overall management of hardware directives for a profile.

Some good examples are

nf-core pipelines (which have dynamic computation of hardware resources)
https://github.com/hoelzer-lab/rnaflow

Exclusion of DR markers phylogeny

At the moment we use Coll2018 for excluding DR markers from phylogenetic inference, which is good but not perfect.
https://github.com/TORCH-Consortium/xbs-nf/blob/4db84e6bc4256484ab713206ffd60207a1e4cdb9/workflows/merge_wf.nf#L37
https://github.com/TORCH-Consortium/xbs-nf/blob/4db84e6bc4256484ab713206ffd60207a1e4cdb9/workflows/merge_wf.nf#L54
Ideally we would use the WHO list for this purpose as well, this means WHO would have to be translated to VCF.

Consider allowing users to simply specific SRA ID for a run

We should accomodate the use-case where people only have SRA IDs.

As per the discussion with Lennert, we can use some default values in this case.

Add FastQ file checkpoint

As discussed in the TORCH meeting sometimes files are empty or corrupted.

We should add a tool that checks file size and corruption

Consider allowing users to simply specific SRA ID for a run

Replace the sample stats computation with a python script

As discussed in the call earlier today, it's high time we replace the content of https://github.com/abhi18av/xbs-nf/blob/master/modules/utils/sample_stats.nf with a Python script.

Problems with tb-profiler WHO database

Hi @TimHHH and @LennertVerboven ,

Continuing the discussion from #80 (comment) about using custom database in tb-profiler, I am currently testing on Azure/AWS with the custom containers for XBS-nf (https://github.com/abhi18av/xbs-nf/tree/master/containers).

Have you seen this problem before?


Using gff file: /opt/conda/share/tbprofiler/resistance_db_who.gff
Using ref file: /opt/conda/share/tbprofiler/resistance_db_who.fasta
Using barcode file: /opt/conda/share/tbprofiler/resistance_db_who.barcode.bed
Using bed file: /opt/conda/share/tbprofiler/resistance_db_who.bed
Using json_db file: /opt/conda/share/tbprofiler/resistance_db_who.dr.json
Using version file: /opt/conda/share/tbprofiler/resistance_db_who.version.json
Using variables file: /opt/conda/share/tbprofiler/resistance_db_who.variables.json


Can't find /opt/conda/share/tbprofiler/resistance_db_who.variables.json // <--- CORE PROBLEM

The resistance_db_who.variables.json in question is present here

TBProfiler Database

We need to load the TBProfiler database in TBProfiler when creating its virtual environment using the following commands

To create the database from a list of variants
tb-profiler create_db --prefix <new_library_name>

Load the newly created database into TBProfiler
tb-profiler load_library --prefix <new_library_name>

QuantTB results are out of sync due to SAMTOOLS_MERGE

Hi @TimHHH and @LennertVerboven ,

As a result of #29, the current state of pipeline has two stages

before the sample bams are merged, samples with different libraries

[b5/73822f] process > TEST:QUANTTB_QUANT (EXIT-RIF.FS031-2.L2.A464.1.1.1)        [100%] 13 of 13, cached: 13 ✔
[29/8dd134] process > TEST:MAP_WF:FASTQC (EXIT-RIF.FS031-2.L2.A464.1.1.1)        [100%] 13 of 13, cached: 13 ✔
[d7/425f90] process > TEST:MAP_WF:BWA_MEM (EXIT-RIF.FS031-2.L3.A464.1.1.1)       [100%] 13 of 13, cached: 13 ✔

after the bams are merged via SAMTOOLE_MERGE

[81/803656] process > TEST:CALL_WF:SAMTOOLS_MERGE (EXIT-RIF.FS031-2)             [100%] 10 of 10, cached: 10 ✔
[c0/b3165c] process > TEST:CALL_WF:GATK_MARK_DUPLICATES (EXIT-RIF.FS007-2)       [100%] 10 of 10, cached: 10 ✔
[be/4095fc] process > TEST:CALL_WF:SAMTOOLS_INDEX (EXIT-RIF.EC090-209)           [100%] 10 of 10, cached: 10 ✔
[4e/775254] process > TEST:CALL_WF:GATK_HAPLOTYPE_CALLER (EXIT-RIF.FS031-2)      [100%] 10 of 10, cached: 10 ✔
[27/e96f03] process > TEST:CALL_WF:LOFREQ_CALL__NTM (EXIT-RIF.EC550A-330)        [100%] 10 of 10, cached: 10 ✔
[19/c59a69] process > TEST:CALL_WF:LOFREQ_INDELQUAL (EXIT-RIF.EC083-202)         [100%] 10 of 10, cached: 10 ✔
[5e/7d8b8d] process > TEST:CALL_WF:SAMTOOLS_INDEX__LOFREQ (EXIT-RIF.EC083-202)   [100%] 10 of 10, cached: 10 ✔
[c9/b90c8c] process > TEST:CALL_WF:LOFREQ_CALL (EXIT-RIF.EC083-202)              [100%] 10 of 10, cached: 10 ✔
[50/db759f] process > TEST:CALL_WF:LOFREQ_FILTER (EXIT-RIF.EC088-207)            [100%] 10 of 10, cached: 10 ✔
[32/edb65b] process > TEST:CALL_WF:DELLY_CALL (EXIT-RIF.FS007-2)                 [100%] 10 of 10, cached: 10 ✔
[d0/e68561] process > TEST:CALL_WF:BCFTOOLS_VIEW (EXIT-RIF.FS031-2)              [100%] 10 of 10, cached: 10 ✔
[3a/26bbd3] process > TEST:CALL_WF:GATK_INDEX_FEATURE_FILE (EXIT-RIF.FS007-2)    [100%] 10 of 10, cached: 10 ✔
[6a/d9b220] process > TEST:CALL_WF:SAMTOOLS_STATS (EXIT-RIF.FS007-2)             [100%] 10 of 10, cached: 10 ✔
[14/76d561] process > TEST:CALL_WF:GATK_COLLECT_WGS_METRICS (EXIT-RIF.EC090-209) [100%] 10 of 10, cached: 10 ✔
[d0/b21681] process > TEST:CALL_WF:GATK_FLAG_STAT (EXIT-RIF.EC550A-330)          [100%] 10 of 10, cached: 10 ✔

The problem that arises is that now the output of QUANTTB_QUANT has a different shape than that of any process after the SAMTOOLS_MERGE - as shown below

# QuantTB
[EXIT-RIF.EC083-202.L1.A254.1.1.1, /home/labbactfiocruz/projects/xbs-nf/work/8b/fc36dbf6a1cd9682e84b1f25d3d787/output/EXIT-RIF.EC083-202.L1.A254.1.1.1.quant.txt]

# WgsMetrics
[EXIT-RIF.EC090-209, /home/labbactfiocruz/projects/xbs-nf/work/14/76d561b1a276bb56ffa272e0af8942/EXIT-RIF.EC090-209.WgsMetrics.txt]

# FlagStat
[EXIT-RIF.EC083-202, /home/labbactfiocruz/projects/xbs-nf/work/0d/2cf006fe1def80c1006f13cf4a5bea/EXIT-RIF.EC083-202.FlagStat.txt]

# SamtoolStats
[EXIT-RIF.EC090-209, /home/labbactfiocruz/projects/xbs-nf/work/72/9508ce671b9a145552de3cf64c33a6/EXIT-RIF.EC090-209.SamtoolStats.txt]

Which means that the code used to merge all of the stats outputs based on sampleID is not working since the

https://github.com/abhi18av/xbs-nf/blob/master/workflows/call_wf.nf#L137-L141

[-        ] process > TEST:CALL_WF:UTILS_SAMPLE_STATS                            -
[-        ] process > TEST:CALL_WF:UTILS_COHORT_STATS                            -

What is the course of action you suggest?

Feedback for v0.5 with on-premise testing

This release relates to #11 and includes the following workflows

MAP_WF
QUANTTB_QUANT
CALL_WF

Once we have understood the behavior of the platform and permissions (on-the-fly conda env) on the cluster, I will give the finishing touches and push the MERGE_WF as well.

Complete the stub based version for next iteration.

With #4 , the basic stubs have been added, but they need to be customized as per each module.

The optimal way to complete the next iteration is to make sure all the process and file stubs are correct and working as per the original design of XBS.

Quanttb relative abundance filter too early

The samples which do not pass the relative abundance filter are eliminated from analysis too soon. They are currently removed from analysis before mapping, they should instead be run through the entire map and call workflows and only be eliminated before merging (same as samples that are eliminated due to insufficient coverage etc.)

Supply EXIT-RIF expansion set

I have regenerated the EXIT-RIF expansion set for those working with few or clonal samples.
On our system it can be found here: /home/shared_data_ghi/CWGSMTB/xbs-nf_output/EXIT-RIF_expansion_set/
We'll have to find a place on line or on github where we can deposit this 191MB file, any ideas?

Failing samples on QUANT_TB should also be captured in the stats

Notes after the meeeting on 12-Jul-2022

We want to capture the stats of all failing samples (from QUANTTB step) into the stats file (cohort stats).
Not to let these samples reach MERGE_WF

Results large run EXIT-RIF

I finally had a chance to have a closer look at the EXIT-RIF run.
See Tower.nf

Good to see that the the whole thing now runs on pbs, only 10 failed jobs, and I don't think that had anything to do with XBS-nf, but rather a system issue (storage was likely full).

As for the run/results, I have to say that, despite the smallish issues remaining things looks excellent now, the run was surprisingly fast (<2days) and there were no major issues in the general data flow of the pipeline. 👍

The nature of the remaining issues also mean that these can all be tested on smaller datasets, no need to run this big data set again, which is great.

Issues remaining:

🤷 The run ended with a tower communication error, hence tower thinks the run is still ongoing:

Duration    : 1d 20h 22m 26s
CPU hours   : 4'822.2 (3.6% cached, 0% failed)
Succeeded   : 8'393
Cached      : 93
Failed      : 10

WARN: Unexpected HTTP response.
Failed to send message to https://api.tower.nf -- received
- status code : 400
- response msg: Oops... Unable to process request - Error ID: 4Jrmw1EpnkVWYOkEBSx4hx
- error cause : Oops... Unable to process request - Error ID: 4Jrmw1EpnkVWYOkEBSx4hx

VQSR, two issues here:
- ✅ Only 1 Gaussian was used, this should be 4 unless VQSR crashes.
- 🕵️ VQSR was performed for several annotation combinations but the best one was not selected based on the VQSLOD score (-156.6738 for -an DP -an AS_MQ -an AS_QD -an AS_FS)
- ✅ *.R.pdf and *.tranches.pdf are not published to the results directory
joint.quanttb_cohort_stats.tsv two issues:
- 🚧 the refname entry is not transferred from the quanttb output to this file hence the columns/ data are unaligned
- 🚧 like in joint.cohort_stats.tsv there should be a column MULTIPLE_INFECTION_THRESHOLD_MET
- 🚧 for a sample like quanttb/quant/output/EXIT-RIF.EC104-256.L1.A271.1.1.1.quant.txt there are actually two output lines, both should be reported, rather than one, do make sure that this does not mess up the selection criteria.
✅ TBprofiler is identifying lineage but not Drug Resistance, likely some database issue.

So, these should be very fixable. Let us know who will tackle which issues.
Thanks.

GATK version

GATK has seen some important updates particularly in light of log4shell, would be good to update our version to the latest 4.2.4.1

Plan out the documentation

Choose a doc framework (public website)
Make no assumptions on the background of user
Provide a script for generating the initial samplesheet

Add/improve default configs for common infra

HPC (PBS @TimHHH @vrennie /SLURM @abhi18av )
Server (CLIMB <16 GB RAM default> @vrenni ; KHAOS <16 CPUs> @abhi18av )
Docker (Cloud Azure) @abhi18av
Workstation (find a machine) @LennertVerboven

@abhi18av to share the instructions for using pipeline URL and updating configs.

[Input samplesheet validation] Sample names with point can result in accidental sample merge

Given two samples with the same name before a point can result in an unwanted sample merge, e.g.:
Votintseva2017.614406.m
Votintseva2017.614406.1
In this case both different samples are interpreted as one, namely Votintseva2017.614406
Hence we should make a note in the manual that points are not allowed in the sample sheet.

Integrated SAMTOOLS_MERGE for multiple bam files

Lennert is correct, this is for merging multiple libraries from the same sample. This may happen when one library is not too good and a better one also available, which happens some times, but this can also happen when we decide we simply want more/deeper data for a sample, this also happens some times. And there are more scenarios. It is important enough that this step should be implemented now please, we will encounter this issue in the dataset for the paper. Please see my post in the testing section on the impact of library/sample names.
cheers.

Originally posted by @TimHHH in #26 (comment)

Add a MULTIPLE_INFECTION_THRESHOLD_MET field in quanttb report

Following up from the discussion on #80 (comment)

like in joint.cohort_stats.tsv there should be a column MULTIPLE_INFECTION_THRESHOLD_MET