mycobactopia-org / mtbseq-nf Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 1.0 8.98 MB

MTBSeq made simple and easy using Nextflow and nf-core standard.

Home Page: https://doi.org/10.5281/zenodo.5498063

License: MIT License

Nextflow 94.19% Makefile 0.25% Shell 2.00% HTML 3.56%

genomics mtb nextflow mycobacterium-tuberculosis

mtbseq-nf's Introduction

Introduction

mycobactopia-org/MTBseq-nf is a bioinformatics pipeline that ...

Read QC (FastQC)
Present QC for raw reads (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Now, you can run the pipeline using:

nextflow run mycobactopia-org/MTBseq-nf \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

mycobactopia-org/MTBseq-nf was originally written by Abhinav Sharma (@abhi18av) and Davi Marcon (@mxrcon).

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

mtbseq-nf's People

Contributors

Stargazers

Watchers

Forkers

bioinformatics-lab

mtbseq-nf's Issues

Consider creating a custom container with all the necessary deps.

I'm thinking that now we can publish a custom docker container starting from micromamba image after this - containing the fastqc and multiqc deps as well 🤔

Originally posted by @abhi18av in #64 (review)

NOTE: This task should only be initiated if the custom multi-container (mulled) request doesn't work out. Refer here #60

Add optional trimmomatic module

Add this new module
Add a binary parameter (toggle on the UI) trim_raw_reads
Update the configs files
Update the workflows (adapt the parameter)

Add github issue/contribution templates

Implement TBFull workflow

Benchmark with the vanilla `mtbseq` tool to document run-time

Interesting results:

WIth individual modules
https://tower.nf/orgs/BioSharp_OU/workspaces/Davi_Internship/watch/UcTCFg3oMU4su
With batch analysis
https://tower.nf/orgs/BioSharp_OU/workspaces/Davi_Internship/watch/1YojrmcA1GTqBV

We need to benchmark the tool for judging how similar are the results for capturing the effects of

Rerunning the analysis on the same platforms (conda / docker (via k8s)) with same cpu/memory settings.

conda (2cpu/8gb) - two times (conda vs conda)
k8s/docker (2cpu/8gb) - two times (k8s/docker vs k8s/docker)

Rerunning the analysis on different executor with same cpu/memory settings

conda vs k8s/docker

Rerunning the analysis on different executor, with scaling of cpu/memory settings

conda (2cpu/8gb) vs conda (8cpu/20gb)

k8s/docker (2cpu/8gb) vs k8s/docker (8cpu/20gb)

Publish on bioarxiv!

Add logs for informative warnings and error messages

Initiate the development board

I've initiated the development board here
https://github.com/biosharp-dotnet/mtbseq_nf/projects/1?add_cards_query=is%3Aopen

We'll be planning and tracking within the Github project via issues.

CC @Mxrcon

[BUG] cannot create regular file ‘GenomeAnalysisTK.jar’ File exists - conda executor

Hey, while I've been testing the latest version, i found this error:
I'm providing a gatk jar as requested, and using a fresh conda env.

  ENV_PREFIX /data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d
  Processing GenomeAnalysisTK.jar as *.jar
  jar file specified matches expected version
  Copying GenomeAnalysisTK.jar to /data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d/opt/gatk-3.8

Command error:
  cp: cannot create regular file ‘/data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d/opt/gatk-3.8/GenomeAnalysisTK.jar’: File exists

I'll test some solutions and reply here if i found anything new.

Kindly, Davi

Allow the users to specify their own MTB references

A first attempt to the problem was made here #31

However, refererring the manual for MTBseq, it doesn't seem like an entire location could be passed

--ref This OPTION sets the reference genome for the read mapping. By default, the genome of Mycobacterium tuberculosis H37Rv (NC_000962.3) is set as reference. User supplied FASTA files for other reference genomes should be placed in the directory /MTBseq_source/var/ref/, and the respective name given without .fasta extension.

Therefore, for the time being, we'll only accomodate the annotations file.

Submit a multi-container image to biocontainers

https://github.com/BioContainers/multi-package-containers/blob/master/combinations/hash.tsv

TODO: Check the correct mtbseq version

bioconda::mtbseq=1.0.3 bioconda/linux-64::gatk=3.8.0 bioconda::fastqc=0.11.9 bioconda::multiqc=1.9

Single-end reads

Hello,

Dose the pipeline works on single-end reads?

Thanks

Best,
Aroob

Make the configurations more readable.

This task stems from #17 (comment)

Accomodate the global options for customizing MTBSeq

These were mentioned here
#8 (comment)

@Mxrcon , reading the codebase for the current PR, I'm not sure if we have made use of these features at all.

This means that for example, any user would only be able to run MTBSeq with the bundled reference fasta file since

Similarly, the absence of --basecalib takes care of the base call calibration

The calibration list is stored in the directory "var/res/MTB_Base_Calibration_List.vcf" of the package. For other reference genomes, the file needs to be specified with the --basecalib OPTION or this step will be skipped.

Basically, the actual design challenge of this workflow is not the DSL2 modules etc, but the treatment of parameters. This is where you need to use creativity and experience 😉

Add stubs to individual modules

Update the stubs for realistic file names based on the mock files

We should update the filenames generated by stub to be more realistic, based on a previous run of mtbseq shared with @Mxrcon earlier.

Extract all native tool commands from the logs

We can create an alternative Nextflow native version of this workflow using the knowledge of the underlying commands.

Automatically download and extract the GATK jar

We should consider adding a process which automatically downloads the GATK-3.8.0 using the public URL

https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2

And then unzip it via tar -xf GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 and pass on the resultant GenomeAnalysisTK.jar through all processes.

This has some overlap with #54

Accomodate NCBI key input in params

[BUG] conda package name

hey there 👋, during my tests using conda i found a small typo on conf/conda

the conda package is named like this bioconda::mtbseq:1.0.3
and nextflow will throw this error:

    UnavailableInvalidChannel: The channel is not accessible or is invalid.
      channel name: bioconda:
      channel url: https://conda.anaconda.org/bioconda:
      error code: 404

This could be fixed writing the conda package name like this: bioconda:mtbseq:1.0.3 or this bioconda::mtbseq=1.0.3

I'll take some time to implement this change asap on the master branch as it seems to be urgent!

Add nextflow_schema file

Error related to log4j2 (with gatk `v3.8.0`)

I'm trying to run the pipeline on a PBS cluster and during the TBBWA step, I'm getting this very odd error:

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/laboratorio/sabmi/karla.lima/sabmi_sra_marilia/mtbseq_nf_test/mtbseq-nf/conda_envs/mtbseq-nf-env/opt/gatk-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  16:18:38,339 GenomeAnalysisEngine - Deflater: JdkDeflater 
INFO  16:18:38,340 GenomeAnalysisEngine - Inflater: JdkInflater 
INFO  16:18:38,341 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:18:38,499 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 10000 
INFO  16:18:38,510 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:18:38,587 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 
##### ERROR ------------------------------------------------------------------------------------------

I'm using a new conda environtment provided by the recipe on this repository, I'm also using the correct GATK jar.

Add QC_REPORTS workflow

With recent experience on some projects, we lost quite some time on the analysis when the base genomes weren't even making it through the base QC checks such as size etc.

I think it's worth it to add an optional QC_REPORTS workflow, which takes the initial genomes (before renaming) and then prepares FASTQC and MULTIQC report for the genomes.

The user should be able to

Run only QC_REPORTS workflow
Run QC_REPORTS along with the current setup (default)

Refactor the pipeline to rely upon the nf-core template

Due to the larger scope of the Organization, we need to follow a standard structure for the pipelines and it's time that we move MTBseq-nf and MAGMA to the nf-core template format so that the user experience stays the same in all pipelins.

Generate nextflow schema

The next step for this would be to create a nextflow_schema.json file - https://help.tower.nf/pipeline-schema/overview/ so that any user on Tower would be able to make sense of these without worrying about working on the terminal or making a typo in the configuration :)

Add stubs ( + mock data) for individual modules

Evaluate upgrading the project to use mtbseq-1.0.4

The conda setup for mtbseq-1.0.3 always runs across the problem mentioned in the MTBseq Github repo (issue-29)

gatk-register not found

Creating a table to explain the parameters on resources

Sync the stub and script sections

As a result of #24 , it seems that the stub and script sections are now out of sync because of the latest changes.

Allow users to start from intermediate steps

Adding excerpts of discussion between @Mxrcon and @abhi18av

we can accept only the required file as input, or accept all the output files from other process, what do you think that works better? a more restrictive input or a more open that relies more on mtbseq to catch the input than nextflow.

I think that we could assume that the user would start from the first step of the workflow and reach the last, so we can focus only on that.

To facilitate the advanced use-case where people want to use the intermediate steps, maybe we can add another workflow which starts from like step-2 or step-3 just to show them how to stitch it together.

This would need to be explained in Readme.md.

Evaluate the use of hard-coded exit status as a fallback for mtbseq statuses

Given the recent experience with files transfers on clusters, the errors raised by GATK tool were ignored as part of the current solution

Ideally we capture this via a shell construct (maybe from bin folder) and throw exit 1

##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

Change all params from camelCase to snake_case

Publish Logs

Task names comes out to be like this PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBBWA_AA01_out.log

Allow users to input a CSV/TSV file

Create a documentation website

I've added a brief section for brief section for usage instruction.

https://github.com/mtb-bioinformatics/mtbseq-nf#usage

But it might be worth to think about a docusaurus website for mtb-bioinformatics and then showcase the options for mtbseq-nf

Create a DAG file for the batch/parallel workflows

@Mxrcon, I'd really appreciate if you could take care of this issue after #25

Introduce the optional use of a samplesheet to simplify the input/channel path names

Perhaps it might be a good idea to ask the user to specify a samplesheet in a csv/tsv format which we can use to parse out the values for library, sampleID etc.

Sub-tasks

Create a utility function to simply rename each sample as per the derived pattern based on this samplesheet.
Allow users to specify the SRA ids
#33

Create a standard pipeline params file including all options for (sub-)modules

Then, let's try to use the following pattern

TBBWA {
// global params

// module params

// <- Here we add the module level params mentioned in tbbwa.nf file, which a user can override through a central location ->
resultsDir = "${params.outdir}/tbbwa"
saveMode = 'copy'
shouldPublish = true
}

I think that we're at a stage we can try some stuff out, I'd be curious to see how the params.outdir behaves within this context.

Originally posted by @abhi18av in #19 (comment)

GATK/Nextflow Java environment differences

GATK requires java 8, but Nextflow at least Java 11.
One can't launch Nextflow when you've activated the mtbseq-env. But when you run the pipeline from the base environment it runs up until BATCH_ANALYSIS:TBFULL and then gives an error at:
Command executed:
gatk-register GenomeAnalysisTK.jar

We can add the reference files to this repo itself - and substitute the docker/conda-based hardcoded paths in the global_config

https://github.com/ngs-fzb/MTBseq_source/tree/master/var

Create a channel for these references
Accomodate all the base modules for these references