Code Monkey home page Code Monkey logo

mtbseq-nf's Introduction

Introduction

mycobactopia-org/MTBseq-nf is a bioinformatics pipeline that ...

  1. Read QC (FastQC)
  2. Present QC for raw reads (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Now, you can run the pipeline using:

nextflow run mycobactopia-org/MTBseq-nf \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

mycobactopia-org/MTBseq-nf was originally written by Abhinav Sharma (@abhi18av) and Davi Marcon (@mxrcon).

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

mtbseq-nf's People

Contributors

abhi18av avatar emilyncosta avatar mxrcon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mtbseq-nf's Issues

Add optional trimmomatic module

  • Add this new module
  • Add a binary parameter (toggle on the UI) trim_raw_reads
  • Update the configs files
  • Update the workflows (adapt the parameter)

Benchmark with the vanilla `mtbseq` tool to document run-time

Interesting results:

We need to benchmark the tool for judging how similar are the results for capturing the effects of

  1. Rerunning the analysis on the same platforms (conda / docker (via k8s)) with same cpu/memory settings.

conda (2cpu/8gb) - two times (conda vs conda)
k8s/docker (2cpu/8gb) - two times (k8s/docker vs k8s/docker)

  1. Rerunning the analysis on different executor with same cpu/memory settings

conda vs k8s/docker

  1. Rerunning the analysis on different executor, with scaling of cpu/memory settings

conda (2cpu/8gb) vs conda (8cpu/20gb)

k8s/docker (2cpu/8gb) vs k8s/docker (8cpu/20gb)

Publish on bioarxiv!

[BUG] cannot create regular file ‘GenomeAnalysisTK.jar’ File exists - conda executor

Hey, while I've been testing the latest version, i found this error:
I'm providing a gatk jar as requested, and using a fresh conda env.

  ENV_PREFIX /data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d
  Processing GenomeAnalysisTK.jar as *.jar
  jar file specified matches expected version
  Copying GenomeAnalysisTK.jar to /data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d/opt/gatk-3.8

Command error:
  cp: cannot create regular file ‘/data/mariliaconceicao/Davi/mtbseq-nf-master/work/conda/env-9bc281415c98bcad7a1079004336496d/opt/gatk-3.8/GenomeAnalysisTK.jar’: File exists

I'll test some solutions and reply here if i found anything new.

Kindly, Davi

Allow the users to specify their own MTB references

A first attempt to the problem was made here #31

However, refererring the manual for MTBseq, it doesn't seem like an entire location could be passed

--ref This OPTION sets the reference genome for the read mapping. By default, the genome of Mycobacterium tuberculosis H37Rv (NC_000962.3) is set as reference. User supplied FASTA files for other reference genomes should be placed in the directory /MTBseq_source/var/ref/, and the respective name given without .fasta extension.

Therefore, for the time being, we'll only accomodate the annotations file.

Single-end reads

Hello,

Dose the pipeline works on single-end reads?

Thanks

Best,
Aroob

Accomodate the global options for customizing MTBSeq

These were mentioned here
#8 (comment)

@Mxrcon , reading the codebase for the current PR, I'm not sure if we have made use of these features at all.

This means that for example, any user would only be able to run MTBSeq with the bundled reference fasta file since

Similarly, the absence of --basecalib takes care of the base call calibration

The calibration list is stored in the directory "var/res/MTB_Base_Calibration_List.vcf" of the package. For other reference genomes, the file needs to be specified with the --basecalib OPTION or this step will be skipped.

Basically, the actual design challenge of this workflow is not the DSL2 modules etc, but the treatment of parameters. This is where you need to use creativity and experience 😉

Automatically download and extract the GATK jar

We should consider adding a process which automatically downloads the GATK-3.8.0 using the public URL

https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2

And then unzip it via tar -xf GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 and pass on the resultant GenomeAnalysisTK.jar through all processes.

This has some overlap with #54

[BUG] conda package name

hey there 👋, during my tests using conda i found a small typo on conf/conda

the conda package is named like this bioconda::mtbseq:1.0.3
and nextflow will throw this error:

    UnavailableInvalidChannel: The channel is not accessible or is invalid.
      channel name: bioconda:
      channel url: https://conda.anaconda.org/bioconda:
      error code: 404

This could be fixed writing the conda package name like this: bioconda:mtbseq:1.0.3 or this bioconda::mtbseq=1.0.3

I'll take some time to implement this change asap on the master branch as it seems to be urgent!

Error related to log4j2 (with gatk `v3.8.0`)

I'm trying to run the pipeline on a PBS cluster and during the TBBWA step, I'm getting this very odd error:

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/laboratorio/sabmi/karla.lima/sabmi_sra_marilia/mtbseq_nf_test/mtbseq-nf/conda_envs/mtbseq-nf-env/opt/gatk-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  16:18:38,339 GenomeAnalysisEngine - Deflater: JdkDeflater 
INFO  16:18:38,340 GenomeAnalysisEngine - Inflater: JdkInflater 
INFO  16:18:38,341 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:18:38,499 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 10000 
INFO  16:18:38,510 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:18:38,587 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 
##### ERROR ------------------------------------------------------------------------------------------

I'm using a new conda environtment provided by the recipe on this repository, I'm also using the correct GATK jar.

Add QC_REPORTS workflow

With recent experience on some projects, we lost quite some time on the analysis when the base genomes weren't even making it through the base QC checks such as size etc.

I think it's worth it to add an optional QC_REPORTS workflow, which takes the initial genomes (before renaming) and then prepares FASTQC and MULTIQC report for the genomes.

The user should be able to

  • Run only QC_REPORTS workflow
  • Run QC_REPORTS along with the current setup (default)

Refactor the pipeline to rely upon the nf-core template

Due to the larger scope of the Organization, we need to follow a standard structure for the pipelines and it's time that we move MTBseq-nf and MAGMA to the nf-core template format so that the user experience stays the same in all pipelins.

Allow users to start from intermediate steps

Adding excerpts of discussion between @Mxrcon and @abhi18av

we can accept only the required file as input, or accept all the output files from other process, what do you think that works better? a more restrictive input or a more open that relies more on mtbseq to catch the input than nextflow.

I think that we could assume that the user would start from the first step of the workflow and reach the last, so we can focus only on that.

To facilitate the advanced use-case where people want to use the intermediate steps, maybe we can add another workflow which starts from like step-2 or step-3 just to show them how to stitch it together.

This would need to be explained in Readme.md.

Publish Logs

  • Task names comes out to be like this PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBBWA_AA01_out.log

Create a standard pipeline params file including all options for (sub-)modules

Then, let's try to use the following pattern

TBBWA {
// global params

// module params

// <- Here we add the module level params mentioned in tbbwa.nf file, which a user can override through a central location ->
resultsDir = "${params.outdir}/tbbwa"
saveMode = 'copy'
shouldPublish = true
}

I think that we're at a stage we can try some stuff out, I'd be curious to see how the params.outdir behaves within this context.

Originally posted by @abhi18av in #19 (comment)

GATK/Nextflow Java environment differences

GATK requires java 8, but Nextflow at least Java 11.
One can't launch Nextflow when you've activated the mtbseq-env. But when you run the pipeline from the base environment it runs up until BATCH_ANALYSIS:TBFULL and then gives an error at:
Command executed:
gatk-register GenomeAnalysisTK.jar

Test all global params per module

We need to make sure that we accomodate the global_parameters mentioned here https://github.com/mtb-bioinformatics/mtbseq-nf/blob/59f7650f521b95cba59774aaac1a165515c11c47/conf/global_params.config#L140

  • We can add the reference files to this repo itself - and substitute the docker/conda-based hardcoded paths in the global_config

https://github.com/ngs-fzb/MTBseq_source/tree/master/var

  • Create a channel for these references
  • Accomodate all the base modules for these references

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.