Code Monkey home page Code Monkey logo

dragonflye-nf's Introduction

dragonflye-nf

A generic pipeline for creating long-read assemblies. Supports both long (Oxford Nanopore) reads, or hybrid assemblies with both short and long reads. Optionally annotate genes. Collects quality info on both incoming and outgoing datasets.

Analyses

Usage

By default, dragonflye will be used for assembly of long reads, and no gene annotation will be run:

nextflow run BCCDC-PHL/dragonflye-nf \
  --long_only \
  --fastq_input_long <long-read fastq input directory> \
  --outdir <output directory>

...or a hybrid assembly can be generated by supplying short reads:

nextflow run BCCDC-PHL/dragonflye-nf \
  --hybrid \
  --fastq_input <short-read fastq input directory> \
  --fastq_input_long <long-read fastq input directory> \
  --outdir <output directory>

Prokka and/or bakta can be used with the --prokka and --bakta flags:

nextflow run BCCDC-PHL/dragonflye-nf \
  --long_only \
  --fastq_input_long <long-read fastq input directory> \
  --prokka \
  --bakta \
  --outdir <output directory>

The pipeline also supports a 'samplesheet input' mode. Pass a samplesheet.csv file with the headers ID, R1, R2,LONG:

nextflow run BCCDC-PHL/dragonflye-nf \
  --samplesheet_input <samplesheet.csv> \
  --outdir <output directory>

Eg:

ID,R1,R2,LONG
sample-01,/path/to/sample-01_R1.fastq.gz,/path/to/sample-01_R2.fastq.gz,/path/to/sample-01_RL.fastq.gz
sample-02,/path/to/sample-02_R1.fastq.gz,/path/to/sample-02_R2.fastq.gz,/path/to/sample-02_RL.fastq.gz
sample-03,/path/to/sample-03_R1.fastq.gz,/path/to/sample-03_R2.fastq.gz,/path/to/sample-03_RL.fastq.gz

By default, dragonflye will tag circularized contigs with a circular=Y annotation in the fasta header, and circular=N for linear contigs. For example:

>contig00001 len=5202987 cov=191.0 origname=contig_1_polypolish polish=racon:1 round(s);polypolish:short_reads,1 round(s); sw=dragonflye-flye/1.1.0 date=20230912 circular=Y
>contig00002 len=3964 cov=155.0 origname=contig_4_polypolish polish=racon:1 round(s);polypolish:short_reads,1 round(s); sw=dragonflye-flye/1.1.0 date=20230912 circular=N
...

In contrast, unicycler adds a circular=true tag to circularized contigs and no circularization tag to linear contigs. For example:

>1 length=5202987 depth=191.0x circular=true
>2 length=3964 depth=155.0

Both this pipeline and our BCCDC-PHL/routine-assembly edit the fasta header to add the sample ID to the front. In addition, this pipeline accepts a --use_unicycler_circularization_tag flag that will convert circular=Y to circular=true and will remove circular=N.

nextflow run BCCDC-PHL/dragonflye-nf \
  --hybrid \
  --use_unicycler_circularization_tag \
  --fastq_input <short-read fastq input directory> \
  --fastq_input_long <long-read fastq input directory> \
  --outdir <output directory>

The --meta flag can be passed through to flye by adding the --flye_meta flag. The --meta flag is intended for use with metagenome assemblies, but may be helpful in other scenarios such as plasmid assembly:

nextflow run BCCDC-PHL/dragonflye-nf \
  --hybrid \
  --flye_meta \
  --fastq_input <short-read fastq input directory> \
  --fastq_input_long <long-read fastq input directory> \
  --outdir <output directory>

Output

An output directory will be created for each sample under the directory provided with the --outdir flag. The directory will be named by sample ID, inferred from the fastq files (all characters before the first underscore in the fastq filenames), or the ID field of the samplesheet, if one is used.

If we have sample-01_R{1,2}.fastq.gz, in our --fastq_input directory, the output directory will be:

sample-01
├── sample-01_20211125165316_provenance.yml
├── sample-01_fastp.csv
├── sample-01_fastp.json
├── sample-01_dragonflye_short.fa
├── sample-01_dragonflye_short.log
└── sample-01_dragonflye_short_quast.csv

Including the tool name suffixes on output files allows re-analysis of the same sample with multiple tools without conflicting output filenames:

sample-01
├── sample-01_20211125165316_provenance.yml
├── sample-01_20211128122118_provenance.yml
├── sample-01_fastp.csv
├── sample-01_fastp.json
├── sample-01_dragonflye_hybrid_bakta.gbk
├── sample-01_dragonflye_hybrid_bakta.gff
├── sample-01_dragonflye_hybrid_bakta.json
├── sample-01_dragonflye_hybrid_bakta.log
├── sample-01_dragonflye_hybrid_bandage.png
├── sample-01_dragonflye_hybrid_prokka.gbk
├── sample-01_dragonflye_hybrid_prokka.gff
├── sample-01_dragonflye_hybrid_quast.csv
├── sample-01_dragonflye_hybrid_quast.csv
├── sample-01_dragonflye_hybrid.fa
├── sample-01_dragonflye_hybrid.gfa
├── sample-01_dragonflye_hybrid.log
├── sample-01_dragonflye_short_bakta.gbk
├── sample-01_dragonflye_short_bakta.gff
├── sample-01_dragonflye_short_bakta.json
├── sample-01_dragonflye_short_bakta.log
├── sample-01_dragonflye_short_bandage.png
├── sample-01_dragonflye_short_prokka.gbk
├── sample-01_dragonflye_short_prokka.gff
├── sample-01_dragonflye_short_quast.csv
├── sample-01_dragonflye_short_quast.csv
├── sample-01_dragonflye_short.fa
├── sample-01_dragonflye_short.gfa
└── sample-01_dragonflye_short.log

If the --versioned_outdir flag is used, then a sub-directory will be created below each sample, named with the pipeline name and minor version:

sample-01
    └── dragonflye-nf-v0.4-output
        ├── sample-01_20220216172238_provenance.yml
        ├── sample-01_fastp.csv
        ├── sample-01_fastp.json
        ├── sample-01_dragonflye_short.fa
        ├── sample-01_dragonflye_short.log
        ├── sample-01_dragonflye_short_prokka.gbk
        ├── sample-01_dragonflye_short_prokka.gff
        └── sample-01_dragonflye_short_quast.csv

This is provided as a way of combining outputs of several different pipelines or re-analysis with future versions of this pipeline:

sample-01
    └── dragonflye-nf-v0.4-output
    │   ├── sample-01_20220216172238_provenance.yml
    │   ├── sample-01_fastp.csv
    │   ├── sample-01_fastp.json
    │   ├── sample-01_dragonflye_short.fa
    │   ├── sample-01_dragonflye_short.log
    │   ├── sample-01_dragonflye_short_prokka.gbk
    │   ├── sample-01_dragonflye_short_prokka.gff
    │   └── sample-01_dragonflye_short_quast.csv
    └── dragonflye-nf-v0.5-output
        ├── sample-01_20220612091224_provenance.yml
        ├── sample-01_fastp.csv
        ├── sample-01_fastp.json
        ├── sample-01_dragonflye_short.fa
        ├── sample-01_dragonflye_short.log
        ├── sample-01_dragonflye_short_prokka.gbk
        ├── sample-01_dragonflye_short_prokka.gff
        └── sample-01_dragonflye_short_quast.csv

Provenance files

For each pipeline invocation, each sample will produce a provenance.yml file with the following contents:

- pipeline_name: BCCDC-PHL/dragonflye-nf
  pipeline_version: 0.4.0
- timestamp_analysis_start: 2022-08-16T13:22:11.553143
- input_filename: sample-01_R1.fastq.gz
  sha256: 4ac3055ac5f03114a005aff033e7018ea98486cbebdae669880e3f0511ed21bb
  file_type: fastq-input
- input_filename: sample-01_R2.fastq.gz
  sha256: 8db388f56a51920752319c67b5308c7e99f2a566ca83311037a425f8d6bb1ecc
  file_type: fastq-input
- process_name: fastp
  tools:
    - tool_name: fastp
      tool_version: 0.23.1
- process_name: dragonflye
  tools:
    - tool_name: dragonflye
      tool_version: 1.1.0
- process_name: prokka
  tools:
    - tool_name: prokka
      tool_version: 1.14.5
      parameters:
        - parameter: --compliant
          value: null
- process_name: quast
  tools:
    - tool_name: quast
      tool_version: 5.0.2
      parameters:
        - parameter: --space-efficient
          value: null
        - parameter: --fast
          value: null

The filename of the provenance file includes a timestamp with format YYYYMMDDHHMMSS to ensure that re-analysis of the same sample will create a unique provenance.yml file.

dragonflye-nf's People

Contributors

dfornika avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

stroehleina

dragonflye-nf's Issues

Quast output includes N90 / L90 metrics but parse_quast_report.py expects N75 / L75 metrics

Hi,
thanks for providing this nextflow wrapper for dragonflye. I ran into an issue where the QUAST output has N90 / L90 metrics but the parse_quast_report.py script expects N75 / L75 metrics and thus falls over.

Command error:
  Traceback (most recent call last):
    File "/home/astroehlein/bin/dragonflye-nf/bin/parse_quast_report.py", line 128, in <module>
      main()
    File "/home/astroehlein/bin/dragonflye-nf/bin/parse_quast_report.py", line 120, in main
      report = parse_transposed_quast_report(args.transposed_quast_report)
    File "/home/astroehlein/bin/dragonflye-nf/bin/parse_quast_report.py", line 69, in parse_transposed_quast_report
      r[field_lookup[f]] = row[f]
  KeyError: 'N75'

~/.conda/envs/dragonflye-nf/bin/quast --version
QUAST v5.2.0

I'll submit a PR to resolve this.

Optionally convert `circular=Y` to `circular=true` in fasta headers

Unicycler marks circularized contigs with a circular=true tag in the fasta header. Some tools (such as mob-suite look for that tag specifically and handle the contigs differently if they know that they are circular.

But dragonflye uses a circular=Y tag to mark contigs as circular. For compatibility, we should optionally convert circular=Y to circular=true, using the --use_unicycler_circularization_tag flag.

Add support for 'collecting' csv/tsv outputs

We produce several outputs on a per-sample basis that are simply a .csv file with a header and one line for the sample. Very often we want to collect those into a single .csv file that includes all samples in a particular analysis.

Add a --collect_outputs flag that will add collected outputs for any sample-specific csv or tsv files. Also support a --collected_outputs_prefix flag to set the filename prefix for the collected output files, with default value collected.

Add plasmid rotation, similar to Unicycler

Unicycler has a nice feature in its finalisation stage where it will attempt to 'rotate' any closed circular replicons to start at either dnaA or repA.

As far as we know, flye and dragonflye don't support this feature directly.

Add a step to identify plasmids and rotate them to start at dnaA or repA.

Simplify contig ids

The contig ids produced by dragonflye are quite long & detailed. It would be preferable to separate the contig ID from the detailed info so tools like abricate can include a short/simple contig ID in the output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.