labshengli / nanome Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 7.0 131.12 MB

NANOME pipeline (Nanopore long-read sequencing data consensus DNA methylation detection)

Home Page: https://www.jax.org/research-and-faculty/faculty/sheng-li

License: MIT License

Python 72.08% Shell 1.50% Nextflow 14.96% Dockerfile 0.27% R 1.29% HTML 0.33% JavaScript 8.90% CSS 0.67%

bioinformatics dna-methylation long-read-sequencing methylation-calling nanopore-sequencing pipeline

nanome's People

Contributors

Stargazers

Watchers

Forkers

panziwei bennuru stain karutl 00mjk parthomayo dagyemang2904

nanome's Issues

Open clusterOptions in command line

Very nice workflow! A small thing:
I would suggest modifying some of the slurm parameters:
--qos should not be mandatory as not every cluster needs it (ours doesnt)
it would be very helpful if there would be an option from the command line to access the Nextflow clusterOptions (https://www.nextflow.io/docs/latest/process.html#process-clusteroptions). We for example have a mandatory account parameter (-A) in our Slurm config, others might have other configs, so this would help to make the workflow more easily configurable if run on other HPC setups.

Kudos for the workflow, it is very well written and I was able to run it on our HPC after the clusterOption modification. If you want I can modify it and create a PR

Add dynamic resource (disk size) allocation based on input file size

#!/usr/bin/env nextflow 

Channel
     .fromPath('hello.txt')
     .map { [it, it.size()] }
     .set { input_ch }

process foo {
  disk { $x.size() < 600.GB ? 400.GB : 700.GB }
  input:
  set file(x), val(sz) from input_ch
  """
  you_command --input $x
  """
}

In human words:

disk { $x.size() < 600.GB ? 400.GB : 700.GB }
"Is the input file size < 600GB?|

If true, allocate disk of size 400.GB in the task
If false, allocate disk of size 700.GB in the task

Add errorStrategy retry for common google machine failures

    errorStrategy = { task.attempt == process.maxRetries ? 'ignore' : task.exitStatus in [3,9,10,14,143,137,104,134,139] ? 'retry' : 'ignore' }<br class="Apple-interchange-newline">

Redirect Broken pipe error message to sterr to trigger task failure and retry

Parse tombo log and if grep "Broken pipe" is found echo and redirect to sterr

No such file

(nanome) [poultrylab1@pbsnode01 nanome]$ nextflow run TheJacksonLaboratory/nanome -profile test,docker
N E X T F L O W  ~  version 21.10.0
Launching `TheJacksonLaboratory/nanome` [chaotic_hamilton] - revision: c181f907e9 [master]
NANOME - NF PIPELINE (v1.3.6)
by Li Lab at The Jackson Laboratory
https://github.com/TheJacksonLaboratory/nanome
=================================
dsname              : CIEcoli
input               : https://github.com/TheJacksonLaboratory/nanome/raw/master/test_data/ecoli_ci_test_fast5.tar.gz
genome              : ecoli

Running settings   : --------
processors          : 2
chrSet              : NC_000913.3
dataType            : ecoli
runBasecall         : Yes
runNanopolish       : Yes
runMegalodon        : Yes
runDeepSignal       : Yes
runGuppy            : Yes

Pipeline settings  : --------
Working dir         : /storage-04/chicken/ont_methylation/nanome/work
Output dir          : outputs
Launch dir          : /storage-04/chicken/ont_methylation/nanome
Script dir          : /storage-01/poultrylab1/.nextflow/assets/TheJacksonLaboratory/nanome
User                : poultrylab1
Profile             : test,docker
Config Files        : /storage-01/poultrylab1/.nextflow/assets/TheJacksonLaboratory/nanome/nextflow.config
Pipeline Release    : master
Container           : docker - liuyangzzu/nanome:v1.2
=================================
executor >  local (1)
[6e/af606b] process > EnvCheck (EnvCheck) [100%] 1 of 1 ✔
[-        ] process > Untar               -
[-        ] process > Basecall            -
[-        ] process > QCExport            -
[-        ] process > Resquiggle          -
[-        ] process > Nanopolish          -
[-        ] process > NplshComb           -
[-        ] process > Megalodon           -
[-        ] process > MgldnComb           -
[-        ] process > DeepSignal          -
[-        ] process > DpSigComb           -
[-        ] process > Guppy               -
[-        ] process > GuppyComb           -
[-        ] process > Report              -
No such file: https://github.com/TheJacksonLaboratory/nanome/raw/master/test_data/ecoli_ci_test_fast5.tar.gz

Add .github/workflows/ci.yml for implementing CI/CD

NOTE: GitHub actions requires additional configuration for running on GPU instances.
It is recommended to implement first the cpu only mode.

There are also limitations in the resources that can be used when testing with Github Actions:

Hardware specification for Linux virtual machines (used by default)
https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

Max cpus: 2-core CPU
Max memory: 7 GB of RAM memory
Max disk size: 14 GB of SSD disk space

To be able to add CI/CD successfully therefore, a minimal test dataset is required.

Here is the template file that implements both docker and singularity CI.
It needs to be created in a folder in the root of the repo named .github/workflows/ci.yml (the folder names are reserved, the file name can be changed eg for ci.yml to continues-integration.yml etc)

name: splicing-pipelines-nf CI
# This workflow is triggered on pushes and PRs to the repository.
on: [pull_request]

jobs:
  docker:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        nxf_ver: ['20.01.0', '']
    steps:
      - uses: actions/checkout@v1
      - name: Install Nextflow
        run: |
          export NXF_VER=${{ matrix.nxf_ver }}
          wget -qO- get.nextflow.io | bash
          sudo mv nextflow /usr/local/bin/
      - name: Basic workflow tests
        run: |
          nextflow run ${GITHUB_WORKSPACE} --config conf/test.config
  singularity:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        singularity_version: ['3.6.4']
        nxf_ver: ['20.01.0', '']
    steps:
      - uses: actions/checkout@v1
      - uses: eWaterCycle/setup-singularity@v6
        with:
          singularity-version: ${{ matrix.singularity_version }}
      - name: Install Nextflow
        run: |
          export NXF_VER=${{ matrix.nxf_ver }}
          wget -qO- get.nextflow.io | bash
          sudo mv nextflow /usr/local/bin/
      - name: Basic workflow tests
        run: |
          nextflow run ${GITHUB_WORKSPACE}  --config conf/test.config

Many configs can be tested using the matrix stretegy.

Add dynamic resource allocation based on error exit status (increase memory and cpus)

Implemented here, we can also add in nanome in the same way:
https://github.com/lifebit-ai/templates/blob/322299f35c354f1b8d86dd5f4848db93850a9288/inst/templates/nextflow/conf/base.config#L28-L66

// contents nextflow.config

// Specify increasing resources on failure for specific process type
    withName: 'my_process' {
        disk = "50 GB"
        cpus = {check_max(2 * task.attempt, 'cpus')}
        memory = {check_max(2.GB * task.attempt, 'memory')}    }

//  Ready to copy-paste function, END OF nextflow.config file

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if (type == 'memory') {
    try {
      if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if (type == 'time') {
    try {
      if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if (type == 'cpus') {
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

Pin DeepMod in specific version for reproducibility

Consider replacing git clone with either staging the the release tar.gz bundle from GitHub https://github.com/WGLab/DeepMod/archive/refs/tags/v0.1.3.tar.gz via a channel or using simply wget.

https://github.com/liuyangzzu/nanome/blob/3c344b67c01ab68b541a5c2add856b3ce4ae9cc2/main.nf#L113

Use simpleName to trim the tar.gz suffix

https://github.com/liuyangzzu/nanome/blob/3c344b67c01ab68b541a5c2add856b3ce4ae9cc2/main.nf#L198

If you want to use the basename without any suffix, you can use simpleName method.

Example:

     guppy_basecaller --output_path ${x.simpleName}_basecalled_folder \

Convert the whole folder to a tar to pass to the next process.

Replace generic conda base image, with guppy to have all dependencies in 1 container

https://github.com/liuyangzzu/nano-compare/blob/7b52f908d29c985bbc3996bfd1ec72773bbaaded/Dockerfile#L1

This is a requirement for the process named Megalodon so it is a good universal solution.
We may need to add extra Miniconda.

reference genomes

Hi, I was wondering if you are considering to open nanome to other references ? Basically, I've been running most of the tools you have included in nanome on my datasets/references but separately. Therefore, it would really help to have a tool such nanome for it.
Thank you,
Best Regards
P

labshengli / nanome Goto Github PK

nanome's People

Contributors

Stargazers

Watchers

Forkers

nanome's Issues

Open clusterOptions in command line

Add dynamic resource (disk size) allocation based on input file size

Add errorStrategy retry for common google machine failures

Redirect Broken pipe error message to sterr to trigger task failure and retry

No such file

Add .github/workflows/ci.yml for implementing CI/CD

Add dynamic resource allocation based on error exit status (increase memory and cpus)

Pin DeepMod in specific version for reproducibility

Use simpleName to trim the tar.gz suffix

Replace generic conda base image, with guppy to have all dependencies in 1 container

reference genomes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent