isugifnf / polishclr Goto Github PK

A nextflow pipeline for polishing CLR assemblies

Home Page: https://isugifnf.github.io/polishCLR/

Nextflow 93.64% Shell 5.39% Dockerfile 0.97%

nextflow arrow freebayes merqury merfin busco genomics bioinformatics assembly pipeline

polishclr's Introduction

Nextflow polishCLR pipeline

View the READ THE DOCs (RTD) Manual for PolishCLR. This RTD manual contains step by step instructions and test datasets on how to use PolishCLR.

polishCLR is a nextflow workflow for polishing genome assemblies (improving accuracy) generated with noisy PacBio reads using accurate, short Illumina reads. It implements the best practices described by the Vertebrate Genome Project (VGP) Assembly community (Rhie et al. 2021) and extends these for use-cases we found common in the Ag100Pest Genome Initiative. This workflow was developed as part of the USDA-ARS Ag100Pest Initiative. The authors thank members of the USDA-ARS Ag100Pest Team and SCINet Virtual Resource Support Core (VRSC) for fruitful discussions and troubleshooting throughout the development of this workflow.

The polishCLR workflow can be easily initiated from three input cases:

Case 1: An unresolved primary assembly with associated contigs (the output of FALCON 2-asm) or without (e.g., the output of Canu or wtdbg2).
Case 2: A haplotype-resolved but unpolished set (e.g., the output of FALCON-Unzip 3-unzip).
Case 3: IDEAL! A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs (e.g., the output of FALCON-Unzip 4-polish).

We strongly recommend including the organellular genome to improve the polishing of nuclear mitochondrial or plasmid pseudogenes (Howe et al., 2021). Organelle genomes should be generated and polished separately for best results. You could use the mitochondrial companion to polishCLR, polishCLRmt or mitoVGP (Formenti et al., 2021).

To allow for the inclusion of scaffolding before final polishing and increase the potential for gap-filling across correctly oriented scaffolded contigs, the core workflow is divided into two steps, controlled by a --step parameter flag.

You can view a more complete visualization of the pipeline in Supp. Fig. S1

Documentation

You can find more details on the usage below. These also include a simple basic run to run the analyses on your own genomes.

Installation
Basic Run
Outputs
References

Installation

This pipeline will require nextflow. The rest of the dependencies can be installed via miniconda, Docker, or Singularity

nextflow -version

nextflow run isugifNF/polishCLR -r main \
  --help

For ag100pest projects, we have been installing dependencies in a miniconda environment. Dependencies may also be installed via docker or singularity.

To update to the latest version of the pipeline, run:

nextflow pull isugifNF/polishCLR -r main

Miniconda

Install dependencies in a miniconda environment.

wget https://raw.githubusercontent.com/isugifNF/polishCLR/main/other_dependencies.yml

[[ -d env ]] || mkdir env
conda env create -f other_dependencies.yml -p ${PWD}/env/polishCLR_env

conda activate env/polishCLR_env

nextflow run isugifNF/polishCLR -r main \
  --check_software

NOTE

if you run into the error ERROR: Unable to load secrets provider while running the nextflow command Then define NXF_HOME where your nextflow script is located.

For example:

export NXF_HOME=~/.conda/envs/nextflow/bin

Your nextflow location may differ. Use which nextflow to locate the path to your nextflow you have installed.

Docker

Start up docker and pull the csiva2022/polishclr:latest image.

docker pull csiva2022/polishclr:latest

# Option 1
docker run -it csiva2022/polishclr:latest \
  nextflow run isugifNF/polishCLR -r main \
  --check_software

# Option 2
nextflow run isugifNF/polishCLR -r main \
  --check_software \
  -profile docker

# Option 3
nextflow run isugifNF/polishCLR -r main \
  --check_software \
  --with-docker csiva2022/polishclr:latest

Run the polishCLR pipeline with an added -with-docker csiva2022/polishclr:latest parameter. See nextflow docker run documentation for more information.

Singularity

Install dependencies as a singularity image.

singularity pull polishclr.sif docker://csiva2022/polishclr:latest

# Option 1:
singularity exec polishclr.sif \
  nextflow run isugifNF/polishCLR -r main \
  --check_software

# Option 2:
nextflow run isugifNF/polishCLR -r main \
  --check_software \
  -profile singularity

# Option 3:
nextflow run isugifNF/polishCLR -r main \
  --check_software \
  -with-singularity polishclr.sif

Run the polishCLR pipeline with an added -with-singularity polishclr.sif parameter. See nextflow singularity run documentation for more information.

Basic Run

Print usage statement

nextflow run isugifNF/polishCLR -r main --help

N E X T F L O W  ~  version 21.04.2
Launching `main.nf` [lonely_liskov] - revision: 6a81970115
----------------------------------------------------
                              \\---------//
___  ___        _   ___  ___    \\-----//
 |  (___  |  | / _   |   |_       \-//
_|_  ___) |__| \_/  _|_  |        // \
                                //-----\\
                              //---------\\
isugifNF/polishCLR  v1.0.0
----------------------------------------------------


Usage:
 The typical command for running the pipeline are as follows:
 nextflow run main.nf --primary_assembly "*fasta" --illumina_reads "*{1,2}.fastq.bz2" --pacbio_reads "*_subreads.bam" -resume

 Mandatory arguments:
 --illumina_reads               Paired end Illumina reads, to be used for Merqury QV scores, and FreeBayes polish primary assembly
 --pacbio_reads                 PacBio reads in bam format, to be used to Arrow polish primary assembly
 --mitochondrial_assembly       Mitochondrial assembly will be concatenated to the assemblies before polishing [default: false]

 Either FALCON (or FALCON-Unzip) assembly:
 --primary_assembly             Genome assembly fasta file to polish
 --alternate_assembly           If alternate/haplotig assembly file is provided, will be concatenated to the primary assembly before polishing [default: false]
 --falcon_unzip                 If primary assembly has already undergone FALCON-Unzip [default: false]. If true, will Arrow polish once instead of twice.

 Pick Step 1 (arrow, purgedups) or Step 2 (Arrow, FreeBayes, FreeBayes)
 --step                         Run step 1 or step 2 (default: 1)

 Optional modifiers
 --species                      If a string is given, rename the final assembly by species name [default:false]
 --k                            kmer to use in MerquryQV scoring [default:21]
 --same_specimen                If Illumina and PacBio reads are from the same specimen [default: true].
 --meryldb                      Path to a prebuilt Meryl database, built from the Illumina reads. If not provided, then build.

 Optional configuration arguments
 --parallel_app                 Link to parallel executable [default: 'parallel']
 --bzcat_app                    Link to bzcat executable [default: 'bzcat']
 --pigz_app                     Link to pigz executable [default: 'pigz']
 --meryl_app                    Link to meryl executable [default: 'meryl']
 --merqury_sh                   Link to merqury script [default: '$MERQURY/merqury.sh']
 --pbmm2_app                    Link to pbmm2 executable [default: 'pbmm2']
 --samtools_app                 Link to samtools executable [default: 'samtools']
 --gcpp_app                     Link to gcpp executable [default: 'gcpp']
 --bwamem2_app                  Link to bwamem2 executable [default: 'bwa-mem2']
 --freebayes_app                Link to freebayes executable [default: 'freebayes']
 --bcftools_app                 Link to bcftools executable [default: 'bcftools']
 --merfin_app                   Link to merfin executable [default: 'merfin']

 Optional parameter arguments
 --parallel_params              Parameters passed to parallel executable [default: ' -j 2 ']
 --pbmm2_params                 Parameters passed to pbmm2 align [default: '']
 --minimap2_params              Parameters passed to minimap2 -xmap-pb or -xasm5 [default: '']
 --gcpp_params                  Parameters passed to gcpp [default: ' -x 10 -X 120 -q 0 ']
 --bwamem2_params               Parameters passed to bwamem2 [default: ' -SP ']
 --freebayes_params             Parameters passed to freebayes [default: ' --min-mapping-quality 0 --min-coverage 3 --min-supporting-allele-qsum 0  --ploidy 2 --min-alternate-fraction 0.2 --max-complex-gap 0 ']
 --purge_dups_params            Parameters passed to purge_dups [default: ' -2 -T p_cufoffs ']
 --busco_params                 Parameters passed to busco [default: ' -l insecta_odb10 -m genome -f ']


 Optional arguments:
 --outdir                       Output directory to place final output [default: 'PolishCLR_Results']
 --clusterOptions               Cluster options for slurm or sge profiles [default slurm: '-N 1 -n 40 -t 04:00:00'; default sge: ' ']
 --threads                      Number of CPUs to use during each job [default: 40]
 --queueSize                    Maximum number of jobs to be queued [default: 50]
 --account                      Some HPCs require you supply an account name for tracking usage.  You can supply that here.
 --help                         This usage statement.
 --check_software               Check if software dependencies are available.

Example

For step-by-step instructions and easily accessible test datasets, please View the READ THE DOCs (RTD) Manual for PolishCLR.

Example input data for each of the three input cases from Helicoverpa zea on Ag Data Commons: https://data.nal.usda.gov/dataset/data-polishclr-example-input-genome-assemblies. SRAs from NCBI for pacbio and illumina data are available through the BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA804956

# If you have installed dependencies in a miniconda environment
source activate ${PWD}/env/polishCLR_env

SP_DIR=/project/ag100pest/Helicoverpa_zea_male

## === Step 1 (up to purge_dups)
nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "${SP_DIR}/Falcon/4-polish/cns-output/cns_p_ctg.fasta" \
  --alternate_assembly "${SP_DIR}/Falcon/4-polish/cns-output/cns_h_ctg.fasta" \
  --mitochondrial_assembly "${SP_DIR}/MT_Contig/Hzea_mtDNA_contig.fasta" \
  --illumina_reads "${SP_DIR}/PolishingData/JDRP*{R1,R2}.fastq.bz2" \
  --pacbio_reads "${SP_DIR}/RawData/m*.subreads.bam" \
  --species "Hzea" \
  --k "21" \
  --falcon_unzip true \
  --step 1 \
  --busco_lineage "lepideoptera_odb10" \
  -resume \
  -profile ceres

### I often submit these as two separate slurm scripts

## === Step 2 
 nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "PolishCLR_Results/Step_1/02_Purge_Dups/primary_purged.fa" \
  --alternate_assembly "PolishCLR_Results/Step_1/02_Purge_Dups/haps_purged.fa" \
  --mitochondrial_assembly "${SP_DIR}/MT_Contig/Hzea_mtDNA_contig.fasta" \
  --illumina_reads "${SP_DIR}/PolishingData/JDRP*{R1,R2}.fastq.bz2" \
  --pacbio_reads "${SP_DIR}/RawData/m*.subreads.bam" \
  --species "Hzea" \
  --k "21" \
  --falcon_unzip true \
  --step 2 \
  -resume \
  -profile ceres

Outputs

Key Outputs are found in

- PolishCLR_Results/Step_1/02_BUSCO/primary_purged/ ## BUSCO stats after purge_dups
- PolishCLR_Results/Step_1/02_Purge_Dups/primary_purged.fa ## Primary assembly after Step 1
- PolishCLR_Results/Step_1/02_Purge_Dups/haps_purged.fa ## Alternate assembly after Step 1
- PolishCLR_Results/Step_2/*_bbstat/ ## Length and number distributions through each process of Step 2
- PolishCLR_Results/Step_2/*_QV/ ## QV and Completeness stats with merfin and merqury for each process of Step 2
- PolishCLR_Results/Step_2/06_FreeBayesPolish/p_Step_2_06_FreeBayesPolish_consensus.fasta ## Final primary fasta (no mito)
- PolishCLR_Results/Step_2/06_FreeBayesPolish/a_Step_2_06_FreeBayesPolish_consensus.fasta ## Final alternate fasta (no mito)

Example Output Logging

Step 1

[ea/f78cb8] process > RENAME_PRIMARY (1) [100%] 1 of 1, cached: 1 ✔
[53/770510] process > MERGE_FILE_00 (1)  [100%] 1 of 1, cached: 1 ✔
[34/a4b7ac] process > bz_to_gz (1)       [100%] 1 of 1, cached: 1 ✔
[3e/0af766] process > meryl_count (1)    [100%] 2 of 2, cached: 2 ✔
[76/a26404] process > meryl_union        [100%] 1 of 1, cached: 1 ✔
[7c/d4916c] process > meryl_peak         [100%] 1 of 1, cached: 1 ✔
[ef/c58401] process > MerquryQV_00 (1)   [100%] 1 of 1, cached: 1 ✔
[cd/3ad9c7] process > bbstat_00 (1)      [100%] 1 of 1, cached: 1 ✔
[3e/dd32f5] process > bam_to_fasta (1)   [100%] 1 of 1, cached: 1 ✔
[00/1b1177] process > SPLIT_FILE_02 (1)  [100%] 1 of 1, cached: 1 ✔
[78/df2d30] process > PURGE_DUPS_02 (1)  [100%] 1 of 1, cached: 1 ✔
[5f/dec926] process > BUSCO (2)          [100%] 2 of 2 ✔

Completed at: 21-Dec-2021 10:45:52
Duration    : 51m 32s
CPU hours   : 3.9 (63.3% cached)
Succeeded   : 2
Cached      : 12

Step 2

executor >  local (1), slurm (2681)
[d9/b48b2c] process > RENAME_PRIMARY (1)             [100%] 1 of 1 ✔
[25/b9acec] process > MERGE_FILE_00 (1)              [100%] 1 of 1 ✔
[94/ff9db2] process > bz_to_gz (1)                   [100%] 1 of 1 ✔
[2c/bd2fc0] process > meryl_count (1)                [100%] 2 of 2 ✔
[00/b4d5aa] process > meryl_union                    [100%] 1 of 1 ✔
[62/116067] process > meryl_peak                     [100%] 1 of 1 ✔
[76/7ecb59] process > MerquryQV_00 (1)               [100%] 1 of 1 ✔
[12/ee4395] process > bbstat_00 (1)                  [100%] 1 of 1 ✔
[93/7d56c3] process > ARROW_04:create_windows (1)    [100%] 1 of 1 ✔
[3a/15bd43] process > ARROW_04:pbmm2_index (1)       [100%] 1 of 1 ✔
[05/75aae9] process > ARROW_04:pbmm2_align (1)       [100%] 1 of 1 ✔
[6b/92ea05] process > ARROW_04:gcpp_arrow (3)        [100%] 882 of 882 ✔
[55/29df60] process > ARROW_04:meryl_genome (1)      [100%] 1 of 1 ✔
[0f/8eb0eb] process > ARROW_04:combineVCF (1)        [100%] 1 of 1 ✔
[9b/a8433d] process > ARROW_04:reshape_arrow (1)     [100%] 1 of 1 ✔
[2f/bb106e] process > ARROW_04:merfin_polish (1)     [100%] 1 of 1 ✔
[e5/609325] process > ARROW_04:vcf_to_fasta (1)      [100%] 1 of 1 ✔
[d6/0b30fa] process > MerquryQV_04 (1)               [100%] 1 of 1 ✔
[4c/dd5ce5] process > bbstat_04 (1)                  [100%] 1 of 1 ✔
[5c/7d00a5] process > FREEBAYES_05:create_windows... [100%] 1 of 1 ✔
[62/269cb4] process > FREEBAYES_05:meryl_genome (1)  [100%] 1 of 1 ✔
[43/329d5d] process > FREEBAYES_05:align_shortrea... [100%] 1 of 1 ✔
[15/ece2b2] process > FREEBAYES_05:freebayes (754)   [100%] 882 of 882 ✔
[ab/47faff] process > FREEBAYES_05:combineVCF (1)    [100%] 1 of 1 ✔
[8b/5461c2] process > FREEBAYES_05:merfin_polish (1) [100%] 1 of 1 ✔
[ea/0a6fb3] process > FREEBAYES_05:vcf_to_fasta (1)  [100%] 1 of 1 ✔
[4a/c67e0a] process > MerquryQV_05 (1)               [100%] 1 of 1 ✔
[57/f661d4] process > bbstat_05 (1)                  [100%] 1 of 1 ✔
[a5/5e8166] process > FREEBAYES_06:create_windows... [100%] 1 of 1 ✔
[ea/ccf140] process > FREEBAYES_06:meryl_genome (1)  [100%] 1 of 1 ✔
[40/d1cbd4] process > FREEBAYES_06:align_shortrea... [100%] 1 of 1 ✔
[f3/a7920f] process > FREEBAYES_06:freebayes (514)   [100%] 882 of 882 ✔
[ff/c3c29b] process > FREEBAYES_06:combineVCF (1)    [100%] 1 of 1 ✔
[9f/462b6c] process > FREEBAYES_06:merfin_polish (1) [100%] 1 of 1 ✔
[a0/3218e0] process > FREEBAYES_06:vcf_to_fasta (1)  [100%] 1 of 1 ✔
[f8/0f5b7c] process > MerquryQV_06 (1)               [100%] 1 of 1 ✔
[e8/09d9db] process > bbstat_06 (1)                  [100%] 1 of 1 ✔
[12/04a0db] process > SPLIT_FILE_07 (1)              [100%] 1 of 1 ✔
Completed at: 23-Dec-2021 11:45:16
Duration    : 21h 33m 40s
CPU hours   : 195.0
Succeeded   : 2'682

timeline.html | report.html

References

Rhie, A., McCarthy, S. A., Fedrigo, O., Damas, J., Formenti, G., Koren, S., Uliano-Silva, M., Chow, W., Fungtammasan, A., Kim, J., Lee, C., Ko, B. J., Chaisson, M., Gedman, G. L., Cantin, L. J., Thibaud-Nissen, F., Haggerty, L., Bista, I., Smith, M., . . . Jarvis, E. D. (2021). Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856), 737-746. https://doi.org/10.1038/s41586-021-03451-0

Rhie, A., Walenz, B. P., Koren, S., & Phillippy, A. M. (2020). Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol, 21(1), 245. https://doi.org/10.1186/s13059-020-02134-9.

Formenti, G., Rhie, A., Balacco, J., Haase, B., Mountcastle, J., Fedrigo, O., Brown, S., Capodiferro, M. R., Al-Ajli, F. O., Ambrosini, R., Houde, P., Koren, S., Oliver, K., Smith, M., Skelton, J., Betteridge, E., Dolucan, J., Corton, C., Bista, I., . . . Vertebrate Genomes Project, C. (2021). Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome Biol, 22(1), 120. https://doi.org/10.1186/s13059-021-02336-9

polishclr's People

Contributors

Stargazers

Watchers

Forkers

molikd astahlke altingia pythseq

polishclr's Issues

polishCLR pipeline

Add an `--alternative_assembly` flag

Clarify how to combine primary and alternative assemblies for arrow, purgedups, and freebayes.

primary                                                                purgedups (primary?)
               > cat (with different headers) -> arrow -> (separate) <
alternative                                                                 ....  \_combine with alt -> purgedups?

What's up with the java install?

I can bypass this error by loading the java module on ceres.

/project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java: symbol lookup error: /project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java: undefined symbol: JLI_StringDup
NOTE: Nextflow is trying to use the Java VM defined by the following environment variables:
 JAVA_CMD: /project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java
 NXF_OPTS:

Does envionment.yml need a fix?

SLURM fails when there are too many contigs

We faced this problem with the many contigs of Striacosta, and ultimately aborted that species, but we should try to fix this anyway at some point.

templates/merge_consensus.sh was generating a huge .command.run file when ${windows_fasta} expanded to each individual contig.
cat ${windows_fasta} > ${outdir}_consensus.fasta

If .command.run is larger than a certain default value > 5MB, the cluster scheduler produces an error:
sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long

Another example of the same issue in a different pipeline
More generally discussed here

nextflow conda launcher can pick the wrong java version

After install, I get this:

(/home/cbfgws6/Programs/nextflow/env/polishCLR_env) cbfgws6@trinity:~/Programs/nextflow$ nextflow run isugifNF/polishCLR -r main --check_software
/home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java: symbol lookup error: /home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java: undefined symbol: JLI_StringDup
NOTE: Nextflow is trying to use the Java VM defined by the following environment variables:
 JAVA_CMD: /home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java
 NXF_OPTS:

Yet, openjdk 11 is installed in the conda environment.

How do I fix this?

Final QV Report

add emit channels for merquryQV (and bbstat?)
consolidate to one QV report

Atlas issues

Running on atlas brings up a bunch of issues.

Trying to generate histograms for purge_dups:

Traceback (most recent call last):
  File "/project/ag100pest/software/purge_dups/scripts/hist_plot.py", line 4, in <module>
    import matplotlib as mpl

For BUSO, you can either kick to the process to queue='service' in configs/atlas.config or direct to an offline directory, eg

    --offline \
   --download_path /project/ag100pest/busco_datasets

Add jekyll read-the-docs

https://github.com/j23414/jekyll_rtd

or rst, or ascii-docs. Review which one to pic

force an error when --profile is used instead of -profile

Since it's easy to confuse double hyphen with single hyphen parameters. Also check if nextflow enabled a way to throw an error with undefined parameters.

auto detect bz extension

  if(params.illumina_reads =~ /bz$/) {
    print("Yes has bz")... pick channel
  } else {
    print("no no bz") ... pick channel
  }

bbsuite stat

Check if we need to add bbsuite stat on the output

https://jgi.doe.gov/data-and-tools/bbtools/

log software versions

Either:

one process to log all software versions <-- leaning toward this one
first use of a software, log version
separate process per software <-- might be more modular, but may increase number of slurm submissions (if there's a quota)

add merfin between arrow and freebayes

Thanks Ben! (https://github.com/arangrhie/merfin/blob/master/scripts/reformat_arrow/reshape_arrow.sh)

reformat: remove any module load calls

Description

Remove or move any slurm specific module load calls in the scripts, listed below:

Or at least move them to one of the following, depending on which HPC is relevant:

Require --species

Flag renaming for clarity

I was confused about how to direct the pipeline to skip 02_Arrow if the Falcon assembly has already been polished. Currently you would supply --falcon-unzip True to skip to purge_dups. I suggest that --falcon_unzip be changed to falcon_polish for clarity.

I think there may be some other renaming, eg steptwo that could be more clear as well.

Incorportate assembled mt genome

Concat assembled MT contig Pgos/RawData/MT_Contig/Pgos_mtDNA_contig.fasta to primary assembly for polishing in arrow

error exit status (139) during ARROW_02:merge_consensus in the Case I run

Hi,

I really thank to you all for this one-step polishing tool of an easy-use.
My current assembly is for an algal species, trying to process a Canu (v2.2) assembly as the Case I.

The latest polishCLR is set-up on a local Ubuntu 22.04 system by nextflow (23.04.1.586) and conda (23.3.1) with the .yml file owing to your guidance.

The command I used is like below.

nextflow run isugifNF/polishCLR -r main --primary_assembly "asm.contigs.fasta" --illumina_reads "PE..fq.gz" --pacbio_reads "m.bam" --step 1 --falcon_unzip false -resume -latest

The run was terminated at the "process > ARROW_02:merge_consensus" stage with following error massage.

May-24 09:31:02.345 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
May-24 09:31:02.347 [Task submitter] INFO nextflow.Session - [73/61bd41] Submitted process > ARROW_02:merge_consensus (1)
May-24 09:31:03.054 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 26931; name: >ARROW_02:merge_consensus (1); status: COMPLETED; exit: 139; error: -; workDir: /home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e]
May-24 09:31:03.062 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=ARROW_02:merge_consensus (1); work-dir=/home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e
error [nextflow.exception.ProcessFailedException]: Process ARROW_02:merge_consensus (1) terminated with an error exit status (139)
May-24 09:31:03.084 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e/.command.out
May-24 09:31:03.085 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /home/gdrg1/data/labor321_polishCLR/work/73/61bd410320c4ccf58791c88254687e/.command.err
May-24 09:31:03.093 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ARROW_02:merge_consensus (1)'

Caused by:
Process ARROW_02:merge_consensus (1) terminated with an error exit status (139)

Command executed:

#! /usr/bin/env bash
OUTNAME=echo "Step_1/01_ArrowPolish" | sed 's:/:_:g'
cat species_name_assembly_pri_tig00000007_0-36040.fasta species_name_assembly_pri_tig00000003_0-44741.fasta
..... (skip; it a quite length of files listed) ... 9568.fasta > ${OUTNAME}_consensus.fasta
Command exit status:
139

Command output:
(empty)

Work dir:
/home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

So I visited the work path, and found the path lacks of any fasta file which neighbor paths have,
and excute the .command.run output the "core dump segment error".

I hope you may have any advice for this trouble.

Rebuild new image for DockerHub and test on singularity (HPC)

@Sivanandan Tag, you're it :)

Thanks for bringing up the singularity issue

Mitogenome missing error

It would be beneficial to get meaningful errors when the inputs are not provided but are required. For instance, the absence of the mitogenome when using a FALCON-unzipped assemblies causes the workflow to fail with no useful error message.

add: smaller dataset

Either as a small genome, or one of several simulated genome options

Option 1: Near ideal case, no repeated sequences in whole genome
ACGT AACCGGTT AAACCCGGGTTT... (avoid short reads mapping to multiple locations, near ideal case)

Option 2: Same as option 1, but introduce random errors

Option 3: Same as option 1, but introduce polyploidy

Turn on nextflow ci

https://github.com/j23414/mini_nf/blob/main/_github/workflows/stubtest.yml

Probably a good way to test the docker image as well

TrioCanu assembly input

Fix the internal filename conflict when passing through paternal_assembly.fasta and maternal_assembly.fasta separately.

sge executor

Hopefully switching from slurm to sge should be executor="sge", but double check if "clusterOptions" are equivalent.

https://www.nextflow.io/docs/latest/executor.html#sge

Add an “unrecognized parameter” catch-all kind of error

Noting here, so I don't forget:

Is there an “unrecognized parameter” catch-all kind of error we could provide?

Notes:

nextflow-io/nextflow#1074

Possible example data

https://vgp.github.io/genomeark/Archocentrus_centrarchus/

Bring in QUAST?

Reviewer's request: It would be good to have some extra metrics at least for the final version of the assembly (e.g. misassemblies estimated with FRC_Align/QUAST)

Add this to envirnoment.yml

Add --meryldb parameter

Rebuilding the meryl database from the illumina reads can be time-consuming, especially if you're swapping out a new primary assembly. Add the parameter to pass in a pre-build meryl database. If not set, then build meryl database as usual

Bug: Singularity image with get_seqs returning segfault

Description

 .command.sh: line 28: 40309 Segmentation fault      (core dumped) get_seqs -e p_dups.bed p_Step_1_01_ArrowPolish_consensus.fasta -p pri
mary
...
  .command.sh: line 51: 40982 Segmentation fault      (core dumped) get_seqs -e h_dups.bed h_a_Step_1_01_ArrowPolish_consensus.fasta -p h
aps

Running the commands line by line in HPC worked using the Ceres modules purge_dups=1.2.5. The error only showed up when we used the singularity image. The singularity image also contained purge_dups=1.2.5.

Require PacBio CLR data to be bam files

If PacBio data is passed in as a fasta file, the @RG annotation is lost, and Arrow gccp will complain

gcpp ERROR: [pbbam] read group ERROR: basecaller version is too short

Maybe can check for .fasta or fa extension and print a warning. Or hack an acceptable @RG annotation...

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

isugifnf / polishclr Goto Github PK

polishclr's Introduction

Nextflow polishCLR pipeline

Documentation

Table of Contents

Installation

Miniconda

Docker

Singularity

Basic Run

Outputs

Example Output Logging

References

polishclr's People

Contributors

Stargazers

Watchers

Forkers

polishclr's Issues

Description

Description

Recommend Projects

Recommend Topics

Recommend Org