isugifnf / polishclr Goto Github PK
View Code? Open in Web Editor NEWA nextflow pipeline for polishing CLR assemblies
Home Page: https://isugifnf.github.io/polishCLR/
A nextflow pipeline for polishing CLR assemblies
Home Page: https://isugifnf.github.io/polishCLR/
It would be beneficial to get meaningful errors when the inputs are not provided but are required. For instance, the absence of the mitogenome when using a FALCON-unzipped assemblies causes the workflow to fail with no useful error message.
If PacBio data is passed in as a fasta file, the @RG
annotation is lost, and Arrow gccp will complain
gcpp ERROR: [pbbam] read group ERROR: basecaller version is too short
Maybe can check for .fasta
or fa
extension and print a warning. Or hack an acceptable @RG
annotation...
Reviewer's request: It would be good to have some extra metrics at least for the final version of the assembly (e.g. misassemblies estimated with FRC_Align/QUAST)
Add this to envirnoment.yml
or rst, or ascii-docs. Review which one to pic
Hi,
I really thank to you all for this one-step polishing tool of an easy-use.
My current assembly is for an algal species, trying to process a Canu (v2.2) assembly as the Case I.
The latest polishCLR is set-up on a local Ubuntu 22.04 system by nextflow (23.04.1.586) and conda (23.3.1) with the .yml file owing to your guidance.
The command I used is like below.
nextflow run isugifNF/polishCLR -r main --primary_assembly "asm.contigs.fasta" --illumina_reads "PE..fq.gz" --pacbio_reads "m.bam" --step 1 --falcon_unzip false -resume -latest
The run was terminated at the "process > ARROW_02:merge_consensus" stage with following error massage.
May-24 09:31:02.345 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
May-24 09:31:02.347 [Task submitter] INFO nextflow.Session - [73/61bd41] Submitted process > ARROW_02:merge_consensus (1)
May-24 09:31:03.054 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 26931; name: >ARROW_02:merge_consensus (1); status: COMPLETED; exit: 139; error: -; workDir: /home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e]
May-24 09:31:03.062 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=ARROW_02:merge_consensus (1); work-dir=/home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e
error [nextflow.exception.ProcessFailedException]: ProcessARROW_02:merge_consensus (1)
terminated with an error exit status (139)
May-24 09:31:03.084 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e/.command.out
May-24 09:31:03.085 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /home/gdrg1/data/labor321_polishCLR/work/73/61bd410320c4ccf58791c88254687e/.command.err
May-24 09:31:03.093 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ARROW_02:merge_consensus (1)'Caused by:
ProcessARROW_02:merge_consensus (1)
terminated with an error exit status (139)
Command executed:
#! /usr/bin/env bash
OUTNAME=echo "Step_1/01_ArrowPolish" | sed 's:/:_:g'
cat species_name_assembly_pri_tig00000007_0-36040.fasta species_name_assembly_pri_tig00000003_0-44741.fasta
..... (skip; it a quite length of files listed) ... 9568.fasta > ${OUTNAME}_consensus.fasta
Command exit status:
139
Command output:
(empty)
Work dir:
/home/gdrg1/data/labor3/21_polishCLR/work/73/61bd410320c4ccf58791c88254687e
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named
.command.sh
So I visited the work path, and found the path lacks of any fasta file which neighbor paths have,
and excute the .command.run output the "core dump segment error".
I hope you may have any advice for this trouble.
After install, I get this:
(/home/cbfgws6/Programs/nextflow/env/polishCLR_env) cbfgws6@trinity:~/Programs/nextflow$ nextflow run isugifNF/polishCLR -r main --check_software
/home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java: symbol lookup error: /home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java: undefined symbol: JLI_StringDup
NOTE: Nextflow is trying to use the Java VM defined by the following environment variables:
JAVA_CMD: /home/cbfgws6/Programs/nextflow/env/polishCLR_env/lib/jvm/bin/java
NXF_OPTS:
Yet, openjdk 11 is installed in the conda environment.
How do I fix this?
Either as a small genome, or one of several simulated genome options
Option 1: Near ideal case, no repeated sequences in whole genome
ACGT AACCGGTT AAACCCGGGTTT... (avoid short reads mapping to multiple locations, near ideal case)
Option 2: Same as option 1, but introduce random errors
Option 3: Same as option 1, but introduce polyploidy
Fix the internal filename conflict when passing through paternal_assembly.fasta
and maternal_assembly.fasta
separately.
if(params.illumina_reads =~ /bz$/) {
print("Yes has bz")... pick channel
} else {
print("no no bz") ... pick channel
}
Noting here, so I don't forget:
Is there an “unrecognized parameter” catch-all kind of error we could provide?
Notes:
Either:
Rebuilding the meryl database from the illumina reads can be time-consuming, especially if you're swapping out a new primary assembly. Add the parameter to pass in a pre-build meryl database. If not set, then build meryl database as usual
Hopefully switching from slurm to sge should be executor="sge"
, but double check if "clusterOptions" are equivalent.
Check if we need to add bbsuite stat on the output
Remove or move any slurm specific module load
calls in the scripts, listed below:
Or at least move them to one of the following, depending on which HPC is relevant:
Clarify how to combine primary and alternative assemblies for arrow, purgedups, and freebayes.
primary purgedups (primary?)
> cat (with different headers) -> arrow -> (separate) <
alternative .... \_combine with alt -> purgedups?
@Sivanandan Tag, you're it :)
Thanks for bringing up the singularity issue
Since it's easy to confuse double hyphen with single hyphen parameters. Also check if nextflow enabled a way to throw an error with undefined parameters.
Probably a good way to test the docker image as well
Running on atlas brings up a bunch of issues.
Traceback (most recent call last):
File "/project/ag100pest/software/purge_dups/scripts/hist_plot.py", line 4, in <module>
import matplotlib as mpl
queue='service'
in configs/atlas.config
or direct to an offline directory, eg --offline \
--download_path /project/ag100pest/busco_datasets
.command.sh: line 28: 40309 Segmentation fault (core dumped) get_seqs -e p_dups.bed p_Step_1_01_ArrowPolish_consensus.fasta -p pri
mary
...
.command.sh: line 51: 40982 Segmentation fault (core dumped) get_seqs -e h_dups.bed h_a_Step_1_01_ArrowPolish_consensus.fasta -p h
aps
Running the commands line by line in HPC worked using the Ceres modules purge_dups=1.2.5
. The error only showed up when we used the singularity image. The singularity image also contained purge_dups=1.2.5
.
Concat assembled MT contig Pgos/RawData/MT_Contig/Pgos_mtDNA_contig.fasta
to primary assembly for polishing in arrow
I can bypass this error by loading the java module on ceres.
/project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java: symbol lookup error: /project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java: undefined symbol: JLI_StringDup
NOTE: Nextflow is trying to use the Java VM defined by the following environment variables:
JAVA_CMD: /project/ag100pest/conda_envs/polishCLR/lib/jvm/bin/java
NXF_OPTS:
Does envionment.yml
need a fix?
We faced this problem with the many contigs of Striacosta, and ultimately aborted that species, but we should try to fix this anyway at some point.
templates/merge_consensus.sh
was generating a huge .command.run
file when ${windows_fasta}
expanded to each individual contig.
cat ${windows_fasta} > ${outdir}_consensus.fasta
If .command.run
is larger than a certain default value > 5MB, the cluster scheduler produces an error:
sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long
Another example of the same issue in a different pipeline
More generally discussed here
I was confused about how to direct the pipeline to skip 02_Arrow if the Falcon assembly has already been polished. Currently you would supply --falcon-unzip True to skip to purge_dups. I suggest that --falcon_unzip be changed to falcon_polish for clarity.
I think there may be some other renaming, eg steptwo that could be more clear as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.