adigenova / wengan Goto Github PK

View Code? Open in Web Editor NEW

83.0 7.0 13.0 144 KB

An accurate and ultra-fast hybrid genome assembler

License: GNU Affero General Public License v3.0

Perl 100.00%

illumina nanopore pacbio hybrid genome-assembler

wengan's Introduction

Wengan

An accurate and ultra-fast genome assembler

Version: 0.2 (18/05/2020)

SYNOPSIS
Description
Short-read assembly
Long-read presets
Wengan demo
Wengan benchmark
Wengan components
Getting the latest source code
- Instructions
  - Containers
  - Building Wengan from source
    - Requirements
Limitations
About the name
Citation

SYNOPSIS

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome	Long reads	Short reads	Wengan Mode	NG50 (Mb)	CPU (h)	RAM (GB)	Fasta file
		2x150bp 50X (GIAB:rs1 , rs2)	WenganA	25.99	725	45	asm
NA12878	ONT 35X (rel5)	2x150bp 50X (GIAB:rs1 , rs2)	WenganM	17.23	203	53	asm
		2x250bp 60X (ENA:rs1 , rs2)	WenganD	35.31	589	622	asm
HG00073	PAC 90X (ENA:rl1)	2x250bp 63X (ENA:rs1 , rs2)	WenganD	32.35	936	644	asm
NA24385	ONT 60X (GIAB:rl1)	2x250bp 70X (GIAB:rs1)	WenganD	50.59	963	651	asm
CHM13	ONT 50X (T2T:rel3)	2x250bp 66X (ENA:rs1 , rs2)	WenganD	69.72	1198	646	asm

The assemblies generated using Wengan (v0.2) can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan manuscript. NG50 was computed using a genome size of 3.08Gb.

Wengan components

A de Bruijn graph assembler (Minia, Abyss or DiscovarDenovo)
FastMIN-SG
IntervalMiss
Liger

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Containers

To facilitate the execution of Wengan, we provide docker/singularity containers. Wengan images are hosted on Dockerhub and can be downloaded with the command:

docker pull adigenova/wengan:v0.2

Alternatively, using singularity:

export TMPDIR=/tmp
singularity pull docker://adigenova/wengan:v0.2

Run WenganM using singularity

#using singularity
CONTAINER=/path_to_container/wengan_v0.2.sif

#location of wengan in the container
WENGAN=/wengan/wengan-v0.2-bin-Linux/wengan.pl

#run WenganM with singularity exec
singularity exec $CONTAINER perl ${WENGAN} \
 -x pacraw \
 -a M \
 -s short.R1.fastq.gz,short.R2.fastq.gz \
 -l pacbio.clr.fastq.gz \
 -p asm_wengan -t 20 -g 3000

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.2

abyss commit d4b4b5d
discovarexp-51885 commit f827bab
minia commit 017d23e
fastmin-sg commit 861b061
intervalmiss commit 11be8b42
liger commit 63a044b0
seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Di Genova, A., Buena-Atienza, E., Ossowski, S. and Sagot,M-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nature Biotechnology (2020), link

wengan's People

Contributors

Stargazers

Watchers

Forkers

adaachill adamvs seppi333 sailfish009 tw7649116 sajjadasaf hai1983 camoragaq digenoma-lab zhangwenda0518 zhanglzu lesimonson galicae

wengan's Issues

Problem running Wengan with singularity

Hi,

I originally tried building Wengan from source as suggested however after running into many of the same issues as issue #14 (and following many of the suggested fixes with no luck) I decided to switch to running Wengan using singularity. I ran the following command:

singularity exec $CONTAINER perl ${WENGAN} \
 -x ontraw \
 -a M \
 -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz \
 -l ecoli/reads/EC.ONT.30X.fa.gz \
 -p wengan_output/ec_Wm_or1 -t 10 -g 5

However I got this output in my error file:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
make: *** [wengan_output/ec_Wm_or1.minia.41.contigs.fa] Error 1

And this for output in ec_Wm_or1.minia.41.err:

HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5F.c line 604 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Fint.c line 1087 in H5F_open(): unable to read superblock
major: File accessibilty
minor: Read failed
#2: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Fsuper.c line 277 in H5F_super_read(): file signature not found
major: File accessibilty
minor: Not an HDF5 file
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5G.c line 454 in H5Gopen2(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5L.c line 813 in H5Lexists(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5G.c line 287 in H5Gcreate2(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5L.c line 813 in H5Lexists(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5D.c line 165 in H5Dcreate2(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5D.c line 460 in H5Dget_space(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5S.c line 883 in H5Sget_simple_extent_dims(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5S.c line 392 in H5Sclose(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c line 401 in H5Aopen(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#1: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_loc(): invalid object ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c line 677 in H5Aget_space(): not an attribute
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5S.c line 883 in H5Sget_simple_extent_dims(): not a dataspace
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
#000: /home/adigenova/HSCAFF/HUMAN/CHM13/HIFI/20kb/minia/thirdparty/gatb-core/gatb-core/thirdparty/hdf5/src/H5A.c line 634 in H5Aread(): not an attribute
major: Invalid arguments to routine
minor: Inappropriate type
EXCEPTION: Unable to open bank 'wengan_output/ec_Wm_or1.minia_reads.41.txt' (if it is a list of files, perhaps some of the files inside don't exist)

This is my first time using Singularity so apologies in advance if my issue is something trivial.

Thanks in advance for any help/suggestions, I'm happy to provide additional information as needed!

/bin/sh: DiscovarExp: command not found

I got this error when I tried running this command:

perl wengan.pl -x ontlon -a D -s /mnt/pharmngs_a/groups/pharmngs/production/WORKDIR/VIRILIS/data/SRR1536175_1.fastq.gz,/mnt/pharmngs_a/groups/pharmngs/production/WORKDIR/VIRILIS/data/SRR1536175_2.fastq.gz -l /mnt/pharmngs_a/groups/pharmngs/production/WORKDIR/VIRILIS/data/SRR7167958_1.fastq.gz -p virilis -t 1 -g 3000

The error log states this:

/bin/sh: DiscovarExp: command not found
-bash-4.2$ more virilis.Disco_denovo.log
Performing re-exec to adjust stack size.

--------------------------------------------------------------------------------
Τετ Νοέ 27 11:22:57 2019 run on aristotle1, pid=22039 [Nov  5 2019 06:43:49 R518
85 ]
DiscovarExp                                                                    \
            READS="/mnt/pharmngs_a/groups/pharmngs/production/WORKDIR/VIRILIS/ \
            data/SRR1536175_1.fastq.gz,/mnt/pharmngs_a/groups/pharmngs/product \
            ion/WORKDIR/VIRILIS/data/SRR1536175_2.fastq.gz"                    \
            OUT_DIR=/tmp/virilisD NUM_THREADS=1
--------------------------------------------------------------------------------

You must specify a command-line value for DEGLOOP_MIN_DIST.
Aborting.

core dump when using docker

Hi,
I run wengan via docker with command:

docker run -it -v $PWD:/data adigenova/wengan:v0.2 \
 perl /wengan/wengan-v0.2-bin-Linux/wengan.pl \
 -x pacraw \
 -a M \
 -s /data/r1.fq.gz,/data/r2.fq.gz \
 -l /data/pacbio.fastq.gz \
 -p data/asm_wengan -t 20 -g 3000

then meet the following error from file asm_wengan.fml.err:

LOG: Mapping mode =pacraw H=1 k=20 w=5 L=2000 l=250 q=40 m=150 c=65 r=300 t=20 o=/data/asm_wengan I=500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000 s=1
Building contig index
[M::mm_idx_gen::1625562708.768*0.00] collected minimizers
[M::mm_idx_gen::1625562708.815*0.00] sorted minimizers
[M::mm_mapopt_update::1625562708.842*0.00] mid_occ = 7
[M::mm_idx_stat] kmer size: 20; skip: 5; is_hpc: 1; #seq: 126
[M::mm_idx_stat::1625562708.859*0.00] distinct minimizers: 1220735 (99.52% are singletons); average occurrences: 1.008; average spacing: 4.077
Index construction time: 0.901258 seconds for 126 target sequence(s)
Segmentation fault (core dumped)

could you tell me what's going wrong.

Thanks.

Recommendation for Illumina + PacBio HiFi + ultralong ONT?

Hi there,

Congratulations on a fantastic, useful tool. I have high coverage ultralong ONT (~70x), Illumina WGS (30x), and PacBio HiFi (currently ~10x). When I run in ccsont mode with just the ONT and HiFi reads, I get pretty good results. Is there a way to utilize all three types of data to potentially improve the assemblies further, perhaps by reducing homopolymer indels a bit through use of the Illumina reads? When I try to pass all three types of data to wengan using the --ccsont preset, it seems like the Illumina reads are ignored.

Thanks!

make: *** [asm1.SPolished.asm.wengan.fasta] Error 134

Hi,
I tried to assemble a 3GB allotetraploid plant genome. However, it crashed on 5.5 GB of RAM.

export MALLOC_PER_THREAD=1
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/DiscovarExp READS="1740D-43-06_S0_L001_R1.fastq.gz,1740D-43-06_S0_L001_R2.fastq.gz" OUT_DIR=/tmp/asm1D NUM_THREADS=14 2> asm1.Disco_denovo.err > asm1.Disco_denovo.log
cp -a /tmp/asm1D asm1D
ln -s asm1D/a.final/a.lines.fasta asm1.contigs-disco.fa
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk cutN -n 1   asm1.contigs-disco.fa | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -L 200 -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk iupac2bases -  | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk rename - D | /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/seqtk seq -l 60 - > asm1.contigs.disco.fa
[L::iupac2bases] A total of 0 bases were changed.
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg shortr -c 50 -k 21 -w 10 -q 20 -r 50000 -t 14 asm1.contigs.disco.fa asm1.fms.txt 2>asm1.fms.err >asm1.fms.log
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -I 500,1000,2000  -t 14 asm1.contigs.disco.fa asm1.fml.im.txt 2>asm1.fml.im.err >asm1.fml.im.log
rm -f longreads.asm1.im.1.fa
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/intervalmiss -d 1 --clib 1  -t 14 -s asm1.fms.sams.txt -c  asm1.contigs.disco.fa -p  asm1 2>asm1.im.err >asm1.im.log
grep ">" asm1.MBC1.msplit.fa | sed 's/>//' | awk '{print $1" "$2}' | sed 's/>//g' > asm1.MBC1.msplit.cov.txt
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -t  14 -p asm1 -I  500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000 asm1.MBC1.msplit.fa asm1.fml.txt 2>asm1.fml.err >asm1.fml.log
/lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/liger  --mlp 10000  --mit 20000000  -t 14 -c  asm1.MBC1.msplit.fa -l  longreads.asm11.fa -d asm1.MBC1.msplit.cov.txt -p asm1 -s asm1.sams.txt 2>asm1.liger.err >asm1.liger.log
/bin/sh: line 1: 271749 Aborted                 (core dumped) /lustre/work-lustre/waterhouse_team/apps/wengan-v0.2-bin-Linux/bin/liger --mlp 10000 --mit 20000000 -t 14 -c asm1.MBC1.msplit.fa -l longreads.asm11.fa -d asm1.MBC1.msplit.cov.txt -p asm1 -s asm1.sams.txt 2> asm1.liger.err > asm1.liger.log
asm1.mk:50: recipe for target 'asm1.SPolished.asm.wengan.fasta' failed
make: *** [asm1.SPolished.asm.wengan.fasta] Error 134

How is it possible to reduce the memory requirements?

Thank you in advance,

Michal

No clear instruction on installation

Hello,

There seem to be no clear instruction on how to install this program. Can it be made simpler?

Need help for assemblying a high heterozygous genome!

Hi Alex,
I have tried to assembly the ~4G genome, but found it can not be assemblied well with N50 only 275,951 bp and 2.77G in size. I estimated the genome with 17-mers, the results show it is a high heterozygous genome with 0.9% of heterozygosity.

The distribution of the 17-mers is as follows:

Could you give me some advice for parameters setting ?

Thanks a lot.

Best,
Bob

wengan v01 vs v02

Hello Wengan users,

I am currently trying to assemble a genome using ~50X Illumina reads and ~5X PACBio reads.
I used both wengan v01 and wengan v02, and I find rather interesting differences between the assemblies when using exactly the same dataset.
The "best" assemblies are obtained with both version when using WenganD. However, I am surprised by the fact that based on statistics alone (obtained through assemblathon), v01 gives fairly better results than v02. (see attached log files)

Is there a reason for that? Should I still use v02 rather than v01?

Cheers,

Alexandre

WenganD_V1.txt
WenganD_V2.txt

Possibility of integrating Megahit as a new member for SR assembly?

Just came across some very interesting findings based on N=1 of a fish genome (another one now assembling ). The N50 of megahit looks pretty good compared to Abyss2 for this sample. N50 9k (abyss2) vs N50 42k (megahit) . I wonder if this megahit can be the "intermediate" between the fast/memory-saving Abyss2/MINION and the all-mighty but memory greedy Discovar?

In other words, i wonder if you will be happy to do some quick benchmark using megahit as the contigger

Regards,
Ming

PacBio CCS reads

Is the option for CCS reads optimized?

ligar step failing with error "double free or corruption (out)"

Hi,

I'm trying to run WenganD using Nanopore and Illumina data. The assembly gets as far as the liger step, runs for about an hour, and then fails:

/shared/wengan-v0.1-bin-Linux/bin/liger --mlp 10000 -t 40 -c acacia_wengan.MBC1.msplit.fa -l longreads.acacia_wengan1.fa -d acacia_wengan.MBC1.msplit.cov.txt -p acacia_wengan -s acacia_wengan.sams.txt 2>acacia_wengan.liger.err >acacia_wengan.liger.log

make: *** [acacia_wengan.mk:58: acacia_wengan.SPolished.asm.wengan.fasta] Error 134

A tail of the acacia_wengan.liger.log gives:

DELETED EDGE ARRAY: EID=15431196493 LQI_S=116457 LQI_E=119473 EID=15431196493 E_S=116456 E_E=119474 5 NLR=4
DELETED EDGE ARRAY: EID=16083793129 LQI_S=241763 LQI_E=247250 EID=16083793129 E_S=241762 E_E=247251 5 NLR=3
Total lines examinated 3720
Total fragments 184228
Average fragment length 11378
Total Examinated bases 679058849
Total Validated bases by at least 1 fragments 674098760
Percentaje of validated bases 99.2696
A total of 55 edges were deleted in the validation step
Time spent in lines validation 4.34629 secs

And the content of acacia_wengan.liger.err is:

double free or corruption (out)

Can you suggest any way to troubleshoot this issue? WenganA ran without an error on the same data, incidentally.

Thanks very much,

Chris

No rule to make target 'asm.MBC1.msplit.cov.txt', needed by 'asm1.fa'.

Hi !
First thanks for the tool !

So I was trying to assemble a genome using wengan D using both short-reads and long-reads on a computer with 1To of ram and 45 cpu.

At one moment I get this error (a file that is needed for the creation of another one ) :
make: *** No rule to make target 'Proasellus_coiffaiti_HYB_wangan_D.MBC1.msplit.cov.txt', needed by 'longreads.Proasellus_coiffaiti_HYB_wangan_D1.fa'. Stop.

So I don't really know what I can do to correct this.

Thanks in advance

here the error, script, and make files :

assembly_wengan_D_PC.e.txt

assembly_wengan_PC.sh.txt

Proasellus_coiffaiti_HYB_wangan_D.mk.txt

New PacBio BAM input

In the latest PacBio instrument (Sequel II), it only produce a BAM file. Although it could alse be converted to FASTQ using samtools, all the QUAL values are !.

Is such input format also supported? Can I directly use such FASTQ with no QUAL values?

Replacement of hard-coded use of /tmp/

We ran into a problem using WenganD on our HPC when /tmp/ ran out of space. The problem lines appear to be in /apps/wengan/0.1/perl/Wengan/Bruijn/Disco.pm:

/apps/wengan/0.1/perl/Wengan/Bruijn/Disco.pm:      my $param="NUM_THREADS=$self->{cores} OUT_DIR=/tmp/$self->{prefix}D";
/apps/wengan/0.1/perl/Wengan/Bruijn/Disco.pm:      "OUT_DIR=/tmp/$self->{prefix}D",
/apps/wengan/0.1/perl/Wengan/Bruijn/Disco.pm:      $cmd="cp -a /tmp/$self->{prefix}D $self->{prefix}D";

My sys admin tells me that If /tmp is replaced with $ENV{‘TMPDIR’} then the problem should not occur.

Error 139 in running both Minia and Abyss mode

Hi, I'm trying out Wengan by using public data from NA12878 GIAB: https://github.com/genome-in-a-bottle/giab_data_indexes/tree/master/NA12878

1 long read file of PacBio
Subsampled Illumina Short Reads from 300X to 15X, cat into 2 paired-end fastq files.

I'm running with this command

wengan.pl -x pacraw \
    -a M \
    -s ~/NA12878_R1_15X_merge.fastq.gz,~/15X/NA12878_R2_15X_merge.fastq.gz \
    -l ~/NA12878_10X_ccs.fastq.gz \
    -p asm_wengan \
    -t 14 \
    -g 3000

I also tried to replace M by A to use Abyss instead of Minia.

However, in both case, they resulted in Error 139:

/home/nguyen/Exec/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -t  14 -p asm_wengan -I  500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000 asm_wengan.MBC7.msplit.fa asm_wengan.fml.txt 2>asm_wengan.fml.err >asm_wengan.fml.log
make: *** [asm_wengan.mk:58: longreads.asm_wengan1.fa] Error 139
make: *** Deleting file 'longreads.asm_wengan1.fa'

/home/nguyen/Exec/wengan-v0.2-bin-Linux/bin/fastmin-sg pacraw -k 20 -w 5 -q 40 -m 150 -r 300 -I 500,1000,2000  -t 14 asm_wengan.abyss2.contigs.fa asm_wengan.fml.im.txt 2>asm_wengan.fml.im.err >asm_wengan.fml.im.log
make: *** [asm_wengan.mk:17: asm_wengan.im.1.I500.fm.sam] Error 139
make: *** Deleting file 'asm_wengan.im.1.I500.fm.sam'

I checked the asm_wengan.fml.err and found Segmentation error at the end

...
Thread_info:Partial LONGREADS=37375 MAPPED_PAIRS=307743 TOTAL_PAIRS=13413009 RL=250 MINQ=40 KS=20 MW=150
Thread_info:Partial LONGREADS=37076 MAPPED_PAIRS=297916 TOTAL_PAIRS=13293275 RL=250 MINQ=40 KS=20 MW=150
Thread_info:Partial LONGREADS=37674 MAPPED_PAIRS=314972 TOTAL_PAIRS=13509172 RL=250 MINQ=40 KS=20 MW=150
Thread_info:Partial LONGREADS=37674 MAPPED_PAIRS=297842 TOTAL_PAIRS=13526340 RL=250 MINQ=40 KS=20 MW=150
Segmentation fault (core dumped)

What problem am I encountering? And how can I solve this? Thank you very much.

I also run ulimit -s unlimited before running the main command according to this issue #19

binaries v01 / v02

Hello wengan users,

Are the needed binaries (e.g. abyss-pe, minia, Discovarexp, liger...) from wengan v.02 the same that were used in wengan v.01 ? If yes, can I simply copy paste my bin folder from v.01 to my v.02 ?

Thank you for the help,

Alexandre

Short reads of length 100bp

I noticed that short reads of length 100 bp result in poor assemblies. I'm assuming this is because of running Minia with k-mer sizes 41, 81, and 121 regardless of the length of short reads.
What is your recommendations for running Wengan with short reads of length 100 bp?

Why the assembly size is smaller than expected?

Dear the authors:
Thanks a lot for your valuable advice. I have now achieved a 16.3Mbp N50 for my genome assembly. But I still have a question that why the assembly seems a little smaller in size than I estimated by Kmer analysis.
The parameters I used are as follows:
wengan.pl -x pacraw -a D -c yinhuang.contigs-disco.fa -i 500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000,30000,40000,50000,60000
-m 40 -d 5 -N 3 -P 8000

Could you give me some more advice?

Thanks again.

Best wishes.

Reading long-read FASTA input, using 10x chromium linked-read information

Dear @adigenova

Thank you for making this software available to the community. It looks stunning and the Perl code is a pleasure to read!

I have a feature request. Is it technically possible to perform the assembly from long-reads without using the associated quality scores in the fastq files? In such a case, could you consider implementing a FASTA reader for the long-reads as well?

This could help in testing, simulations or evaluating the effect of various external read polishers (although I understand that polishing is not necessary for this pipeline itself).

Just a thought, would it be possible incorporate and leverage 10x chromium linked-read information into the assembler? (I notice that one of the assembly options is using Abyss, which can perform some error-correction with 10x info with Tigmint, so this might be indirectly possible)

ERROR: read file XYZ.fastq.gz don't exist or have 0 size.

Hello, I'm running into troubles with the Wengan singularity container. Very often (but not regularly; circa 9 from 10 tries) my calculations end with error:

ERROR: read file Eudiplozoon_A7KL0_R1.fastq.gz don't exist or have 0 size.

is there some problem with input data reading? But sometimes I was able to run the same calculation (command) successfully if I remember well. My script is attached.

Thank you.
Best regards,
Jiri
Wengan_run1_test.txt

how can i set the output directory?

1.As the title, how can i set the ouput directory?
2.should i run some correlation program before running the software when input ONT reads?

Perl error Can't use an undefined value as an ARRAY reference

I am attempting to run wengan with an assembly generated from minia in the past (well, gatb-minia pipeline). I am running Wengan on a cluster with the Linux release (I will try building from source soon, but I have another run from the raw reads going and am waiting to see if it fails too).

The command I used is:
wengan.pl -x pacraw -a M -c assembly.fasta -l path/to/pacbio_subreads.fastq.gz -p asm_hybrid_contigsdone -t 12 -g 500
and the output I get is:
Can't use an undefined value as an ARRAY reference at /full/path/to/Wengan/Scheduler/Local.pm line 118.

Any idea what might resolve this issue?

racon and pilon

When I finished the assembly with wengan, the result was very good. and busco scored was 90%.
So, I used racon software polish my genome, the score dropped to 63%.
May I ask if racon and pilon are not used?

What's the difference of GPolished and SPolished

I'm doing a hybrid assembly of ~ 2G genome. When I saw a file called foo.GPolished.asm.wengan.fasta generated, I thought all the assembly steps have been finished (as I can't see the liger process using the top command) and then move all the files to other directories.

In fact, after referring the demo dataset, I found I was wrong. The SPolished rather than GPolished should be the final result.

(1) What's the difference of GPolished and SPolished? I suggest to add this as a notice in the README.
(2) Since this pipeline usually lasts for 1-2 days, are there any methods to resume if it were stopped in accident?

Need help for assemblying a high heterozygous genome with platanus2

Hic Alex,
I have tried platanus2 for assemblying the genome, but found it not powerful enough. The results show the raw assembly is very fragmented with contig N50 only 507 bp. When I fed wengan with the contig assembly, it can only assemly ~2G genome (the estimated size is ~4G).
The command used by platanus2 are as follows:
platanus_allee assemble -u 0.2 -m 800 -k 29

Could you give me some more suggestions?
Thanks

Best,
Bob

ERROR: read file don't exist or have 0 size

Hi there,

I am running Wengan using singularity on our institute's cluster using the following submission script (note path names generalized for privacy):

#! /bin/bash

#BSUB -n 10
#BSUB -W 250:00
#BSUB -R "rusage[mem=50GB]"
#BSUB -e error_file%J
#BSUB -J job_name

module load htslib/1.4.1
module load gcc/10.2.0
module load perl/5.24.1

#using singularity
CONTAINER=/home/k001y/programs/wengan_v0.2.sif

#location of wengan in the container
WENGAN=/wengan/wengan-v0.2-bin-Linux/wengan.pl

#Setting location of illumina short read fastq files 
SHORT_READ=/path/to/short/read/fastas

#Setting location of ONT raw reads
LONG_READ=/path/to/long/read/fastas

#run WenganM with singularity exec
singularity exec $CONTAINER perl ${WENGAN} \
 -x ontraw \
 -a M \
 -s ${SHORT_READ}/R1.fastq.gz,${SHORT_READ}/R2.fastq.gz \
 -l ${LONG_READ}/longread.merged.fastq \
 -p sample -t 10 -g 5

However I keep on getting the following error:

ERROR: read file 
/path/to/short/read/fastas/R1.fastq.gz don't exist or have 0 size.

I've double checked and the paths and file names are indeed correct, so I don't know why I am getting this error.
The short read and long read fastq files are both in different places, do the files themselves need to be in the same directory or subdirectory as the submission script? I tried creating symlinks to the data in a folder where I was executing Wengan but that didn't seem to help. Unfortunately the files are too large to copy or move into the execution directory. Am I doing something else wrong here?

Thanks in advance for your help!

Parameter recommendation for low coverage Nanopore read ( 5X; N50 ~ 5kb)

Hello,

I am hoping to use wengan as an alternative to MaSurCA. However, my initial assembly using 50X PCR-free illumina ( 2x 150bp) and 5x nanopore has been somewhat suboptimal using the default v0.2 setting. Specifically, the inclusion of long read data didn't improve the assembly significantly. The current N50 is around 8kb using short read alone and still hovering around 8kb with long read inclusion (abyss).

Going through the discussion, i can see that reducing the -N to 2 is helpful to tackle low coverage long read.
Is there additional parameter that can be tweaked to account for shorter Nanopore read length? I see that the default assume a read length distribution of 15kb.

A combination of pore block and old DNA led to the reduced nanopore yield and read length unfortunately.

Regards,
Ming

Congratulations

Congratulations wengan was accepted by Nature Biotechnology magazine。★,°:.☆(￣▽￣)/$:.°★

Evaluate the assembly results

HI， I have got the assembly results of a human by wengan,and i want to assess the [Prefix].SPolished.asm.wengan.fasta by running Quast , so could you play the running quast command line used in your article?

make: *** [myse.SPolished.asm.wengan.fasta] Error 134

Hi,
I am trying to use Wengan with the following dataset:
~ 65X of 100 bps short-read sequencing
~ 15X of Nanopore reads, with N50 ~ 3K and with read length up to 400kbps

My run had the following configuration:
singularity exec /fs/scratch/PHS0338/wengan_v0.2.sif perl ${WENGAN}
-x ontlon
-a M
-s reads_1.fq.gz,reads_2.fq.gz
-l RataCor100.fasta.gz
-p mzz -t 48 -g 2500
and the SPolished.asm.wengan.fasta was only 192M. My genome size is about 2.5 GB,
and I gave the job 740 GB and 48 cores.

Could you please help me with what parameter configuration should I use? I ran a MaSurca Assembly with the same files and I got a final fasta of 900 MB.

Also, I ran another assembly but with
-x ontlon
-a A
and I got the following error messages:
[L::iupac2bases] A total of 18 bases were changed.
make: *** [myse.SPolished.asm.wengan.fasta] Error 134

Thanks;

Final fasta file

Hi!
What is the final fasta file of the created genome? Is it the *.SPolished.asm.wengan.fasta or *.GPolished.asm.wengan.fasta? What's the difference of these? Is the final assembly somewhere else?

Could you give some advice for assembly a 2G sized genome with 60X Pacbio and 60 X illumina data?

Dear the authors:

Thanks very much for devoloping such a useful program and it facilitates my research a lot. Now I have some questions.
I am tring to assembly a 2G sized genome with 60 X Pacbio and 60X illumina reads, using defualt arguments just results a draft genome with Ng50 8,401,827 bp. I then changed the arguments to "-k 15 -m 100 -N 3 -f 0.12 -M 500", but it has taken more than 10 days to be fihished with 150 Intel(R) Xeon(R) Platinum 8160H CPUs, I do not know when it will be finished. Could you give me some advice for improving the assembly with shorter time.

Thanks again.
looking forward to your timely help.

Pre-generated short read genome input

Hiya,

I've been having a play with Wengan and it assembled one of my datasets fantastically! However, my second data set isn't quite as good quality and didn't perform well with the Minia assembler. I had some troubles running the Abyss2 assembler through Wengan, however I did manage to assemble a contig level assembly seperately. However I then encountered some weird issues when trying to input that into Wengan with the -C and -a A options. In parallel to that, I already have some short read assemblies created through SPAdes. Could you shed a little light about whats happening when inputting a pre-constructed assembly via -C so that I can maybe trouble shoot whats causing my errors? Would it be possible for me to use my SPAdes assemblies at all? It's my standard go-to assembler for my genomes, it's nowhere near as fast as Minia, but produces more contiguous genomes with better BUSCO scores. For example, the Minia-Wengan assembly came out with a low 70% complete BUSCO and was half the size it should be, in contrast to a solid mid ~96% complete via SPAdes.

Otherwise, Wengan is fantastic! It smashed out a better genome than my Canu + RaconX2 + PilonX2 genome in a fraction of the time!

Wengan paper short read assemblies

Hi Thanks for wengan
Am eager to try wengan on our data but the DISCOVAR DENOVO memory seems to be quite high
Was wondering if any of the short read NA24385/NA12878 DISCOVAR DENOVO assemblies are available anywhere for download?

Cheers,
Kevin

Can we use wengan to assemble a high heterozygous plant genome?

Hi Alex,

Can we use wengan to assemble a high heterozygous (1.2% - 2% ) plant genome(genome size is 3-4G, >6G if assembler can not collapse two haplotype correctly)? Any suggestions about parameters setting if I use wengan to do it?

Best,
Kun

How to combine mate-pair reads?

Hi Alex,

Is it possible to combine information from mate-pair reads in a Wengan run, for example, using Fast-SG? I'm asking because I saw the contig N50 could still be increased after following steps of scaffolding and gapclosing (The contigs were based on ~220X paired-end + ~20X PacBio data, and I added mate-pair reads using Platanus assembler for scaffolding+gapclosing).

Ps: Thanks for your previous suggestions. I ran WenganD mode with adjusted the parameters ("-N 3 -P 8000 -L 1000 -Q 1000 -i 500,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000,13000,14000,15000,20000"). Now the contig N50 has increased to ~400k ! :-)

Best,
Minky

Crash after finding links between unitigs

I ran Wengan in on a machine with 1TB RAM and 20 cores, using both Illumina short-reads and PromethION data and Minia3 assembler. After 1 day and 9 hours the process crashes after finding links between unitigs.

This is the bottom of asm1.minia.41.log:

step 1 pass 7                                   15:47:07     memory [current, maxRSS]: [16847, 70610] MB 
step 2 (91448815kmers/217943625extremities)        16:07:05     memory [current, maxRSS]: [20032, 70610] MB 
gathering links from disk                       16:32:06     memory [current, maxRSS]: [17727, 70610] MB 
Done finding links between unitigs              17:48:17     memory [current, maxRSS]: [7857, 70610] MB 
loading unitigs from disk to memory
Stats:
Number of unitigs: 2213985986
Average number of incoming/outcoming neighbors: 0.0/0.0
Total number of nucleotides in unitigs: 116610855409

Memory usage:
   23581 MB keys in incoming vector
   23600 MB keys in outcoming vector
   1019 MB keys in dag_incoming_map vector
   1017 MB keys in dag_outcoming_map vector
   31566 MB packed unitigs (incl. 2721 MB delimiters)
   16384 MB unitigs lengths
   8445 MB unitigs abundances
   527 MB deleted/visited bitvectors
Estimated total: 106143.5 MB
iterating on 3755914903 nodes on disk
unexpected error: li.unitig=4611686016299118776, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016283862430, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016343682979, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016344738064, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016332944866, unitig_deleted.size()=unexpected error: li.unitig=unexpected error: li.unitig=46116860163007477422213985986, unitig_deleted.size()=46116860163007218602213985986
, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016343714144, unitig_deleted.size()=2213985986

unexpected error: li.unitig=4611686016284879238, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016289026266, unitig_deleted.size()=2213985986
unexpected error: li.unitig=unexpected error: li.unitig=4611686016291656030, unitig_deleted.size()=22139859864611686016293551128
, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016332374618, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016312758977, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016311400272, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016335807901, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016311031310, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016292558110, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016311674943, unitig_deleted.size()=2213985986
unexpected error: li.unitig=4611686016289764357, unitig_deleted.size()=2213985986

This is the bottom of asm1.minia.41.log:

[Building BooPHF]  98.4 %   elapsed:   3 min 48 sec   remaining:   0 min 4  sec
[Building BooPHF]  98.5 %   elapsed:   3 min 49 sec   remaining:   0 min 3  sec

[removing tips,    pass  1               ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [106828, 106828, 128825] MB *** Error in `/crex/proj/uppstore2018098/nobackup/andreas/devel/wengan/wengan-v0.1-bin-Linux/bin/minia': double free or corruption (!prev): 0x0000000002737040 ***

My own error log on the machine indicates that the program died with a segmentation fault. There is also peculiar "make" error:

make: *** [<MYPATH>/asm1.minia.41.contigs.fa] Error 139

I used the precompiled binaries to run Wengan.

This machine is a CentOS Linux 7 (Core) with gcc (GCC) 4.8.5 and GNU libc 2.17, which different from what was used to compile the tools that Wengan wraps around. To get Minia3 to start, I loaded a recent version of GCC (gcc/9.2.0) via our lmod system.

Was Wengan about to switch to a different binary at this point?
Could there be some GCC/glibc issue here?
Assuming the I can fix the issue, is it possible to resume the assembly at this point or do I have to restart it?

Cheers!

correct the long read sequence ？

Do I need to correct the long read sequence（ONT or PAC） in advance for the next assembly?

Not in gzip format

Hello,

I tried to assembly a fungal genome with illumina and ont data using the following command:

wengan.pl -x ontraw -a M -s A1.illumina.fwd.fastq, A1.illumina.rev.fastq -l ont_trimQ10.fq -p asm1 -t 8 -g 36

But I got the following error message:

gzip: A1.illumina.fwd.fastq: not in gzip format
Illegal division by zero at /gpfs1/projects/shaobin.zhong/wengan/wengan-v0.2-bin-Linux/perl/Wengan/Reads.pm line 150.

I have the sequence reads in fastq format. Does Wengan accept fastq files?

Thanks for your help!

Shaobin

Problem with wengan-v0.2-bin-Linux versus wengan-v0.1-bin-Linux for CCS reads

Hi,

I can run the following using Wengan-v0.1 release

perl /genetics/elbers/wengan-v0.1-bin-Linux/wengan.pl \
-x pacccs \
-a M \
-s ../wm5_trim_rm_optical_dups_combined_seq_replicates_dedupe_bbduk-k31-speed15-R1.cor.fq.gz,../wm5_trim_rm_optical_dups_combined_seq_replicates_dedupe_bbduk-k31-speed15-R2.cor.fq.gz \
-l ../wm3-pacbio-subreads-3-icecreamfinder-bbduk-k31-speed15-corrected-with-wm5-lighter-k17-680000000-0.1-bcalm2-k65-abund3-clipped.fastq.gz \
-c /genetics/fly/genome/denovo/haslr5/sr_k65_a3.contigs.nooverlap.200.fa.gz -g 680 -p clipped > wengan-minia-clipped.log 2>&1 &

But, the updated commands for Wengan-v0.2 release

perl /genetics/elbers/wengan-v0.2-bin-Linux/wengan.pl \
-x ccspac \
-a M \
-s ../wm5_trim_rm_optical_dups_combined_seq_replicates_dedupe_bbduk-k31-speed15-R1.cor.fq.gz,../wm5_trim_rm_optical_dups_combined_seq_replicates_dedupe_bbduk-k31-speed15-R2.cor.fq.gz \
-l ../wm3-pacbio-subreads-3-icecreamfinder-bbduk-k31-speed15-corrected-with-wm5-lighter-k17-680000000-0.1-bcalm2-k65-abund3-clipped.fastq.gz \
-b ../wm3-pacbio-subreads-3-icecreamfinder-bbduk-k31-speed15-corrected-with-wm5-lighter-k17-680000000-0.1-bcalm2-k65-abund3-clipped.fastq.gz \
-c /genetics/fly/genome/denovo/haslr5/sr_k65_a3.contigs.nooverlap.200.fa.gz -g 680 -p clipped > wengan-minia-clipped.log 2>&1 &

results in

cat clipped.fms.err

LOG: Mapping mode =pacccs H=0 k=21 w=10 L=2000 l=250 q=30 m=150 c=65 r=500 t=1 o=FM I=500,1000,2000 s=1
Building contig index
[M::mm_idx_gen::1591173570.185*0.00] collected minimizers
[M::mm_idx_gen::1591173585.914*0.00] sorted minimizers
[M::mm_mapopt_update::1591173588.800*0.00] mid_occ = 68
[M::mm_idx_stat] kmer size: 21; skip: 10; is_hpc: 0; #seq: 1040074
[M::mm_idx_stat::1591173590.686*0.00] distinct minimizers: 100098122 (91.83% are singletons); average occurrences: 1.212; average spacing: 5.703
Index construction time: 58.593850 seconds for 1040074 target sequence(s)
gzopen of 'clipped.ccs.ec.fa' failed: No such file or directory.


cat wengan-minia-clipped.log

ln -s /genetics/fly/genome/denovo/haslr5/sr_k65_a3.contigs.nooverlap.200.fa.gz clipped.minia.121.contigs.fa
/genetics/elbers/wengan-v0.2-bin-Linux/bin/seqtk seq -L 200 clipped.minia.121.contigs.fa  | /genetics/elbers/wengan-v0.2-bin-Linux/bin/seqtk seq -l 60 - > clipped.minia.contigs.fa
grep ">" clipped.minia.contigs.fa | sed 's/km:f://' | awk '{print $1" "$4}' | sed 's/>//g' > clipped.minia.contigs.cov.txt
/genetics/elbers/wengan-v0.2-bin-Linux/bin/fastmin-sg pacccs -k 21 -w 10 -q 30 -r 500 -I 500,1000,2000  -t 1 clipped.minia.contigs.fa clipped.fms.txt 2>clipped.fms.err >clipped.fms.log
clipped.mk:15: recipe for target 'clipped1.ccs.I500.fm.sam' failed
make: *** [clipped1.ccs.I500.fm.sam] Error 1

The makefile does not seem to generate clipped1.ccs and clipped.ccs.ec.fa

Best,
Jean

liger maxpos Assertion error

Hi, thanks for releasing this software, it looks to be very useful. I tried running WENGAN on my dataset with the following options:

perl ./wengan-v0.2-bin-Linux/wengan.pl -x ontraw -a A -s ../illumina/SRR8327096_1.fastq.gz,../illumina/SRR8327096_2.fastq.gz -l ../nanoporecombined.fastq.gz -p asm1 -t 10 -g 60

It seems to run most of the way through, however it dies at an error in the liger step. The asm1.liger.log contains:
Sorting libs by insert size:
asm11.I500.fm.sam 0 476 47
asm11.I1000.fm.sam 1 937 93
asm11.I2000.fm.sam 2 1788 178
asm11.I3000.fm.sam 3 2598 259
asm11.I4000.fm.sam 4 3593 359
asm11.I15000.fm.sam 5 3694 369
asm11.I10000.fm.sam 6 3870 387
asm11.I5000.fm.sam 7 4387 438
asm11.I6000.fm.sam 8 5100 510
asm11.I8000.fm.sam 9 5140 514
asm11.I7000.fm.sam 10 5470 547
asm11.I20000.fm.sam 11 20000 2000
asm11.I30000.fm.sam 12 30000 3000
asm11.I40000.fm.sam 13 40000 4000
asm11.I50000.fm.sam 14 50000 5000
Max allowed coverage 4112.09
AVG:2741.39 STD:4005.75 Limit: 4112.09 #REP(ctg):8
Number of BCC: 46
STARTING TRANSITIVE REDUCTION:
A total of 1 edges can be reduced, the total weight is 1845 ,whit a total lengh of 2614 bases
NO paths founds for edge: 36 245638 1845 2614 12 NODES(e): 370 505
Number of iterations in edge reduction: 3
REDUCED GRAPH 3198 ORIGINAL GRAPH 3198
ENDING TRANSITIVE REDUCTION
Matching Done
Total Edges :2940 input M 2 Edges in matching: 2 MCW: 7714 Total W:7714 % 1
the graph has cicles and we should remove some edges
Deleting edge: 5 4 1845 855
Matching COVER GRAPH 259 ORIGINAL GRAPH 3198
Number of CC 257
Number of CC 257
Total Reduced edges >= 10000 (bp) = 0
A total of 0 (-nan%) reduced edges were converted to fragments
The Physical Coverage represented in the Paths is: -nan
Scaffolds length larger than >= 10000 = 0
Max scaffold length =0

while asm1.liger.err contains:
liger: src/ValidateBackbone.cpp:112: int ValidateBackbone::validate_backbone(int, int, GraphS*): Assertion `maxpos > 0 and maxpos < cov.size()' failed.

wengan M output

Hello,
Please which is the final fasta file?

mucuna.GPolished.asm.wengan.fasta
mucuna.SPolished.asm.wengan.fasta

mucuna.Unpolished.asm.wengan.fasta

Does it run on computer cluster and how to continue running unfinished tasks?

Hello, I am trying to assemble a plant genome using different genome assembly software. I am very interested in wengan, but I have some problems.

Do wengan supports computer cluster (e.g. sge) and continue running unfinished tasks?
Do you support some assembler or alignment pipeline in wengan? And then use wengan to integrate the data and produce the finally assemble genome. I guess it be faster than the default pipeline, especially if running unfinished tasks is not supported.
A range of read coverage (including short read and long read) was recommended. Is this the best range? If the actual data exceeds this range, will the assembly result be worse?

Output file details?

Hiya,

Could you give some info regarding the output files? I notice a GPolish and an SPolish, but what's the difference?

Cheers

short read assembly only

Hi,

I only have short read data for my assembly which Wengan seems to do but I don't know what to put in the -l flag since I don't have long reads. What is the suggested command if one doesn't have long read sequence?

Thanks!

How to set up -g parameter ?

Dear author,

I'm very happy with the result from wengan!

However, I'm a little worry about the -g parameter
We have to set this parameter in each execution, but how strong will this parameter effect our result?
The genome size estimation of my target animal is 106 Mb from genomescope (using NGS data). Unfortunately, there are some contamination in my nanopore data, so the "genome size" of my nanopore dataset is more likely larger than 106 Mb.
In such case, how should I set up the -g parameter?

Looking forward your answer!

Best
Shaolin

A liger error was reproted, need help!

Dear the authors,
Thanks for your valuable previous advice. Wengan is a very cost effective assembler, although may lost some repeats. When trying to assemly a genome sized near 4Gb , I met a problem. The liger software reproted:

"src/BFilling.cpp:73: BFilling::BFilling(GraphS*, std::__cxx11::string, std::__cxx11::string): Assertion `lorder.size() == longreadDB->getTotalseqs()' failed."

and stoped.
I do not know why it occured. The parameters used are as follows:
perl wengan.pl -x pacraw -a D -c contigs-disco.fa -s "LAB2020188-a_L2_184184.R1.fastq.gz,LAB2020188-a_L2_184184.R2.fastq.gz,LAB2020188-a_L4_183183.R1.fastq.gz,LAB2020188-a_L4_183183.R2.fastq.gz" -l "m64064_200806_090946.subreads.bam.fq.gz,m64065_200814_093113.subreads.bam.fq.gz,m64066_200810_083047.subreads.bam.fq.gz,m64066_200811_145542.subreads.bam.fq.gz" -t 60 -g 4000 -i 500,1000,2000,3000,4000,5000,6000,7000,8000,10000,15000,20000,30000,40000,50000,60000,70000,80000 -m 100 -d 5 -N 3 -P 20000 -M 2500

Looking forward to your timely help!

Thanks a lot!

Best,

EXCEPTION: Failure because of unhandled kmer size 161

Any idea how to solve this?

Add wengan to bioconda or create a docker image.

Hi,

Can you add wengan to bioconda or create a docker image? The installing/compiling of wengan made me crazy.

Best
Kun

core dumped at liger step

Hello,

I had problem in running wengan at liger step, it seems that the program can't handle with ambiguous nucleotides, but I'm not sure.

here is the reported error in log files:
(core dumped) /home/minky/Software/wengan-v0.2-bin-Linux/bin/liger --mlp 10000 -t 64 -c asm8.MBC7.msplit.fa -l longreads.asm81.fa -d asm8.MBC7.msplit.cov.txt -p asm8 -s asm8.sams.txt 2> asm8.liger.err > asm8.liger.log

In asm8.liger.err:
terminate called after throwing an instance of 'std::invalid_argument'
what(): invalid DNA base found in DnaBitset class

In the last lines of asm8.liger.log:
Warning:Invalid DNA base found :N
Warning: Changing base to A

Any suggestions how to solve the issue?

Best,
Minky

adigenova / wengan Goto Github PK

wengan's Introduction

Wengan

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

Description

Short-read assembly

WenganM [M]

WenganA [A]

WenganD [D]

Long-read presets

ontlon

ontraw

pacraw

pacccs (experimental)

Wengan demo

Wengan benchmark

Wengan components

Getting the latest source code

Instructions

Containers

Run WenganM using singularity

Building Wengan from source

Requirements

Specific component source code versions used to build Wengan v0.2

Limitations

About the name

Citation

wengan's People

Contributors

Stargazers

Watchers

Forkers

wengan's Issues

Recommend Projects

Recommend Topics

Recommend Org