mahulchak / quickmerge Goto Github PK

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.

License: GNU General Public License v3.0

Makefile 0.97% TeX 1.03% HTML 17.96% Gnuplot 0.04% Perl 13.38% Shell 0.31% Awk 0.06% C 23.17% C++ 40.74% Python 0.54% Hack 1.68% CAP CDS 0.11%

quickmerge's People

Contributors

Stargazers

Watchers

quickmerge's Issues

bad ELF interpreter

Hello,

I have an error when using quickmerge. The compilation looks like it worked.

~/progs/quickmerge/merger 
(master)>make 
make: `quickmerge' is up to date.

Upon trying to launch it:

~/progs/quickmerge/merger 
(master)>./quickmerge
-bash: ./quickmerge: /software/lib64/ld-linux-x86-64.so.2: bad ELF interpreter: No such file or directory

Is quickmerge a 32 bit application? Is there a way to force the compilation with the interpreter that is available to me? (It's a cluster I can't install libraries on it easily myself).
Thanks!

Redundancy introduced by Quickmerge?

Hi,

When using reference assembly to improve the hybrid assembly, e.g, contig A in the reference assembly can improve contig B in the hybrid assembly, will quickmerge further check if there are genomic equivalent of contig A in the hybrid assembly? If not, then there will be redundancy introduced by quickmerge.

Best,
Danshu

Add --version or -V flag

% quickmerge -V
quickmerge 0.2

% echo $?
0

(to stdout. and clean exit code)

the result

scaffold100000|size4398 C27495846419.0 4398 287
1003 1288 287 1 1 1 0-2400
scaffold100000|size4398 C27567557919.0 4398 322
1800 2026 322 95 2 2 0-1000
scaffold100000|size4398 scaffold13097019.4 4398 3412
2129 2283 167 10 4 4 0-112-1-250
scaffold100000|size4398 scaffold44576520.6 4398 5548
1458 1695 3065 2826 3 3 0-202-3203867 4222 1115 757 7 7 0-14124-71-2332-58-60
scaffold100001|size4398 C2721590907.0 4398 209
4144 4248 140 36 0 0 00
scaffold100001|size4398 scaffold133571412.7 4398 1391
3481 4011 809 1347 28 28 0-216-1-1-1-1-1-1-1166-80
scaffold100001|size4398 scaffold4287048.3 4398 1847
525 792 269 4 7 7 025-1054511-460
scaffold100001|size4398 scaffold86268516.5 4398 1028
1381 1726 281 627 8 8 0-24-1720
scaffold100002|size4398 C27479571713.0 4398 281
584 858 281 7 0 0 00
scaffold100003|size4398 C27185330824.0 4398 203
931 1133 1 203 1 1 00
##############################################333

This is the result in my merged.fasta, what does this mean?
I thought may be there are some problems here....

terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos (which is 1047464) > this->size() (which is 0) Aborted

Hi,

When I try to run quickmerge using assembly A as reference (Illumina assembly ~746MB) and assembly B (nanopore assembly ~846MB) as query, it works fine.

However, when I switch assembly A as query and assembly B as reference, it throws an error: terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 1047464) > this->size() (which is 0)
Aborted

I changed the file format (before running the quickmerge) using the following command in the both of cases:
merge_wrapper.py nanopore_assembly.fasta illumina_assembly.fasta --no_nucmer --no_delta --clean_only

Could you please give me insight on this issue?

Regards,
Niraj

terminate called after throwing an instance of 'std::logic_error'

Hi,

I have an issue after running merge_wrapper.py

4: FINISHING DATA
0       quickmerge
1       -d
2       out.rq.delta
3       -q
4       hybrid_oneline.fa
5       -r
6       self_oneline.fa
7       -hco
8       5.0
9       -c
10      1.5
11      -l
12      0
13      -ml
14      5000
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid

merged.fasta the same as hybrid_assembly.fasta

Hi,

I'm trying to run quickmerge with an assembly generated with dbg2olc and canu. I run both the wrapper script and individual steps as described on the readme. I don't receive any errors. I do notice that the resultant merged.fasta has the same content exactly as the hybrid_assembly.fasta. I can also see that while the summaryOut.txt and aln_summary.tsv seem normal anchor_summary.txt file is empty except for a header line. Please let me know if any additional info would be useful in determine why the merging doesn't seem to be working.

terminate called after throwing an instance of 'std::out_of_range'

Hello ,

I am facing a problem related to 'std::out_of_range'.When I ran quickmerge first time it worked fine with the following command1.But when I change the reference and query( Command 2 : swapped query & reference ) it showing the error std::out_of_range error.I am using latest build.Please let me know how can I solve the issue.

command 1
nucmer -l 100 -prefix out New_CspixV2gen.fa 0x_spix.fa
delta-filter -i 95 -r -q out.delta > out.rq.delta
quickmerge -d out.rq.delta -r New_CspixV2gen.fa -q 10x_spix.fa -hco 5.0 -c 1.5 -l 1000000 -ml 6000

command 2
nucmer -l 100 -prefix out 10x_spix.faNew_CspixV2gen.fa
delta-filter -i 95 -r -q out.delta > out.rq.delta
quickmerge -d out.rq.delta -r 10x_spix.fa -q New_CspixV2gen.fa -hco 5.0 -c 1.5 -l 1000000 -ml 6000

merge uniq contigs

Would it be possible to merge those uniq contigs in reference assembly into query assembly ? By calling it "uniq" I mean those that have little overlap with any contigs in query assembly.

Possible reason for the following big scaffold got discarded

Hi @mahulchak ,

We used quickmerge to merge a PacBio + Dovetail + Bionano (pb+dt+bn) scaffolds and a ONT assembly,

I found that a 83290869bp scaffold got discarded in the quickmerge results. And this scaffold is actually NOT in the anchor summary output. I could not figure out the reason. I suspect the reason would be misassembly, but cannot find a prove. It will be very appriciated if you can take a look at the following info I attached and give me some suggestions.

Here I attached a Excel sheet of NUCmer alignment (converted from .delta format to .paf using Li Heng's paftools.js for readability) of this scaffold to quickmerge assembly.

Super-Scaffold_7308.xlsx

For the file header

header	details
query_id	Sequence id in our pb + dt + bn scaffolds
query_length	pb + dt + bn scaffold length
query_start	start of alignment on pb + dt + bn scaffold
query_end	end of alignment on pb + dt + bn scaffold
relative_strandness	relative strandness
quickmerge_id	Sequence id in quickmerge result
quickmerge_length	quickmerge sequence length
quickmerge_start	start of alignment on quickmerge sequence
quickmerge_end	end of alignment on quickmerge sequence

For quickmerge_id, there are two possibilities:

Super-Scaffold_XXX/ScIPGVn_XXX_obj_pilon, pb+dt+bn scaffolds ids (used as query)
utgXXX_pilon, ONT assembly ids (used as reference)

Thank you very much!

better naming of output (intermediate) files

Hi,

you have a great software here. I just wanted to bring something up to perhaps make it even better.
I think it might be useful to name the files the program creates more uniform. I mean: name them all using the prefix that is given on the cmdline. I notice that some files have a general name (eg. self_oneline.fa or the merged.fasta outputfile) which causes them to be overwritten when I run several jobs at once (where each uses a different prefix though).

thx and keep up the good work.

merge_wrapper.py run errors

Hi all,

I have got an error when I run merge_warpper.py. Can anyone suggest how to fix the following error?
I used the following script.

/home/tg484/quickmerge/merge_wrapper.py -p Dbia_merge_wrapper Dbia_min1000_Illumina.fasta Dbia_min1000_nanopore.fasta -hco 5.0 -c 1.5 -l 2791184 -ml 15000
Error: Multiple query file is only supported with the SAM output format
Usage: nucmer [options] ref:path qry:path+
Use --help for more information
ERROR: Could not parse delta file, Dbia_merge_wrapper.delta
error no: 400
Traceback (most recent call last):
File "/home/tg484/quickmerge/merge_wrapper.py", line 176, in
subprocess.call(mergercall)
File "/home/tg484/anaconda3/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/tg484/anaconda3/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/home/tg484/anaconda3/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'quickmerge': 'quickmerge'

Thank you,
Thiru

HELP!!!!!!terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_S_construct null not valid

quickmerge -d out.rq_1.delta -q /ufrc/pelzstelinski/sneupane/Analysis_of_Demultiplexed_10_wDi_DNA_HMW_Unpure_1st_Elution/wDi_canu_seqtk45_CCS_1000_corr0.015_500_100_GOOD/uni_corrected/assembly_edited.fasta -r /ufrc/pelzstelinski/sneupane/Analysis_of_Demultiplexed_10_wDi_DNA_HMW_Unpure_1st_Elution/wDi_canu_seqtk45_CCS_1000_corr0.015_500_100_GOOD/wDi.contigs.fasta_headeredited_spaceremoved.fa -hco 5.0 -c 1.5 -l 1250000 -ml 10000

0 quickmerge
1 -d
2 out.rq_1.delta
3 -q
4 /ufrc/pelzstelinski/sneupane/Analysis_of_Demultiplexed_10_wDi_DNA_HMW_Unpure_1st_Elution/wDi_canu_seqtk45_CCS_1000_corr0.015_500_100_GOOD/uni_corrected/assembly_edited.fasta
5 -r
6 /ufrc/pelzstelinski/sneupane/Analysis_of_Demultiplexed_10_wDi_DNA_HMW_Unpure_1st_Elution/wDi_canu_seqtk45_CCS_1000_corr0.015_500_100_GOOD/wDi.contigs.fasta_headeredited_spaceremoved.fa
7 -hco
8 5.0
9 -c
10 1.5
11 -l
12 1250000
13 -ml
14 10000
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
Aborted

terminate called after throwing an instance of 'std::out_of_range'

Code: fix merge_wrapper.py

Hi,

Code improvements to fix errors with merge_wrapper.py:

#if args.length_minimum:
mergercall.append('-ml')
mergercall.append(str(length_minimum))
#if args.prefix:m
mergercall.append('-p')
mergercall.append(str(prefix))

-ml instead of -lm
prefix (required) for quickmerge

nico ;)

Segmentation fault with larger heterozygous input

Hi,

I'm having segmentation fault at the quickmerger stage after using the merge_wrapper. Nucmer and everything before worked fine and created normal outputs, including oneline.fa files.

-rw-r--r-- 1 guerrer QGGP    2456765 Jul 22 13:23 anchor_summary_out.txt
-rw-r--r-- 1 guerrer QGGP 1049205219 Jul 22 13:22 hybrid_oneline.fa
-rw-r--r-- 1 guerrer QGGP          0 Jul 22 13:23 merged_out.fasta
-rw-r--r-- 1 guerrer QGGP   91857760 Jul 22 13:22 out.delta
-rw-r--r-- 1 guerrer QGGP    9563554 Jul 22 13:22 out.rq.delta
-rw-r--r-- 1 guerrer QGGP    3982704 Jul 22 13:23 param_summary_out.txt
-rw-r--r-- 1 guerrer QGGP 1343593069 Jul 22 13:22 self_oneline.fa

The quickmerge error must have to do with my input self_oneline.fa, it's the only new thing.

Now, instead of rerrunning the whole wrapper, I'm just running this command:

quickmerge -d out.rq.delta -q hybrid_oneline.fa -r self_oneline.fa -hco 5.0 -c 1.5 -l 0 -ml 5000 -p out

Results:

   File                   size        First line(fasta header)
hybrid_oneline.fa      1049205219        >1
self_oneline.fa        845417800         >tig00035160                Success!

hybrid_oneline.fa      1049205219        >1
self_oneline.fa        1343593069        >tig00000003_pilon     Segmentation fault (core dumped)

The only difference is the size and origin of the self file. The successful one is a canu assembly after using purge haplotigs to eliminate haplotigs (alternative diploid contigs ). The unsuccessful one is a canu assembly but without using purge haplotigs (thus it is heterozygous/diploid).

Is my problem derived from the input sizes? Or maybe from heterozygosity?

hybrid assembly

Hi,
We have 50x Illumina paired-end and 7x ONT reads. Is QuickMerge able to produce a hybrid assembly out of these data?

Thank you in advance.

Michal

some issues

Hi,

some things I noticed while trying quickmerge:

make_merger.sh has wrong compilation instructions
should be "g++ -Wall -o quickmerge quickmerge.cpp qmergelib.cpp -I." instead of "g++ -Wall work_in_prog_temp.cpp exp_testlib.cpp -o merger"

MUMmer compilation might fail, because fodler aux_bin isn't created.

Running the quickmerge wrapper just prints all the scaffolds and contigs to stdout. The headers are printed twice, then the sequence itself.

Chris

merged fasta is empty - where to look to troubleshoot?

Hi Mahul,

Thanks for writing and maintaining a terrific program. I've used your software successfully on a previous genome by merging a pair of 10x and Nanopore assemblies. I thought I'd give it a shot on a different genome (from a different organism), but this time the output of quickmerge is an empty fasta file. It appears that the program has run without any errors, so I wasn't sure where to start troubleshooting (or how to interpret an empty fasta file).

I ran the following code using the wrapper:

QMERGEPY=/path/to/merge_wrapper.py
$QMERGEPY -pre mrgd nanopore.fasta hic.scaffolds.fasta

The following files were present in the directory where quickmerge was executed (after the job had completed):

-rw-r--r-- 1 dro49 cluster 705K Dec 16 22:51 anchor_summary_mrgd.txt
-rw-r--r-- 1 dro49 cluster 2.8M Dec 16 22:50 param_summary_mrgd.txt
-rw-r--r-- 1 dro49 cluster  15M Dec 16 22:50 aln_summary_mrgd.tsv
-rw-r--r-- 1 dro49 cluster    0 Dec 16 22:50 merged_mrgd.fasta
-rw-r--r-- 1 dro49 cluster  60M Dec 16 22:50 hic.rq.delta
-rw-r--r-- 1 dro49 cluster 1.1K Dec 16 22:50 qmerge.log
-rw-r--r-- 1 dro49 cluster 159M Dec 16 22:49 hic.delta
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 18:10 self_oneline.fa
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 18:08 nanopore.fasta
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 17:44 hic.scaffolds.fasta

Happy to share any further into that is useful for troubleshooting. Appreciate your help and insights,

Devon

Output as 80 character standard FASTA

Hello! Mahul

I am using the merge_wrapper.py approach.
I noticed that a second round of quickmerge from quickmerge output would not complete.
It stops after generating out.delta, without generating the final merged.fasta

It is found that by correcting the fasta from one-line sequence into standard 80 characters per line, the script can complete.

May I ask if I might've missed anything?
Or if it is possible to generate merged.fasta into 80 characters, or to let quickmerge accept one-line sequence?

Many thanks!

Is there a way to run quickmerge wrapper in multi-threads?

Hi,

I am dealing with a large genome and therefore running quickmerge took a very long time, is there a way to run the quickmerge wrapper in multi-thread to accelerate the process? I check the wrapper and nucmer help manuals, there seems no option for multi-thread.

Shu

About option: -c

Hi,
Could you go into more detail on option "-c"?

-hco: controls the overlap cutoff used in selection of anchor contigs. Default is 5.0.
-c: controls the overlap cutoff for contigs used for extension of the anchor contig. Default is 1.5.

According to the thesis,
-hco=overlapping region/non-overlapping region

What is the difference between "-c" and "-hco"?

errors on libc++abi.dylib

Hi there,

I tried to run quickmerge to combine hybrid assembly (using spades) with pacbio only assembly (using canu) and got the following error message: libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string

The first two steps look fine, without any error messages. This was produced at the last step using quickmerge. Do you have an idea how can I fix this?

I also tried with the python wrapper script and got a different error message.

I attached the full log file and the commands used if you want to have a look.
merge_wrapper_errors.txt
quickmerge_errors.txt

Thanks
Tuan

running quickmerge using multiple threads

Is there any parameter to speed up quickmerge progress, for example multiple threads?
Thanks

segmentation fault

Hi,

I try to run quickmerge with two wgs-assemblies of a1.4 Gb genome but run into a segmentation fault in the merge step. N50 of the assemblies is about 200 Kb.

my commands:

nucmer -l  100 -prefix out pex_1.ctg.fasta pex_2.ctg.fasta

delta-filter -i 95 -r -q out.delta > out.rq.delta

quickmerge -d out.rq.delta -q pex_2.ctg.fasta -r pex_1.ctg.fasta -hco 5 -c 1.5 -l 200000

The stderror is unfortunately not very informative:

/opt/sge/default/spool/binfservas12/job_scripts/116904: line 14: 99325 Segmentation fault      /home/mmoser/quickmerge/merger/quickmerge -d out.rq.delta -q pex_2.ctg.fasta -r pex_1.ctg.fasta -hco 5 -c 1.5 -l 200000

Stdout contains a list of 37 contigs present in anchor_summary.txt (which has anchors for 1172 contigs).

Hope to get a clue how to resolve this problem. I ran everything on a SGE cluster. RAM usage didnt seem to be too high when the job terminated.

Thank you,
Michel

Reg: mummer's newer version and alternatives

Hi,

This is more of an enhancement/feature request than an issue!

Is there any reason that you are shipping mummer version 3 rather than the 4th one?

Also, given that I work with genomes > 2Gb normally, I have this query. Have you tried minimap2-> sam -> delta and then using those for quickmerge?

fatal error: iostream: No such file or directory

[root@localhost Quick1]# sh make_merger.sh
g++ -O3 -Wall -o quickmerge quickmerge.cpp qmergelib.cpp -I.
quickmerge.cpp:8:19: fatal error: iostream: No such file or directory
#include
^
compilation terminated.
qmergelib.cpp:1:19: fatal error: iostream: No such file or directory
#include
^
compilation terminated.
make: *** [quickmerge] Error 1
mkdir: cannot create directory ‘aux_bin’: File exists
check complete
cd /home/tools/Quick1/MUMmer3.23/src/kurtz; make mummer
make[1]: Entering directory /home/tools/Quick1/MUMmer3.23/src/kurtz' cd libbasedir; make libbase.a make[2]: Entering directory /home/tools/Quick1/MUMmer3.23/src/kurtz/libbasedir'
/usr/local/bin/gcc -O3 -c -o cleanMUMcand.o cleanMUMcand.c
In file included from types.h:13:0,
from cleanMUMcand.c:11:
/usr/include/sys/types.h:146:20: fatal error: stddef.h: No such file or directory
#include <stddef.h>
^
compilation terminated.
make[2]: *** [cleanMUMcand.o] Error 1
make[2]: Leaving directory /home/tools/Quick1/MUMmer3.23/src/kurtz/libbasedir' make[1]: *** [mummer] Error 2 make[1]: Leaving directory /home/tools/Quick1/MUMmer3.23/src/kurtz'
make: *** [kurtz] Error 2
[root@localhost Quick1]# ls
quickmerge-0.3.tar.gz
[root@localhost Quick1]# tar -xjvf quickmerge-0.3.tar.gz
bzip2: (stdin) is not a bzip2 file.
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Please help. No output, no error, exit 0

Hi Mahul,

I saw your talk at the PacBio UGM at Standford, read your paper, and wanted to give quickmerge a try. However, I cannot seem to get the program to work and I was hoping you could help me. Probably unrelated: when I compiled quickmerge I get the following warnings:

$ make
g++ -O3 -Wall -o quickmerge quickmerge.cpp qmergelib.cpp -I.
qmergelib.cpp: In function 'void nOvlStoreCalculator(asmMerge&)':
qmergelib.cpp:367:44: warning: 'noRovl' may be used uninitialized in this function [-Wuninitialized]
qmergelib.cpp:367:44: warning: 'noLovl' may be used uninitialized in this function [-Wuninitialized]
qmergelib.cpp: In function 'void discAnchor(std::string&, asmMerge&, std::string&, double)':
qmergelib.cpp:1761:3: warning: 'cutoff' may be used uninitialized in this function [-Wuninitialized]

Being as these are not critical errors, I tried to run quickmerge anyway on a pre-generated nucmer delta file and it's associated fasta files. quickmerge creates the expected output files with only headers, emits no error messages, and exits with code 0. My command-line was:

$ ~/tools/bin/quickmerge/quickmerge -l 1200000 -hco 5 -c 1.5 -ml 50 -d in.delta -r r.fasta -q q.fasta

My de novo contigs were generated by Canu and my hybrid assembly by DBG2OLC (with some custom contig breaking along the way). I made sure to remove any whitespace in the fasta headers and reformat the fasta sequences to occupy a single line (no line wrapping). Running merge_wrapper.py --clean_only creates no ouput (the code block following the conditional statement on line 112 is not evaluated).

A snippet of my delta file:

$ grep -A 1 '^>' in.delta | head 
>tig00002811 Backbone_1000_1_621250 588440 621250
271694 369472 10 97855 296 296 0
--
>tig00007630 Backbone_1000_1_621250 278801 621250
1 20162 316551 336715 80 80 0
--
>tig00008053 Backbone_1000_1_621250 240074 621250
6546 7428 609882 610823 126 126 0
--
>tig00001422 Backbone_1000_1344501_3092750 1131566 1748250

A snippet of my query file:

$ head -2 q.fasta | cut -c1-30
>Backbone_1_1_343250
TCTTTTAAACAAAGTGGAGAACAAAAACTA...

A snippet of my ref file:

$ head -2 r.fasta | cut -c1-30
>tig00000005
ATCATCATGGAAGTTCAGCTAGAGGAGTTA...

nucmer parameters for large genome

Hi,

I am running quickmerge on an eukaryotic genome, which is around 1Gb. I got two assemblies from Canu (960Mb) and Miniasm( 870Mb).
By using the default parameters of quickmerge, the merged assembly is ~1Gb and N50 is 3Mb, and I am quite satisfied with the metrics . However, as I would like to study the chromosome structure of this species, it is really important to have the orientation of the merged contigs as accurate as possible.
Thus, could you provide any suggestion for the nucmer parameters to exclude mismerge for such big genome? I am re-running quickmerge with nucmer --maxmatch -c 500 -l 100, however, it is still difficult to determine how stringent should it be.

Thanks,

Query/reference OR donor/acceptor OR hybrid/self OR hybrid/pacbio

Which is first on the command line?

This may sound trivial, but to someone like me who is trying to figure out which of my assemblies I should give first on the command line and which second, it isn't helping that every time the assemblies are discussed they have different names.

The NAR manuscript uses both donor/acceptor (which to me seems the most informative) and reference/query. It seems to mostly use reference/query in discussion but figure 4 only uses donor/acceptor.

In description of how to run the wrapper, the main readme here calls them hybrid and self. I have yet to figure out which of those is the donor (a.k.a reference) and which is acceptor (a.k.a. query).

The manuscript also says it merge a hybrid and a pacbio. If the paper tells me which was donor it has eluded me.

The quickmerge wiki gives some advice for deciding which of my assemblies should be the query, and which reference. But, it fails to tell me which order these should appear on the command line.

Full run command & expected output

Is there anywhere with a full run command including all the parameters that need to be set? Also, what is the nature and format of the output?

The -h information is pretty limited. I can see that I need to set -l seed_length_cutoff -ml merging_length_cutoff but don't really know how these are used to make an educated guess as to what to set.

#!/usr/bin/env python

Hi Mahul,

Just a note -- I had to change this in merge_wrapper.py:

!/usr/bin/python

!/usr/bin/env python

for it to work.

It might be good to change this in your code as the latter will figure out the correct location of python on anyone's system.

best,

John

Duplications

Hi Mahul,

Based on BUSCO analysis, quickmerge is introducing duplications into my genome sequence. The original versions of my assemblies have 17 and 27 duplicated BUSCOs but, after quickmerge, there are 78 or 398 duplicated BUSCOs, depending on merge order. I tried the two-stage approach but the number of duplicated BUSCOs just keeps getting higher with each iteration. Any idea why this is happening or how I can remedy the situation? I have been running merge_wrapper.py with -hco 7 -c 2 -lm 5000 -l 5000000. My genome is from a bee and is ~400Mb (though the assemblies are closer to 300Mb) and known to be somewhat repetitive. My original assemblies are both pretty complete and are only missing ~2% of 4,415 BUSCOs.

Any help would be much appreciated.

Thanks,
Ben

Request: Description of Output Files

Hi,

Is it possible to include a description of all the output files to expect from running merge_wrapper.py ?
Or did I overlook that?

best,

John

merging self pacbio assembly with illumina-based one

Hello,

I'm trying to merge two assemblies of the same individual using different approaches: one refers to a previously generated illumina draft assembly based on very high coverage available, the other is a canu assembly produced from self corrected reads and polished with quiver and pilon. My pacbio coverage is modest, as after error correction I was just able to use 45% of the data (~30X coverage).

I believe my best draft is the illumina one because I think it is capturing a broader portion of the genome. Although a little bit more fragmented than the canu assembly (~23k scaffolds vs ~18k contigs), the N50 of the Illumina one is much higher (~450k vs ~91kb). Therefore, following your recomendations and my sensibility I understand that using the illumina draft as the query (in the quickmerge wrapper the hybrid assembly positional argument) must be the best solution (pacbio assembly will help closing regions that short read assembly didn't capture), despite I tried the other approach (pacbio self assembly as query).

The quast output displays an improvement in both cases (file attached), with best metrics achieved when using illumina draft as query (best N50, less scaffolds, best genome size).

As I understood, quickmerge mostly outputs sequences from the query genome that were joined by the reference genome, as well as the query sequences that remained unaligned. The reference sequences are not included in the ouptut, and if I want them, I should follow recommendations on issue #11. However, I observe in the output fasta headers coming from both assemblies. Furthermore, checking their length in the merged file and in the original assembly I see that the Illumina scaffolds (which I think served as the query) have the exact same length as the original draft, and the pacbio based contigs (the reference) have either longer or the same as before.

My questions are:
a) Given the following commnad, which sequences will serve as queries?
merge_wrapper.py -pre draftAsQuery -l 1000 illumina.fasta pacbio.fasta
In the alignment summary file I see in the 1st columns (REF), sequences coming from pacbio assembly as I expected, but in the merged fasta I see sequences from both, particularly contig extensions in the reference sequence.

b) Is ok for quickmerge to provide scaffolds (with Ns) instead of contigs ?

c) Could you comment, given my case of having a full Illumina assembly, the applicability of the tool ?

Thanks in advance,
Pedro Barbosa
quickmergeResults.txt

'std::out_of_range' error

Hi Mahul,
I am trying to use quickmerge but am receiving the following error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709536653) > this->size() (which is 342465)

Looking at some of the other issues, I've seen this error come up a few other times. However, it looked the culprit was fasta files with whitespaces in the header names, or sequences not on one line. I do not believe this to be the issue in this case, as I first started with the merge_wrapper.py script. I run the command as follows:

merge_wrapper.py ../scaff10x_rounds2/renamed.sspace_scaff10x.2.fasta ../canu_assembly/asm/AM.contigs.fasta

I can see that it correctly creates the files hybrid_oneline.fa and self_oneline.fa in my current working directory. If I look at the first few headers in each file:

cat hybrid_oneline.fa|grep ">"|head -n 5
>1
>2
>3
>4
>5

cat self_oneline.fa|grep ">"|head -n 5
>tig00000004_len=34946_reads=29_covStat=35.85_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000005_len=26830_reads=11_covStat=22.77_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000007_len=146883_reads=146_covStat=247.16_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000009_len=142320_reads=139_covStat=238.60_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no
>tig00000013_len=39096_reads=25_covStat=60.84_gappedBases=no_class=contig_suggestRepeat=no_suggestCircular=no

Everything looks correct. I have also tried cutting a lot of the unnecessary text in the headers for self_oneline.fa, leaving the headers as ">tigXXXX" in a file called renamed_self.fa. If I try running the quickmerge command, following the order of arguments as on the wiki

quickmerge -d out.rq.delta -q hybrid_oneline.fa -r renamed_self.fa -hco 5 -c 1.5 -l 200000 -ml 5000

I still get the same error. What else, if anything, besides wrongly formatted fasta files could be throwing this error? Thanks for any information/insight and I hope to get this working!

core dumped: terminate called after throwing an instance of 'std::invalid_argument'

Hello:
I had run quickmerge by the following commands:
nucmer -l 100 -prefix out contig.fasta pacbio_assemble.fasta
delta-filter -r -q -l 10000 out.delta > outrq.delta
quickmerge -d outrq.delta -q contig.fasta -r pacbio_assemble.fasta -hco 5 -c 1 -l n -ml m -p prefix
But in the third steps, I got an error，and nothing carry out. lIKE THIS:

n -ml m -p prefix
0 quickmerge
1 -d
2 yylrq.delta
3 -q
4 /mnt/data/liyunxia/4-project/mito/YYL_Mtctgs-zhu2-uniq.fasta
5 -r
6 /mnt/data/liyunxia/4-project/mito/Mashmap/fmap/YYL.pass.fa@fmlrcor@mashmap2-idseq
7 -hco
8 5
9 -c
10 1
11 -l
12 n
13 -ml
14 m
15 -p
16 prefix
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi
Aborted (core dumped)

I check the gdb -c core.317488, and I got this one:

Missing separate debuginfo for the main executable file
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/9a/5d301e005e22924e5bf88f3b95641c2490a441
Core was generated by `quickmerge -d out.delta -q /pwd/contig.fasta'.
Program terminated with signal 6, Aborted.
#0 0x00007fe7c0ef2207 in ?? ()
(gdb) where
#0 0x00007fe7c0ef2207 in ?? ()
#1 0x00007fe7c0ef38f8 in ?? ()
#2 0x0000000000000020 in ?? ()
#3 0x0000000000000000 in ?? ()

So, what‘s’ the problems with this run?
Hope for your reply.

bioconda version

Greetings of the season.
I would like to suggest a development of conda version of this amazing tool.

Thank you.
Yedomon

BUG for int type range

Because int type range is ( -2147483648 to 2147483647), when the merged genome size is over 2147483647 (2.15Gb) will report throw an error like "std::out_of_range".

From BerryGenomics zhk.

An error when running quickmerge

Hello,
I had a problem when running quickmerge. The error message like below:
"
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 3152) > this->size() (which is 0)
run.sh: line 5: 26095 Aborted quickmerge -d out.rq.delta -q secondary_contigs.fasta -r pb.only.fasta -hco 5.0 -c 1.5 -l n
"
I know that you already have an explanation for this issue: probably because of the mis-formated fasta header line with white space.
However, I checked my two fasta files and there are no white space in the header line. Thus I guess the mis-formated fasta file is not my case.
I attached what my files looks like:

less pb.only.fasta

000000F
CACCTCGTCGGGGAAGGAGATAGCTTCCTCACGCCAT
less hy.asm.fasta
scf7180000332062
GAGGAGACACCGTGCTACTAGGTGGTTGTGCCACCGGAGCAGCCACACCCTTTAACAGGT

Looking forward to your suggestion!
Thanks a lot!

merge_wrapper.py run errors

Dear author;
while I use the quickmerge software ,it appears some errors that confused me a while .here is the error :

4: FINISHING DATA
Traceback (most recent call last):
File "/public1/home/testuser/genome_Aessmblysoftware/quickmerge-master/merge_wrapper.py", line 174, in
subprocess.call(mergercall)
File "/public1/home/testuser/miniconda2/lib/python2.7/subprocess.py", line 168, in call
return Popen(*popenargs, **kwargs).wait()
File "/public1/home/testuser/miniconda2/lib/python2.7/subprocess.py", line 390, in init
errread, errwrite)
File "/public1/home/testuser/miniconda2/lib/python2.7/subprocess.py", line 1024, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

while my command is merge_wrapper.py p_ctg.fa SUNSET.contigs.fasta >merge.fasta
Can someone help me to solve it?
Thanks
Alex

terminate called after throwing an instance of 'std::out_of_range'

Hi,
When running quickmerge for a large genome i get following error:

...
ctg7180000048769        ctg7180000048768        1        Backbone_3247  -1      ctg7180000048769        1
ctg7180000049599         Backbone_10703 1       ctg7180000049599        -1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551359) > this->size() (which is 118601)
Aborted

The number of sequences which get outputted from the quickmerge run is less than the number of lines in the anchor_summary.txt file, so the error must be while merging. Also no fasta file gets created.

When i cut down the .rq.delta file to half of the size, the error disappears and i get normal output (of course missing some valuable merged contigs).

I am using commit 3f950d8.

Shall i send you my files?

Thank you,
michel

terminate called after throwing an instance of std invalid argument

trying to merge to fasta files and I keep getting this error!

fasta parsing problems

Hello!
On some of my genome fastas I get this error

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 764951) > this->size() (which is 755524)
-bash: line 166: 10789 Aborted                 (core dumped)

My sequences are renamed with numbers like >1, >2 ... >n in both genome fastas. I checked for special characters in sequences by grepping:
' ' - space: none in the file
'\t' - tabs: non in the file
'[^ACTGN]' - anything non nucleotide: i only get the headers.
So my fastas contain '>', ints, A, C, T, N and $.
Considering quickmerge works perfectly with my other datasets, I am pretty sure the issue is my fasta files. I am also careful to use the same genome as query in both nucmer and quickmerge.

Would you have any suggestion for other possible errors in my fasta files or other reasons why this error would appear?

Thanks,
Alex

edit: I am trying with the wrapper. Saw similar issue below.

make clean required for 0.2

After struggling seeing other people's paths in MUMmer files (/Home/mmoser/... etc.), I realized I needed to do a 'make clean' in the MUMmer directory. Should that be added to the make_merger.sh script?

slightly off-topic -- curious about your finisherSC recommendation

Hello,

Thanks for the great tool.

I was curious -- since you recommend finisherSC, I was wondering if you have done any evaluations on the results? Do you have a feel for whether it introduces mis-assemblies and at what rate? etc.

I have read the finisherSC paper -- but I am curious about independent feedback to see if I should include it as part of a de novo assembly I am doing...

Best,

John

Increasing contiguity further

Hello,
I am trying to create an extremely contiguous (chromosome-size) genome for a Drosophila, using the pipeline proposed in your 2016 paper. It has done very well so far, I have 3 chromosomes almost fully resolved, but the 4th one is in 3 fragments. I am wondering if there would be a way to stitch together those 3 fragments, for instance, by rerunning quickmerge on the final assembly (or just the 3 fragments extracted from that assembly) and one of the two assemblies (likely the PB only assembly, I guess those gaps are present in the hybrid assembly anyway), but with less stringent parameters. Do you have experience and/or advices for such a procedure ? Would that decrease the quality of the assembly in other parts than the ends of the fragments that need to be stitched together ? How to chose the parameters in this context (or just play around and compare the output ?) ?

Alternatively, I was thinking about using the raw reads again on those specific fragment ends and try to extend them through a consensus-calling method, would you have some tools/programs in mind that would do that ?

Thank you very much !
Best regards,
Coline

Where have all the buscos gone?

Hi,

While trying to decide which order I should merge assemblies, I realized there is something happening with quickmerge which I don't understand.

As you recommended on the wiki, I used the "best" assembly as the query. (best was based just on n50 and busco content). After merging, the number of missing busco genes goes up.

merge_wrapper.py query.fasta ref.fasta -l 90000 -lm 5000

query:
C:94.5%[S:93.1%,D:1.4%],F:3.3%,M:2.2%,n:1658

ref:
C:92.6%[S:91.9%,D:0.7%],F:2.2%,M:5.2%,n:1658

quick_merged:
C:94.0%[S:90.4%,D:3.6%],F:1.4%,M:4.6%,n:1658

I looked in the full output table from busco to get a handle on what is happening. There are indeed genes that are complete in the query assembly but are absent in the merged assembly. Basically all transitions between complete, duplicated, fragmented and missing are occurring during the merging process.

From your publication and your explanation in issue #22, it seems that it should be impossible to lose genes from the query if the reference sequences are only used in gaps to stitch together query contigs, and unaligned query contigs make it to the merged.fasta. Am I missing something?

Thanks,
Earl

Error : Multiple query file is only supported with the SAM output format

How to fix following error ?

python merge_wrapper.py Hdata/extended_10K.fa Hdata/final.genome.scf.fasta
Error: Multiple query file is only supported with the SAM output format
Usage: nucmer [options] ref:path qry:path+
Use --help for more information
ERROR: Could not parse delta file, out.delta
error no: 400
Traceback (most recent call last):
File "merge_wrapper.py", line 174, in
subprocess.call(mergercall)
File "/home/urbe/anaconda3/lib/python3.6/subprocess.py", line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/urbe/anaconda3/lib/python3.6/subprocess.py", line 709, in init
restore_signals, start_new_session)
File "/home/urbe/anaconda3/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'quickmerge': 'quickmerge'

After fixing quickmerge executable

Following error occurs:
python merge_wrapper.py Hdata/extended_10K.fa Hdata/final.genome.scf.fasta
Error: Multiple query file is only supported with the SAM output format
Usage: nucmer [options] ref:path qry:path+
Use --help for more information
ERROR: Could not parse delta file, out.delta
error no: 400

maximum reference limited

I run this command:/psd/biosoft/quickmerge/merge_wrapper.py -pre test_canu -hco 5.0 -c 1.5 -l 310000 -lm 7000 DBG2OLC.fasta canu.sub.fasta
while I got an error about the maximum reference:
###############################################
1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS

reading input file "test_canu.ntref" of length 1107738543

construct suffix tree for sequence of length 1107738543

(maximum reference length is 536870908)

(maximum query length is 4294967295)

process 11077385 characters per dot

/psd/biosoft/MUMmer3.23/mummer: suffix tree construction failed: textlen=1107738543 larger than maximal textlen=536870908
ERROR: mummer and/or mgaps returned non-zero
ERROR: Could not parse delta file, test_canu.delta
error no: 400
Traceback (most recent call last):
File "/psd/biosoft/quickmerge/merge_wrapper.py", line 174, in
subprocess.call(mergercall)
File "/usr/local/Python-2.7.9/lib/python2.7/subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/local/Python-2.7.9/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/local/Python-2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
###############################################

How can I solve this error? Increasing the maximum reference length? And how to do?
Thank you for your attention.

mahulchak / quickmerge Goto Github PK

quickmerge's People

Contributors

Stargazers

Watchers

Forkers

quickmerge's Issues

!/usr/bin/python

!/usr/bin/env python

reading input file "test_canu.ntref" of length 1107738543

construct suffix tree for sequence of length 1107738543

(maximum reference length is 536870908)

(maximum query length is 4294967295)

process 11077385 characters per dot

Recommend Projects

Recommend Topics

Recommend Org