lh3 / unimap Goto Github PK
View Code? Open in Web Editor NEWA EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
License: MIT License
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
License: MIT License
Hi,
I am trying to align a high-divergent hifiasm contigs to hg38 (divergent ~36 MYA, the SNV divergence ~10%).
I first tried the parameters (--eqx -ax asm20 --secondary=no -z 10000,50 -r 50000 --end-bonus=100 -O 5,56 -E 4,1 -B 5), then, I could get highly continuous mapped segments (please see P1.pdf the red blocks, please ignore the blue blocks(it is another assembly))). [P1.pdf]
As you see, there are lots of segments/sequencing are missing on chrX, chr16, and chr19 (p arm).
Then, I tried the parameters (--eqx -ax asm20 --secondary=no) and I could more fragmental mapped segments (Please see P2.pdf the red blocks. the three layers for segments >50kb, >10kp,<10kbp) []
As you see, I could align the 'missing sequences' to hg38, but I can not get larger contiguous aligned segments. [P2.pdf]
Then, I have couple questions about my mapping strategy:
Why are there so many sequences missing with parameters (--eqx -ax asm20 --secondary=no -z 10000,50 -r 50000 --end-bonus=100 -O 5,56 -E 4,1 -B 5) compare to the parameters (--eqx -ax asm20 --secondary=no )?
Could you recommend to me which parameters I should use to get more contiguous and 'no missing aligned segments?
Thank you in advance.
--Yafei
unimap
does a great job aligning complex regions (e.g. satellites) even using reads. We are using it to detect some variants at telomeric regions, using PacBio HiFi sequencing data.
We have evidence from one of these regions that there should be 2 separate deletions, corresponding to two separate groups of monomers from a satellite, each of them of different size. However, unimap
finds one large deletion instead, close together to several variants that suggest mismapping. I assume this behavior comes from the default settings, that may favour large deletions instead of many small deletions. The program was run with unimap -a -x asm5 -x hifi --cs
. I attach snapshot below for this region.
I have been trying to tune settings to penalize nucleotide mismatchings and favour more than 1 deletion if convenient. However, I find some problems including:
-B 500 -O 1 -E 2
: increased mismatch penalty and lowered cost for gaps. This results in a "Segmentation fault" error.-B 500
: Does not seem to solve the problem, the alignment remains the same.-O 1 -E 2
: This makes some improvements, and for some reads we observe that mismatches disappear in favour of deletions (snapshot below), but not for the majority of reads. Cannot make progress from here however, since using "-B > 6" will cause the segfault, lower values of "-B" won't change anything, and -O and -E are near their minimum values (cannot take values of zero).Wonder if there are other options that are worth exploring here. Any hints on how to deal with this situation? Thank you
Dear Li,
Is there any way to align genome 1 to 3 or more?
I would like to check the whole genome duplication by using unimap but it seems it only work with 1 to 1.
Thanks.
Won
Hi and thanks for the tool.
As instructed, I want to use it to map a de novo genome assembly from pacbio long reads to a reference genome. I'm not sure on which presets to use in this case. Is there anyone specific to pacbio, as it seems to be for nanopore? I guess I should use asm5 since I expect the assembly to be similar to the reference, but what about asm10 and asm20? When are these recommended?
Your advice would be much appreciated, thank you for your time and your support
Hi @lh3,
given that this is a minimap2 (+minigraph) fork, I'd assume you intended this to be MIT-licensed as well, but I couldn't find a license here.
I'm currently in the process of packaging dipcall/unimap/bedtk for Bioconda and in need of a license information file so we are able to distribute unimap :).
Cheers,
Marcel
Hello Heng,
Unimap seems like a very useful tool!
Is it possible to provide a mm_idx_str - like interface for index construction in unimap? (that takes "const char **seq" instead of path to the reference file)
Thank you!
Hi,
This looks like a very useful tool! Does it extend the ~1-1 chaining algorithm down to base pair level alignment? i.e. does this improve 1-1 mappings in very complex regions?
Thanks in advance!
Mitchell
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.