Comments (6)
I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of #
, whereas you've given them a near-perfect base quality. If this still occurs with more realistic base qualities, it would also help to have the full alignments to look at (the path
field in the JSON).
from vg.
With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.)
from vg.
I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score.
from vg.
I think I know what is happening. Let's concentrate on the read with two trailing Ns.
There are three equally good alignments in the graph, with the problematic one being path >45>47>48>49>50
in the following subgraph:
H VN:Z:1.1 RS:Z:GRCh38
S 44686345 AGGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCAT
S 44686346 A
S 44686347 G
S 44686348 ACACTGTA
S 44686349 C
S 44686350 TCCAGCCTGGGTGACAGAGCGAAATTCCAGCAAAGAAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAG
L 44686345 + 44686346 + 0M
L 44686345 + 44686347 + 0M
L 44686346 + 44686348 + 0M
L 44686347 + 44686348 + 0M
L 44686348 + 44686349 + 0M
L 44686349 + 44686350 + 0M
W GRCh38 0 chr5 176090011 176090154 >44686345>44686346>44686348>44686349>44686350 WT:i:51
W unknown 1 chr5 0 57 >44686345>44686347>44686348 WT:i:1
W unknown 2 chr5 0 143 >44686345>44686347>44686348>44686349>44686350 WT:i:36
W unknown 3 chr5 0 85 >44686350 WT:i:1
The first (internal) N corresponds to node 49. Haplotype unknown 1
ends at node 48, because it would have a rare allele at the position of node 49. Meanwhile, haplotype unknown 2
would be the correct alignment. We get one right-maximal extension with no mismatches over unknown 1
and another with 10 additional matches, 4 additional mismatches (all Ns), and a full-length bonus over unknown 2
. Because the extension over unknown 1
scores higher, we select it as the best extension.
The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph.
from vg.
A couple of questions:
-
Do the reads automatically get trimmed if there are at least 2 Ns. Or is it that in this example the 1-N read just happens to be small enough to get the full-length bonus?
-
Do the Ns receive a penalty during the right-maximal extension?
-
Is there anyway we can tell Dozeu to use a non-0 penalty for Ns?
from vg.
Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels.
In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented.
from vg.
Related Issues (20)
- when i do the code :vg stats -a var1.gam > var1.stats ,show me a bug HOT 2
- vg surject and pack fail on GraphAligner produced .gam HOT 4
- About the usage of vg deconstruct HOT 5
- Right method to annotate genes on pangenome graph HOT 1
- vg call report missing allele HOT 2
- vg call error HOT 2
- How do different methods affect precision? HOT 4
- Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled HOT 13
- Giraffe needs to better deal with systems where multiple processes opening the distance index causes slowdowns
- Any option like BWA MEM '-C'? HOT 1
- vg convert failes while converting gbz to xg with signal 6 error HOT 8
- Release vg v1.60.0
- `vg convert --vg-algorithm` loses start coordinates of paths in its W lines HOT 3
- The Amazing Uninjectable SAM HOT 1
- Throw an error "error: trying to access a snarl tree node of the wrong type" in `haplotypes` HOT 11
- Get input file error HOT 5
- vg pack error
- Negative or "*" end-position for named-coordinates gaf with giraffe
- vg autoindex vg giraffe pggb gfa HOT 1
- Hi, I have a question about vg augment. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vg.