lh3 / gfatools Goto Github PK
View Code? Open in Web Editor NEWTools for manipulating sequence graphs in the GFA and rGFA formats
Tools for manipulating sequence graphs in the GFA and rGFA formats
Directly copying the description from GFA1 repo https://github.com/lh3/gfa1/issues/7, since it seems to be still there.
I believe there is a bug in bubble popping algorithm that leads to a removal of bubbles with (seemingly arbitrary long) tips of any orientation. See attached gfa (called my.txt to trick github attach dialogue).
my.txt
Despite looking like this in Bandage it is being simplified into a single unambiguous path by the -b option of gfaview.
Dear @lh3,
Can you please provide a comment on how the subgraph extraction with a given radius works?
I tried using this function and was getting weird results. In particular the resulting graph had some non-branching paths that to the best of my knowledge shouldn't have been there and my own implementation wasn't producing them.
I will start with a simple case: compact de Bruijng graph (cDBG). We can construct a cDBG for GRCh38 and keep chromosome paths (more exactly walks). This is a lossless representation of GRCh38, but will we prefer this over the linear genome? No. This cDBG discards the intuition of linearity. A unitig may corresponds to multiple regions in GRCh38, which unnecessarily complicates analyses. cDBG is not a good reference model.
Path model can encode such a cDBG. It semantically allows graphs overcomplicated for for reference. This over-flexibility is a problem. Another issue with the path model is that it doesn't have a concept of sample. Having two similar paths from two samples indicate orthology; having two paths from one sample indicate paralogy/segdup. Without the concept of sample, we can't distinguish the two cases.
Starting with all-pair alignment, seqwish will build a graph akin to A-bruijn, which is affected by the same issues with cDBG. From a practical point of view, a much bigger problem is that it assumes all sequences should be kept in the graph. However, assemblies and alignments have many errors and these errors will contaminate the reference. On the other hand, a reference may be incomplete, but the content in the reference shall be accurate. I have closely worked with GRC and firmly share their view that accuracy is critically important to a reference.
rGFA imposes a strong constraint on a reference model: a linear reference sequence doesn't collapse onto a segment. This constraint keeps the intuition of linearity and simplifies graph operations. It is exactly what we want.
rGFA can be extended to a constrained path model with the addition of Walk lines. It explicitly exposes the concept of sample in this case. However, such an extended model will be more complicated. We don't have algorithms to build such a graph. Building an extended rGFA to base accuracy again assumes contigs are perfectly accurate, but we won't get there in a few years.
Another discussion thread. It is probably too early to implement libraries now, but it would be good to start thinking about the topic.
Currently, gfatools comes with very preliminary APIs to read rGFA into memory. The memory layout is described in gfa.h. It largely follows the model of string graphs. I quite like model and will stick with it. However, I guess general devs will feel uncomfortable with this representation. I won't have the bandwidth to implement the more general path model any time soon, either. In addition, it is also preferable to have two independent implementations (e.g. samtools vs picard vs bamtools). I wonder if you (@ekg and @benedictpaten) are interested in implementing a standalone library to work with GFA. You already have in vg a GFA parser, an in-memory model and a serialization format. You can isolate the relevant code and expose stable C and C++ APIs to other devs. I know vg has APIs, but I guess other devs will prefer a more focused lightweight library that is easier to build.
Walk/W-lines are used to keep sequence walks through the graph. There are a few options:
Don't add W-lines as we already have Path/P-lines in GFA1. However, there are multiple problems with P-lines. a) the most important one is that it doesn't allow tags as the last field is optional. We need to attach information (see below) in addition to path names. b) naming is not appropriate. Strictly speaking, a path disallows cycles. c) P-lines use comma as a field separator. Comma is a common symbol in reference names.
Define W-lines as:
<W-line> <- 'W' <ctgId> <oriSeg>+ <tag>*
<oriSeg> <- ( '>' | '<' ) <segId>
We put all the other information in tags. Note that without a list of CIGARs, W-lines are unable to distinguish multiple edges between the same pair of segments. We may define an optional tag to keep the list of CIGARs for general cases.
Define W-lines as:
<W-line> <- 'W' <sampleId> <ctgId> <oriSeg>+ <tag>*
Define W-lines as:
<W-line> <- 'W' <sampleId> <ctgId> <ctgStart> <ctgEnd> <oriSeg>+ <tag>*
With <ctgStart>
and <ctgEnd>
, a walk doesn't have to encode an entire contig.
The differences between 2/3/4 are about what fields should be made mandatory. My mild preference may be 4 (I am not even sure); 2/3 are also good to me. I strongly think P-lines are inadequate for a reference model. With an optional tag to keep a list of CIGARs, W-lines are more general. Internally, a GFA parser can parse both W-lines and P-lines into the same data structure.
CC @ekg and @benedictpaten.
Hello,
I'm having trouble running gfatools bubble
on a GFA file that has been processed with with gfatools asm
.
I'm working with a subgraph that I extracted from minigraph
output using gfatools view
. If I run gfatools bubble
on this subgraph file, I get the expected BED file:
$ gfatools bubble test_subgraph.gfa | head -n 1
V 19985854 19985854 2 1 0 0 0 -1 -1 -1 s54242,s54243 * *
And everything looks as expected in bandage.
However, if I then pop bubbles in this graph using gfatools asm
(command: gfatools asm -b 10000 -u test_subgraph.gfa >test_subgraph.b10000u.gfa
), and then try to run gfatools bubble
on the output, I get nothing:
$ gfatools bubble test_subgraph.b10000u.gfa | head -n 1
This GFA looks as expected in bandage (same structure as before, but fewer segments/bubbles), and I can generate a FASTA using gfatools gfa2fa
. However, in addition to gfatools bubble
not working, gfatools gfa2bed
also produces nothing.
Is there anyway to get this second file to work with gfatools bubble
? Is this something to do with GFA vs rGFA?
Thanks for your help!
I've attached the example files as .txt as github wouldn't accept my GFA files:
test_subgraph.b10000u.gfa.txt
test_subgraph.gfa.txt
And I'm using the following version of gfatools:
$ gfatools version
gfa.h: 0.5-r244-dirty
gfatools: 0.5-r238-dirty
Hello Heng,
I'm getting myself familiar with the GAF format.
One site of interest to me (an INS onto GRCh38), has several reads with the following column 6 (broken down for easier viewing):
>CHM13#0#chr18:67654357-67665821
<HG02145#2#h2tg000315l:763450-763480
>CHM13#0#chr18:67666067-67666203
>HG00735#1#h1tg000016l:20472026-20472231
>CHM13#0#chr18:67666258-67666722
<HG00621#2#h2tg000001l:12875529-12875687
>CHM13#0#chr18:67666724-67672585
>chr18:67453985-67471633
which translates, by hand (I did it, so there might be errors) to the segment coordinate
>s132736>s132737>s132738>s132739<s204376>s132742>s164264>s164265>s132745>s132746>s132747>s132748<s150020>s132750>s132751>chr18:67453985-67471633
I know that minigraph
can output GAF in the second format via --vc
.
I am wondering if there's any tool in gfatools
that does this translation between the two coordinate systems? This way a sound round of mapping can be avoided.
Thanks!
Steve
While my original gfa file has sequences in it, but asm -u
command results in a gfa having * for all the segments.
Is there an option to produce the unitig graph with sequences?
Hi Heng Li,
Huge fan here, apologies but I have been unable to get what I deem a perfectly fine gfa file from miniasm to convert to fasta using gfatools
S utg000001l * LN:i:127742
a utg000001l 0 m64247_210928_214804/49612954/ccs:6-11010 + 3901
a utg000001l 3901 m64247_210928_214804/99615607/ccs:3-9113 + 2765
a utg000001l 6666 m64247_210928_214804/157876436/ccs:1-6871 + 50
From what I can tell the tool gfatools would need the original reads file with those read headers convert it to an actual sequence fasta file right?
$MINIASM -c 2 ava.reads.paf.gz > output.gfa ;
$GFATOOLS gfa2fa -s output.gfa
[M::main] Version: 0.4-r214-dirty
[M::main] CMD: /home/BIOTECH/dmsanders/progs/gfatools/gfatools gfa2fa -s output.gfa
[M::main] Real time: 0.000 sec; CPU: 0.001 sec
This occurs but there is no output...
without -s it segfaults...
And the traditional command:
awk '/^S/{print ">"$2"\n"$3}' output.gfa | fold > out.fa
gives me:
utg000001l
utg000002l
ideas?
Dear Heng,
I just have a simple questions regarding the gfa2bed format (when run with -s)
Test-2 0 227208972 * * * * * *
Test-1 769027 769267 < Test-2 7108396 < Test-2 7108066
The example tells me that Test-1 (from 769027 769267) is inserted/replacing Test-2 at 7108396 to 7108066.
Am I getting this right?
Thanks!
Hello,
I'm testing a few SV detection methods on a small rGFA graph I produced with 5 individuals using minigraph. It seems like the 'bubble' tool would be useful for detecting small to medium sized (<100kb) structural variants... at least simple ones. I was able to discern from the code the different fields in the output, except for b->cf_min, b->cf_max, b->cf_ref. In my test case these fields are always -1. If these are meaningful fields, could you please explain what they mean?
Thank you
Hello,
I create a recipes for gfatools in bioconda bioconda/bioconda-recipes#26073 but gfatools actually didn't have license could you add one ?
Thanks for your work.
It seems that gfatools asm -u
just discards isolated segments.
For the attached file with a single segment the output is empty.
a.txt
Hi Ekim. I have noticed that there is a "--bf" flag in the config. But I didn't find any explanation about the bloom filter in the original paper. I would like to know how it works and what would happen if I apply this flag or not. Thank you very much!
Hello,
I wish I could use your tool 'Online GfaViewer on gfaServer' (https://lh3.github.io/gfatools/) to show the message in GFA file, but I can't find the source code of this tool. Could you please tell me how to find the message?
Thanks for your help.
Hello - is there a way gfatools
can be used to convert fastg
file to gfa
? One of your older programs has a misc folder with a script called fastg2gfa.c
, but I don't know how to run this script and that program's github page directs users to gfatools
. Thus, I was curious if gfatools
can help me accomplish my goal of creating a gfa
file from fastg
.
Please let me know. Thank you!
I would like draw the assembly graph after having removed the alternative haplotype (extracted using purge_dups) from an assembly graph gfa file. Could you add a feature to gfatools blacklist to produce a new gfa from an input gfa and a list of regions to remove?
The region could be given as a simple one column file, with one region per line, or comma separated in a command line option.
I am trying to figure out the aligned output file .gfa,
I learned the meaning of these columns
read 71 0 71 + >1>2>4 87 3 73 67 72 60
Please enlighten me what other columns stand for
NM:i:5 AS:f:56.3 dv:f:0.0694444 id:f:0.930556 cg:Z:4=1X2=1I38=1D5=1I5=1X13=
What is the format of the "A" line produced by gfatools asm
?
S utg0000001l * LN:i:25977 RC:i:371 lc:i:25977
A utg0000001l 0 + 82117 0 401
A utg0000001l 16 - 82119 0 488
A utg0000001l 172 + 82121 0 395
A utg0000001l 200 + 82123 0 390
A utg0000001l 203 - 82125 0 394
A utg0000001l 221 - 82127 0 389
A utg0000001l 304 - 82129 0 317
A utg0000001l 312 - 82131 0 443
A utg0000001l 315 + 82133 0 464
...
Hi
Thank you for your work on gfatools
,
I have an assembly graph that I want to process with gfatools asm
(to simplify the graph, like popping the bubbles, etc) and output the scaffolds.
The graph is based on draft assembly contigs and some connections I inferred based on long reads, either edges or gaps based on if the estimated distance/gap size is negative or positive.
I wanted to get some information/advice on what I should provide gfatools asm
with to get the best out of it. Like in my analysis, each connection (edge or gap) is based on supporting long reads so I have weights (number of supports) for the connections that may be useful, and I format that in the tag currently (FC:i: / see the example below) but not sure if gfatools
will consider that?
I also have the gap size estimates that I wanted to output on G-lines for both gaps and edges (as I don't have alignments but only gap estimates for overlapping contigs too), but I found out negative distances, for overlapping contigs, are not allowed on G-lines, so I will have to somehow format that as an E-line? if so, then are the start/end positions important for the analysis or I can put some fake values?
Toy example:
H VN:Z:2.0
graph [scaf_num=None]
S 1 49057 *
S 2 33803 *
S 3 22222 *
G * 2- 1- 3340 * FC:i:20
# 20 reads support the above gap and the gap size is 334
G * 1+ 3- 4000 * FC:i:6
# 6 reads support the above gap and the gap size is 400
G * 1- 2- -300 * FC:i:15
# 15 reads support this connection and the contigs overlap by 300 bp, but this seems like an invalid G-line and should probably be converted to an E-line?
And lastly: is there any additional information that gfatools
can benefit from? I can potentially prepare and provide those too.
Hello,
I'm working with a large pan-genome and starting to explore options of working with it in a graph context. I created an rGFA graph using minigraph (testing with just 3 samples for now). Now I'd like to derive from it a set of sample-coherent paths. Reading the discussion you had in issue #1 it sounds like at the time you were in the process of formalizing this feature in gfatools. Has there been much progress in this area? It seems like one should be able to pick a vertex and do a BFS where only edges to ranks equal or lower than the starting segment (sample) are considered. Am I interpreting the meaning of rank correctly here?
Thanks!
gfatools gfa2fa -s ./test/MT.gfa > MT.fa
cat MT.fa | grep "^>" | less
MT_human
MT_orang_3426_3927 <MT_human:4502 <MT_human:4001
MT_orang_8961_9463 >MT_human:9505 >MT_human:9505
I don't understand >MT_orang_3426_3927 <MT_human:4502 <MT_human:4001
MT_orang_8961_9463 >MT_human:9505 >MT_human:9505 Anybody can tell me about this?Thank you very much!
Got the following error when trying to run make:
gfa-util.c:65:12: error: conflicting types for ‘gfa_gfa2sfa’
gfa_sfa_t *gfa_gfa2sfa(const gfa_t *g, int32_t *n_sfa_, int32_t write_seq)
^~~~~~~~~~~
In file included from gfa-util.c:2:0:
gfa-priv.h:61:12: note: previous declaration of ‘gfa_gfa2sfa’ was here
gfa_seg_t *gfa_gfa2sfa(const gfa_t *g, int32_t *n_sfa_, int32_t write_seq);
^~~~~~~~~~~
Makefile:18: recipe for target 'gfa-util.o' failed
Saw in the source code that lots of recently changed from gfa_seg_t to gfa_sfa_t, so I changed the type in gfa-priv.h to gfa_sfa_t. Make worked after that, but I have no idea what that affects.
It seems that running gfatools ed -c
produces invalid GAF as only 10 entries are present instead of 12 mandatory per spec
E.g.:
NA21309#1 119129 0 119129 + >s60779>s60780>s60781>s60782>s60783<s227791>s60785>s60786 119130 0 119126 82
Before using the command:
gfatools view -l 303726,303738 -r 20 Assembly-BothStrands.gfa > Assembly-BothStrands_303726-303738_r20.gfa
my GFA has the following elements:
S 5924062 GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG RC:i:400
L 5209751 + 5924062 + 18M1D6M
L 5420387 + 5924062 + 18M1D6M
After running the command, they look like this:
S 5924062 GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG LN:i:32 RC:i:400
L 5209751 + 5924062 + 24M L1:i:33 L2:i:8
L 5420387 + 5924062 + 24M L1:i:33 L2:i:8
These overlap Cigars are incorrect:
Cigar: 24M
5924062 1 GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG
|||||||||||||||||||||||||
5420387 34 ACTGCATTCCAGCCTGGGTGATAGAGTGAGGCTGTTTCAAAAAAAAAAAAAAGAGCAT
Compared to a local alignment with Smith-Waterman:
Cigar: 5M1I19M
5924062 1 GTTTC-AAAAAAAAAAAAAGAGCATGGCTCTGG
||||| |||||||||||||||||||
5420387 34 ACTGCATTCCAGCCTGGGTGATAGAGTGAGGCTGTTTCAAAAAAAAAAAAAAGAGCAT
Dear Heng Li,
I recently tried the gfa view
command with options -l <SEGMENT_NAME> -r <RADIUS>
.
However, the results are not the one which I would expect. Since the tools do not have a very clear documentation yet, I am not sure if I maybe misunderstood completely the commands - or if there is some kind of bug instead.
E.g. when I use the test file included in the repository (MT.gfa). This is the file displayed using my gfaviz tool. I gave a different color to each of what I would expect to be the definition of different radius.
I would expect to extract, when using -l MTh0 -r <radius>
the following:
What I actually get is:
@lh3, I'm curious to know , whether we can convert GFA
to rGFA
? . As per rGFA format, we need to maintain 3 additional information along with each segment lines (S
). So,
Assuming we have GFA
with segments links;
S s1 AAT
S s2 T
can't we just add ;
S s1 AAT LN:i:3 SN:Z:chr1 SO:i:90374744 SR:i:0
S s2 T LN:i:1 SN:Z:chr1 SO:i:176753158 SR:i:0
Will it be valid rGFA
?
When using your minigraph for alignment, sometimes I want to extract the path sequence in the graph to observe the alignment result, it would be great if gfatools can extract the path sequence based on the path name。Thank you for the efficient tool!
Hi, sir:
I wanted to subset a gfa file from a region (eg: Chr01:1-10000) , and I thought the parameter -R might help, but got a empty file, may your help me with the right code:
The code I used:
[path to]/gfa view -R Chr01:1-10000 -r 1 my.gfa >1_10k.gfa
# I tried -r 1 , -r 10, or without -r
# I also tried -l Chr01:1-10000 with or without -r parameter
It output a empty file 1_10k.gfa
the example code was successful:
./gfatools view -l MTh4502 -r 1 test/MT.gfa >test/sub.gfa
Thanks for your time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.