Code Monkey home page Code Monkey logo

gfatools's Introduction

Getting Started

git clone https://github.com/lh3/gfatools
cd gfatools && make
# Extract a subgraph
./gfatools view -l MTh4502 -r 1 test/MT.gfa > sub.gfa
# Convert GFA to segment FASTA
./gfatools gfa2fa test/MT.gfa > MT-seg.fa
# Convert rGFA to stable FASTA or BED
./gfatools gfa2fa -s test/MT.gfa > MT.fa
./gfatools gfa2bed -m test/MT.gfa > MT.bed

Introduction

gfatools is a set of tools for manipulating sequence graphs in the GFA or the rGFA format. It has implemented parsing, subgraph and conversion to FASTA/BED. More functionality may be added in future.

gfatools's People

Contributors

lh3 avatar zwets avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gfatools's Issues

Is there support for translation between stable coordinate and segment/local coordinate for GAF?

Hello Heng,

I'm getting myself familiar with the GAF format.
One site of interest to me (an INS onto GRCh38), has several reads with the following column 6 (broken down for easier viewing):

>CHM13#0#chr18:67654357-67665821
<HG02145#2#h2tg000315l:763450-763480
>CHM13#0#chr18:67666067-67666203
>HG00735#1#h1tg000016l:20472026-20472231
>CHM13#0#chr18:67666258-67666722
<HG00621#2#h2tg000001l:12875529-12875687
>CHM13#0#chr18:67666724-67672585
>chr18:67453985-67471633

which translates, by hand (I did it, so there might be errors) to the segment coordinate

>s132736>s132737>s132738>s132739<s204376>s132742>s164264>s164265>s132745>s132746>s132747>s132748<s150020>s132750>s132751>chr18:67453985-67471633

I know that minigraph can output GAF in the second format via --vc.

I am wondering if there's any tool in gfatools that does this translation between the two coordinate systems? This way a sound round of mapping can be avoided.

Thanks!
Steve

paths from rGFA

Hello,
I'm working with a large pan-genome and starting to explore options of working with it in a graph context. I created an rGFA graph using minigraph (testing with just 3 samples for now). Now I'd like to derive from it a set of sample-coherent paths. Reading the discussion you had in issue #1 it sounds like at the time you were in the process of formalizing this feature in gfatools. Has there been much progress in this area? It seems like one should be able to pick a vertex and do a BFS where only edges to ranks equal or lower than the starting segment (sample) are considered. Am I interpreting the meaning of rank correctly here?
Thanks!

Conflicting types when running make

Got the following error when trying to run make:

gfa-util.c:65:12: error: conflicting types for ‘gfa_gfa2sfa’
 gfa_sfa_t *gfa_gfa2sfa(const gfa_t *g, int32_t *n_sfa_, int32_t write_seq)
            ^~~~~~~~~~~
In file included from gfa-util.c:2:0:
gfa-priv.h:61:12: note: previous declaration of ‘gfa_gfa2sfa’ was here
 gfa_seg_t *gfa_gfa2sfa(const gfa_t *g, int32_t *n_sfa_, int32_t write_seq);
            ^~~~~~~~~~~
Makefile:18: recipe for target 'gfa-util.o' failed

Saw in the source code that lots of recently changed from gfa_seg_t to gfa_sfa_t, so I changed the type in gfa-priv.h to gfa_sfa_t. Make worked after that, but I have no idea what that affects.

Can I extract path sequences from GFA or rGFA by pathname?

When using your minigraph for alignment, sometimes I want to extract the path sequence in the graph to observe the alignment result, it would be great if gfatools can extract the path sequence based on the path name。Thank you for the efficient tool!

About "--bf" the bloom filter

Hi Ekim. I have noticed that there is a "--bf" flag in the config. But I didn't find any explanation about the bloom filter in the original paper. I would like to know how it works and what would happen if I apply this flag or not. Thank you very much!

gfatools view to subset a gfa file

Hi, sir:

I wanted to subset a gfa file from a region (eg: Chr01:1-10000) , and I thought the parameter -R might help, but got a empty file, may your help me with the right code:
The code I used:

[path to]/gfa view -R Chr01:1-10000  -r 1 my.gfa >1_10k.gfa
 # I tried -r 1 , -r 10, or without -r
# I also tried -l Chr01:1-10000 with or without -r parameter

It output a empty file 1_10k.gfa

I used vg to construct the graph genome .vg, then got the gfa file.

the example code was successful:

./gfatools view -l MTh4502 -r 1 test/MT.gfa >test/sub.gfa

Thanks for your time.

blacklisting region from a region file

I would like draw the assembly graph after having removed the alternative haplotype (extracted using purge_dups) from an assembly graph gfa file. Could you add a feature to gfatools blacklist to produce a new gfa from an input gfa and a list of regions to remove?
The region could be given as a simple one column file, with one region per line, or comma separated in a command line option.

gfatools bubble doesn't work on GFA after bubble popping with gfatools asm

Hello,

I'm having trouble running gfatools bubble on a GFA file that has been processed with with gfatools asm.

I'm working with a subgraph that I extracted from minigraph output using gfatools view. If I run gfatools bubble on this subgraph file, I get the expected BED file:

$ gfatools bubble test_subgraph.gfa | head -n 1
V       19985854        19985854        2       1       0       0       0       -1      -1      -1      s54242,s54243   *       *

And everything looks as expected in bandage.

However, if I then pop bubbles in this graph using gfatools asm (command: gfatools asm -b 10000 -u test_subgraph.gfa >test_subgraph.b10000u.gfa), and then try to run gfatools bubble on the output, I get nothing:

$ gfatools bubble test_subgraph.b10000u.gfa | head -n 1

This GFA looks as expected in bandage (same structure as before, but fewer segments/bubbles), and I can generate a FASTA using gfatools gfa2fa. However, in addition to gfatools bubble not working, gfatools gfa2bed also produces nothing.

Is there anyway to get this second file to work with gfatools bubble? Is this something to do with GFA vs rGFA?

Thanks for your help!


I've attached the example files as .txt as github wouldn't accept my GFA files:

test_subgraph.b10000u.gfa.txt
test_subgraph.gfa.txt

And I'm using the following version of gfatools:

$ gfatools version
gfa.h: 0.5-r244-dirty
gfatools: 0.5-r238-dirty

Cigar strings altered after extracting subgraph

Before using the command:

gfatools view -l 303726,303738 -r 20 Assembly-BothStrands.gfa > Assembly-BothStrands_303726-303738_r20.gfa

my GFA has the following elements:

S       5924062 GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG        RC:i:400
L       5209751 +       5924062 +       18M1D6M
L       5420387 +       5924062 +       18M1D6M

After running the command, they look like this:

S	5924062	GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG	LN:i:32	RC:i:400
L	5209751	+	5924062	+	24M	L1:i:33	L2:i:8
L	5420387	+	5924062	+	24M	L1:i:33	L2:i:8

These overlap Cigars are incorrect:

Cigar: 24M

5924062    1                                   GTTTCAAAAAAAAAAAAAGAGCATGGCTCTGG
                                               |||||||||||||||||||||||||
5420387    34 ACTGCATTCCAGCCTGGGTGATAGAGTGAGGCTGTTTCAAAAAAAAAAAAAAGAGCAT

Compared to a local alignment with Smith-Waterman:

Cigar:  5M1I19M

5924062    1                                   GTTTC-AAAAAAAAAAAAAGAGCATGGCTCTGG
                                               ||||| |||||||||||||||||||
5420387    34 ACTGCATTCCAGCCTGGGTGATAGAGTGAGGCTGTTTCAAAAAAAAAAAAAAGAGCAT

conversion of GFA to rGFA

@lh3, I'm curious to know , whether we can convert GFA to rGFA? . As per rGFA format, we need to maintain 3 additional information along with each segment lines (S). So,
Assuming we have GFA with segments links;

S	s1	AAT
S	s2	T

can't we just add ;

S	s1	AAT	LN:i:3	SN:Z:chr1	SO:i:90374744	SR:i:0
S	s2	T	LN:i:1	SN:Z:chr1	SO:i:176753158	SR:i:0

Will it be valid rGFA ?

Logic of taking the subset within a given radius

Dear @lh3,
Can you please provide a comment on how the subgraph extraction with a given radius works?
I tried using this function and was getting weird results. In particular the resulting graph had some non-branching paths that to the best of my knowledge shouldn't have been there and my own implementation wasn't producing them.

Bug in bubble popping algorithm

Directly copying the description from GFA1 repo https://github.com/lh3/gfa1/issues/7, since it seems to be still there.

I believe there is a bug in bubble popping algorithm that leads to a removal of bubbles with (seemingly arbitrary long) tips of any orientation. See attached gfa (called my.txt to trick github attach dialogue).
my.txt

Despite looking like this in Bandage it is being simplified into a single unambiguous path by the -b option of gfaview.

image

How to convert fastG to GFA?

Hello - is there a way gfatools can be used to convert fastg file to gfa? One of your older programs has a misc folder with a script called fastg2gfa.c, but I don't know how to run this script and that program's github page directs users to gfatools. Thus, I was curious if gfatools can help me accomplish my goal of creating a gfa file from fastg.

Please let me know. Thank you!

Format clarification

Dear Heng,

I just have a simple questions regarding the gfa2bed format (when run with -s)

Test-2  0       227208972       *       *       *       *       *       *
Test-1       769027  769267  <       Test-2  7108396 <       Test-2  7108066

The example tells me that Test-1 (from 769027 769267) is inserted/replacing Test-2 at 7108396 to 7108066.

Am I getting this right?

Thanks!

Defining Walk/W-lines in GFA1

Walk/W-lines are used to keep sequence walks through the graph. There are a few options:

  1. Don't add W-lines as we already have Path/P-lines in GFA1. However, there are multiple problems with P-lines. a) the most important one is that it doesn't allow tags as the last field is optional. We need to attach information (see below) in addition to path names. b) naming is not appropriate. Strictly speaking, a path disallows cycles. c) P-lines use comma as a field separator. Comma is a common symbol in reference names.

  2. Define W-lines as:

    <W-line> <- 'W' <ctgId> <oriSeg>+ <tag>*
    <oriSeg> <- ( '>' | '<' ) <segId>
    

    We put all the other information in tags. Note that without a list of CIGARs, W-lines are unable to distinguish multiple edges between the same pair of segments. We may define an optional tag to keep the list of CIGARs for general cases.

  3. Define W-lines as:

    <W-line> <- 'W' <sampleId> <ctgId> <oriSeg>+ <tag>*
    
  4. Define W-lines as:

    <W-line> <- 'W' <sampleId> <ctgId> <ctgStart> <ctgEnd> <oriSeg>+ <tag>*
    

    With <ctgStart> and <ctgEnd>, a walk doesn't have to encode an entire contig.

The differences between 2/3/4 are about what fields should be made mandatory. My mild preference may be 4 (I am not even sure); 2/3 are also good to me. I strongly think P-lines are inadequate for a reference model. With an optional tag to keep a list of CIGARs, W-lines are more general. Internally, a GFA parser can parse both W-lines and P-lines into the same data structure.

CC @ekg and @benedictpaten.

gfa view subgraph radius definition

Dear Heng Li,
I recently tried the gfa view command with options -l <SEGMENT_NAME> -r <RADIUS>.

However, the results are not the one which I would expect. Since the tools do not have a very clear documentation yet, I am not sure if I maybe misunderstood completely the commands - or if there is some kind of bug instead.

E.g. when I use the test file included in the repository (MT.gfa). This is the file displayed using my gfaviz tool. I gave a different color to each of what I would expect to be the definition of different radius.
colored-segments
I would expect to extract, when using -l MTh0 -r <radius> the following:

  • r=0 --> red segment only
  • r=1 --> red segment & blue segments and links between them
  • r=2 --> ...& yellow
  • r=3 --> ...& magenta
  • r>=4 --> whole graph.

What I actually get is:

  • r=0 => as expected
  • r=1 => as expected
  • r=2 => as expected
  • r=3 => as expected, BUT:
  • r=4 => same result as r=2, which I do not understand...
  • r=5 => same result as r=3 (unexpected)

Invalid GAF is produced

It seems that running gfatools ed -c produces invalid GAF as only 10 entries are present instead of 12 mandatory per spec

E.g.:

NA21309#1       119129  0       119129  +       >s60779>s60780>s60781>s60782>s60783<s227791>s60785>s60786       119130  0       119126  82

Subgraph option question

I am having an issue with gfatools view -l.
While running on
my.txt
(which is actually a gfa) with the list of nodes
list.txt I am getting an empty answer.
Is the format of list.txt wrong?

GFA2 format from miniasm not properly converting to fasta

Hi Heng Li,
Huge fan here, apologies but I have been unable to get what I deem a perfectly fine gfa file from miniasm to convert to fasta using gfatools

The top few lines are as follows

S utg000001l * LN:i:127742
a utg000001l 0 m64247_210928_214804/49612954/ccs:6-11010 + 3901
a utg000001l 3901 m64247_210928_214804/99615607/ccs:3-9113 + 2765
a utg000001l 6666 m64247_210928_214804/157876436/ccs:1-6871 + 50

From what I can tell the tool gfatools would need the original reads file with those read headers convert it to an actual sequence fasta file right?

My commands are as below:

AVA alignment

$MINIMAP2 -x ava-pb -t$(($Threads-1)) $READS $READS | gzip -1 > ava.reads.paf.gz

Miniasm run with -c 2

$MINIASM -c 2 ava.reads.paf.gz > output.gfa ;

GFAtools conversion

$GFATOOLS gfa2fa -s output.gfa

[M::main] Version: 0.4-r214-dirty
[M::main] CMD: /home/BIOTECH/dmsanders/progs/gfatools/gfatools gfa2fa -s output.gfa
[M::main] Real time: 0.000 sec; CPU: 0.001 sec

This occurs but there is no output...
without -s it segfaults...

And the traditional command:
awk '/^S/{print ">"$2"\n"$3}' output.gfa | fold > out.fa
gives me:

utg000001l

utg000002l

ideas?

Standalone libraries to work with GFA

Another discussion thread. It is probably too early to implement libraries now, but it would be good to start thinking about the topic.

Currently, gfatools comes with very preliminary APIs to read rGFA into memory. The memory layout is described in gfa.h. It largely follows the model of string graphs. I quite like model and will stick with it. However, I guess general devs will feel uncomfortable with this representation. I won't have the bandwidth to implement the more general path model any time soon, either. In addition, it is also preferable to have two independent implementations (e.g. samtools vs picard vs bamtools). I wonder if you (@ekg and @benedictpaten) are interested in implementing a standalone library to work with GFA. You already have in vg a GFA parser, an in-memory model and a serialization format. You can isolate the relevant code and expose stable C and C++ APIs to other devs. I know vg has APIs, but I guess other devs will prefer a more focused lightweight library that is easier to build.

Fields in bubble output

Hello,
I'm testing a few SV detection methods on a small rGFA graph I produced with 5 individuals using minigraph. It seems like the 'bubble' tool would be useful for detecting small to medium sized (<100kb) structural variants... at least simple ones. I was able to discern from the code the different fields in the output, except for b->cf_min, b->cf_max, b->cf_ref. In my test case these fields are always -1. If these are meaningful fields, could you please explain what they mean?

Thank you

Using `gfatools asm`

Hi
Thank you for your work on gfatools,
I have an assembly graph that I want to process with gfatools asm (to simplify the graph, like popping the bubbles, etc) and output the scaffolds.
The graph is based on draft assembly contigs and some connections I inferred based on long reads, either edges or gaps based on if the estimated distance/gap size is negative or positive.
I wanted to get some information/advice on what I should provide gfatools asm with to get the best out of it. Like in my analysis, each connection (edge or gap) is based on supporting long reads so I have weights (number of supports) for the connections that may be useful, and I format that in the tag currently (FC:i: / see the example below) but not sure if gfatools will consider that?
I also have the gap size estimates that I wanted to output on G-lines for both gaps and edges (as I don't have alignments but only gap estimates for overlapping contigs too), but I found out negative distances, for overlapping contigs, are not allowed on G-lines, so I will have to somehow format that as an E-line? if so, then are the start/end positions important for the analysis or I can put some fake values?
Toy example:

H	VN:Z:2.0
graph [scaf_num=None]
S	1	49057	*
S	2	33803	*
S	3	22222	*
G	*	2-	1-	3340	*	FC:i:20
# 20 reads support the above gap and the gap size is 334
G	*	1+	3-	4000	*	FC:i:6
# 6 reads support the above gap and the gap size is 400
G	*	1-	2-	-300	        *	FC:i:15
# 15 reads support this connection and the contigs overlap by 300 bp, but this seems like an invalid G-line and should probably be converted to an E-line? 

And lastly: is there any additional information that gfatools can benefit from? I can potentially prepare and provide those too.

The question about format of fasta using gfatoos gfa2fa

gfatools gfa2fa -s ./test/MT.gfa > MT.fa

cat MT.fa | grep "^>" | less

MT_human
MT_orang_3426_3927 <MT_human:4502 <MT_human:4001
MT_orang_8961_9463 >MT_human:9505 >MT_human:9505

I don't understand >MT_orang_3426_3927 <MT_human:4502 <MT_human:4001

MT_orang_8961_9463 >MT_human:9505 >MT_human:9505 Anybody can tell me about this?Thank you very much!

"A" line produced by gfatools asm

What is the format of the "A" line produced by gfatools asm?

S       utg0000001l     *       LN:i:25977      RC:i:371        lc:i:25977
A       utg0000001l     0       +       82117   0       401
A       utg0000001l     16      -       82119   0       488
A       utg0000001l     172     +       82121   0       395
A       utg0000001l     200     +       82123   0       390
A       utg0000001l     203     -       82125   0       394
A       utg0000001l     221     -       82127   0       389
A       utg0000001l     304     -       82129   0       317
A       utg0000001l     312     -       82131   0       443
A       utg0000001l     315     +       82133   0       464
...

rGFA vs vg's path model

I will start with a simple case: compact de Bruijng graph (cDBG). We can construct a cDBG for GRCh38 and keep chromosome paths (more exactly walks). This is a lossless representation of GRCh38, but will we prefer this over the linear genome? No. This cDBG discards the intuition of linearity. A unitig may corresponds to multiple regions in GRCh38, which unnecessarily complicates analyses. cDBG is not a good reference model.

Path model can encode such a cDBG. It semantically allows graphs overcomplicated for for reference. This over-flexibility is a problem. Another issue with the path model is that it doesn't have a concept of sample. Having two similar paths from two samples indicate orthology; having two paths from one sample indicate paralogy/segdup. Without the concept of sample, we can't distinguish the two cases.

Starting with all-pair alignment, seqwish will build a graph akin to A-bruijn, which is affected by the same issues with cDBG. From a practical point of view, a much bigger problem is that it assumes all sequences should be kept in the graph. However, assemblies and alignments have many errors and these errors will contaminate the reference. On the other hand, a reference may be incomplete, but the content in the reference shall be accurate. I have closely worked with GRC and firmly share their view that accuracy is critically important to a reference.

rGFA imposes a strong constraint on a reference model: a linear reference sequence doesn't collapse onto a segment. This constraint keeps the intuition of linearity and simplifies graph operations. It is exactly what we want.

rGFA can be extended to a constrained path model with the addition of Walk lines. It explicitly exposes the concept of sample in this case. However, such an extended model will be more complicated. We don't have algorithms to build such a graph. Building an extended rGFA to base accuracy again assumes contigs are perfectly accurate, but we won't get there in a few years.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.