Code Monkey home page Code Monkey logo

vg's Introduction

vg

Join the chat at https://gitter.im/vgteam/vg Latest Release Doxygen API Documentation

variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Variation graph

Variation graphs provide a succinct encoding of the sequences of many genomes. A variation graph (in particular as implemented in vg) is composed of:

  • nodes, which are labeled by sequences and ids
  • edges, which connect two nodes via either of their respective ends
  • paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges

This model is similar to sequence graphs that have been used in assembly and multiple sequence alignment.

Paths provide coordinate systems relative to genomes encoded in the graph, allowing stable mappings to be produced even if the structure of the graph is changed. The variation graph model makes this embedding explicit and essential. Tools in vg maintain paths as immutable during transformations of the graph. They use paths to project graph-relative data into reference-relative coordinate spaces. Paths provide stable coordinates for graphs built in different ways from the same input sequences.

example variation graph

Citing VG

Please cite:

Support

We maintain a support forum on biostars: https://www.biostars.org/tag/vg/

Installation

Download Releases

The easiest way to get vg is to download one of our release builds for Linux. We have a 6-week release cadence, so our builds are never too far out of date.

Download Button
Download the latest vg release for Linux

For MacOS, see Building on MacOS.

Building on Linux

If you don't want to or can't use a pre-built release of vg, or if you want to become a vg developer, you can build it from source instead.

First, obtain the repo and its submodules:

git clone --recursive https://github.com/vgteam/vg.git
cd vg

Then, install VG's dependencies. You'll need the protobuf and jansson development libraries installed, and to run the tests you will need: * jq, bc, rs, and parallel * hexdump and column from bsdmainutils * npm for testing documentation examples). On Ubuntu, you should be able to do:

make get-deps

On other distros, you will need to perform the equivalent of:

sudo apt-get install build-essential git cmake pkg-config libncurses-dev libbz2-dev  \
                     protobuf-compiler libprotoc-dev libprotobuf-dev libjansson-dev \
                     automake gettext autopoint libtool jq bsdmainutils bc rs parallel \
                     npm curl unzip redland-utils librdf-dev bison flex gawk lzma-dev \
                     liblzma-dev liblz4-dev libffi-dev libcairo-dev libboost-all-dev \
                     libzstd-devel pybind11-dev python3-pybind11

Note that Ubuntu 16.04 does not ship a sufficiently new Protobuf; vg requires Protobuf 3 which will have to be manually installed.

At present, you will need GCC version 4.9 or greater, with support for C++14, to compile vg. (Check your version with gcc --version.) GCC up to 11.2.0 is supported.

Other libraries may be required. Please report any build difficulties.

Note that a 64-bit OS is required. Ubuntu 20.04 should work.

When you are ready, build with . ./source_me.sh && make, and run with ./bin/vg.

Note that vg can take anywhere from 10 minutes to more than an hour to compile depending on your machine and the number of threads used.

You can also produce a static binary with make static, assuming you have static versions of all the dependencies installed on your system.

Building on MacOS

Clone VG

The first step is to clone the vg repository:

git clone --recursive https://github.com/vgteam/vg.git
cd vg

Install Dependencies

VG depends on a number of packages being installed on the system where it is being built. Dependencies can be installed using either MacPorts or Homebrew.

Using MacPorts

You can use MacPorts to install VG's dependencies:

sudo port install libtool protobuf3-cpp jansson jq cmake pkgconfig autoconf automake libtool coreutils samtools redland bison gperftools md5sha1sum rasqal gmake autogen cairo libomp boost zstd pybind11
Using Homebrew

Homebrew provides another package management solution for OSX, and may be preferable to some users over MacPorts. VG ships a Brewfile describing its Homebrew dependencies, so from the root vg directory, you can install dependencies, and expose them to vg, like this:

# Install all the dependencies in the Brewfile
brew bundle

Build

With dependencies installed, VG can now be built:

. ./source_me.sh && make

Note that static binaries cannot yet be built for Mac.

Our team has successfully built vg on Mac with GCC versions 4.9, 5.3, 6, 7, and 7.3, as well as Clang 9.0.

Migrating to ARM Macs

The Mac platform is moving to ARM, with Apple's M1, M1 Pro, M1 Max, and subsequent chip designs. The vg codebase supports ARM on Mac as well as on Linux. The normal installation instructions work on a factory-fresh ARM Mac.

However, it is easy to run into problems when migrating a working vg build environment or migrating Macports or Homebrew from x86_64 to ARM. The ARM machine can successfully run x86_64 tools installed via Macports or Homebrew on the old machine, but vg can only build properly on ARM if you are using ARM versions of the build tools, like make and CMake.

So, after migrating to an ARM Mac using e.g. Apple's migration tools:

  1. Uninstall Macports and its packages, if they were migrated from the old machine. Only an ARM Macports install can be used to provide dependencies for vg on ARM.
  2. Uninstall Homebrew and its packages, if they were migrated. Similarly, only an ARM Homebrew install will work.
  3. Reinstall one of Macports or Homebrew. Make sure to use the M1 or ARM version.
  4. Use the package manager you installed to install system dependencies of vg, such as CMake, as documented above.
  5. Clean vg with make clean. This should remove all build artefacts.
  6. Build vg again with make.

If you still experience build problems after this, delete the whole checkout and check out the code again; make clean is not under CI test and is not always up to date with the rest of the build system.

Whether or not that helps, please then open an issue so we can help fix the build or fix make clean.

Usage

Variation graph construction

From VCF

Note See the vg autoindex examples below for how to use that tool in place of vg construct to build and index graphs in a single step.

One way to build a graph with vg is to construct it from variant calls using a reference FASTA file and VCF file. If you're working in vg's test/ directory:

vg construct -r small/x.fa -v small/x.vcf.gz >x.vg

Note that to build a graph, an index of the VCF file is required. The VCF index file can be generated using the tabix command provided by SAMtools (e.g. tabix -p vcf x.vcf.gz on the command line).

From Assemblies

You can also build a graph (and indexes for mapping with vg) from a set of genome assemblies (FASTA), as opposed to variant calls as described above, using Minigraph-Cactus.

Importing and exporting different graph formats

vg supports many formats, the three most important are:

  • PackedGraph (.vg) : This is vg's native format. It supports edits of all kinds (to topology and paths), but can be inefficient at large scales, especially with many paths.
  • GFA (.gfa) : GFA is standard text-based format and usually the best way to exchange graphs between vg and other pangenome tools. vg can also operate on (uncompressed) GFA files directly, by way of using a PackedGraph representation in memory (and therefore shares that format's scaling concerns and edit-ability).
  • GBZ (.gbz) : GBZ is a highly-compressed format that uses much less space to store paths than the above formats, but at the cost of not allowing general edits to the graph.

You can query the format of any graph using vg stats -F.

Importing

In general, you will build and index vg graphs using vg autoindex (from GFA or VCF) or Minigraph-Cactus (FASTAs). You can also import GFA files from other tools such as ODGI and PGGB using vg convert -g.

Exporting

You can convert any graph to GFA using vg convert -f. By default, vg uses GFA v1.1 where paths are represented as W-lines. To use P-lines instead (GFA v1.0), use vg convert -fW.

Path Types

The GBZ format makes the distinction between REFERENCE and HAPLOTYPE paths. REFERENCE paths can be used as coordinate systems but are more expensive to store. HAPLOTYPE paths are highly compressed but cannot be used for position lookups. In the HPRC graphs for example, contigs from GRCh38 and CHM13(T2T) are REFERENCE paths and all other samples HAPLOTYPE paths.

The distinction between REFERENCE and HAPLOTYPE paths is carried over into the other formats such as .vg and .gfa to facilitate conversion and inter-operation. In .gfa, REFERENCE paths are P-Lines, or W-lines whose sample names are flagged in the header. W-lines whose names are not flagged in the header are HAPLOTYPE paths. In .vg they are denoted using a naming convention.

See the Path Metadata WIKI for more details.

Warning GBZ is the only format that supports efficient loading large numbers of HAPLOTYPE paths in vg. You may run into issues trying to load whole-genome graphs with thousands of HAPLOTYPE from .vg or .gfa files. vg convert -H can be used to drop HAPLOTYPE paths, allowing the graph to be more easily loaded in other formats.

Viewing

Note It is best to use the newer vg convert tool (described above) for GFA conversion

vg view provides a way to convert the graph into various formats:

# GFA output
vg view x.vg >x.gfa

# dot output suitable for graphviz
vg view -d x.vg >x.dot

# And if you have a GAM file
cp small/x-s1337-n1.gam x.gam

# json version of binary alignments
vg view -a x.gam >x.json

Mapping

If you have more than one sequence, or you are working on a large graph, you will want to map rather than merely aligning.

There are multiple read mappers in vg:

  • vg giraffe is designed to be fast for highly accurate short reads, against graphs with haplotype information.
  • vg map is a general-purpose read mapper.
  • vg mpmap does "multi-path" mapping, to allow describing local alignment uncertainty. This is useful for transcriptomics.

Mapping with vg giraffe

To use vg giraffe to map reads, you will first need to prepare indexes. This is best done using vg autoindex. In order to get vg autoindex to use haplotype information from a VCF file, you can give it the VCF and the associated linear reference directly.

# construct the graph and indexes (paths below assume running from `vg/test` directory)
vg autoindex --workflow giraffe -r small/x.fa -v small/x.vcf.gz -p x

# simulate a bunch of 150bp reads from the graph, into a GAM file of reads aligned to a graph
vg sim -n 1000 -l 150 -x x.giraffe.gbz -a > x.sim.gam
# now re-map these reads against the graph, and get BAM output in linear space
# FASTQ input uses -f instead of -G.
vg giraffe -Z x.giraffe.gbz -G x.sim.gam -o BAM > aln.bam

More information on using vg girafe can be found on the vg wiki.

Mapping with vg map

If your graph is large, you want to use vg index to store the graph and vg map to align reads. vg map implements a kmer based seed and extend alignment model that is similar to that used in aligners like novoalign or MOSAIK. First an on-disk index is built with vg index which includes the graph itself and kmers of a particular size. When mapping, any kmer size shorter than that used in the index can be employed, and by default the mapper will decrease the kmer size to increase sensitivity when alignment at a particular k fails.

# construct the graph (paths below assume running from `vg/test` directory)
vg construct -r small/x.fa -v small/x.vcf.gz > x.vg

# store the graph in the xg/gcsa index pair
vg index -x x.xg -g x.gcsa -k 16 x.vg

# align a read to the indexed version of the graph
# note that the graph file is not opened, but x.vg.index is assumed
vg map -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -x x.xg -g x.gcsa > read.gam

# simulate a bunch of 150bp reads from the graph, one per line
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
# now map these reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam

# or alternatively, surject them to BAM in the call to map
vg sim -n 1000 -l 150 -x x.xg > x.sim.txt
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam

Augmentation

Variation from alignments can be embedded back into the graph. This process is called augmentation and can be used for de novo variant calling, for example (see below).

Warning Using vg augment for variant calling remains very experimental. It is not at all recommended for structural variant calling, and even for small variants, you will often get much more accurate results (at least on human) by projecting your alignment to BAM and running a linear variant caller such as DeepVariant.

# augment the graph with all variation from the GAM except that implied by soft clips, saving to aug.vg.  aug.gam contains the same reads as aln.gam but mapped to aug.vg
vg augment x.vg aln.gam -A aug.gam > aug.vg

# augment the graph with all variation from the GAM, saving each mapping as a path in the graph.
# softclips of alignment paths are preserved (`-S`).
# Note, this can be much less efficient than the above example if there are many alignments in the GAM
vg augment x.vg aln.gam -i -S > aug_with_paths.vg

Variant Calling

Note More information can be found in the WIKI.

Calling variants using read support

The following examples show how to generate a VCF with vg using read support. They depend on output from the Mapping and Augmentation examples above. Small variants and SVs can be called using the same approach. Currently, it is more accuracte for SVs.

Call only variants that are present in the graph:

# Compute the read support from the gam
# -Q 5: ignore mapping and base qualitiy < 5
vg pack -x x.xg -g aln.gam -Q 5  -o aln.pack

# Generate a VCF from the support.  
vg call x.xg -k aln.pack > graph_calls.vcf

By default, vg call omits 0/0 variants and tries to normalize alleles to make the VCF more compact. Both these steps can make it difficult to compare the outputs from different samples as the VCFs will have different coordinates even though they were created using the same graph. The -a option addresses this by calling every snarl using the same coordinates and including reference calls. Outputs for different samples can be combined with bcftools merge -m all.

vg call x.xg -k aln.pack -a > snarl_genotypes.vcf

In order to also consider novel variants from the reads, use the augmented graph and gam (as created in the "Augmentation" example using vg augment -A):

Warning Using vg augment for variant calling remains very experimental. It is not at all recommended for structural variant calling, and even for small variants, you will often get much more accurate results (at least on human) by projecting your alignment to BAM and running a linear variant caller such as DeepVariant.

# Index our augmented graph
vg index aug.vg -x aug.xg

# Compute the read support from the augmented gam (ignoring qualitiy < 5, and 1st and last 5bp of each read)
vg pack -x aug.xg -g aug.gam -Q 5 -s 5 -o aln_aug.pack

# Generate a VCF from the support
vg call aug.xg -k aln_aug.pack > calls.vcf

A similar process can by used to genotype known variants from a VCF. To do this, the graph must be constructed from the VCF with vg construct -a (graphs from other sources such as vg autoindex and Minigraph-Cactus cannot be used):

# Re-construct the same graph as before but with `-a`
vg construct -r small/x.fa -v small/x.vcf.gz -a > xa.vg

# Index the graph with `-L' to preserve alt paths in the xg
vg index xa.vg -x xa.xg -L

# Compute the support (we could also reuse aln.pack from above)
vg pack -x xa.xg -g aln.gam -o aln.pack

# Genotype the VCF (use -v)
vg call xa.xg -k aln.pack -v small/x.vcf.gz > genotypes.vcf

Pre-filtering the GAM before computing support can improve precision of SNP calling:

# filter secondary and ambiguous read mappings out of the gam
vg filter aln.gam -r 0.90 -fu -m 1 -q 15 -D 999 -x x.xg > aln.filtered.gam

# then compute the support from aln.filtered.gam instead of aln.gam in above etc.
vg pack -x xa.xg -g aln.filtered.gam -o aln.pack
vg call xa.xg -k aln.pack -v small/x.vcf.gz > genotypes.vcf

For larger graphs, it is recommended to compute snarls separately:

vg snarls x.xg > x.snarls

# load snarls from a file instead of computing on the fly
vg call x.xg -k aln.pack -r x.snarls > calls.vcf

Note: vg augment, vg pack, vg call and vg snarls can now all be run on directly on any graph format (ex '.gbz', '.gfa', .vg, .xg (except augment) or anything output by vg convert). Operating on .vg or '.gfa' uses the most memory and is not recommended for large graphs. The output of vg pack can only be read in conjunction with the same graph used to create it, so vg pack x.vg -g aln.gam -o x.pack then vg call x.xg -k x.pack will not work.

Calling variants from paths in the graph

Infer variants from from alignments implied by paths in the graph. This can be used, for example, to call SVs directly from a variation graph that was constructed from a multiple alignment of different assemblies:

# create a graph from a multiple alignment of HLA haplotypes (from vg/test directory)
vg msga -f GRCh38_alts/FASTA/HLA/V-352962.fa -t 1 -k 16 | vg mod -U 10 - | vg mod -c - > hla.vg

# index it
vg index hla.vg -x hla.xg

# generate a VCF using gi|568815592:29791752-29792749 as the reference contig.  The other paths will be considered as haploid samples
vg deconstruct hla.xg -e -p "gi|568815592:29791752-29792749" > hla_variants.vcf

Variants can also be inferred strictly from topology by not using -e, though unlike the above example, cycles are not supported. "Deconstruct" the VCF variants that were used to construct the graph. The output will be similar but identical to small/x.vcf.gz as vg construct can add edges between adjacent alts and/or do some normalization:

# using the same graph from the `map` example
vg deconstruct x.xg -p x > x.vcf

Haplotype paths from .gbz or .gbwt indexes input can be considered using -z and `-g', respectively.

As with vg call, it is best to compute snarls separately and pass them in with -r when working with large graphs.

Transcriptomic analysis

vg has a number of tools to support transcriptomic analyses with spliced graphs (i.e. graphs that have annotated splice junctions added as edges into the graph). These edges can be added into an existing graph using vg rna. We can then perform splice-aware mapping to these graphs using vg mpmap. vg developers have also made a tool for haplotype-aware transcript quantification based on these tools in rpvg. The easiest way to start this pipeline is to use the vg autoindex subcommand to make indexes for vg mpmap. vg autoindex creates indexes for mapping from common interchange formats like FASTA, VCF, and GTF.

More information is available in the wiki page on transcriptomics.

Working from the test/ directory the following example shows how to create a spliced pangenome graph and indexes using vg autoindex with 4 threads:

# Create spliced pangenome graph and indexes for vg mpmap
vg autoindex --workflow mpmap -t 4 --prefix vg_rna --ref-fasta small/x.fa --vcf small/x.vcf.gz --tx-gff small/x.gtf

RNA-seq reads can be mapped to the spliced pangenome graph using vg mpmap with 4 threads:

# Map simulated RNA-seq reads using vg mpmap
vg mpmap -n rna -t 4 -x vg_rna.spliced.xg -g vg_rna.spliced.gcsa -d vg_rna.spliced.dist -f small/x_rna_1.fq -f small/x_rna_2.fq > mpmap.gamp

This will produce alignments in the multipath format. For more information on the multipath alignment format and vg mpmap see wiki page on mpmap. Running the two commands on the small example data using 4 threads should on most machines take less than a minute.

Alignment

If you have a small graph, you can align a sequence to the whole graph, using a full-length partial order alignment:

vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG x.vg

Note that you don't have to store the graph on disk at all, you can simply pipe it into the local aligner:

vg construct -r small/x.fa -v small/x.vcf.gz | vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -

Most commands allow the streaming of graphs into and out of vg.

Command line interface

A variety of commands are available:

  • autoindex: construct graphs and indexes for other tools from common interchange file formats
  • construct: graph construction
  • index: index features of the graph in a disk-backed key/value store
  • map: mapp reads to a graph
  • giraffe: fast, haplotype-based mapping of reads to a graph
  • mpmap: short read mapping and multipath alignment (optionally spliced)
  • surject: project graph alignments onto a linear reference
  • augment: adds variation from aligned reads into the graph
  • call: call variants from an augmented graph
  • rna: construct splicing graphs and pantranscriptomes
  • convert: convert graph and alignment formats
  • combine: combine graphs
  • chunk: extract or break into subgraphs
  • ids: node ID manipulation
  • sim: simulate reads by walking paths in the graph
  • prune: prune graphs to restrict their path complexity
  • snarls: find bubble-like motifs in a graph
  • mod: various graph transformations
  • filter: filter reads out of an alignment
  • deconstruct: create a VCF from variation in the graph
  • paths: traverse paths in the graph
  • stats: metrics describing graph properties

Implementation notes

vg is a collection of tools based on a common data model (the variation graph) that is described by a protobuf schema (vg.proto). Data objects defined in vg.proto may be serialized via a stream pattern defined in stream.hpp. It is not necessary to write code in vg in order to interface with the algorithms defined here. Rather, it is sometimes simpler to write an external algorithm that reads and writes the same data formats.

License

MIT

vg's People

Contributors

6br avatar abeconnelly avatar adamnovak avatar alexandersavelyev avatar alexjironkin avatar apregier avatar buske avatar code-s-witch avatar edawson avatar ekg avatar emikoifish avatar glennhickey avatar jeizenga avatar jervenbolleman avatar jltsiren avatar jmonlong avatar jonassibbesen avatar kevyin avatar ktym avatar ldenti avatar mlin avatar mr-c avatar mzueva avatar ocxtal avatar richarddurbin avatar stephenhwang avatar tanessav avatar xchang1 avatar yhoogstrate avatar yoheirosen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vg's Issues

Assertion `table' failed in sparsehash

I just got this very weird assertion error after completing whole-genome kmer indexing:

vg.hgi: sparsehash/build/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const std::pair<long int, long int>, vg::Edge*>; Key = std::pair<long int, long int>; HashFcn = std::hash<std::pair<long int, long int> >; ExtractKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SelectKey; SetKey = google::dense_hash_map<std::pair<long int, long int>, vg::Edge*, std::hash<std::pair<long int, long int> >, std::equal_to<std::pair<long int, long int> >, google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> > >::SetKey; EqualKey = std::equal_to<std::pair<long int, long int> >; Alloc = google::libc_allocator_with_realloc<std::pair<const std::pair<long int, long int>, vg::Edge*> >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion `table' failed.

I'm trying to reproduce it on a smaller set. It'd be nice to know what's going on.

Defining personal genomes on a variation graph

Hi,
For some analyses, like Allelic Specific Expression, one wants to have a personal version of the reference. That is the reference modified to represent the known sequence of a individual. It seems this would be trivial with variation graphs if one has a set of variants for which a individual is hom/het. Question: is it possible to align the reads against a subset of the paths of the graph? like define a subset of the graph removing paths for alleles not present on a personal genome (it is hom at the site) and keeping only paths for het sites?
Thanks in advance

unitig mapping: Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed

I'm running vg map on some paired-end reads now, appears to be going smoothly.

I'd previously tried it on some @lh3 fermikit unitigs and hit the following:

$ time vg/vg map -f mlin_unitigs.mag.gz -d wg.index.k27e11 -t 32 -FX 1.9 >aln.gam
vg: path.cpp:510: vg::Path vg::merge_paths(const vg::Path&, const vg::Path&, int&, int&): Assertion `p1mp->position().node_id() == p2mp->position().node_id()' failed.
Aborted

real    0m12.258s
user    0m11.355s
sys     0m2.657s

Here are the first couple unitigs from mlin_unitigs.mag.gz:

@374566:656953435       108     .       1079017880,56;1596667180,64;2213272247,65;2213272249,65;
AAAAAAAGAAAGAAAAAGAGAGAGAGAAAATAAAAGAAAATTAATATCATTGGCTGTTTTTAAGTTCATCTTTCCCTCCTCTGTCATCTCACAGGTATTAGTAAGAACCGCTGTTACACTGCGTGCCACACTGAATTTCAACTATCCCTCTATCTGCTTTGTCTTCTCTCCCAGCCAGTAAGCTACTAAATGATTTTGGATGAATAAATAAACATCTAGGAATGGGAAAGAGAGCAAAATTGAACAAATAGTAATGAATTAGAGTAATCCTTTAAAAGGTGGAAATTATTGGAACAGATATGCAGTTTAAATAAGTTGCAGACTAAGATAGCAGCATAAAACATACAGGAATATGGCCGGGCGCGGTGGCTCAAGCCTGTAATCCCAGCACTTTGG
+
'''''((()*****++++++++,,,-..///////012234556667777888899:::;;;<<<<<<=====>>>>??@@ABBCCCDDDEEDEEEEEEFFFFFFEEEDCCCCDCDDEEEEEDEEDCCBBBBCCDCBAA@?>>===<<<<;;;<;;;;;::;::;<<<;<<<<;;;<;;:::9988887766667776665677888889::::::;::;<<==>>???@@A@@?@ABBBCDEEEEEEEEFGGGHHHIIJJIIIHGGHHGGGGGGGFGGGFFFGGGGFFFFFFFFGFGGFFFEEDEEEFEDEEDEDCDDCBCBA@@???>>>>>>=<;;;:988888888766544322111111110000000000////..----------,+*
@2445285:1581873033     5       1497293684,98;1884040499,93;    189839833,99;434191570,89;1562090913,91;1633644090,82;
AAAAAAAAAAAGTCATGGGAGAGGATGGTAAAGCTAAGTATCTTTTGCACCTACTCCCCAGCCCCACCACTGCAGAAGCTGAAGGGGTTCCTAGAGGCTTCTTCTGCC
+
""###$%%&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&%%$$$#""

round trip from .vg to db and back

It should be possible to dump a .vg format file from a vg database, and also to read it back in. Could this be done quickly using something like the node range query in vg find?

Steps to compile vg locally without needing sudo access

Hi Erik,

I wrote up the following steps, in case anyone would like to install vg without needing sudo access. Feel free to change/integrate them any way you prefer :) Below are the steps:

  1. Download and install jansson in your home directory, and please replace YOUR_USERNAME with your username on the system:
git clone https://github.com/akheron/jansson
autoreconf -i
./configure --prefix=/home/YOUR_USERNAME/apps/jansson
make
make install
cd ..
  1. Download vg and enter its directory:
git clone --recursive https://github.com/ekg/vg.git
cd vg
  1. Update INCLUDES and LDFLAGS in Makefile:

Add to INCLUDES the following, with YOUR_USERNAME being set appropriately:

-I/home/YOUR_USERNAME/apps/jansson/include

Add to LDFLAGS the following, with YOUR_USERNAME being set appropriately:

-L/home/YOUR_USERNAME/apps/jansson/lib

They should look something like this, with YOUR_USERNAME being set appropriately:

INCLUDES=-I./ -I/home/YOUR_USERNAME/apps/jansson/include -Icpp -I$(VCFLIB)/src -I$(VCFLIB) -Ifastahack -Igssw/src -Iprotobuf/build/include -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache -Ihtslib -Isha1
LDFLAGS=-L./ -L/home/YOUR_USERNAME/apps/jansson/lib -Lvcflib -Lgssw/src -Lprotobuf -Lsnappy -Lrocksdb -Lprogressbar -Lhtslib -lvcflib -lgssw -lprotobuf -lhts -lpthread -ljansson -lncurses -lrocksdb -lsnappy -lz -lbz2
  1. Next compile vg:
make
  1. For the final step, just update your LD_LIBRARY_PATH with jansson, and add it to you .bashrc file to keep it persistent between sessions. Just as before, YOUR_USERNAME would need to be replaced with your specific username on the system:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/YOUR_USERNAME/apps/jansson/lib

Hope it helps others as well,
Paul

use many alignments to modify a graph

Due to the destructive nature of the changes on the internal representation of the graph, it's not trivial to take many graph alignments and include them in the graph. However, it should be possible to do with a little care, and in particular is not hard if we can handle all the mappings to a single node at once. This means sorting the changes we want to add up-front, modifying the graph in one step for each node.

vg surject: unhygienic output for unmapped reads

Unmapped reads (RNAME=*) in the BAM output from vg surject seem to have an arbitrary value filled in for POS and perhaps other fields as well. These could be dirty values left over in some data structure from a previous use.

$ samtools view aln.surn.bam
C2KC2ACXX_1:6:1101:3304:0/1     6       *       11075625        0       1S99M   *       11075625
        0       TGGGTTGATGCCATGGAAAGGGGCAGTAACTTCCTGATGTTACCATGGCAACAGTAAACTAACATGGCACACTGGTGTCTAATG
GGGGAGGTGCTTCTGC    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B><8B><8B><8B><8B><8B><8B>
<8B><8B><8B><88><8B><8B><84><88><84><88><8B><8B><8B><8B><8B><8B><88><88><8B><8B><8B><88><8B><88><88>
<88><88><88><84><88><8B><8B><8B><88><88><84><88><88><8B><8B><88><88><88><88><88><88><88><88><84><88>
<88><88><88><88><88><88><84><84><84><88><88><84><88><88><88><88><88><88><84><88><88>~<84><84><84>
<84><84><88><88><88>
C2KC2ACXX_1:6:1101:3573:0/1     22      *       11077961        0       100M    *       11077961
        0       AGCAGCAGTGTTTCTGAACAGCTTCAGGAAGAGCTTGCCACTTTCAGGCTCTCACAAATGGAGAGACTTCTTATTAATCTCTTT
CTCTCCACTGCAGGCA    <84><84><84><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88><88>
<88><88><84><88><88><88><88><88><8B><8B><88><8B><88><8B><88><88><88><88><88><8B><8B>~<88><88><8B>
<88><88><88><88><88><88><88><8B><8B><8B><8B><8B><88><8B><8B>~<88><84><84><88><88><88><8B><8B><8B>
<8B><88><8B><84><88><88><88><88><88><88>~<88><88><88>~<84><88><88><84><88><84><84><84><84><84><84>
<84><84><84><84>
C2KC2ACXX_1:6:1101:3928:0/1     22      *       11030553        0       1S99M   *       11030553
        0       GGGTAGTCTGAAAGAGCTTGTTCCTCCCCGCCTCTCTCTCTCTCTTGCTCTCTCTCTTGCCATGTAACATTCAGGCTCCTCCTT
CACCTTCCAACATGGT    <84><84><84><88><88><88><88><88><88><88><88><88><88><8B>~<88><88><88><8B><88>
<8B><8B><8B><88><88><8B><8B><88><8B><8B><8B><8B><8B><88><8B><8B><8B><8B><8B><8B><8B><8B><8B><8B><88>
<8B><8B><88><8B><8B><8B><8B><8B><8B><8B><84><88><88><8B><8B><88><88><8B><88><88><88><88><88><88><88>
<84><84><88><84>riiiriyririirrr<84>iiriririyi

Make include guards unique

I find that include guards like "INDEX_H" and "PATH_H" are too short for the safe reuse of your header files (when they belong to an application programming interface).

paths should include edges

Right now these are implicit, which is causing problems that are obvious as soon as one uses vg view -d to convert a graph to dot format. Non-reference sequences are colored as red, and in some cases we see that deletions are colored as black (implying they are part of a known path). This can be corrected by examining the path membership of the nodes they connect--- if there is a gap we'll know that we are not in the path. However, perhaps it is better to add edges to paths? Unclear.

mapping qualities

We need mapping qualities. They can perhaps be estimated by looking at the size of the kmer space of the read vs. the kmer space of the entire index. A more nuanced approach is to build a ML model that calculates mapping accuracy given some parameters we can extract from the alignment (akin to Mosaik). But simple may be enough in this case.

Possibility for MPI integration...

Hi Erik,

Just out of curiosity, have you tried to integrate with MPI to work across multiple machines. MPI plays really well with OpenMP, and would allow access to even more cores to speed up things.

Just a thought,
Paul

Protobuf "info" map fields take too much memory

Every VG protobuf message has a "map<string, Info> info" field. On my system, this field, takes 88 bytes in memory when empty, and amounts to more than half of the in-memory protobuf Graph representation. And probably matches a good chunk of the memory used by the various hash_map indexes (if we assume 2 * 8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of it if possible, or moving it into Graph-level maps - ie instead of having an info record per Node, have a map in Graph that maps node id's to info records. This seems to me to be a low-hanging-fruit way to cut memory usage by about half. Something I can do, but would like feedback on as I don't really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info, ::google::protobuf::internal::WireFormatLite::TYPE_STRING, ::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

GCSA2 output problem

@jltsiren reports:

The forward and backward edges don't always match in the example.

In the first case, kmer AAGAATACAAG starts at position 35:4 and has A as a predecessor. If I follow the backward edge on A, I reach kmer AAAGAATACAA, which can be found at position 35:3. However, because AAAGAATACAA does not have G as a successor, I can't reach the original kmer AAGAATACAAG by following a forward edge.

In all cases, the successor positions of the two kmers seem to differ by more than 1, so one of them is probably wrong.

Reverse: Node AAAGAATACAA is missing successor(G): AAGAATACAAG
AAAGAATACAA 35:3 G A 37:1
AAGAATACAAG 35:4 A A 37:3

Reverse: Node TACTCCACATC is missing successor(A): ACTCCACATCA
TACTCCACATC 200:0 C,G C 202:0
ACTCCACATCA 200:1 T A 202:2

Reverse: Node GATGCTTGTGA is missing successor(A): ATGCTTGTGAA
GATGCTTGTGA 77:24 T T 79:0
ATGCTTGTGAA 77:25 G G 82:2

Reverse: Node CATTGTCAACA is missing successor(C): ATTGTCAACAC
CATTGTCAACA 107:1 T A 109:0
ATTGTCAACAC 107:2 C A 109:2

Reverse: Node TCTCTTCACTG is missing successor(C): CTCTTCACTGC
TCTCTTCACTG 153:2 A G 155:0
CTCTTCACTGC 153:3 T C 155:2

Reverse: Node TGGGTCCTGGT is missing successor(G): GGGTCCTGGTG
TGGGTCCTGGT 15:9 C T 20:0
GGGTCCTGGTG 15:10 T C 20:2

Reverse: Node TGGTTCCTGGT is missing successor(G): GGTTCCTGGTG
TGGTTCCTGGT 15:9 C T 20:0
GGTTCCTGGTG 15:10 T C 20:2

Reverse: Node GTGATGCTTGT is missing successor(A): TGATGCTTGTA
GTGATGCTTGT 77:22 G G 78:1
TGATGCTTGTA 77:23 G A 82:0

edges should be included in paths

The protobuf format doesn't currently have a way to represent when edges are part of paths.

Case in point (from the test directory):

test git:(master) ✗ vg construct -r tiny/tiny.fa >t.vg; vg align -s CAAATAAGGCTTGGAAATTATATTCCAACTCTCTT -Q query t.vg | vg mod -i - t.vg | vg view -
H       HVN:Z:1.0
S       2       CAAATAAGGCTTGGAAATT
P       2       x       +       19M
L       2       -       5       +       0M
L       2       -       4       +       0M
S       4       TTCTGGAGTTCTATT
P       4       x       +       15M
L       4       -       5       +       0M
S       5       ATATTCCAACTCTCTG
P       5       x       +       16M

Nothing in the GFA output (or other output) can refer to the added path query. This is a problem with the schema itself.

shared-lib errors

I just installed vg on a CentOS 6.4 box (after failing on OSX in #3); it seemed to compile fine but when I try to run it on a simple chr1 fasta file:

$ ./vg construct -r /hpc/users/willir31/data/refs/hg19.chr1.fasta
./vg: /usr/lib64/libgomp.so.1: version `GOMP_4.0' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./vg)
./vg: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.19' not found (required by ./vg)

I don't really know what to make of these errors. Any ideas?

Break banded alignment

I'm looking for failing unitig/contig alignments.

Reads past -B bases in length are aligned in "banded" mode in which the read is broke into overlapping subreads, each subread is mapped independently, and the results are merged by finding common points in the paths of the subreads.

This method does work, but there are certainly problems with my implementation. It might be much better to implement banded alignment in gssw. Either way I am looking for tests.

paired-end reads

Is there any way to keep the separation distance of paired-end reads or is that information lost in graph format?

circular graphs

vg can now support circular graphs, but there are no tests of this functionality. Make one, and verify this can work for at least some operations.

path storage change

To dynamically modify paths, they should be stored in-memory as linked lists: std::list<Mapping>, rather than literal protobuf path objects (which are only efficient in append-only mode).

Long deletion simplification

Paths imply edges. However, we do not simplify paths where many nodes are skipped by deletions. Each node will be referenced in the path with a mapping to_length of 0. This is redundant and we can simply skip these nodes when describing the path. Downstream this will help when extending the graph with a set of alignments.

Installation-from-source issues on OSX

Hi there, I've hit a few snags trying to install from source on OSX 10.10.2. All of my cmds / output are in this gist but I'll link to specific files / sections within it below:

That's as far as I can get right now.

I've installed "real" gcc/g++ via brew and they are the only ones on my path. I did this due to earlier errors like:

In file included from vg.cpp:1:
./vg.hpp:9:10: fatal error: 'omp.h' file not found
#include <omp.h>
         ^
1 error generated.

Some googling led me to believe that this was the result of using clang, cf. this SO answer.

When running in to the pb2json linker errors above, I thought they might be due to my having run brew install protobuf-c when I was still using clang, so I brew remove'd and re-brew install'd it, and that actually got me further, to errors in index.cpp:

 Tue 19:41:36  ryan@mbp: vg:master$ make
g++ -std=c++11 -fopenmp -g  -O3 -c -o vg.o vg.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o cpp/vg.pb.o cpp/vg.pb.cc -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o main.o main.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
g++ -std=c++11 -fopenmp -g  -O3 -c -o index.o index.cpp -I./ -Ipb2json -Icpp -Ivcflib/src -Ivcflib -Ifastahack -Igssw/src -Irocksdb/include -Iprogress_bar -Isparsehash/build/include -Ilru_cache
index.cpp: In member function 'const string vg::Index::key_for_node(int64_t)':
index.cpp:119:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_from_to(int64_t, int64_t)':
index.cpp:133:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_edge_to_from(int64_t, int64_t)':
index.cpp:151:20: error: 'htobe64' was not declared in this scope
     to = htobe64(to);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_kmer(const string&, int64_t)':
index.cpp:168:20: error: 'htobe64' was not declared in this scope
     id = htobe64(id);
                    ^
index.cpp: In member function 'const string vg::Index::key_for_node_path(int64_t, int64_t, int64_t)':
index.cpp:182:30: error: 'htobe64' was not declared in this scope
     node_id = htobe64(node_id);
                              ^
index.cpp: In member function 'const string vg::Index::key_for_path_position(int64_t, int64_t, int64_t)':
index.cpp:202:30: error: 'htobe64' was not declared in this scope

For unknown reasons, I can't even get back to this point after having removed everything and started over; I'm currently stuck at the aforementioned pb2json linker errors.

Just putting all of this here in case people have ideas / anyone else hits the same issues.

FastG output

Hi,

Would you be interested in outputting into fastg format to then view it in @rrwick Bandage tool?

update mapping algorithm

A big performance win lies at the other end of a small optimization. If we can guess what the best mapping target is on the basis of kmer matches, we can try to run fewer gssw alignments. If this can happen we can run a lot quicker when mapping.

One idea would be to measure the informativeness of each kmer, and then build a conditional entropy metric for each mapping target. We want to evaluate the mapping targets where our kmers hits are rare, and where there are many hits, before mapping against a huge complex of lower quality hits.

graph can imply unobserved sequences

What are the implications of the graph encoding (i.e. implying the existence of) sequences which have never been observed? e.g. variants at different locations which are not seen in a single individual but are on a valid path through the graph. Is there a way to encode that sort of contextual data in the graph?

vg on RNA-seq data

Hi,
This is great work and I look forwards to try it! I had some questions and I thought it would be better to ask rather than not to.

  • Would vg work on RNA-seq data?
  • would it be able to to spliced alignment directly?
  • could the known introns be coded as indels and the re-interpret them as introns when mapped back onto the reference space?
  • would it be possible to build a graph just with transcripts and move on from there?

thanks in advance,

Inti

vg surject: Assertion `alignment.has_path()' failed.

aln.gam (16GiB) has one lane of HiSeq 2500 100x2 reads mapped with `vg map'.

$ time vg/vg surject -d wg.index.k27e11 -t 32 -p 17 -b aln.gam > aln.surj.bam
vg: alignment.cpp:462: std::string vg::cigar_against_path(const vg::Alignment&): Assertion `alignment.has_path()' failed.
Aborted  

real    0m12.513s  
user    0m21.291s
sys     0m2.333s  
'''

GFA output is not really GFA

It's come to my attention that vg is not outputting correct GFA format.

That said, it's not clear yet to me if GFA can support vg graphs as @adamnovak 's recent commits enable, as such graphs have four kinds of edges.

Any suggestions about how to deal with this would be welcome. I think GFA is a useful format as it enables text processing of graphs on the command line.

adding variation with map / mod does not seem to work

I am trying to use map and mod to add sequences to the graph (as new paths). This does not work as expected on some simple examples (derived from existing unit tests). (originally mentioned a while ago in email, but adding to Github where it should have been in 1st place for posteriority, and if Adam wants to take a look). .

Using your test tiny/tiny.fa, I tried to make a point mutation (A->G 2nd base).

vg construct -r tiny/tiny.fa >t.vg
vg index -s -k 11 t.vg
vg view t.vg
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M

vg map -s CGAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Shouldn't I see a bubble in the graph? Same deal if I insert GGG at same position:

vg map -s CGGGAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTT t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 1 CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
P 1 x + 50M
(no change)

Inserting GGG at position 20 seems to work

vg map -s CAAATAAGGCTTGGAAATTTGGGTCTGGAGTTCTATTATATTCCAACTCTCTG t.vg | vg mod -i - t.vg | vg view -
H HVN:Z:1.0
S 2 CAAATAAGGCTTGGAAATTT
P 2 x + 20M
L 2 - 3 + 0M
L 2 - 4 + 0M
S 3 TCTGGAGTTCTATTATATTCCAACTCTCTG
P 3 x + 30M
S 4 GGG
L 4 - 3 + 0M

but I only see the one path for the sequence "x" in tiny.fa. I'd like to have a 2nd path be added that includes the insertion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.