Code Monkey home page Code Monkey logo

rs-gfatovcf's Introduction

rs-gfatovcf

rs-gfatovcf is a tool that finds variants in a Variation Graph. Refer to this link for further details.

How to run this tool

First, clone this repo and move to the resulting directory:

git clone https://github.com/HopedWall/rs-gfatovcf
cd rs-gfatovcf

To compile the program, run:

cargo build --release

In order to run the program, use the following command:

cargo run --release -- -i {input_file.gfa} -o {output_file.vcf}

Where:

  • input_file.gfa is the gfa file that will be used to build the graph
  • output_file.vcf is the vcf file that will be used to store the variants

You can also add one or more of the following arguments:

  • -v (or --verbose) to display various debug messages.
  • -p {path_ids} (or --reference-paths {path_ids}) to specify which paths from the GFA should be used as reference. By default, all paths are used.
  • -m {integer} to set the max number of edges to be traversed, default is 100. WARNING: Greater values may require a very large amount of memory!
  • -j {path} (or --json {path}) to store both the starting graph and its bfs-tree in json format.

How it works

GFAtoVCF performs two main tasks during its execution:

  1. Bubble Detection
  2. Variant Identification

Bubble Detection

A bubble consists of multiple directed unipaths (a path in which all internal vertices are of degree 2) between two vertices. Bubbles are commonly caused by a small number of errors in the centre of reads. More information can be found in this paper

Detecting bubbles is important for converting a file from GFA to VCF since Variants will be found exclusively inside bubbles. In order to detect bubbles, we first build a support tree from the starting graph; then, by looking at how many nodes are at the each level of the tree, we can easily understand whether or not a bubble has been found. The approach will now be described in more detail:

  1. Run a BFS on the graph starting from a node X with no incoming edges, by doing so we obtain a tree rooted in X. We chose a BFS instead of a DFS since we wanted the order of the edges to not matter, and BFS achieves exactly that.
  2. Detect, for each level of the tree, how many nodes are at that level, this will provide crucial information for detecting bubbles in the next step.
  3. Detect bubbles by analyzing the data structure obtained in the previous step; basically, we explore the tree level-by-level. The following situations are possible:
    • no bubble has been detected yet, a level with only 1 node is found (and this node has at least 2 outgoing edges) -> OPEN the bubble
    • a bubble has already been detected, a level with multiple nodes is found -> EXTEND the bubble
    • a bubble has already been detected, a level with only 1 node is found -> CLOSE the bubble

Each bubble is represented internally as a tuple of NodeIDs: (Bubble_Start, Bubble_End), this will allow us to easily explore the nodes in the graph inside the bubble. Note that this script is currently unable to detect nested bubbles, also known as superbubbles.

Variant Identification

Variant Identification consists in finding Variants, which are differences in terms of sequences from a reference genome. In the graph, all genomes, including the reference ones, are described as paths. This second steps mostly invoves exploring different paths and comparing them to the references; instead of comparing all possible paths in the graph, we only focus on paths inside each bubble (since Variants can only be found there). The approach will now be described in more detail:

  1. Consider a reference path: remember that a path is a list of nodes, so ref = [r1,r2...rn]
  2. Compute all possible paths between Bubble_Start and Bubble_End: each path will be represented as: path = [p1,p2...pm]
  3. Compare reference path to each possible path. Consider a generic index i s.t. i <= min(n,m):
    • if ref[i] and path[i] are equal -> no variant found
    • if ref[i+1] and path[i] are equal -> DEL
    • if ref[i] and path[i+1] are equal -> INS
    • if ref[i] and path[i] are different -> SNV

This will be repeated for each reference path and for each bubble.

rs-gfatovcf's People

Contributors

hopedwall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

dduchen

rs-gfatovcf's Issues

empty output

Hi @HopedWall,

I'm running the tool, and getting a VCF file that contains a header, but no variants were called. The graph was generated by the latest version of minigraph. We do get calls when running monograph call. The verbose output does produce information about the traversal.

gfa_to_vcf --input family.gfa --output family-out.vcf

Does your tool generate a multi-sample vcf with genotypes?

Errors 308 and 277 during compilation

Hello,

I get 27 of these errors during compilation. Do you know the reason for it? Should I update some parameters with the last versions of rust?

Thanks.

error[E0308]: mismatched types
--> /home/pio/.cargo/registry/src/github.com-1ecc6299db9ec823/lexical-core-0.7.4/src/atof/algorithm/bhcomp.rs:62:24
|
62 | let bytes = bits / Limb::BITS;
| ^^^^^^^^^^ expected usize, found u32

error[E0277]: cannot divide usize by u32
--> /home/pio/.cargo/registry/src/github.com-1ecc6299db9ec823/lexical-core-0.7.4/src/atof/algorithm/bhcomp.rs:62:22
|
62 | let bytes = bits / Limb::BITS;
| ^ no implementation for usize / u32
|
= help: the trait Div<u32> is not implemented for usize

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.