Code Monkey home page Code Monkey logo

gfa2bin's Introduction

gfa2bin

Convert a gfa to a plink file. This tool can also use a compressed coverage file from packing.

Installation

git clone https://github.com/MoinSebi/gfa2bin
cd gfa2bin
cargo build --release
./target/release/gfa2bin -h 

Subcommands

graph - Converting from graphs directly

Convert a graph in gfa format to a plink or bimbam format. You are able to specify which feature (-f ) you want to use as entry. We support nodes (1), edges (1+2+), and directed nodes (1+). Path can be merged to samples using the PanSN-spec, which is highly recommended.
You are able to ignore certain path using the --path option. Nodes can also be ignored using the gfa2bin mod command after you have converted the graph into plink format.

Workflow

We count the occurrence of each feature in each path/sample in the graph. This is done by iterating through the graph and counting the occurrence of each feature in each path. Dependent on output format, you are able to create a presence/absence matrix (plink) or a normalized matrix (bimbam). You can either provide an absolute value as threshold -a, which will be used in all entries, or you provide a method (mean, median, percentile), which will be computed per row (feature). The resulting value will then be multiplied by the relative threshold. If no threshold is provided, you will receive a presence/absence plink or max-val normalized bimbam file.

Diploid

We are able to provide information about ploidy. This is auto-detected when using the PanSN-spec. In PLINK files, ploidy can easily be represented by 11, 01, 10, 00. In a bimfile, we use the average of both "normalized" haplotypes.

BIMBAM

In bimbam format, the value used for normalization represents the 2.0 in the 0.0 to 2.0 range. As said above, we average the two haplotpyes after the "0-2" normalization. Not sure if this is right.

Comment:

In our experience there is no need for a columns (path) normalization, since samples/paths can contain extreme numbers of single features which mess up the normalization. If wanted, I can implement this in the feature.

Example usage:

gfa2bin graph -g input.gfa -o output -f node --bimbam 
gfa2bin graph -g input.gfa -o output -f dirnode -m mean -r 50 --pansn '#'

align - Using graph alignment

Convert coverage information from sequence to graph alignments to plink bed files. Either can use plain pack files directly (which will consume large amount of memory) or use one of the custom coverage file formats from packiong repository as input. The packing repository helps to reduce storage and can perform pre-processing on sample level.
Comparable to graph subcommand, we offer additional normalization can be run when using value based input. THis normalization is then run on feature level (e.g. nodes or edges).

Example usage:

Remove

Remove samples or entries from the plink files (bed, bim, fam).

Example usage:

./target/release/gfa2bin remove -b input --samples samples.txt --genotypes genotypes_names.txt -o output_plink

Tip: Use gretl or any other tool to get a list of samples or entries with a specific statistic.

Filter

Filter entries or samples from a plink file.

Samples can be filtered by:

Genotypes can be filtered by:

  • MAF (major allele frequency)
  • maf (minor allele frequency)

Genotypes can be filtered by:

Split and merge

Split

Split a single plink file (bed, bim, fam), into multiple parts of the same size. This might be prefered if the testing data set is very big and performing GWAS takes a lot of time and multiprocessing not is possible.

Merge

Merging multiple plink files back together. Either from the above computation or any other splitting operation. Entries in all input files, must be in same sample order (similar fam order and names).

View

Convert a plink bed file to a vcf-like file format. This method might be useful for general checking of the genrated genotyes. File might of huge size dependent on input.

Example

Find

Given a list of genotypes (e.g. significant nodes or edges) and graph, return the position (in bed format) of those paths, where such genotypes can be found. Each genotype will be listed as additional information in the bed file. If users might need more than just the exact position, additional --length information can be added, which will return in bigger intervals, adding the additinal lengthg to each site.
The output is made for extracting the sequence from the initial sequence and blasting these back to a database to get more information about selected DNA segment (overlap with genes or other interesting regions).

Nearest node

Return the closest reference-node in resprect to the input node. A reference node is the clostest node which can be found on any given reference path. The result does additionally return reference position of this node, An example is shown shown below.

Example usage

Example output

gfa2bin's People

Contributors

moinsebi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.