Code Monkey home page Code Monkey logo

vcfgo's Introduction

GoDoc Build Status Coverage Status

vcfgo is a golang library to read, write and manipulate files in the variant call format.

vcfgo

-- import "github.com/brentp/vcfgo"

Package vcfgo implements a Reader and Writer for variant call format. It eases reading, filtering modifying VCF's even if they are not to spec. Example:

Usage

f, _ := os.Open("examples/test.auto_dom.no_parents.vcf")
rdr, err := vcfgo.NewReader(f, false)
if err != nil {
    panic(err)
}
for {
    variant := rdr.Read()
    if variant == nil {
        break
    }
    fmt.Printf("%s\t%d\t%s\t%v\n", variant.Chromosome, variant.Pos, variant.Ref(), variant.Alt())
    dp, err := variant.Info().Get("DP")
    fmt.Printf("depth: %v\n", dp.(int))
    sample := variant.Samples[0]
    // we can get the PL field as a list (-1 is default in case of missing value)
    PL, err := variant.GetGenotypeField(sample, "PL", -1)
    if err != nil {
        panic(err)
    }
    fmt.Printf("%v\n", PL)
    _ = sample.DP
}
fmt.Fprintln(os.Stderr, rdr.Error())

Status

vcfgo is well-tested, but still in development. It tries to tolerate, but report errors; after every rdr.Read() call, the caller can check rdr.Error() and get feedback on the errors without stopping execution unless it is explicitly requested to do so.

Info and sample fields are pre-parsed and stored as map[string]interface{} so callers will have to cast to the appropriate type upon retrieval.

type Header

type Header struct {
	SampleNames   []string
	Infos         map[string]*Info
	SampleFormats map[string]*SampleFormat
	Filters       map[string]string
	Extras        map[string]string
	FileFormat    string
	// contid id maps to a map of length, URL, etc.
	Contigs map[string]map[string]string
}

Header holds all the type and format information for the variants.

func NewHeader

func NewHeader() *Header

NewHeader returns a Header with the requisite allocations.

type Info

type Info struct {
	Id          string
	Description string
	Number      string // A G R . ''
	Type        string // STRING INTEGER FLOAT FLAG CHARACTER UNKONWN
}

Info holds the Info and Format fields

func (*Info) String

func (i *Info) String() string

String returns a string representation.

type InfoMap

type InfoMap map[string]interface{}

InfoMap holds the parsed Info field which can contain floats, ints and lists thereof.

func (InfoMap) String

func (m InfoMap) String() string

String returns a string that matches the original info field.

type Reader

type Reader struct {
	Header *Header

	LineNumber int64
}

Reader holds information about the current line number (for errors) and The VCF header that indicates the structure of records.

func NewReader

func NewReader(r io.Reader, lazySamples bool) (*Reader, error)

NewReader returns a Reader.

func (*Reader) Clear

func (vr *Reader) Clear()

Clear empties the cache of errors.

func (*Reader) Error

func (vr *Reader) Error() error

Error() aggregates the multiple errors that can occur into a single object.

func (*Reader) Read

func (vr *Reader) Read() *Variant

Read returns a pointer to a Variant. Upon reading the caller is assumed to check Reader.Err()

type SampleFormat

type SampleFormat Info

SampleFormat holds the type info for Format fields.

func (*SampleFormat) String

func (i *SampleFormat) String() string

String returns a string representation.

type SampleGenotype

type SampleGenotype struct {
	Phased bool
	GT     []int
	DP     int
	GL     []float32
	GQ     int
	MQ     int
	Fields map[string]string
}

SampleGenotype holds the information about a sample. Several fields are pre-parsed, but all fields are kept in Fields as well.

func NewSampleGenotype

func NewSampleGenotype() *SampleGenotype

NewSampleGenotype allocates the internals and returns a SampleGenotype

func (*SampleGenotype) String

func (sg *SampleGenotype) String(fields []string) string

String returns the string representation of the sample field.

type VCFError

type VCFError struct {
	Msgs  []string
	Lines []int64
}

VCFError satisfies the error interface and allows multiple errors. This is useful because, for example, on a single line, every sample may have a field that doesn't match the description in the header. We want to keep parsing but also let the caller know about the error.

func NewVCFError

func NewVCFError() *VCFError

NewVCFError allocates the needed ingredients.

func (*VCFError) Add

func (e *VCFError) Add(err error, line int64)

Add adds an error and the line number within the vcf where the error took place.

func (*VCFError) Clear

func (e *VCFError) Clear()

Clear empties the Messages

func (*VCFError) Error

func (e *VCFError) Error() string

Error returns a string with all errors delimited by newlines.

func (*VCFError) IsEmpty

func (e *VCFError) IsEmpty() bool

IsEmpty returns true if there no errors stored.

type Variant

type Variant struct {
	Chromosome      string
	Pos        		uint64
	Id         		string
	Ref        		string
	Alt        		[]string
	Quality    		float32
	Filter     		string
	Info       		InfoMap
	Format     		[]string
	Samples    		[]*SampleGenotype
	Header     		*Header
	LineNumber 		int64
}

Variant holds the information about a single site. It is analagous to a row in a VCF file.

func (*Variant) GetGenotypeField

func (v *Variant) GetGenotypeField(g *SampleGenotype, field string, missing interface{}) (interface{}, error)

GetGenotypeField uses the information from the header to parse the correct time from a genotype field. It returns an interface that can be asserted to the expected type.

func (*Variant) String

func (v *Variant) String() string

String gives a string representation of a variant

type Writer

type Writer struct {
	io.Writer
	Header *Header
}

Writer allows writing VCF files.

func NewWriter

func NewWriter(w io.Writer, h *Header) (*Writer, error)

NewWriter returns a writer after writing the header.

func (*Writer) WriteVariant

func (w *Writer) WriteVariant(v *Variant)

WriteVariant writes a single variant

vcfgo's People

Contributors

brentp avatar chapmanb avatar codelingobot avatar revl avatar tyhullinger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vcfgo's Issues

vcfgo seems to be stricter than the VCF spec

The VCF spec Section 1.2 appears to allow arbitrary meta-information lines (starting with ##).

vcfgo fails to open VCFs that contain unexpected meta-information fields that nevertheless seem spec compliant (e.g., "##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.").

It seems that this check should either be liberalized or perhaps should go away.

vcfgo/reader.go

Lines 132 to 142 in bdb8e83

} else if strings.HasPrefix(line, "#CHROM") {
var err error
h.SampleNames, err = parseSampleLine(line)
verr.Add(err, LineNumber)
//h.Validate(verr)
break
} else {
e := fmt.Errorf("unexpected header line: %s", line)
return nil, e
}

Handling a FORMAT that is inaccurate

Interestingly, my genotypes are not parsing, but it turns out this is an issue with the FORMAT. The FORMAT indicates: GT:AD:DP, but only the data that is actually provided for each sample is the genotype (e.g., 0/0).

This feels like it is is against spec, which is why it's not parsing. If I manually fix one of these lines, the genotype parses correctly. Weirdly, though, vcf-validate (from vcftools) does not complain about this FORMAT/data mismatch. I would interpret the spec the same way you did, though I'm a bit perplexed that vcf-validate is OK with this. Just want to make sure this is the right approach (I think it is).

Genotype parsing does not handle Haploid/Triploid

Per the VCF specification. Genotype field for a sample can be diploid, haploid or triploid. It appears you are mainly concerned with diploid calls in your unit tests, however there is an issue in parsing haploid genotypes. Right now there is an index out of range error being thrown.

The X, Y and MT chromosomes may contain only haploid calls. Is there any way that you can adjust your logic? If you are unable to fix this, I can try to find some time to do so.

vcfgo stricter than VCF spec in terms of sample filters

vcfgo seems to expect the sample filter to be the same for all samples:

vcfgo/header.go

Lines 69 to 73 in aeb512d

func (h *Header) parseSample(format []string, s string) (*SampleGenotype, []error) {
values := strings.Split(s, ":")
if len(format) != len(values) {
return NewSampleGenotype(), []error{fmt.Errorf("bad sample string: %s", s)}
}

However, the VCF spec is more permissive: "Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field)." See page 6.

I believe that, with the exception of GT, trailing fields should be allowed to be missing.

Demo code in README.md is not valid go?

The README has the following:

f, _ := os.Open("examples/test.auto_dom.no_parents.vcf")
rdr, err := vcfgo.NewReader(f, false)
if err != nil {
	panic(err)
}
for {
	variant := rdr.Read()
	if variant == nil {
		break
	}
	fmt.Printf("%s\t%d\t%s\t%s\n", variant.Chromosome, variant.Pos, variant.Ref, variant.Alt)
	fmt.Printf("%s", variant.Info["DP"].(int) > 10)
	sample := variant.Samples[0]
	// we can get the PL field as a list (-1 is default in case of missing value)
	fmt.Println("%s", variant.GetGenotypeField(sample, "PL", -1))
	_ = sample.DP
}
fmt.Fprintln(os.Stderr, rdr.Error())

However, variant.Info is a function and is not indexable with ["DP"]. Also, variant.GetGenotypeField returns two values and so cannot be used in fmt.Println which is a single-value context.

INFO with `Number=.,Type=String` parsed as `string` or `[]string`.

I parse INFO.CSQ with header line ##INFO=<ID=CSQ,Number=.,Type=String,Description=".
I found that variant.Info_.Get("CSQ") will return interface{} as string or []string.
I use switch to handle this uncertain.

switch csqInfos.(type){
case string:
  // handle as string
case []string:
  // handle as []string
}

I wonder if there are more efficient and simpler way.

Possible opt: splitGT fmt.Sprintf -> strconv.Itoa

Line beginning: ai := fmt.Sprintf("%d", i). Not sure if this is a better solution.

func splitGT(m interface{}, i int) ([]interface{}, error) {
	ml := m.([]interface{})
	out := make([]interface{}, len(ml))
	ai := fmt.Sprintf("%d", i)
	for i, allelei := range ml {
		allele := allelei.(string)
		if allele == "0" {
			out[i] = "0"
		}
		if ai == allele {
			out[i] = "1"
		} else {
			out[i] = "."
		}
	}
	return out, nil
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.