Code Monkey home page Code Monkey logo

bdg-formats's People

Contributors

akmorrow13 avatar carlyeks avatar davidonlaptop avatar elzaggo avatar erictu avatar fnothaft avatar gasta88 avatar heuermh avatar ilveroluca avatar jpdna avatar laserson avatar massie avatar niranjan93 avatar ryan-williams avatar shaneknapp avatar tdanford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bdg-formats's Issues

Streamline genotype/variant annotations

  • Rename DatabaseVariantAnnotation to VariantAnnotation
  • Rename VariantCallingAnnotations to GenotypeAnnotation
  • Roll TranscriptEffect under VariantAnnotation.

Rename "end" fields

I have trouble accessing those "end" fields (e.g. AlignmentRecord.end, variant.end) with sparkSql because end is a reserved keyword there and it conflicts with the field names.

I was wondering: Is it possible to assign different names to those fields?

Rename 'Read' back to 'Record' in 0.1.2

I remember there was some discussion of converting the name of 'ADAMRecord' to 'Read' at the same time as we made the move to remove the 'ADAM' prefix from a lot of these schemas -- although I can't find the discussion of it, at the time.

However, can I vote that we change the name of 'Read' back to 'Record'? If 'Record' is too generic, maybe something more specific like 'AlignmentRecord'.

The point is, the 'Read' schema actually bears a many-to-one relationship with the reads themselves (which will become clear, if we start parsing the FASTQ files directly in any context where we're parsing the raw data), and using a '[something]Record' name will (continue to) evoke the association with SAMRecord which is so clearly implied by the actual presence of fields with similar names and semantics.

Thoughts?

Code style and doc fixes

Minor code style and doc fixes:

  • Use UPPERCASE_WITH_UNDERSCORES for enum values
  • Remove @see tags that cause javadoc warnings
  • Remove is from boolean field names
  • Fix whitespace and doc formatting
  • Add doc where missing
  • Remove unused Base enum

Add OP/OC flags

The original position and original cigar flags are useful for describing the alignment of a read prior to realignment.

Re-organize the Feature schema

This is a ticket to capture our collective thinking around the re-organization of the Feature schema. The Feature schema needs to be edited to satisfy the following (additional) requirements:

  • it needs an explicit 'type' field
  • it should be less file-format specific (i.e. fields like 'qValue' and 'signalValue' could be moved to an attributes field)
  • we need to make sure it's as memory efficient as possible (and some benchmarking would be nice, too)

I can't find Pileup class file in org.bdgenomics.formats.avro package.

I can't find Pileup class file in org.bdgenomics.formats.avro package. than there is error:

val pileups = reads.adamRecords2Pileup().cache()
:32: error: value adamRecords2Pileup is not a member of org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord]
val pileups = reads.adamRecords2Pileup().cache()

Please how i can git the file and solute the error?

Thanks!

jack xu

@davidonlaptop @hammer @jey @massie @heuermh

Why was alternateAllele in TranscriptEffect removed?

I can't remember the part of the discussion around #90 where alternateAllele in TranscriptEffect was dropped. As far as I can tell, this is still necessary when reading in variants to associate an effect with a specific alternate allele.

Reference allele value for Number=R VCF INFO attributes

The documentation for VariantAttribute fields states that Number=R VCF INFO attribute values are split for multi-allelic sites

  /**
   Total read depth, VCF INFO reserved key AD, Number=R, split for multi-allelic
   sites.
   */
  union { null, int } readDepth = null;

  /**
   Forward strand read depth, VCF INFO reserved key ADF, Number=R, split for
   multi-allelic sites.
   */
  union { null, int } forwardReadDepth = null;

  /**
   Reverse strand read depth, VCF INFO reserved key ADR, Number=R, split for
   multi-allelic sites.
   */
  union { null, int } reverseReadDepth = null;

...

  /**
   Additional variant attributes that do not fit into the standard fields above.
   The values are stored as strings, even for flag, integer, and float types. VCF
   INFO key values with Number=., Number=0, Number=1, and Number=[n] are shared across
   all alternate alleles in the same VCF record. VCF INFO key values with Number=A and
   Number=R are split for multi-allelic sites.
   */
  map<string> attributes = {};

When converting VCF records to VariantAnnotation records, we assume that the first index of the array is the reference allele value, and use the alternate allele index to extract the value for the alternate allele.

##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1  1024  .  G  A,T  .  PASS  MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"attributes": "MY=20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"attributes": "MY=30"}}

When converting from VariantAnnotation back to VCF records, we no longer have access to the reference allele value. I can think of at least four ways to handle this:

  1. Add a default value based on the Type
1  1024  .  G  A  .  PASS  MY=-1,20
1  1024  .  G  T  .  PASS  MY=-1,30
  1. Add the missing value
1  1024  .  G  A  .  PASS  MY=.,20
1  1024  .  G  T  .  PASS  MY=.,30
  1. Write as Number=R with the wrong cardinality
1  1024  .  G  A  .  PASS  MY=20
1  1024  .  G  T  .  PASS  MY=30
  1. Create Variant and VariantAnnotation records for the reference allele when splitting multi-allelic sites
{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "<*>", "annotation":{"attributes": "MY=10"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"attributes": "MY=20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"attributes": "MY=30"}}
1  1024  .  G  <*>  .  PASS  MY=10
1  1024  .  G  A  .  PASS  MY=20
1  1024  .  G  T  .  PASS  MY=30


For the Number=R VCF INFO attribute values that map to fields (currently AD, ADF, ADR) we have a couple more options:

1  1024  .  G  A,T  .  PASS  AD=10,20,30
  1. Use array<int> for the field type with cardinality 2
  array<int> readDepth = [];
  array<int> forwardReadDepth = [];
  array<int> reverseReadDepth = [];
{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 10,20}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"readDepth": 10,30}}
1  1024  .  G  A  .  PASS  AD=10,20
1  1024  .  G  T  .  PASS  AD=10,30
  1. Add new reference value fields
  union { null, int } readDepth = null;
  union { null, int } forwardReadDepth = null;
  union { null, int } reverseReadDepth = null;

  union { null, int } referenceReadDepth = null;
  union { null, int } referenceForwardReadDepth = null;
  union { null, int } referenceReverseReadDepth = null;
{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 20, "referenceReadDepth": 10}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": 30, "referenceReadDepth": 10}}
1  1024  .  G  A  .  PASS  AD=10,20
1  1024  .  G  T  .  PASS  AD=10,30


New option, with 5 above, as proposed below

##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1  1024  .  G  A,T  .  PASS  AD=5,15,25;MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "A", "annotation":{"readDepth": [5,15], "attributes": "MY=10,20"}}

{"contigName": "1", "start": 1024, "referenceAllele": "G",
 "alternateAllele": "T", "annotation":{"readDepth": [5,25], "attributes": "MY=10,30"}}
1  1024  .  G  A  .  PASS  AD=5,15;MY=10,20
1  1024  .  G  T  .  PASS  AD=5,25;MY=10,30

Release fails under Java 8

Releasing bdg-formats fails under Java 8 because certain Javadoc warnings have changed to errors in Java 8. We should look closer to see if the issues are from the comments we've written inline in our avdl, or if it is caused by the Javadoc generated by Avro. To work around, we can just cut the releases using Java 7.

Proposal: VariantCallingAnnotations to be moved to Variant

This stems from the -onlyvariants flag we added to vcf2adam, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?

Rename to bigdatagenomics/formats

Small nit: I feel like bigdatagenomics/bdg-formats is redundant. Would anyone be opposed to renaming to bigdatagenomics/formats? I will leave this open for a week and make the changes if no one is opposed.

Publish C/C++ artifacts

Publish the C/C++ artifacts, generated from the latest JSON (via bigdatagenomics.github.io?).

Harmonize Variant/VariantCallingAnnotations filters

Related to bigdatagenomics/adam#194, and #108. Specifically, this is a subset of #108 that I'd like to get into 0.10.0.

Variant has:

  /**
   True if filters were applied for this variant. VCF column 7 "FILTER" any value other
   than the missing value.
   */
  union { null, boolean } filtersApplied = null;

  /**
   True if all filters for this variant passed. VCF column 7 "FILTER" value PASS.
   */
  union { null, boolean } filtersPassed = null;

  /**
   Zero or more filters that failed for this variant. VCF column 7 "FILTER" shared across
   all alleles in the same VCF record.
   */
  array<string> filtersFailed = [];

While VariantCallingAnnotations has:

  // FILTER: True or false implies that filters were applied and this variant PASSed or not.
  // While 'null' implies not filters were applied.
  union { null, boolean } variantIsPassing = null;
  array <string> variantFilters = [];

I'm going to make VariantCallingAnnotations match Variant.

We need a proper README for bdg-formats

We should leave some top-level documentation in bdg-formats, as part of the README.

This should explain, among other things, that

  • the .avdl file is the central set of data structures for ADAM and other downstream projects,
  • the actual classes are auto-generated by the Maven build,
  • the history of moving this out of adam-formats and into its own downstream formats, and
  • links to adam, avocado, and any other projects that might use the data structures defined here.

Rename Contig objects

I was writing code that munged NucleotideContigFragments and realized that Contig should be changed to Reference and NucleotideContigFragment should be renamed to Contig or ContigFragment.

Revert back to 0.9.0

@heuermh will break out the variant/genotype changes into smaller chunks, so that we can roll them into ADAM downstream incrementally.

Avro version is out of sync with ADAM

ADAM is on Avro 1.7.7, while bdg-formats is on 1.7.4. I think this is causing some weird behavior with the Spark 1.5 stream of releases which pull in Avro 1.7.7.

Generate UML diagrams from source

Having a diagrams of all the data structures of BDG-Formats would help a newcomer get started in the project.

Many tools allow us to generate such a diagram automatically from Java sources, like UMLGraph which is open source.

I'll submit a pull request shortly that achieves this feature.

Improve gVCF support

Specifically, better schema level support for:

  • Symbolic alts
  • Quality score ranges
  • Info END field

Revisit VariantCallingAnnotations

I wanted to revisit VariantCallingAnnotations. It seems a lot of these fields are GATK specific and it seems hard to extend to add new annotations or variant calling output.

Might it make more sense to move of these to attributes and improve VCF output things from attributes (which it doesn't seem like it does currently?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.