bigdatagenomics / bdg-formats Goto Github PK
View Code? Open in Web Editor NEWOpen source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
License: Apache License 2.0
Open source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
License: Apache License 2.0
DatabaseVariantAnnotation
to VariantAnnotation
VariantCallingAnnotations
to GenotypeAnnotation
TranscriptEffect
under VariantAnnotation
.I have trouble accessing those "end" fields (e.g. AlignmentRecord.end, variant.end) with sparkSql because end is a reserved keyword there and it conflicts with the field names.
I was wondering: Is it possible to assign different names to those fields?
Similar to 95f4b5b, we want to move the Contig
record out of Variant
and replace it with a string contigName
field.
Seems like both of these are still hanging around, was there any resolution on this?
I remember there was some discussion of converting the name of 'ADAMRecord' to 'Read' at the same time as we made the move to remove the 'ADAM' prefix from a lot of these schemas -- although I can't find the discussion of it, at the time.
However, can I vote that we change the name of 'Read' back to 'Record'? If 'Record' is too generic, maybe something more specific like 'AlignmentRecord'.
The point is, the 'Read' schema actually bears a many-to-one relationship with the reads themselves (which will become clear, if we start parsing the FASTQ files directly in any context where we're parsing the raw data), and using a '[something]Record' name will (continue to) evoke the association with SAMRecord which is so clearly implied by the actual presence of fields with similar names and semantics.
Thoughts?
In bigdatagenomics/adam#815 we decided that the normalization provided by the Sequence
s in the Fragment
record wasn't that useful and was somewhat hard to reason about.
Minor code style and doc fixes:
Base
enumThe original position and original cigar flags are useful for describing the alignment of a read prior to realignment.
Along with #9
In the 0.1.1 implementation, the ADAMFeature contains a single field 'trackName', which the comments imply (for GFF/GTF files) contains both the 'feature' and 'source' values from the original record.
However, for parsing out GFF and GTF files into hierarchical or structured gene models, we're going to need to represent those two fields as separate values.
This is a ticket to capture our collective thinking around the re-organization of the Feature schema. The Feature schema needs to be edited to satisfy the following (additional) requirements:
I can't find Pileup class file in org.bdgenomics.formats.avro package. than there is error:
val pileups = reads.adamRecords2Pileup().cache()
:32: error: value adamRecords2Pileup is not a member of org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord]
val pileups = reads.adamRecords2Pileup().cache()
Please how i can git the file and solute the error?
Thanks!
jack xu
Currently we just have fragmentStartPosition
and fragmentLength
. To perform predicate filtering by a ReferenceRegion, we need a fragmentEndPosition
field.
I can't remember the part of the discussion around #90 where alternateAllele
in TranscriptEffect
was dropped. As far as I can tell, this is still necessary when reading in variants to associate an effect with a specific alternate allele.
The reference between Variant
and VariantAnnotation
should be forward so that it can be projected away.
The documentation for VariantAttribute
fields states that Number=R VCF INFO attribute values are split for multi-allelic sites
/**
Total read depth, VCF INFO reserved key AD, Number=R, split for multi-allelic
sites.
*/
union { null, int } readDepth = null;
/**
Forward strand read depth, VCF INFO reserved key ADF, Number=R, split for
multi-allelic sites.
*/
union { null, int } forwardReadDepth = null;
/**
Reverse strand read depth, VCF INFO reserved key ADR, Number=R, split for
multi-allelic sites.
*/
union { null, int } reverseReadDepth = null;
...
/**
Additional variant attributes that do not fit into the standard fields above.
The values are stored as strings, even for flag, integer, and float types. VCF
INFO key values with Number=., Number=0, Number=1, and Number=[n] are shared across
all alternate alleles in the same VCF record. VCF INFO key values with Number=A and
Number=R are split for multi-allelic sites.
*/
map<string> attributes = {};
When converting VCF records to VariantAnnotation
records, we assume that the first index of the array is the reference allele value, and use the alternate allele index to extract the value for the alternate allele.
##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1 1024 . G A,T . PASS MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"attributes": "MY=20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"attributes": "MY=30"}}
When converting from VariantAnnotation
back to VCF records, we no longer have access to the reference allele value. I can think of at least four ways to handle this:
1 1024 . G A . PASS MY=-1,20
1 1024 . G T . PASS MY=-1,30
1 1024 . G A . PASS MY=.,20
1 1024 . G T . PASS MY=.,30
1 1024 . G A . PASS MY=20
1 1024 . G T . PASS MY=30
Variant
and VariantAnnotation
records for the reference allele when splitting multi-allelic sites{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "<*>", "annotation":{"attributes": "MY=10"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"attributes": "MY=20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"attributes": "MY=30"}}
1 1024 . G <*> . PASS MY=10
1 1024 . G A . PASS MY=20
1 1024 . G T . PASS MY=30
For the Number=R VCF INFO attribute values that map to fields (currently AD, ADF, ADR) we have a couple more options:
1 1024 . G A,T . PASS AD=10,20,30
array<int>
for the field type with cardinality 2 array<int> readDepth = [];
array<int> forwardReadDepth = [];
array<int> reverseReadDepth = [];
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 10,20}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"readDepth": 10,30}}
1 1024 . G A . PASS AD=10,20
1 1024 . G T . PASS AD=10,30
union { null, int } readDepth = null;
union { null, int } forwardReadDepth = null;
union { null, int } reverseReadDepth = null;
union { null, int } referenceReadDepth = null;
union { null, int } referenceForwardReadDepth = null;
union { null, int } referenceReverseReadDepth = null;
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 20, "referenceReadDepth": 10}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": 30, "referenceReadDepth": 10}}
1 1024 . G A . PASS AD=10,20
1 1024 . G T . PASS AD=10,30
New option, with 5 above, as proposed below
##INFO=<ID=MY,Number=R,Type=Integer,Description="My Number=R attribute">
1 1024 . G A,T . PASS AD=5,15,25;MY=10,20,30
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "A", "annotation":{"readDepth": [5,15], "attributes": "MY=10,20"}}
{"contigName": "1", "start": 1024, "referenceAllele": "G",
"alternateAllele": "T", "annotation":{"readDepth": [5,25], "attributes": "MY=10,30"}}
1 1024 . G A . PASS AD=5,15;MY=10,20
1 1024 . G T . PASS AD=5,25;MY=10,30
Releasing bdg-formats fails under Java 8 because certain Javadoc warnings have changed to errors in Java 8. We should look closer to see if the issues are from the comments we've written inline in our avdl, or if it is caused by the Javadoc generated by Avro. To work around, we can just cut the releases using Java 7.
This stems from the -onlyvariants
flag we added to vcf2adam
, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype
, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants
perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?
We should add fields for:
We should also add a tag array.
See discussion on bigdatagenomics/adam#1103
I just noticed this field getting lost when converting into and out of BAMs with ADAM. Should we/I add it? Should we just infer it in ADAM? Or continue not supporting it?
See TLEN
in the SAM spec.
htsjdk
puts it in SAMRecord
s as "inferred insert size"
Small nit: I feel like bigdatagenomics/bdg-formats is redundant. Would anyone be opposed to renaming to bigdatagenomics/formats? I will leave this open for a week and make the changes if no one is opposed.
@arahuja implied there had been some discussion around this in the past.
AFAICT there is no good way right now to capture there being two non-reference alleles at one locus, having one Variant
per Genotype
.
In the immediate term I will work around this by emitting two lines / Variant
s in my VCFs, but I'm curious whether we should support the other way here. Thanks!
Publish the C/C++ artifacts, generated from the latest JSON (via bigdatagenomics.github.io?).
To support things like random barcodes, drop-seq, etc.
Followup to #44.
Related to bigdatagenomics/adam#194, and #108. Specifically, this is a subset of #108 that I'd like to get into 0.10.0.
Variant has:
/**
True if filters were applied for this variant. VCF column 7 "FILTER" any value other
than the missing value.
*/
union { null, boolean } filtersApplied = null;
/**
True if all filters for this variant passed. VCF column 7 "FILTER" value PASS.
*/
union { null, boolean } filtersPassed = null;
/**
Zero or more filters that failed for this variant. VCF column 7 "FILTER" shared across
all alleles in the same VCF record.
*/
array<string> filtersFailed = [];
While VariantCallingAnnotations has:
// FILTER: True or false implies that filters were applied and this variant PASSed or not.
// While 'null' implies not filters were applied.
union { null, boolean } variantIsPassing = null;
array <string> variantFilters = [];
I'm going to make VariantCallingAnnotations match Variant.
We should leave some top-level documentation in bdg-formats, as part of the README.
This should explain, among other things, that
May contain incompatible API changes.
Source diff
apache/avro@release-1.7.7...release-1.8.0
I was writing code that munged NucleotideContigFragment
s and realized that Contig
should be changed to Reference
and NucleotideContigFragment
should be renamed to Contig
or ContigFragment
.
@heuermh will break out the variant/genotype changes into smaller chunks, so that we can roll them into ADAM downstream incrementally.
ADAM is on Avro 1.7.7, while bdg-formats is on 1.7.4. I think this is causing some weird behavior with the Spark 1.5 stream of releases which pull in Avro 1.7.7.
Having a diagrams of all the data structures of BDG-Formats would help a newcomer get started in the project.
Many tools allow us to generate such a diagram automatically from Java sources, like UMLGraph which is open source.
I'll submit a pull request shortly that achieves this feature.
Similar to the VariantCallingAnnotations
record (see #51), I think the DatabaseVariantAnnotation
record needs some TLC.
#103 removed StructuralVariant
and StructuralVariantType
records; this issue suggests we might want to reconsider that decision.
Specifically, better schema level support for:
END
fieldAm I missing this somewhere or is the sample name not stored in AlignmentRecord. I'm looking for something equivalent to the
SAMRecord.getReadGroup.getSample
from https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMReadGroupRecord.java#L70
per discussion at bigdatagenomics/adam#1290
I wanted to revisit VariantCallingAnnotations
. It seems a lot of these fields are GATK specific and it seems hard to extend to add new annotations or variant calling output.
Might it make more sense to move of these to attributes
and improve VCF output things from attributes (which it doesn't seem like it does currently?)
As raised on bigdatagenomics/adam#815, the readNum
field of AlignmentRecord
is a bit vaguely named. E.g., readNum
could be read as a UUID, instead of the number of the read in the fragment.
The last revision to the Pileup schema removed too many fields, we need the sampleId
field (or its equivalent) added back in.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.