Code Monkey home page Code Monkey logo

Comments (32)

fnothaft avatar fnothaft commented on August 16, 2024

+1

Jeff Hammerbacher [email protected] wrote:

The scope of work that's happening within the ADAM project has grown to encompass the entire pipeline of genomics data management, from unmapped short reads through to annotated variants. Within this new context, the name "ADAMRecord" is too generic. Names like "ADAMPileup" and "ADAMVariant" are more descriptive and useful for end users.

Alternatively, given that Avro supports namespaces, we may want to consider dropping the "ADAM" prefix for all of these types. That's a larger conversation, of course. I'm curious first to get opinions on the specific issue of "ADAMRecord".


Reply to this email directly or view it on GitHub:
#95

The scope of work that's happening within the ADAM project has grown to encompass the entire pipeline of genomics data management, from unmapped short reads through to annotated variants. Within this new context, the name "ADAMRecord" is too generic. Names like "ADAMPileup" and "ADAMVariant" are more descriptive and useful for end users.

Alternatively, given that Avro supports namespaces, we may want to consider dropping the "ADAM" prefix for all of these types. That's a larger conversation, of course. I'm curious first to get opinions on the specific issue of "ADAMRecord".


Reply to this email directly or view it on GitHub.

from adam.

hammer avatar hammer commented on August 16, 2024

Thanks for the quick response, Frank. I'll handle the code changes once I hear from a few more people. I need to get a commit into this repo, and this refactoring is one I can handle.

from adam.

tdanford avatar tdanford commented on August 16, 2024

-0

Obviously the name doesn't have any functional significance, so if everyone
feels differently I won't object too loudly. But FWIW, (a) I think
ADAMRecord nicely mirrors the "SAMRecord" class whose interface it mirrors,
and (b) I think the "Aligned" element would be confusing, since it's
entirely possible to have un-aligned records. It's even reasonable to have
a BAM file with no alignments whatsoever, we build those in our pipeline
semi-frequently.

But if others feel differently, like I said, no blockers from me.

On Mon, Feb 10, 2014 at 7:40 PM, Jeff Hammerbacher <[email protected]

wrote:

The scope of work that's happening within the ADAM project has grown to
encompass the entire pipeline of genomics data management, from unmapped
short reads through to annotated variants. Within this new context, the
name "ADAMRecord" is too generic. Names like "ADAMPileup" and "ADAMVariant"
are more descriptive and useful for end users.

Alternatively, given that Avro supports namespaces, we may want to
consider dropping the "ADAM" prefix for all of these types. That's a larger
conversation, of course. I'm curious first to get opinions on the specific
issue of "ADAMRecord".


Reply to this email directly or view it on GitHubhttps://github.com//issues/95
.

from adam.

nealsid avatar nealsid commented on August 16, 2024

+1
IMO, the difference is that ADAM encompasses much more than SAM does and
isn't suitable for a prefix for something that doesn't clarify it further.
Maybe we could call it ADAMRead, which could encompass both aligned &
unaligned reads?

Neal

On Mon, Feb 10, 2014 at 7:47 PM, Timothy Danford
[email protected]:

-0

Obviously the name doesn't have any functional significance, so if
everyone
feels differently I won't object too loudly. But FWIW, (a) I think
ADAMRecord nicely mirrors the "SAMRecord" class whose interface it
mirrors,
and (b) I think the "Aligned" element would be confusing, since it's
entirely possible to have un-aligned records. It's even reasonable to have
a BAM file with no alignments whatsoever, we build those in our pipeline
semi-frequently.

But if others feel differently, like I said, no blockers from me.

On Mon, Feb 10, 2014 at 7:40 PM, Jeff Hammerbacher <
[email protected]

wrote:

The scope of work that's happening within the ADAM project has grown to
encompass the entire pipeline of genomics data management, from unmapped
short reads through to annotated variants. Within this new context, the
name "ADAMRecord" is too generic. Names like "ADAMPileup" and
"ADAMVariant"
are more descriptive and useful for end users.

Alternatively, given that Avro supports namespaces, we may want to
consider dropping the "ADAM" prefix for all of these types. That's a
larger
conversation, of course. I'm curious first to get opinions on the
specific
issue of "ADAMRecord".


Reply to this email directly or view it on GitHub<
https://github.com/bigdatagenomics/adam/issues/95>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34715792
.

from adam.

hammer avatar hammer commented on August 16, 2024

@tdanford I'm happy to hear alternative names for this type. The name "SAMRecord" makes sense within the context of the SAMtools project. ADAM's scope is larger than the SAMtools project, and we have many kinds of Records, so I don't think the name should persist.

What is the advantage of a BAM file with no alignments? Is it simply because BAM files have compression and indexing? I suspect you are abusing the BAM container format, something that we don't need to do in ADAM because of Avro and Parquet.

from adam.

tdanford avatar tdanford commented on August 16, 2024

Like I said, I won't stand in a way of a name change that other folks think
is appropriate. If you guys all think it should be something else, that's
good enough for me.

I suspect you are abusing the BAM container format

Yup, but it's not a choice I made :-)

something that we don't need to do in ADAM because of Avro and Parquet.

Agreed, yet another reason why I'm excited about ADAM.

On Mon, Feb 10, 2014 at 7:52 PM, Jeff Hammerbacher <[email protected]

wrote:

@tdanford https://github.com/tdanford I'm happy to hear alternative
names for this type. The name "SAMRecord" makes sense within the context of
the SAMtools project. ADAM's scope is larger than the SAMtools project, and
we have many kinds of Records, so I don't think the name should persist.

What is the advantage of a BAM file with no alignments? Is it simply
because BAM files have compression and indexing? I suspect you are abusing
the BAM container format, something that we don't need to do in ADAM
because of Avro and Parquet.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34716087
.

from adam.

laserson avatar laserson commented on August 16, 2024

+1 for removing ADAM prefix. All these types are inside the ADAM project.
+1 for changing Record. It's too broad, and any of the other types is, in principle, a "record". "Read" is a good suggestion.

from adam.

tdanford avatar tdanford commented on August 16, 2024

{SAM,ADAM}Records aren't in 1-1 correspondence with reads though, right?

On Mon, Feb 10, 2014 at 8:38 PM, Uri Laserson [email protected]:

+1 for removing ADAM prefix. All these types are inside the ADAM project.
+1 for changing Record. It's too broad, and any of the other types is, in
principle, a "record". "Read" is a good suggestion.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34718611
.

from adam.

jey avatar jey commented on August 16, 2024

+0

I am in favor of the overall objective of having clear naming and an expressive type system. I'm against renaming "ADAMRecord" to "ADAMAlignedRead" because these records are not necessarily aligned, but something like "ADAMRead" or (with namespacing) "Read" would be fine by me.

I don't think we should feel too influenced by the existing naming/design conventions in other projects such as SAMtools -- let's be inspired by their features and learn from their bugs. ;-)

Somewhat unrelated, but I've started writing a wrapper class for Reads that abstracts away some of the implementation details of the on-disk serialization format, so that code dealing with reads can be written in a more semantically-oriented (rather than data-oriented) manner.

from adam.

jey avatar jey commented on August 16, 2024

@tdanford, yeah, there are a couple edge cases, such as "duplicate" records corresponding to a single read, e.g. if the aligner found multiple candidate alignments for a given read. I still think calling it a "Read" is a pretty good compromise.

from adam.

tdanford avatar tdanford commented on August 16, 2024

I'm not sure these are edge-cases, they're a pretty big part of dealing
consistently with ADAMRecords as currently constituted. That's what the
SingleReadBucket and ReadBucket classes are there for, and I usually think
of those as being synonymous with "a read." Carl and I spent a fair bit
of time thinking about the relationship between reads and records for the
comparison PR, and I know others have too.

I'm all for consistent naming and (again) I really don't care if people
want to rename ADAMRecord. I do find myself having to explain what the
name means to (e.g.) people here at Genome Bridge who've heard of ADAM but
aren't familiar with Picard or other existing libraries for SAM/BAM
manipulation -- so maybe changing it would make that job easier!

I don't think we should feel too influenced by the existing naming/design
conventions in other projects such as SAMtools

Completely agree, although in this case we should be pretty clear that
ADAMRecord is basically a straight-up reimplementation of SAMRecord, most
of the field names are identical or nearly so (modulo folding in the seq
dict, flipping the sign on a few of the bit flags, the handling of the
cigar strings where we just use the corresponding Picard types and codecs,
etc.).

On Mon, Feb 10, 2014 at 8:45 PM, Jey Kottalam [email protected]:

@tdanford https://github.com/tdanford, yeah, there are a couple edge
cases, such as "duplicate" records corresponding to a single read, e.g. if
the aligner found multiple candidate alignments for a given read. I still
think calling it a "Read" is a pretty good compromise.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34718995
.

from adam.

massie avatar massie commented on August 16, 2024

Timothy is correct. I chose the name to mimic the name "SAMRecord".
Actually, the name ADAM was a play on SAM too, e.g. Sequence Alignment/Map
(SAM), Avro Datafile Alignment/Map (ADAM). We'll have to come up with a
backronym for ADAM later since we've outgrown it.

While it's true that Java provides namespace, I still think we should keep
the ADAM prefix. It's likely the names of our datatypes will collide with
other systems, e.g. Record, Alignment, Base, Variant, etc. These collisions
will require an explicit namespace in the body of the source not just in
the imports, e.g. "adam.Record", "adam.Alignment" which is, in my opinion,
less discernible from ADAMRecord, ADAMAlignment to read.

I'm ok with changing ADAMRecord to ADAMRead but I don't see a strong reason
to do that.

I do see value in rethinking the way we're storing aligned reads if that's
what's driving this. But whatever changes we make, we need to make it easy
for people to rerun aligners on the all the reads again (since usually ~1%
of reads can't be aligned and realigning data is common). Keeping aligned
and unaligned reads together requires either (1) redefine a top-level
nested schema to hold both aligned and unaligned reads or (2) write aligned
and unaligned reads to two different locations (e.g. partition file
prefixes or directories). If we split up the data this way, then could have
alignment specific definitions instead of relying on the flags. However,
keep this in mind -- there are also primary and secondary alignments -- do
we split those up too or just use a flag? Oddly, there's also the
possibility of multiple secondary alignments.

-Matt

On Mon, Feb 10, 2014 at 4:54 PM, Timothy Danford
[email protected]:

Like I said, I won't stand in a way of a name change that other folks
think
is appropriate. If you guys all think it should be something else, that's
good enough for me.

I suspect you are abusing the BAM container format

Yup, but it's not a choice I made :-)

something that we don't need to do in ADAM because of Avro and Parquet.

Agreed, yet another reason why I'm excited about ADAM.

On Mon, Feb 10, 2014 at 7:52 PM, Jeff Hammerbacher <
[email protected]

wrote:

@tdanford https://github.com/tdanford I'm happy to hear alternative
names for this type. The name "SAMRecord" makes sense within the context
of
the SAMtools project. ADAM's scope is larger than the SAMtools project,
and
we have many kinds of Records, so I don't think the name should persist.

What is the advantage of a BAM file with no alignments? Is it simply
because BAM files have compression and indexing? I suspect you are
abusing
the BAM container format, something that we don't need to do in ADAM
because of Avro and Parquet.

Reply to this email directly or view it on GitHub<
https://github.com/bigdatagenomics/adam/issues/95#issuecomment-34716087>
.

Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34716197
.

from adam.

massie avatar massie commented on August 16, 2024

Just to be a little more concrete here, the FASTQ files which are output by sequencers, provide the read name, sequence and qualities. If we're ok with a nested schema (breaks integration), we could process these FASTQ reads (matching pairs) and add an addition array of "alignment" types to each (PE) read. Zero alignments would imply the read(s) is not aligned and one implies only a single (primary) alignment. The "alignment" type would contain the reference ID and position of the alignment.

@tdanford points out an important point that the SingleReadBucket and ReadBucket play the role of this higher-level structure for now.

from adam.

tdanford avatar tdanford commented on August 16, 2024

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

If we were intent on renaming ADAMRecord, I'd vote for something along the
lines of, renaming it to AlignmentRecord (or ADAMAlignmentRecord, by Matt's
namespacing conventions), since it does seem to me to logically be "a
record of an alignment."

Then we should do some work to unify "SingleReadBucket" and "ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc) down
into the AlignmentRecords too for convenience, as we have now.

That'd put us in a situation where our tools could either take a flat
collection of AlignmentRecords, or a nested collection of Reads with
AlignmentRecords grouped within them. This seems (to me) to be analogous to
existing tools that either take the BAM in read- or coordinate-sorted order.

On Mon, Feb 10, 2014 at 9:24 PM, Matt Massie [email protected]:

Just to be a little more concrete here, the FASTQ fileshttp://en.wikipedia.org/wiki/FASTQ_formatwhich are output by sequencers, provide the read name, sequence and
qualities. If we're ok with a nested schema (breaks integration), we could
process these FASTQ reads (matching pairs) and add an addition array of
"alignment" types to each (PE) read. Zero alignments would imply the
read(s) is not aligned and one implies only a single (primary) alignment.
The "alignment" type would contain the reference ID and position of the
alignment.

@tdanford https://github.com/tdanford points out an important point
that the SingleReadBucket and ReadBucket play the role of this higher-level
structure for now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34720884
.

from adam.

massie avatar massie commented on August 16, 2024

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

I didn't see one bit of negativity in your responses. Please feel free to keep your informed and constructive objections coming!

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

Then we should do some work to unify "SingleReadBucket" and "ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc) down
into the AlignmentRecords too for convenience, as we have now.

I agree with this proposal. It’s especially good for things like RNA-seq data, and etc. It’d be even nicer if we could refactor things hierarchically to do what Uri talked about a bit back (having specific alphabets).

I'm ok with changing ADAMRecord to ADAMRead but I don't see a strong reason
to do that.

When working with data in all representations, it’s really counterintuitive to refer to the read data as “Record” data.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:32 PM, Timothy Danford [email protected] wrote:

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

If we were intent on renaming ADAMRecord, I'd vote for something along the
lines of, renaming it to AlignmentRecord (or ADAMAlignmentRecord, by Matt's
namespacing conventions), since it does seem to me to logically be "a
record of an alignment."

Then we should do some work to unify "SingleReadBucket" and "ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc) down
into the AlignmentRecords too for convenience, as we have now.

That'd put us in a situation where our tools could either take a flat
collection of AlignmentRecords, or a nested collection of Reads with
AlignmentRecords grouped within them. This seems (to me) to be analogous to
existing tools that either take the BAM in read- or coordinate-sorted order.

On Mon, Feb 10, 2014 at 9:24 PM, Matt Massie [email protected]:

Just to be a little more concrete here, the FASTQ fileshttp://en.wikipedia.org/wiki/FASTQ_formatwhich are output by sequencers, provide the read name, sequence and
qualities. If we're ok with a nested schema (breaks integration), we could
process these FASTQ reads (matching pairs) and add an addition array of
"alignment" types to each (PE) read. Zero alignments would imply the
read(s) is not aligned and one implies only a single (primary) alignment.
The "alignment" type would contain the reference ID and position of the
alignment.

@tdanford https://github.com/tdanford points out an important point
that the SingleReadBucket and ReadBucket play the role of this higher-level
structure for now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34720884
.


Reply to this email directly or view it on GitHub.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

Anywho, a modest proposal for a schema for this new read:

record ADAMRead {

   array<Base> forwardStrandSequence = [];
   array<byte> forwardStrandQualities = [];
   array<Base> reverseStrandSequence = [];
   array<byte> reverseStrandQualities = [];

   array<ADAMAlignment> forwardStrandAlignments = [];
   array<ADAMAlignment> reverseStrandAlignments = [];

   // record group identifer from sequencing run
   union { null, string } recordGroupSequencingCenter = null;
   union { null, string } recordGroupDescription = null;
   union { null, long } recordGroupRunDateEpoch = null;
   union { null, string } recordGroupFlowOrder = null;
   union { null, string } recordGroupKeySequence = null;
   union { null, string } recordGroupLibrary = null;
   union { null, int } recordGroupPredictedMedianInsertSize = null;
   union { null, string } recordGroupPlatform = null;
   union { null, string } recordGroupPlatformUnit = null;
   union { null, string } recordGroupSample = null;

   // flags
   union { boolean, null } failedVendorQualityChecks = false;
   union { boolean, null } duplicateRead = false;
   union { null, string } attributes = null;

}

record ADAMAlignment {
array alignmentBlocks;
}

enum Alignment {
Match,
Insertion,
Deletion,
Clip
}

record ADAMAlignmentBlock {

   // alignment location parameters
   union { null, int } alignmentPosition = null;
   union { null, string } referenceName = null;
   union { null, int } referenceId = null;
   union { null, long } referenceLength = null;
   union { null, string } referenceUrl = null;

   // alignment description
   union { null, Alignment } alignmentType = null;
   map<Base> changedPositions = [];

}

Thoughts? Attributes like whether a read has a mate pair or is mapped would be computed dynamically by a rich class. By having alignments which are constructed out of alignment blocks, we would:

  1. Support split alignments —> which would allow us to cleanly support both split aligned reads and RNA-seq data
  2. Eliminate both CIGARs and MD tags.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:57 PM, Frank Austin Nothaft [email protected] wrote:

Then we should do some work to unify "SingleReadBucket" and "ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc) down
into the AlignmentRecords too for convenience, as we have now.

I agree with this proposal. It’s especially good for things like RNA-seq data, and etc. It’d be even nicer if we could refactor things hierarchically to do what Uri talked about a bit back (having specific alphabets).

I'm ok with changing ADAMRecord to ADAMRead but I don't see a strong reason
to do that.

When working with data in all representations, it’s really counterintuitive to refer to the read data as “Record” data.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:32 PM, Timothy Danford [email protected] wrote:

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

If we were intent on renaming ADAMRecord, I'd vote for something along the
lines of, renaming it to AlignmentRecord (or ADAMAlignmentRecord, by Matt's
namespacing conventions), since it does seem to me to logically be "a
record of an alignment."

Then we should do some work to unify "SingleReadBucket" and "ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc) down
into the AlignmentRecords too for convenience, as we have now.

That'd put us in a situation where our tools could either take a flat
collection of AlignmentRecords, or a nested collection of Reads with
AlignmentRecords grouped within them. This seems (to me) to be analogous to
existing tools that either take the BAM in read- or coordinate-sorted order.

On Mon, Feb 10, 2014 at 9:24 PM, Matt Massie [email protected]:

Just to be a little more concrete here, the FASTQ fileshttp://en.wikipedia.org/wiki/FASTQ_formatwhich are output by sequencers, provide the read name, sequence and
qualities. If we're ok with a nested schema (breaks integration), we could
process these FASTQ reads (matching pairs) and add an addition array of
"alignment" types to each (PE) read. Zero alignments would imply the
read(s) is not aligned and one implies only a single (primary) alignment.
The "alignment" type would contain the reference ID and position of the
alignment.

@tdanford https://github.com/tdanford points out an important point
that the SingleReadBucket and ReadBucket play the role of this higher-level
structure for now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34720884
.


Reply to this email directly or view it on GitHub.

from adam.

hammer avatar hammer commented on August 16, 2024

Thanks for all of the thoughtful feedback, everyone.

There are three issues being discussed right now:

  1. Renaming ADAMRecord type.
  2. Removing the "ADAM" prefix from all types.
  3. Refactoring the ADAMRecord, SingleReadBucket, and ReadBucket into ADAMAlignment and ADAMRead types.

Matt's made it clear that 2 is not a good idea. So, we can resolve that discussion.

I haven't examined the SingleRead types so I can't contribute to 3.

As for 2: from my perspective, there's no value to the word "Record" in our type names. It's like appending "Object" to a Java class name, or "Struct" to a C struct variable--it's redundant. If we want to store information about an alignment, we should call the type "ADAMAlignment", not "ADAMAlignmentRecord".

from adam.

nealsid avatar nealsid commented on August 16, 2024

Thanks for putting that together, Frank.
Just a small nit/FYI (and to add another issue into the many we're
discussing here :)), Michael & I thought it'd be better to move away from
array, towards representing strands/alleles as strings to take
advantage of string interning and common string operations for alleles.
Thoughts?

Neal

On Tue, Feb 11, 2014 at 1:14 PM, Frank Austin Nothaft <
[email protected]> wrote:

Anywho, a modest proposal for a schema for this new read:

record ADAMRead {

array forwardStrandSequence = [];
array forwardStrandQualities = [];
array reverseStrandSequence = [];
array reverseStrandQualities = [];

array forwardStrandAlignments = [];
array reverseStrandAlignments = [];

// record group identifer from sequencing run
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;

// flags
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { null, string } attributes = null;

}

record ADAMAlignment {
array alignmentBlocks;
}

enum Alignment {
Match,
Insertion,
Deletion,
Clip
}

record ADAMAlignmentBlock {

// alignment location parameters
union { null, int } alignmentPosition = null;
union { null, string } referenceName = null;
union { null, int } referenceId = null;
union { null, long } referenceLength = null;
union { null, string } referenceUrl = null;

// alignment description
union { null, Alignment } alignmentType = null;
map changedPositions = [];
}

Thoughts? Attributes like whether a read has a mate pair or is mapped
would be computed dynamically by a rich class. By having alignments which
are constructed out of alignment blocks, we would:

  1. Support split alignments —> which would allow us to cleanly support
    both split aligned reads and RNA-seq data
  2. Eliminate both CIGARs and MD tags.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:57 PM, Frank Austin Nothaft <
[email protected]> wrote:

Then we should do some work to unify "SingleReadBucket" and
"ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a
separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc)
down
into the AlignmentRecords too for convenience, as we have now.

I agree with this proposal. It’s especially good for things like RNA-seq
data, and etc. It’d be even nicer if we could refactor things
hierarchically to do what Uri talked about a bit back (having specific
alphabets).

I'm ok with changing ADAMRecord to ADAMRead but I don't see a strong
reason
to do that.

When working with data in all representations, it’s really
counterintuitive to refer to the read data as “Record” data.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:32 PM, Timothy Danford [email protected]
wrote:

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

If we were intent on renaming ADAMRecord, I'd vote for something along
the
lines of, renaming it to AlignmentRecord (or ADAMAlignmentRecord, by
Matt's
namespacing conventions), since it does seem to me to logically be "a
record of an alignment."

Then we should do some work to unify "SingleReadBucket" and
"ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a
separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc)
down
into the AlignmentRecords too for convenience, as we have now.

That'd put us in a situation where our tools could either take a flat
collection of AlignmentRecords, or a nested collection of Reads with
AlignmentRecords grouped within them. This seems (to me) to be
analogous to
existing tools that either take the BAM in read- or coordinate-sorted
order.

On Mon, Feb 10, 2014 at 9:24 PM, Matt Massie [email protected]:

Just to be a little more concrete here, the FASTQ files<
http://en.wikipedia.org/wiki/FASTQ_format>which are output by sequencers,
provide the read name, sequence and
qualities. If we're ok with a nested schema (breaks integration), we
could
process these FASTQ reads (matching pairs) and add an addition array
of
"alignment" types to each (PE) read. Zero alignments would imply the
read(s) is not aligned and one implies only a single (primary)
alignment.
The "alignment" type would contain the reference ID and position of
the
alignment.

@tdanford https://github.com/tdanford points out an important
point
that the SingleReadBucket and ReadBucket play the role of this
higher-level
structure for now.


Reply to this email directly or view it on GitHub<
https://github.com/bigdatagenomics/adam/issues/95#issuecomment-34720884>
.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34786137
.

from adam.

hammer avatar hammer commented on August 16, 2024

Also I'd suggest opening new issues for the "Refactoring the ADAMRecord, SingleReadBucket, and ReadBucket into ADAMAlignment and ADAMRead types" and the "representing strands/alleles as strings" discussions, so that we can focus this discussion on the ADAMRecord renaming.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

Neal,

I’m not picky—strings are also easier to logically query. My only strong preference is that whatever we pick, we are consistent in it’s use.

Regards,

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 11, 2014, at 11:38 AM, Neal [email protected] wrote:

Thanks for putting that together, Frank.
Just a small nit/FYI (and to add another issue into the many we're
discussing here :)), Michael & I thought it'd be better to move away from
array, towards representing strands/alleles as strings to take
advantage of string interning and common string operations for alleles.
Thoughts?

Neal

On Tue, Feb 11, 2014 at 1:14 PM, Frank Austin Nothaft <
[email protected]> wrote:

Anywho, a modest proposal for a schema for this new read:

record ADAMRead {

array forwardStrandSequence = [];
array forwardStrandQualities = [];
array reverseStrandSequence = [];
array reverseStrandQualities = [];

array forwardStrandAlignments = [];
array reverseStrandAlignments = [];

// record group identifer from sequencing run
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;

// flags
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { null, string } attributes = null;

}

record ADAMAlignment {
array alignmentBlocks;
}

enum Alignment {
Match,
Insertion,
Deletion,
Clip
}

record ADAMAlignmentBlock {

// alignment location parameters
union { null, int } alignmentPosition = null;
union { null, string } referenceName = null;
union { null, int } referenceId = null;
union { null, long } referenceLength = null;
union { null, string } referenceUrl = null;

// alignment description
union { null, Alignment } alignmentType = null;
map changedPositions = [];
}

Thoughts? Attributes like whether a read has a mate pair or is mapped
would be computed dynamically by a rich class. By having alignments which
are constructed out of alignment blocks, we would:

  1. Support split alignments —> which would allow us to cleanly support
    both split aligned reads and RNA-seq data
  2. Eliminate both CIGARs and MD tags.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:57 PM, Frank Austin Nothaft <
[email protected]> wrote:

Then we should do some work to unify "SingleReadBucket" and
"ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a
separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc)
down
into the AlignmentRecords too for convenience, as we have now.

I agree with this proposal. It’s especially good for things like RNA-seq
data, and etc. It’d be even nicer if we could refactor things
hierarchically to do what Uri talked about a bit back (having specific
alphabets).

I'm ok with changing ADAMRecord to ADAMRead but I don't see a strong
reason
to do that.

When working with data in all representations, it’s really
counterintuitive to refer to the read data as “Record” data.

Frank Austin Nothaft
[email protected]
202-340-0466

On Feb 10, 2014, at 6:32 PM, Timothy Danford [email protected]
wrote:

In the interest of adding something constructive and not just being the
only consistently negative voice on this thread...

If we were intent on renaming ADAMRecord, I'd vote for something along
the
lines of, renaming it to AlignmentRecord (or ADAMAlignmentRecord, by
Matt's
namespacing conventions), since it does seem to me to logically be "a
record of an alignment."

Then we should do some work to unify "SingleReadBucket" and
"ReadBucket"
into a single, "Read" or "ADAMRead" class. Backing that with a
separate,
nested Avro schema seems like the next logical step. We could still
de-normalize the read-level data (like the sequence, qualities, etc)
down
into the AlignmentRecords too for convenience, as we have now.

That'd put us in a situation where our tools could either take a flat
collection of AlignmentRecords, or a nested collection of Reads with
AlignmentRecords grouped within them. This seems (to me) to be
analogous to
existing tools that either take the BAM in read- or coordinate-sorted
order.

On Mon, Feb 10, 2014 at 9:24 PM, Matt Massie [email protected]:

Just to be a little more concrete here, the FASTQ files<
http://en.wikipedia.org/wiki/FASTQ_format>which are output by sequencers,
provide the read name, sequence and
qualities. If we're ok with a nested schema (breaks integration), we
could
process these FASTQ reads (matching pairs) and add an addition array
of
"alignment" types to each (PE) read. Zero alignments would imply the
read(s) is not aligned and one implies only a single (primary)
alignment.
The "alignment" type would contain the reference ID and position of
the
alignment.

@tdanford https://github.com/tdanford points out an important
point
that the SingleReadBucket and ReadBucket play the role of this
higher-level
structure for now.


Reply to this email directly or view it on GitHub<
https://github.com/bigdatagenomics/adam/issues/95#issuecomment-34720884>
.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34786137
.


Reply to this email directly or view it on GitHub.

from adam.

tdanford avatar tdanford commented on August 16, 2024

Guys, if this is actually going to happen, can whoever is actually going to
create this change and file it as a pull request also include a bump to the
ADAM version number in that request?

That way, those of us who have downstream dependencies on ADAM can handle
the shift in a semi-asynchronous way.

On Tue, Feb 11, 2014 at 2:41 PM, Jeff Hammerbacher <[email protected]

wrote:

Also I'd suggest opening new issues for the "Refactoring the ADAMRecord,
SingleReadBucket, and ReadBucket into ADAMAlignment and ADAMRead types" and
the "representing strands/alleles as strings" discussions, so that we can
focus this discussion on the ADAMRecord renaming.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34796802
.

from adam.

carlyeks avatar carlyeks commented on August 16, 2024

+1 to bumping the version. It's a breaking API change...

from adam.

hammer avatar hammer commented on August 16, 2024

Hey @tdanford, if you and @massie are cool with the ADAMRecord -> ADAMAlignment renaming, I can make it happen soon, and I'll be happy to bump the version.

Out of curiosity, what downstream dependencies do you have on ADAM? One of the reasons I've been so vocal about not bringing on new users or writing papers about our work is that we've got a lot of exploration left and I'd like to retain the flexibility to break dependencies often for at least the next few months.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

I can't speak for @tdanford, but avocado depends extensively on ADAM. I imagine that Timothy also has some downstream analysis pipe that he's building.

I've taken it as a granted that I will build things just to tear them up and rebuild several months later. I agree though that we would need a soft "cliff", and also preferably, more frequent releases.

Jeff Hammerbacher [email protected] wrote:

Hey @tdanford, if you and @massie are cool with the ADAMRecord -> ADAMAlignment renaming, I can make it happen soon, and I'll be happy to bump the version.

Out of curiosity, what downstream dependencies do you have on ADAM? One of the reasons I've been so vocal about not bringing on new users or writing papers about our work is that we've got a lot of exploration left and I'd like to retain the flexibility to break dependencies often for at least the next few months.


Reply to this email directly or view it on GitHub:
#95 (comment)

Hey @tdanford, if you and @massie are cool with the ADAMRecord -> ADAMAlignment renaming, I can make it happen soon, and I'll be happy to bump the version.

Out of curiosity, what downstream dependencies do you have on ADAM? One of the reasons I've been so vocal about not bringing on new users or writing papers about our work is that we've got a lot of exploration left and I'd like to retain the flexibility to break dependencies often for at least the next few months.


Reply to this email directly or view it on GitHub.

from adam.

tdanford avatar tdanford commented on August 16, 2024

Jeff, I think the name change is ultimately an aesthetic issue.
"ADAMRecord" doesn't bother me, but I'll be fine with whatever name you
think is more appropriate. Just please (do) include the version update.

In re: downstream dependencies, I know there's avocado, and there are also
some GenomeBridge-specific repos that Carl and I sometimes update and
maintain, and that depend on adam-core or adam-format. Most of the code
that goes in those downstream projects either eventually migrates into ADAM
itself (that's where the comparison work originally incubated) or is so
GenomeBridge-specific that it wouldn't be useful for anyone else.

I don't think the existence of dependencies should stop you or anyone else
from making the changes you think are necessary, to ADAM itself, though!
For the most part, we work hard to keep downstream data and code up-to-date
with respect to the changes going on in the latest ADAM master, and for
bigger changes that it will take a while to synch up with ... well, that's
why we have version numbers. I do think that renaming ADAMRecord is a
big-enough change that bumping the version is necessary, that's all.

On Tue, Feb 11, 2014 at 3:06 PM, Jeff Hammerbacher <[email protected]

wrote:

Hey @tdanford https://github.com/tdanford, if you and @massiehttps://github.com/massieare cool with the ADAMRecord -> ADAMAlignment renaming, I can make it
happen soon, and I'll be happy to bump the version.

Out of curiosity, what downstream dependencies do you have on ADAM? One of
the reasons I've been so vocal about not bringing on new users or writing
papers about our work is that we've got a lot of exploration left and I'd
like to retain the flexibility to break dependencies often for at least the
next few months.


Reply to this email directly or view it on GitHubhttps://github.com//issues/95#issuecomment-34799656
.

from adam.

hammer avatar hammer commented on August 16, 2024

Cool. As long as downstream dependencies are primarily developed by ADAM contributors, breaking changes to ADAM are not so bad.

from adam.

tdanford avatar tdanford commented on August 16, 2024

I think we eventually came to a decision here ... but we didn't actually rename anything. @hammer do you still want to make this change?

from adam.

laserson avatar laserson commented on August 16, 2024

I would vote to make these changes. Along these lines, I would also be in favor of completely eliminating the VIP status that reads/bam files have, i.e.,

  • adamLoad to adamBAMLoad or equiv
  • ADAMContext to ADAMReadsContext or equiv
  • create an o.b.a.rdd.reads package analogous to o.b.a.rdd.variation
  • etc.

This would make things more consistent I believe.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

+1 to @laserson's suggestions; it'd be good to do the rename as one fell swoop.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

This will be closed by #342 when it merges.

from adam.

fnothaft avatar fnothaft commented on August 16, 2024

Fixed by #342.

from adam.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.