Comments (10)
I agree.
from ga4gh-schemas.
So to expand on this, we can either encode baseQuality
as an array of int
relying on documentation to define the limits:
/** Represents the quality of each base in this read, using the Phred-scale from 0-93. */
array<int> baseQuality = [];
or we can go the extra mile and create an enumeration for Phred quality scores:
enum GAPhredQualityScore {
Q00,
Q01,
// ...
Q93
}
record GARead {
// ...
/** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */
array<GAPhredQualityScore> qualityScore = [];
// ...
}
from ga4gh-schemas.
I'm partial to the simple int array.
from ga4gh-schemas.
8 bits for the 0-93 enumeration vs. 32 bits per int
value. That is a significant in-memory difference. on-disk column compression should be about the same though.
from ga4gh-schemas.
And by extension, we can enumerate the originalBases
attribute to any array of sequence bases.
/**
An enumeration of possible nucleic acid codes.
| Nucleic Acid Code | Meaning | Mnemonic |
|-------------------|--------------|--------------------------------------|
| A | A | Adenine |
| C | C | Cytosine |
| G | G | Guanine |
| T | T | Thymine |
| U | U | Uracil |
| R | A or G | puRine |
| Y | C, T or U | pYrimidines |
| K | G, T or U | bases which are Ketones |
| M | A or C | bases with aMino groups |
| S | C or G | Strong interaction |
| W | A, T or U | Weak interaction |
| B | not A (i.e. C, G, T or U) | B comes after A |
| D | not C (i.e. A, G, T or U) | D comes after C |
| H | not G (i.e., A, C, T or U) | H comes after G |
| V | neither T nor U (i.e. A, C or G) | V comes after U |
| N | A C G T U | Nucleic acid |
| X | masked | |
*/
enum GANucleicAcid {
A,
C,
G,
T,
U,
R,
Y,
K,
M,
S,
W,
B,
D,
H,
V,
N,
X
}
and
record GARead {
// ...
/** The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */
array<GANucleicAcid> originalBases = [];
/** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */
array<GAPhredQualityScore> qualityScore = [];
// ...
}
from ga4gh-schemas.
If we want to get away from the character string then the base qualities should be some sort of int. This is because they
are numbers (values are -10log_10 probabilities) not arbitrary symbols. Adding qualities is a meaningful operation equivalent to multiplying probabilities.
They could be 8-bit ints if that is supported (in C one can use unsigned char for this), but if they will be compressed in the data store and one does not
need to hold too many in memory this is perhaps not so important.
For the nucleic acids I am more happy with an enumeration, but alongside this we should provide a mapping to characters giving the standard symbol,
and three mappings to integers plus their inverses:
1) ordinal: A,C,G,TorU,N to 0,1,2,3,4 and all others to 4
2) 4-bit binary: A,C,G,TorU to 1,2,4,8 and others to the bitwise-or implied by this, e.g. R to 5, B to 14.
3) 2-bit binary: A,C,G,TorU to 0,1,2,3 and anything else undefined (could throw an exception if a function call)
These are all commonly used and it is good to standardise them and provide them in the interface. Also important to have the inverse mappings
from arrays of characters, integers or packed integers into GANucleicAcid.
An alternative is to use ints as in (2) above, with the decision whether to map 8 to T or U based on whether the molecule is DNA (default) or RNA.
In this notation it can be useful to also allow 0, which translates into '-', the pad character representing a gap in an alignment.
Richard
On 21 May 2014, at 17:24, Angel Pizarro [email protected] wrote:
And by extension, we can enumerate the originalBases attribute to any array of sequence bases.
/**
An enumeration of possible nucleic acid codes.
Nucleic Acid Code Meaning Mnemonic A A Adenine C C Cytosine G G Guanine T T Thymine U U Uracil R A or G puRine Y C, T or U pYrimidines K G, T or U bases which are Ketones M A or C bases with aMino groups S C or G Strong interaction W A, T or U Weak interaction B not A (i.e. C, G, T or U) B comes after A D not C (i.e. A, G, T or U) D comes after C H not G (i.e., A, C, T or U) H comes after G V neither T nor U (i.e. A, C or G) V comes after U N A C G T U Nucleic acid X masked */
enum GANucleicAcid {
A,
C,
G,
T,
U,
R,
Y,
K,
M,
S,
W,
B,
D,
H,
V,
N,
X
}
andrecord GARead {
// .../** The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */ array<GANucleicAcid> originalBases = []; /** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ array<GAPhredQualityScore> qualityScore = []; // ...
}
—
Reply to this email directly or view it on GitHub.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
from ga4gh-schemas.
I'm -0 with regards to swapping out the string of bases for an enum. We've tried this on the ADAM project; the pro of this approach is that you're explicitly checking for valid character values, the cons are that you don't improve performance/storage efficiency, and you may need code to box/unbox the enums. Practically, I don't think we've seen any cases where explicitly checking character values has caught bad sequence data.
I'm -0.5 with regards to enumerating the phred scaled values, unless anyone has quantitative data showing that it improves performance. I think this is a micro-optimization that adds hassle to packing/unpacking data, but that doesn't give us much of a performance gain (as backend stores may already optimize anyways).
from ga4gh-schemas.
I'm with Frank - I prefer array<int>
for quality, and string
for bases.
from ga4gh-schemas.
@fnothaft thanks for the real-life experience with ADAM. I agree that if there are no practical benefits to enumeration of base pairs, then we should stick to strings. Also agree with @cassiedoll and @richarddurbin that baseQuality should be array<int>
.
from ga4gh-schemas.
addressed by #62
from ga4gh-schemas.
Related Issues (20)
- Package for CRAN
- RNA expression data structure is inefficient HOT 7
- Rename repository HOT 2
- Update Release notes for the v0.6.0a10 release
- Remove created and updated timestamps from API HOT 4
- Add peer service human readable docs HOT 1
- Document maven release process HOT 1
- Move datamodel to its own repo
- Improve development.rst
- Content Type Negotiation
- Implement updated transcript effects protocol
- Deprecate reference ID in favor of reference name or accession ID HOT 1
- Recreate assay metadata HOT 2
- Update Java Protobuf Dependency to 3.1+
- protobuf java square write code-gen HOT 3
- Change booleans to enums
- Assay Metadata for Analysis object table is broken in documentation...
- GeoLocation attributes names HOT 1
- ListReferenceBasesRequest GET or POST HOT 1
- AnalysisResult scores
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ga4gh-schemas.