Code Monkey home page Code Monkey logo

Comments (10)

delagoya avatar delagoya commented on August 11, 2024

I agree.

from ga4gh-schemas.

delagoya avatar delagoya commented on August 11, 2024

So to expand on this, we can either encode baseQuality as an array of int relying on documentation to define the limits:

/** Represents the quality of each base in this read, using the Phred-scale from 0-93. */ 
array<int> baseQuality = [];

or we can go the extra mile and create an enumeration for Phred quality scores:

enum GAPhredQualityScore {
  Q00, 
  Q01,
  // ...
  Q93
}

record GARead {
     // ... 

    /** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
    array<GAPhredQualityScore> qualityScore = [];

    // ...
}

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 11, 2024

I'm partial to the simple int array.

from ga4gh-schemas.

delagoya avatar delagoya commented on August 11, 2024

8 bits for the 0-93 enumeration vs. 32 bits per int value. That is a significant in-memory difference. on-disk column compression should be about the same though.

from ga4gh-schemas.

delagoya avatar delagoya commented on August 11, 2024

And by extension, we can enumerate the originalBases attribute to any array of sequence bases.

/**
An enumeration of possible nucleic acid codes.

| Nucleic Acid Code | Meaning      | Mnemonic                             |
|-------------------|--------------|--------------------------------------|
| A                 | A            | Adenine                              |
| C                 | C            | Cytosine                             |
| G                 | G            | Guanine                              |
| T                 | T            | Thymine                              |
| U                 | U            | Uracil                               |
| R                 | A or G       | puRine                               |
| Y                 | C, T or U    | pYrimidines                          |
| K                 | G, T or U    | bases which are Ketones              |
| M                 | A or C       | bases with aMino groups              |
| S                 | C or G       | Strong interaction                   |
| W                 | A, T or U    | Weak interaction                     |
| B                 | not A (i.e. C, G, T or U)  | B comes after A        |
| D                 | not C (i.e. A, G, T or U)  | D comes after C        |
| H                 | not G (i.e., A, C, T or U) | H comes after G        |
| V                 | neither T nor U (i.e. A, C or G) | V comes after U  |
| N                 | A C G T U    | Nucleic acid                         |
| X                 | masked       |                                      |

*/
enum GANucleicAcid {
  A,
  C,
  G,
  T,
  U,
  R,
  Y,
  K,
  M,
  S,
  W,
  B,
  D,
  H,
  V,
  N,
  X
}

and

record GARead {
     // ... 


    /**  The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */
    array<GANucleicAcid> originalBases = [];

    /** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
    array<GAPhredQualityScore> qualityScore = [];

    // ...
}

from ga4gh-schemas.

richarddurbin avatar richarddurbin commented on August 11, 2024

If we want to get away from the character string then the base qualities should be some sort of int. This is because they
are numbers (values are -10log_10 probabilities) not arbitrary symbols. Adding qualities is a meaningful operation equivalent to multiplying probabilities.
They could be 8-bit ints if that is supported (in C one can use unsigned char for this), but if they will be compressed in the data store and one does not
need to hold too many in memory this is perhaps not so important.

For the nucleic acids I am more happy with an enumeration, but alongside this we should provide a mapping to characters giving the standard symbol,
and three mappings to integers plus their inverses:
1) ordinal: A,C,G,TorU,N to 0,1,2,3,4 and all others to 4
2) 4-bit binary: A,C,G,TorU to 1,2,4,8 and others to the bitwise-or implied by this, e.g. R to 5, B to 14.
3) 2-bit binary: A,C,G,TorU to 0,1,2,3 and anything else undefined (could throw an exception if a function call)
These are all commonly used and it is good to standardise them and provide them in the interface. Also important to have the inverse mappings
from arrays of characters, integers or packed integers into GANucleicAcid.

An alternative is to use ints as in (2) above, with the decision whether to map 8 to T or U based on whether the molecule is DNA (default) or RNA.
In this notation it can be useful to also allow 0, which translates into '-', the pad character representing a gap in an alignment.

Richard

On 21 May 2014, at 17:24, Angel Pizarro [email protected] wrote:

And by extension, we can enumerate the originalBases attribute to any array of sequence bases.

/**
An enumeration of possible nucleic acid codes.

Nucleic Acid Code Meaning Mnemonic
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
R A or G puRine
Y C, T or U pYrimidines
K G, T or U bases which are Ketones
M A or C bases with aMino groups
S C or G Strong interaction
W A, T or U Weak interaction
B not A (i.e. C, G, T or U) B comes after A
D not C (i.e. A, G, T or U) D comes after C
H not G (i.e., A, C, T or U) H comes after G
V neither T nor U (i.e. A, C or G) V comes after U
N A C G T U Nucleic acid
X masked

*/
enum GANucleicAcid {
A,
C,
G,
T,
U,
R,
Y,
K,
M,
S,
W,
B,
D,
H,
V,
N,
X
}
and

record GARead {
// ...

/**  The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */
array<GANucleicAcid> originalBases = [];

/** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
array<GAPhredQualityScore> qualityScore = [];

// ...

}

Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from ga4gh-schemas.

fnothaft avatar fnothaft commented on August 11, 2024

I'm -0 with regards to swapping out the string of bases for an enum. We've tried this on the ADAM project; the pro of this approach is that you're explicitly checking for valid character values, the cons are that you don't improve performance/storage efficiency, and you may need code to box/unbox the enums. Practically, I don't think we've seen any cases where explicitly checking character values has caught bad sequence data.

I'm -0.5 with regards to enumerating the phred scaled values, unless anyone has quantitative data showing that it improves performance. I think this is a micro-optimization that adds hassle to packing/unpacking data, but that doesn't give us much of a performance gain (as backend stores may already optimize anyways).

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 11, 2024

I'm with Frank - I prefer array<int> for quality, and string for bases.

from ga4gh-schemas.

delagoya avatar delagoya commented on August 11, 2024

@fnothaft thanks for the real-life experience with ADAM. I agree that if there are no practical benefits to enumeration of base pairs, then we should stick to strings. Also agree with @cassiedoll and @richarddurbin that baseQuality should be array<int>.

from ga4gh-schemas.

cassiedoll avatar cassiedoll commented on August 11, 2024

addressed by #62

from ga4gh-schemas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.