Interpreting GA4GHRead.baseQuality requires some funny string manipulation. It might b

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

addressed by <a class="issue-link js-issue-link" data-error-text="Failed to load title

Should GA4GHRead.baseQuality be something other than a string? about ga4gh-schemas HOT 10 CLOSED

ga4gh commented on August 11, 2024

Should GA4GHRead.baseQuality be something other than a string?

from ga4gh-schemas.

Comments (10)

delagoya commented on August 11, 2024

I agree.

from ga4gh-schemas.

delagoya commented on August 11, 2024

So to expand on this, we can either encode baseQuality as an array of int relying on documentation to define the limits:

/** Represents the quality of each base in this read, using the Phred-scale from 0-93. */ 
array<int> baseQuality = [];

or we can go the extra mile and create an enumeration for Phred quality scores:

enum GAPhredQualityScore {
  Q00, 
  Q01,
  // ...
  Q93
}

record GARead {
     // ... 

    /** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
    array<GAPhredQualityScore> qualityScore = [];

    // ...
}

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

I'm partial to the simple int array.

from ga4gh-schemas.

delagoya commented on August 11, 2024

8 bits for the 0-93 enumeration vs. 32 bits per int value. That is a significant in-memory difference. on-disk column compression should be about the same though.

from ga4gh-schemas.

delagoya commented on August 11, 2024

And by extension, we can enumerate the originalBases attribute to any array of sequence bases.

/**
An enumeration of possible nucleic acid codes.

| Nucleic Acid Code | Meaning      | Mnemonic                             |
|-------------------|--------------|--------------------------------------|
| A                 | A            | Adenine                              |
| C                 | C            | Cytosine                             |
| G                 | G            | Guanine                              |
| T                 | T            | Thymine                              |
| U                 | U            | Uracil                               |
| R                 | A or G       | puRine                               |
| Y                 | C, T or U    | pYrimidines                          |
| K                 | G, T or U    | bases which are Ketones              |
| M                 | A or C       | bases with aMino groups              |
| S                 | C or G       | Strong interaction                   |
| W                 | A, T or U    | Weak interaction                     |
| B                 | not A (i.e. C, G, T or U)  | B comes after A        |
| D                 | not C (i.e. A, G, T or U)  | D comes after C        |
| H                 | not G (i.e., A, C, T or U) | H comes after G        |
| V                 | neither T nor U (i.e. A, C or G) | V comes after U  |
| N                 | A C G T U    | Nucleic acid                         |
| X                 | masked       |                                      |

*/
enum GANucleicAcid {
  A,
  C,
  G,
  T,
  U,
  R,
  Y,
  K,
  M,
  S,
  W,
  B,
  D,
  H,
  V,
  N,
  X
}

and

record GARead {
     // ... 


    /**  The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */
    array<GANucleicAcid> originalBases = [];

    /** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
    array<GAPhredQualityScore> qualityScore = [];

    // ...
}

from ga4gh-schemas.

richarddurbin commented on August 11, 2024

If we want to get away from the character string then the base qualities should be some sort of int. This is because they
are numbers (values are -10log_10 probabilities) not arbitrary symbols. Adding qualities is a meaningful operation equivalent to multiplying probabilities.
They could be 8-bit ints if that is supported (in C one can use unsigned char for this), but if they will be compressed in the data store and one does not
need to hold too many in memory this is perhaps not so important.

For the nucleic acids I am more happy with an enumeration, but alongside this we should provide a mapping to characters giving the standard symbol,
and three mappings to integers plus their inverses:
1) ordinal: A,C,G,TorU,N to 0,1,2,3,4 and all others to 4
2) 4-bit binary: A,C,G,TorU to 1,2,4,8 and others to the bitwise-or implied by this, e.g. R to 5, B to 14.
3) 2-bit binary: A,C,G,TorU to 0,1,2,3 and anything else undefined (could throw an exception if a function call)
These are all commonly used and it is good to standardise them and provide them in the interface. Also important to have the inverse mappings
from arrays of characters, integers or packed integers into GANucleicAcid.

An alternative is to use ints as in (2) above, with the decision whether to map 8 to T or U based on whether the molecule is DNA (default) or RNA.
In this notation it can be useful to also allow 0, which translates into '-', the pad character representing a gap in an alignment.

Richard

On 21 May 2014, at 17:24, Angel Pizarro [email protected] wrote:

And by extension, we can enumerate the originalBases attribute to any array of sequence bases.

/**
An enumeration of possible nucleic acid codes.

Nucleic Acid Code Meaning Mnemonic

A A Adenine

C C Cytosine

G G Guanine

T T Thymine

U U Uracil

R A or G puRine

Y C, T or U pYrimidines

K G, T or U bases which are Ketones

M A or C bases with aMino groups

S C or G Strong interaction

W A, T or U Weak interaction

B not A (i.e. C, G, T or U) B comes after A

D not C (i.e. A, G, T or U) D comes after C

H not G (i.e., A, C, T or U) H comes after G

V neither T nor U (i.e. A, C or G) V comes after U

N A C G T U Nucleic acid

X masked

*/
enum GANucleicAcid {
A,
C,
G,
T,
U,
R,
Y,
K,
M,
S,
W,
B,
D,
H,
V,
N,
X
}
and

record GARead {
// ...
/**  The list of bases that this read represents (e.g. 'CATCGA'). (SEQ) */
array<GANucleicAcid> originalBases = [];

/** Represents the quality of each base in this read, using the Phred-scale defined by the GAPhredQualityScore enumeration. */ 
array<GAPhredQualityScore> qualityScore = [];

// ...
}
—
Reply to this email directly or view it on GitHub.

Nucleic Acid Code	Meaning	Mnemonic
A	A	Adenine
C	C	Cytosine
G	G	Guanine
T	T	Thymine
U	U	Uracil
R	A or G	puRine
Y	C, T or U	pYrimidines
K	G, T or U	bases which are Ketones
M	A or C	bases with aMino groups
S	C or G	Strong interaction
W	A, T or U	Weak interaction
B	not A (i.e. C, G, T or U)	B comes after A
D	not C (i.e. A, G, T or U)	D comes after C
H	not G (i.e., A, C, T or U)	H comes after G
V	neither T nor U (i.e. A, C or G)	V comes after U
N	A C G T U	Nucleic acid
X	masked

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from ga4gh-schemas.

fnothaft commented on August 11, 2024

I'm -0 with regards to swapping out the string of bases for an enum. We've tried this on the ADAM project; the pro of this approach is that you're explicitly checking for valid character values, the cons are that you don't improve performance/storage efficiency, and you may need code to box/unbox the enums. Practically, I don't think we've seen any cases where explicitly checking character values has caught bad sequence data.

I'm -0.5 with regards to enumerating the phred scaled values, unless anyone has quantitative data showing that it improves performance. I think this is a micro-optimization that adds hassle to packing/unpacking data, but that doesn't give us much of a performance gain (as backend stores may already optimize anyways).

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

I'm with Frank - I prefer array<int> for quality, and string for bases.

from ga4gh-schemas.

delagoya commented on August 11, 2024

@fnothaft thanks for the real-life experience with ADAM. I agree that if there are no practical benefits to enumeration of base pairs, then we should stick to strings. Also agree with @cassiedoll and @richarddurbin that baseQuality should be array<int>.

from ga4gh-schemas.

cassiedoll commented on August 11, 2024

addressed by #62

from ga4gh-schemas.

Should GA4GHRead.baseQuality be something other than a string? about ga4gh-schemas HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent