The NEF specification in Overview.md says that sequence_code is a string and not an in

residue identifiers - sequence_code vs. sequential integers about nef HOT 6 CLOSED

nmrexchangeformat commented on August 24, 2024

residue identifiers - sequence_code vs. sequential integers

from nef.

Comments (6)

rhfogh commented on August 24, 2024

This is for the community to discuss. But my personal opinion is that it would be better to keep the format simple and avoid duplicate information. Using both sequential integer, and unique author ID opens the risk that programs might confuse them, and refer to the wrong residue in cases where the number sequences do not match. And keeping two parallel numbering systems is additional work.

It is anyway the case that the sequence description may contain sequences that some programs are unable to deal with: multiple chains, for a start, non-linear-polymers and crosslinks, chain breaks ...

NMR-STAR (as indeed the CCPN data model) was designed to store all possible information, which made the format very powerful and very complicated. For a deposition format that was the right thing to do, but the NEF format has different needs.

from nef.

kumar-physics commented on August 24, 2024

I would suggest its better to stick to the current standard(sequence_code as int). Defining sequence_code as a string is an outdated PBD format. PDB file format itself is outdated now. Validation of NEF data against a mmCIF coordinate file will be difficult, if NEF try to support outdated PDB format.

from nef.

rhfogh commented on August 24, 2024

Dear Kumaran,

That would take some discussion, since it would be a major change in the
current NEF agreement. It would certainly break the way CCPN programs
use NEF, and it would require additional columns in the data.

The main use case from our point of view is not old PDB files, but NMR
assignment. While assigning you do not (yet) know the exact residue you
are assigning to, and having a string field allows you flexibility in
having a single identifier for various assignment-in-process residues.

Yours,

Rasmus

On 02/08/2016 19:37, Kumaran Baskaran wrote:

I would suggest its better to stick to the current
standard(sequence_code as int). Defining sequence_code as a string is an
outdated PBD format. PDB file format itself is outdated now. Validation
of NEF data against a mmCIF coordinate file will be difficult, if NEF
try to support outdated PDB format.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJXYEMXZDLwzOY-5_UffBWqp8PzGqJODks5qb45bgaJpZM4DVZXC.

Dr. Rasmus H. Fogh

Email: [email protected]
Department of Biochemistry, University of Leicester,
Henry Wellcome Building, Lancaster Road, Leicester, LE1 7RH, UK

Winner Times Higher Education Award 2007-2013
Elite without being elitist

from nef.

kumar-physics commented on August 24, 2024

Dear Rasmus,
Thanks for the explanation. I agree it certainly allows some flexibility, but I don’t see any advantage in carrying this flexibility up to chemical shift list level. You may have the flexibility in the peak list section(with peak label or something like that) but the chemical shift list should be consistent with the sequence numbering in the structure(mmCIF). Once the chemical shift data is extracted from the spectrum, that has to be consistent with the standards, because an user may wish to deposit that in BMRB/PDB.

Best,
Kumaran

On Aug 3, 2016, at 7:30 AM, rhfogh [email protected] wrote:

Dear Kumaran,

That would take some discussion, since it would be a major change in the
current NEF agreement. It would certainly break the way CCPN programs
use NEF, and it would require additional columns in the data.

The main use case from our point of view is not old PDB files, but NMR
assignment. While assigning you do not (yet) know the exact residue you
are assigning to, and having a string field allows you flexibility in
having a single identifier for various assignment-in-process residues.

Yours,

Rasmus

On 02/08/2016 19:37, Kumaran Baskaran wrote:

I would suggest its better to stick to the current
standard(sequence_code as int). Defining sequence_code as a string is an
outdated PBD format. PDB file format itself is outdated now. Validation
of NEF data against a mmCIF coordinate file will be difficult, if NEF
try to support outdated PDB format.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJXYEMXZDLwzOY-5_UffBWqp8PzGqJODks5qb45bgaJpZM4DVZXC.

Dr. Rasmus H. Fogh

Email: [email protected]
Department of Biochemistry, University of Leicester,
Henry Wellcome Building, Lancaster Road, Leicester, LE1 7RH, UK

Winner Times Higher Education Award 2007-2013
Elite without being elitist

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub #12 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AFYAWGgtoSTmTvn4XljcKHlHfVuHiegxks5qcIn9gaJpZM4DVZXC.

Kumaran Baskaran PhD
Assistant Scientist
Department of Biochemistry
433 Babcock Dr
Madison WI 53706
Office +1 (608) 265 5657
http://www.bmrb.wisc.edu

from nef.

davidarndt commented on August 24, 2024

kumar-physics wrote:

I would suggest its better to stick to the current standard(sequence_code as int).

As far as I can tell, the current standard is sequence_code as a string, as stated in specification/Overview.md.

Given that sequence_code is a string, my original concern was that NEF format is hard to parse. If all someone wants is a protein sequence with its corresponding chemical shifts, they cannot simply parse the _nef_chemical_shift table. In the Commented_Example.nef file, for example:

     A     14      TRP   HB2    3.4     0   
     A     14      TRP   HB3    3.2     0   
     A     14      TRP   HE1    9.9     0   
     A     14      TRP   NE1    135.6   0   
     A     15      GLY   H      7.8     0   
     A     15      GLY   N      106.5   0   
     A     15      GLY   QA     3.42    0.02
     A     17      VAL   CG%    22.1    0.3 
     A     17      VAL   HG%    0.73    0.02
     ...

Since sequence_code is an arbitrary string, there is no guarantee that residue 15 follows immediately after residue 14, or that there is a gap between residues 15 and 17. One would have to consult another source of information, such as the _nef_sequence table, to know for sure (requiring additional coding). But I can tell you from experience that most programmers who would be writing a parsing script in an academic lab will not consult the specifications, and will simply (mistakenly) assume that sequence_code is an int and is sequentially numbered. This will mostly work, but for some files it will fail. Practically speaking, this will likely result in decreased usability for NEF, because it is not as developer-friendly as it could be.

On the other hand, if sequence_code consisted of sequentially numbered ints, this problem would be solved, but then a whole new problem would arise: inconsistency with the data in some old PDB formatted files. I work in Dr. David Wishart's lab, and of the couple dozen or so applications we have that use protein coordinate data, I don't know of any that use mmCIF. I would expect many other academic labs to be similar. But if NEF format were to be compatible only with mmCIF and not with PDB, then any application that requires input of both coordinate data and chemical shift data would (1) require mmCIF input of the user if their shifts are in NEF format, and (2) require the developer to support mmCIF in addition to NEF. (1) might inconvenience users, and (2) would make developer support for NEF more onerous. Or, many developers may not realize that this hypothetical NEF format would be incompatible with PDB format, resulting in buggy behavior when using NEF files.

NMR-STAR 2.1 and 3.1 have columns for each approach to labeling residues, so the above problems do not arise. Though the presentation is more cluttered and initially confusing, the information is all in one table, so developers can easily use whichever column they need without much additional coding.

So those are my 2 cents. No solution is perfect, and hopefully whatever approach is taken will be a good compromise between alternatives.

from nef.

rhfogh commented on August 24, 2024

Dear David,

That is a very good overview of the various considerations.

But I think there is one more thing to consider: how friendly the format
is to people who produce the data, as opposed to people who read the
final results.

If you want chemical shifts it is clearly easier if you have as single
table, with integer sequence codes reflecting the official sequence, all
relevant information, and a guarantee that all the columns are filled
in, with correct values. For a deposition database that is not that big
a problem - deposition is a one-off event, and there are people who are
paid to validate and curate entries. For application programmers, or
users, there is the question who should be doing the work. The
spectroscopists might prefer to use the numbering system of the parent
molecule, even if they are actually working on a deletion or insertion
mutant, or might prefer that the numbering should go '..., -2, -1, 1, 2,
...'. Programs might have their own reasons not to use sequential
integer numbering that matches the official sequence. They might combine
multiple chains into a single chain for calculation with dummy linking
residues (CYANA?), or they might operate with partially assigned
residues, alternative conformations or unknown impurities (as CCPN does).

Restraint deposition has been plagued by depositions that are
inconsistent, out of sync, incomplete, or badly converted. I suspect
that is largely because people who write files are prone to take the
same kinds of shortcuts as you list for people who parse them. The
minimum requirement of the NEF (so far) is that people must use a single
consistent set of atom identifiers of their own choosing, and must give
the sequence separately. As the PDB format shows, there is a real risks
that more stringent requirements might be ignored in practice.

Yours,

Rasmus

On 04/08/2016 03:14, davidarndt wrote:

kumar-physics wrote:
I would suggest its better to stick to the current
standard(sequence_code as int).
As far as I can tell, the current standard is sequence_code as a string,
as stated in specification/Overview.md.

Given that sequence_code is a string, my original concern was that NEF
format is hard to parse. If all someone wants is a protein sequence with
its corresponding chemical shifts, they cannot simply parse the
_nef_chemical_shift table. In the Commented_Example.nef file, for example:

|A 14 TRP HB2 3.4 0 A 14 TRP HB3 3.2 0 A 14 TRP HE1 9.9 0 A 14 TRP NE1
135.6 0 A 15 GLY H 7.8 0 A 15 GLY N 106.5 0 A 15 GLY QA 3.42 0.02 A 17
VAL CG% 22.1 0.3 A 17 VAL HG% 0.73 0.02 ... |

Since sequence_code is an arbitrary string, there is no guarantee that
residue 15 follows immediately after residue 14, or that there is a gap
between residues 15 and 17. One would have to consult another source of
information, such as the _nef_sequence table, to know for sure
(requiring additional coding). But I can tell you from experience that
most programmers who would be writing a parsing script in an academic
lab will not consult the specifications, and will simply (mistakenly)
assume that sequence_code is an int and is sequentially numbered. This
will mostly work, but for some files it will fail. Practically speaking,
this will likely result in decreased usability for NEF, because it is
not as developer-friendly as it could be.

On the other hand, if sequence_code consisted of sequentially numbered
ints, this problem would be solved, but then a whole new problem would
arise: inconsistency with the data in some old PDB formatted files. I
work in Dr. David Wishart's lab, and of the couple dozen or so
applications we have that use protein coordinate data, I don't know of
any that use mmCIF. I would expect many other academic labs to be
similar. But if NEF format were to be compatible only with mmCIF and not
with PDB, then any application that requires input of both coordinate
data and chemical shift data would (1) require mmCIF input of the user
if their shifts are in NEF format, and (2) require the developer to
support mmCIF in addition to NEF. (1) might inconvenience users, and (2)
would make developer support for NEF more onerous. Or, many developers
may not realize that this hypothetical NEF format would be incompatible
with PDB format, resulting in buggy behavior when using NEF files.

NMR-STAR 2.1 and 3.1 have columns for each approach to labeling
residues, so the above problems do not arise. Though the presentation is
more cluttered and initially confusing, the information is all in one
table, so developers can easily use whichever column they need without
much additional coding.

So those are my 2 cents. No solution is perfect, and hopefully whatever
approach is taken will be a good compromise between alternatives.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJXYEDTvxV_TmOxzW9I9P5wMSdBVXRk4ks5qcUr3gaJpZM4DVZXC.

Dr. Rasmus H. Fogh

Email: [email protected]
Department of Biochemistry, University of Leicester,
Henry Wellcome Building, Lancaster Road, Leicester, LE1 7RH, UK

Winner Times Higher Education Award 2007-2013
Elite without being elitist

from nef.

residue identifiers - sequence_code vs. sequential integers about nef HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent