nmrexchangeformat / nef Goto Github PK

NMR Exchange Format official specification

License: GNU Lesser General Public License v3.0

Python 92.34% Shell 7.66%

nef's Introduction

NEF (NMR Exchange Format)

Project Organisation

specifications contains the format specifications. Currently this consists of the Commented_example.nef example file, the Overview.md explanatory comments, and the mmcif_nef.dic speciofications file.

The data_0_2 directory contains sample and test data up to version 0.20. The data_1_0 directory contains sample and test data up to version 1.0

For input/output test we use the naming convention that a program that reads a file prepends its name to the file when output. So if e.g. Cyana reads the file ARIA-CCPN_CASD155.nef, the result would be output to Cyana-ARIA-CCPN_CASD155.nef.

For the top-level files, LICENSE contains the license for the NEF files (we propose GNU LGPL v. 3), Charter.md is the (proposed) charter and founding document for the NEF consortium, Changes.md` is a change log.

nef's People

Contributors

Stargazers

Watchers

nef's Issues

Should we not use OXT/HXT for C-terminal carboxylates in proteins?

Generally, the wwPDB uses atom names that match IUPAC recommendations, at least as I understand it. Based on recent updates, it appears that the wwPDB nomenclature of OXT/HXT for a C-terminal carboxylate is an exception to this general rule.

Our software (Amber) is used for many tasks, not just NMR refinement, and it is universal to follow the wwPDB standard with regards to these names. My view is that the wwPDB conventions should take precedence over any IUPAC rules that conflict with them, since the main point of the NEF standard is to have useful restraint files that can be used in conjunction with coordinate files from the wwPDB.

In this particular case, it doesn't make much difference, since assignments and restraints for these two nuclei are quite rare. But this issue also arises elsewhere, e.g. for hydroxy-proline: here the wwPDB hydrogen names are inconsistent with proline itself. Still, the wwPDB standard is widely used and easy to find; I'm not sure how I would determine the correct "IUPAC" names for a modified amino acid or a ligand. So my vote would be in favor of the wwPDB names, as codified in its "components" library. Otherwise, all our software needs to (a) be able to figure out what the IUPAC names are, and (b) maintain a mapping between those names and the wwPDB names (since users will expect PDB formatted files to adhere to the wwPDB conventions.)

[In cases where there is no wwPDB standard, then IUPAC rules would seem to make the most sense.]

Degenerate checmical shifts

How degenerate chemical shifts are handled in NEF? Are they treated as ambiguous chemical shifts or uniquely assigned chemical shifts?. I see both representation in the same example file

A 372 TYR HD1 7.239 0.02
A 372 TYR HD2 7.239 0.02

and

A 375 HIS HBX 3.037 0.02
A 375 HIS HBY 3.037 0.02

Is there any rule for these cases?

Validation and atom nomenclature

Validation and atom nomenclature are really two different things.

The rule that an atom indicator must contain chain code, sequence code, atom name and (optionally empty) residue type is a matter of validity, as is the rule that a given chain code and sequence code must be paired with the same residue type throughout the file,, or the rule that identifiers that differ only by case are not allowed (e.g. HA and Ha). Break those and the file is invalid.

An atom name like "?@295-1" is perfectly valid, on the other hand. It cannot (or course) be converted into a IUPAC atom name, no matter what the context, but the format rules say that names that cannot be so mapped are allowed, and are treated as anonymous asignment names.

As I understand the system, the actual atoms present follow from the sequence description, through the residue types ('three letter codes'). Atom names are matched to these, using three different types of wildcards:
'%' : "a sequence of digits" or "[0-9]+"
'' : "any string" or "."
'X' and 'Y', which signify one of two alternative branches (but we do not know which one) for atoms and groups with ambiguity codes 2 or 3.

So, there are two thing to say about 'valid atom names':

it is a matter not so much of valid NEF files, but of what names other programs can interpret - so maybe the best place to put it is in a file reader - e.g. a BMRB one.
I do not think you can decide what is interpretable without comparing it to a known set of templates with known atom names. Whether e.g. 'HB%' can be interpreted (and how) depends on whether the relevant residue is ALA, GLU, THR, GLY - or DNA.

.nef is a bad filename suffix...

.nef is already used by a common file format as the extension NEF which stands for the Nikon Electronics Format. This clash causes complications as programs such as the OS X finder and mail clients assume that all files with this file format are images and leads to unexpected errors such as being unable to attach a nef file to an e-mail (Microsoft Outlook).

I would suggest that .nef should be deprecated as a file extension and use of .neff should be made instead [nmr exchange format file...]

Multiline comments in example file

Our STAR parser produced and error while reading Commented_Example.nef .
The following method is not allowed

; Optional item
For comments.
Any multiline text can be put here
;

The correct format according to original STAR specification should be like this

;
Optional item
For comments.
Any multiline text can be put here
;

Please correct it.

Test Data

We need more test data, from more varied sources, with more varied contents (complexes, multimers, nucleic acids, unusual topologies, assigned peak lists, ...).

For now I shall finish adding the ten CASD_NMR 2013 projects. With the conversion code I had to write to make those I should be able to make a test file from any DOCR project without too much hassle. So if anyone has a good suggestion for a test project, I can take a look. Meanwhile I shall see if I can convert at least one of the dimers in Guys X-ray/NMR data set.

Looking forward to seeing more data...

Validating NEF file

How to validate a NEF file and make sure it is consistent with NEF dictionary? Will NEF developers provide validation tools?

"Wildcards 'X' and 'Y' always exists in pairs"- Is this true?

I need some clear documentation for he usage of wild cards 'X' and 'Y'. Do they always exist in pairs? or one of them can be used alone?
If atom X and Y are involved in two independent restraints 'a' and 'b' respectively , which statement is true.

X satisfies both 'a' and 'b' AND Y satisfies both 'a' and 'b'
'X' satisfies 'a' and Y satisfies b OR X satisfies 'b' and Y satisfies 'a'
X satisfies both 'a' and 'b' OR Y satisfies both 'a' and 'b'

If there is only one restrain say 'c' involved only in atom 'X'. which of the following statements are true?

X satisfies c OR Y satisfies c
X satisfies c AND y satisfies c
X satisfies c OR Y satisfies c but NOT both at the same time

In the chemical shift section how to represent, if only one of the stereo specific chemical shifts is observed ? Should I use 'X' or 'Y'?

It would be helpful if you clarify these cases and improve the documentation for the use of wild cards.
Thanks

Residue Variants

The current residue variant specification uses a small subset of the RCSB variant codes. Do we want to keep it like this? If we want specific, simple codes just for (de)protonable amino acids the RCSB system is rather complex (DYANA codes would do). If we want something that can be extended we would need to consider how (and the RCSB system does not fit well with the other information we have). Specifically:

The RCSB variant codes were made to specify also backbone configuration. We use them only for protonation state, for which they are rather complex - and require special codes for N and C terminal residues.
RCSB codes are based on a fully protonated default state, whereas NEF uses a pH 7 default state, which makes things less intuitive.
There is no RCSB code for cis peptide bonds. protonation state of backbone NH3, or distinguishing CYS-SH from deprotonated CYS.
There is no system for extending the codes beyond the 20 standard amino acids, should we ever want to.

Is there any appetite to change this?

why residue_name is called residue_type in NEF?

Is there any special reason to call residue_name as residue_type? residue_type may have many different meanings like standard vs non-standard or hydrophobic or acidic or aromatic .....etc.
nef dictionary also describe this tag as "The author provided residue name." , but still it is called residue_type.
why don't you just call it as residue_name? May be in future for some reason if you want to include a tag for residue type, you need to rename the existing tag or call residue type as residue_something_else. If there is no good reason to call this tag as reside_type, it is better to change the tag now itself.

Numeric precision

Can we agree on a global numerical precision that will not lose information, and can we set it as a mimimum for the standard?

Ideally we should maybe set the required precision on a tag-by-tag basis, but that would be too
demanding in practice. The current settings in the CCPN I/O will give the shortest string that represents
the value exactly (thus avoiding values lin2 2.319999999999999999994 in favour of 2.312). Otherwise, we use '%9g', mixed float/scientific notation with nine significant figures. The lowest precision we can contemplate is probably '%6g', which gives values like '124,371' and (regrettably) '0.617332' for chemical shifts. What is the smallest precision we can accept?

NEF and ccpn assign have different meanings for axis codes

NEF uses axis codes as isotopes while CCPN has separate axis code (axis names) and isotope code (isotopes)

Wildcard atom sets for non-terminal Threonine

This table looks so strange for me. The main purpose of NEF is to exchange data between softwares.
Lets say software 'A' writes out an output chemical shift for HG%, how does another software 'B' interprets that? There are multiple ways it can interpret according to your table. Which interpretation is correct? If this data pass through several softwares with unclear and inconsistent definition, the final output will be questionable . These type of misinterpretation might propagate and amplify from one software to another, and as a result the data quality will be lost

How to expand wildcard nomenclature for a modified residue?

One of the example file 2mtv contains a modified residue 6MZ. Without the topology information or coordinate file how do I expand 'HHX%'. I looked into the cif file downloaded from PDB and I couldn't find an atom represented as 'HH' followed by two numbers for the residue 6MZ.
Is 'HHX%' a typo or a limitation of NEF?

Question: which version of the spec should I target?

Hello,

I am developer writing a NEF file reader. Should I be targeting spec version 1.0 or spec version 1.1 in my reader. Is it easy to determine the differences? I assume a 1.1 reader will read 1.0 files just fine. Is that a good assumption? I have already completed my 1.0 reader and just need to maybe enhance it for 1.1 support.

dummy file required

It would be helpful to have a dummy NEF file with all possible saveframes and tags.Is this example contains all saveframes and tags?

BMRB ambiguity codes

Official discussion of how to treat BMRB ambiguity codes 4-9 has been postponed till after version 1.0 of NEF. Nevertheless I had been giving the matter some thought, and I attach my current proposal for discussion, in case anybody is interested.
NEF - ambiguity codes.docx

What happens to NEF Metadata who files are merged?

When NEF files are merged if there was a save_nef_nmr_meta_data what should it contain? Effectively the _nef_run_history needs an extra field which says where the history comes from. I propose something like the following

   loop_
      _nef_run_history.run_number
      _nef_run_history.stream_id
      _nef_run_history.program_name
      _nef_run_history.program_version
      _nef_run_history.script_name

      1 1 NEFPipelines     1.1       header.py 
      2 1 NEFPipelines     1.1       nmrview_peaks.py
      3 2 CCPN_assign      3.0.3     .
      4 2 Aria             2.3       .
      5 2 NefPipelines     2.3       aria_noe_restraints.py

   stop_

where NefPipelines merged in data from CCPN_assign Aria using the script aria_noe_restraints.py

[Proposal] The NEF frame nef_nmr_meta_data history should include uuids

The nef_nmr_meta_data frame includes a uuid but doesn't include it in the _nef_run_history loop. It is suggested that this is included in the _nef_run_history loop as shown below using the tag _nef_run_history.uuid. This would allow files to be tracked as they are modified.

data_new

save_nef_nmr_meta_data
   _nef_nmr_meta_data.sf_category      nef_nmr_meta_data
   _nef_nmr_meta_data.sf_framecode     nef_nmr_meta_data
   _nef_nmr_meta_data.format_name      nmr_exchange_format
   _nef_nmr_meta_data.format_version   1.1
   _nef_nmr_meta_data.program_name     NEFPipelines
   _nef_nmr_meta_data.program_version  0.0.1
   _nef_nmr_meta_data.script_name      header.py
   _nef_nmr_meta_data.creation_date    2021-06-19T21:13:32.548158
   _nef_nmr_meta_data.uuid             NEFPipelines-2021-06-19T21:13:32.548158-0485797022

   loop_
      _nef_run_history.run_number
      _nef_run_history.program_name
      _nef_run_history.program_version
      _nef_run_history.script_name
      _nef_run_history.uuid

      1 NEFPipelines 1.1 header.py NEFPipelines-2021-06-19T21:13:32.548158-0485797022

   stop_

save_

Software specific tag in nmr_meta_data saveframe

Commented_Example.nef file contains a software specific tag in the nmr_meta_data saveframe

# program-specific parameter
_nef_program_script.cyana_parameter_1

It is defined as loop a loop tag along with other tags(program_name,script_name,script). Consider a situation with more than one program is used(example: CYANA and XPLOR) Now cyana_parameter_1 column become meaningless for a row containing XPLOR data.

"cyana_parameter_1" is a software specific tag, which should go into software specific saveframe.

Thanks

Is this Duplicate restraint (or) Ambiguous restraint?

From the example Example NEF/data_1_1/CCPN_2mqq_docr.nef

8 7 . A 373 GLY HAx A 427 TYR HE% 1 . . . . 5.83 .
9 8 . A 373 GLY HAy A 427 TYR HE% 1 . . . . 5.83 .
12 10 . A 373 GLY HAx A 427 TYR HE% 1 . . . . 5.52 .
13 10 . A 373 GLY HAy A 427 TYR HE% 1 . . . . 5.52 .

The above four rows (index:8,9,12,13) refers to the same set of atoms. First two rows they have different restraints id and the last two rows they have the same restraint id and in both case they have different values. Is this an error or does it mean something?. What I understand from here is there is an ambiguous restraint between GLY HAs and TYR HEs, but I don't understand the fine details of why this is expressed in 3(according to restraint IDs) sets of restraints with different distance values.

Test data and mmCIF

When version 1.0 is frozen we shall need everybody (please?) to resubmit existing test data and and more to represent your program output. For a complete test, any program that exports structures should give an mmCIF file with the structures in addition to the .nef file. The mmCIF file contains the author-to-IUPAC name mapping, and is necessary for testing for the RCSB, BMRB, and others who make use of structures together with the NMR data.

Line width, phase and decay rate in peak list.

Current NEF dictionary has no tags to specify line width, phase and decay rate in the peak list. While I am updating the BMRB data base to NMR-STAR v 3.2, I realize that all the data stored in old saveframes can not be translated in to NEF peak list with out adding new tags. There are about 50 BMRB entries contains line with information and few contains decay rates and other valuable information.
BMRB is working on to provide all entries in V 3.2, which supports 'NEF like' peak list. The only major difference between NEF and NMR-STAR v 3.2 is the atom nomenclature, otherwise NEF could easily be considered as a subset of NMR-STAR v 3.2. Our updated dictionary (v. 3.2) provides tags to specify line width, phase and decay rates in a 'NEF like' peak list.
Now my question is whether NEF consortium and the software developers are interested in adding these tags to NEF dictionary?

Discussion about NEF format

Hi,

I'm David Arndt, from Dr. David Wishart's lab. I'm updating some of our lab's servers to handle the new NEF format. Dr. Wishart forwarded an email to me from Dr. Rasmus Fogh, which read in part, "We hope to convince you to use the account on GitHub for communiation, proposals, and sharing test results … ." So I'm wondering whether the intent is for people to raise questions or make comments on the issues tracker on GitHub, or whether some other means of communication is preferred.

One question I have is: Is it part of the specification for _nef_chemical_shift.value_uncertainty to optionally use scientific notation, as in the example file CCPN_H1GI_alt.nef? Or was it an oversight that led to the inclusion of those values?

Thanks

residue identifiers - sequence_code vs. sequential integers

The NEF specification in Overview.md says that sequence_code is a string and not an integer. But for some applications, a program may want to number residues sequentially with integers (as is the case with some existing programs in our lab). With NMR-STAR 2.1 format, one could easily choose to use _Residue_seq_code (unique integer) instead of _Residue_author_seq_code. With NMR-STAR 3.1 format, one could use _Atom_chem_shift.Comp_index_ID. But with NEF format, things are trickier, since there is no column in the chemical shifts table with such sequential integers. Since the _nef_chemical_shift table may omit residues that have no chemical shift data, my approach was to match chain_code and sequence_code in the _nef_chemical_shift table to the same in the _nef_sequence table, and then derive an integer based on the position of a residue in the _nef_sequence table (first residue is 1, and so on).

Now, my concern is that it is somewhat cumbersome to write this code, and if implementers are not sufficiently attentive to things they may assume that sequence_code is an integer (and their parsers will fail on some input), or assume that the chemical shifts table contains all residues in a chain, which is incorrect. Might it be better to include a column in the chemical shifts table that will indicate unique and sequential integers for each residue? Thanks.

wildcard for TYR

In this file, for the residue 49 TYR side chain ambiguities are expressed in 'X' and 'Y' in the chemical shift saveframe, but in restrain section it is represented as '%' . Does it have any explanation or just a typo?

Sequence with one-letter codes?

Sometimes people may want to get the sequence of one or more chains from a NEF file. In NMR-STAR 2.1 and 3.1 formats, this is easy since one just finds _Mol_residue_sequence or _Entity.Polymer_seq_one_letter_code, respectively (or one can visually inspect the file without having to remember these labels). But in the NEF examples, there is no such field. Will a PDB ID or other external ID always be present that would be used to easily get a FASTA file? If not, to get the sequence (in a format using one-letter codes), one would need to either write a script or use text manipulation tricks and use something like the Three to One form at http://bioinformatics.org/sms2/three_to_one.html. Not everyone will have the expertise for this, or think of the right tricks to use. Should perhaps the sequence(s) of chains be written out in NEF files using one-letter codes?

Consider : relation with NMReDATA assignment of small molecules

I wonder if NEF is interested to include the NMReDATA format into account.
nmredata.org

Dictionary issue 1( gathered from wwPDD team)

_category.parent_category_id is not defined in DDL. We have parent/child relationships between attributes in categories. We expect there to be parent/child relationships between attributes between the categories as there is no way to know how to map two categories together.

should nef_nmr_meta_data loop _nef_run_history have the current program in it

When you first create a nmr_meta_data_frame should the history contain the current program? I would suggest that it does as this imposes less of a burden on the developer side since the developer doesn't have to record previous values and append them to the list on each modification. All he would need to do is add to the list. So the proposal is that an initial nmr_meta_data_frame should look as follows

save_nef_nmr_meta_data
   _nef_nmr_meta_data.sf_category      nef_nmr_meta_data
   _nef_nmr_meta_data.sf_framecode     nef_nmr_meta_data
   _nef_nmr_meta_data.format_name      nmr_exchange_format
   _nef_nmr_meta_data.format_version   1.1
   _nef_nmr_meta_data.program_name     NEFPipelines
   _nef_nmr_meta_data.program_version  0.0.1
   _nef_nmr_meta_data.script_name      header.py
   _nef_nmr_meta_data.creation_date    2021-06-19T21:13:32.548158
   _nef_nmr_meta_data.uuid             NEFPipelines-2021-06-19T21:13:32.548158-0485797022

   loop_
      _nef_run_history.run_number
      _nef_run_history.program_name
      _nef_run_history.program_version
      _nef_run_history.script_name

      1 NEFPipelines 1.1 header.py 

   stop_

save_

_nef_sequence.index vs _nef_sequence.sequence_code

According to definition '_nef_sequence.index' is a sequential number starting from 1 and '_nef_sequence.sequence_code' is the author specified sequence position (old PDB like numbering).
My question is "Can I safely assume that '_nef_sequence.index' will match the '_entity_poly_seq.num' or '_atom_site.label_seq_id ' in the corresponding mmCIF file?"

Thanks

The NEF frame nef_nmr_meta_data doesn't include a script name in its tags

The NEF frame nef_nmr_meta_data doesn't include a script name in its tags but it is part of loop of changes

suggestion add script_name as part of the nef_nmr_meta_data tags _nef_nmr_meta_data.script_name__

save_nef_nmr_meta_data
   _nef_nmr_meta_data.sf_category      nef_nmr_meta_data
   _nef_nmr_meta_data.sf_framecode     nef_nmr_meta_data
   _nef_nmr_meta_data.format_name      nmr_exchange_format
   _nef_nmr_meta_data.format_version   1.1
   _nef_nmr_meta_data.program_name     NEFPipelines
   _nef_nmr_meta_data.program_version  0.0.1
   _nef_nmr_meta_data.script_name      header.py
   _nef_nmr_meta_data.creation_date    2021-06-19T21:13:32.548158
   _nef_nmr_meta_data.uuid             NEFPipelines-2021-06-19T21:13:32.548158-0485797022

   loop_
      _nef_run_history.run_number
      _nef_run_history.program_name
      _nef_run_history.program_version
      _nef_run_history.script_name

      1 NEFPipelines 1.1 header.py 

   stop_

save_

What is mandatory

Recently, I have seen test data with at least three mandatory elements mising:

  _nef_nmr_meta_data.format_version

the 'index' columns on restraints and peaks.

and

save_nef_chemical_shift_list

Clearly people cannot be prevented from (ab)using the format in whatever way they prefer, internally. The question, especially for the BMRB, is what will happen if any of these mandatory elements are missing.

The format_version is clearly necessary - otherwise we do not know which reader to use.

The index columns are less certain. Since they are in effect line numbers they can be inferred from the file.
Will the BMRB refuse to accept files without indices? Or should we reconsider whether they are mandatory?

Finally, the chemical_shift_list, which is mandatory in order to get a list of the atoms used, instead of having to dig them out of restraints lists. The thing is that it is not mandatory that it be complete (that would be impossible to enforce), and people who do not have shifts will understandably be slow to add shift lists containing no useful shift values. Any comments?

CSROSETTA peak list problem

CSROSETTA example file contains 4D peak list, but in the loop only 3 chemical shift values are provided. Is it an error or is it a reduced dimentionality experiment ?.

PDB validation reports

I am not on this project any more, but I had a look at the new PDB validation reports, for old times sake.

Congratulations on a very nice-looking validation report.

I see one problem, though. Just checking a single report, 2juw, there were a remarkable number of 'duplicate shifts' and 'unmapped shifts' warnings, and a total of only 66% of atome assigned. It could be a quite badly assigned structure, of course, but it looks suspiciously like ambiguous assignments (e.g. TYR HB%) are double counted somehow, and non-stereospdcific assignments (e.g. ASP HBx, ASP HBy) are ignored or rejected. If so, any NMR spectroscopist would see this report as misleading, and any non-NMR-spectroscopist would get an unduly bad impression of the quality of the structure (only 66% assigned??), and maye of the entire technique.

The spectroscopically correct way of looking at shifts assigned to e.g. .ASP HBx, ASP HBy is to consider them both assigned and count assignment percentages accordingly. If desired you can give them a special category, 'non-stereospecifically assigned', but it would be misleading at best to describe them only as 'not mapped to atoms'.

Maybe you are doing this correctly, 2juw is a rather badly assigned structure, and i am just misunderstanding the situation - it was a quick look, and I am not going to spend hours of analysis on a project where I have not been invited to contribute. But I think it might be worth your while to have another look at this.

Pseudo atom vs wildcard

In the CSROSETTA example file pseudo atoms were used for VAL (QG1,QG2) in the chemical shift section and wildcards(HG%) were used in restraints sections. I guess these representations have its own meaning and purpose. I expect the other way, pseudo atoms in restraints and wildcards in chemical shifts. Is there any rule saying when to use pseudo atoms and when to use wildcard?

nmrexchangeformat / nef Goto Github PK

nef's Introduction

NEF (NMR Exchange Format)

Project Organisation

nef's People

Contributors

Stargazers

Watchers

Forkers

nef's Issues

Recommend Projects

Recommend Topics

Recommend Org