project-gemmi / gemmi Goto Github PK
View Code? Open in Web Editor NEWmacromolecular crystallography library and utilities
Home Page: https://project-gemmi.github.io/
License: Mozilla Public License 2.0
macromolecular crystallography library and utilities
Home Page: https://project-gemmi.github.io/
License: Mozilla Public License 2.0
It is great that the SEQRES records were added. Is there any change you could add the DBREF and SEQADV records as well, since these relate strongly to the SEQRES record?
_struct_ref_seq
_struct_ref_seq_dif
In particular, I'd like the conversion in the to_pdb.hpp file
Hi @wojdyr , I tried the superposition algorithm from python, but it seems to crash often.
import gemmi
x = gemmi.read_structure('2vta_updated.cif')[0]['A'].get_polymer()
y = gemmi.read_structure('6q49_updated.cif')[0]['A'].get_polymer()
r = gemmi.calculate_superposition(x, y, gemmi.PolymerType.PeptideL, gemmi.SupSelect.CaP) # this line fails
What I could get out if that is:
[1] 25560 segmentation fault
MemoryError: std::bad_alloc
there is additional line of code that crashes (again segmentation fault) I could find in the documentation and that is: x.check_polymer_type()
, that is why I use gemmi.PolymerType.PeptideL
in the code.
Interestingly, when I use archive files (https://www.ebi.ac.uk/pdbe/entry-files/download/2vta.cif and https://www.ebi.ac.uk/pdbe/entry-files/download/6q49.cif) I dont get the memmory error, but nothing is computed, only thing I can see is message More than 50 iterations needed!
So my natural question is how can I increase this limit and is there anything I can do to provide more info so that you guys can track down the root of the problem?
Thank you for looking into this!
Right now there is no obvious way to map an unmerged miller index to the reciprocal space asymmetric unit from Python. At the C++ level, the closest thing seems to be the UnmergedHklMover in mtz.hpp. It would be really great if we could have access to this object through Pybind11. @JBGreisman and I have reimplemented this functionality with numpy, but we would like to be able to compare it to Gemmi's implementation for testing purposes.
More generally, might it make sense to refactor UnmergedHklMover into symm.hpp and make its constructor take a SpaceGroup instead of an Mtz. This would require a bit more work, but I think it makes sense to have a less specialized class for this task.
On an unrelated note, thanks for your work on this project. It has drastically reduced the overhead involved in processing crystallographic data in my research.
gemmi.FloatGrid
seems to have a method for setting its corresponding unit cell (FloatGrid.set_unit_cell()
), but there is not a corresponding method for setting the grid's spacegroup. This seems useful for being able to get the correct ASU mask from the FloatGrid.asu()
method.
This is the current behavior:
grid = gemmi.FloatGrid(100, 100, 100)
grid.spacegroup = gemmi.SpaceGroup(19)
print(grid.spacegroup)
returns:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-142-5698f3495619> in <module>
----> 1 print(grid.spacegroup)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 20: invalid start byte
Is there a way to set the spacegroup for a FloatGrid
object that was not instantiated directly from a call to gemmi.transform_f_phi_grid_to_map()
?
Thanks,
Jack
Opening refmac mmcif by gemmi.read_structure("refmac.mmcif")
and saving it by make_mmcif_document().write_file("foo.mmcif")
results in loss of some records. Can gemmi keep them? If it is too much trouble I can work on it. Currently, following records are lost.
_atom_site.auth_atom_id
_atom_site.auth_comp_id
_atom_site.calc_flag
_atom_site.group_pdb
_atom_site.pdbx_pdb_atom_name
_atom_site.pdbx_pdb_residue_name
_atom_site.pdbx_pdb_residue_no
_atom_type.n_electrons
_atom_type.scat_cromer_mann_a1
_atom_type.scat_cromer_mann_a2
_atom_type.scat_cromer_mann_a3
_atom_type.scat_cromer_mann_a4
_atom_type.scat_cromer_mann_b1
_atom_type.scat_cromer_mann_b2
_atom_type.scat_cromer_mann_b3
_atom_type.scat_cromer_mann_b4
_atom_type.scat_cromer_mann_c
_atom_type.scat_z
_ccp4_form_factor.scat_method
_ccp4_refine.details
_ccp4_refine_ls.jelly_body
_ccp4_refine_ls.jelly_body_dist
_ccp4_refine_ls.jelly_body_sigma
_entity_poly_seq.hetero
_refine.aniso_b[1][1]
_refine.aniso_b[1][2]
_refine.aniso_b[1][3]
_refine.aniso_b[2][2]
_refine.aniso_b[2][3]
_refine.aniso_b[3][3]
_refine.b_iso_mean
_refine.correlation_coeff_fo_to_fc
_refine.correlation_coeff_fo_to_fc_free
_refine.details
_refine.ls_number_reflns_r_free
_refine.ls_number_reflns_r_work
_refine.ls_percent_reflns_r_free
_refine.ls_r_factor_all
_refine.ls_r_factor_r_free
_refine.ls_r_factor_r_work
_refine.ls_wr_factor_r_free
_refine.ls_wr_factor_r_work
_refine.overall_su_b
_refine.overall_su_ml
_refine.pdbx_average_fsc_free
_refine.pdbx_average_fsc_overall
_refine.pdbx_average_fsc_work
_refine.pdbx_overall_esu_r
_refine.pdbx_overall_esu_r_free
_refine.pdbx_solvent_ion_probe_radii
_refine.pdbx_solvent_shrinkage_radii
_refine.pdbx_solvent_vdw_probe_radii
_refine.solvent_model_details
_refine_ls_restr.dev_ideal
_refine_ls_restr.dev_ideal_target
_refine_ls_restr.number
_refine_ls_restr.pdbx_refine_id
_refine_ls_restr.type
_refine_ls_shell.d_res_high
_refine_ls_shell.d_res_low
_refine_ls_shell.number_reflns_all
_refine_ls_shell.number_reflns_r_free
_refine_ls_shell.number_reflns_r_work
_refine_ls_shell.pdbx_fsc_free
_refine_ls_shell.pdbx_fsc_work
_refine_ls_shell.pdbx_refine_id
_refine_ls_shell.pdbx_total_number_of_bins_used
_refine_ls_shell.percent_reflns_obs
_refine_ls_shell.r_factor_all
_refine_ls_shell.r_factor_r_free
_refine_ls_shell.r_factor_r_work
_refine_ls_shell.wr_factor_r_work
_software.contact_author
_software.contact_author_email
_software.description
Thank you very much for the time taken in developing the parser and especially the python bindings.
I have a feature request for set_mmcif_category to take a table. Looping and modifying a list of dicts is easier to work with sometimes. The current functionality is also very useful (dict of lists).
I am seeing a memory issue when using gemmi::make_small_structure_from_block
. When running under valgrind, I get:
$ cat a.cpp
#include <gemmi/cif.hpp>
#include <gemmi/smcif.hpp>
int main (void) {
auto block = gemmi::cif::read_file("1544173.cif").sole_block();
gemmi::SmallStructure SiC = gemmi::make_small_structure_from_block(block);
}
$ g++ a.cpp -I $PWD/gemmi-0.4.1/include
$ valgrind ./a.out
==2764260== Memcheck, a memory error detector
==2764260== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==2764260== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2764260== Command: ./a.out
==2764260==
==2764260== Conditional jump or move depends on uninitialised value(s)
==2764260== at 0x11B874: void gemmi::split_element_and_charge<gemmi::SmallStructure::Site>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, gemmi::SmallStructure::Site*) (in /home/coudert/a.out)
==2764260== by 0x11781B: gemmi::make_small_structure_from_block(gemmi::cif::Block const&) (in /home/coudert/a.out)
==2764260== by 0x10BBAF: main (in /home/coudert/a.out)
==2764260==
==2764260==
==2764260== HEAP SUMMARY:
==2764260== in use at exit: 0 bytes in 0 blocks
==2764260== total heap usage: 849 allocs, 849 frees, 367,416 bytes allocated
==2764260==
==2764260== All heap blocks were freed -- no leaks are possible
==2764260==
==2764260== For counts of detected and suppressed errors, rerun with: -v
==2764260== Use --track-origins=yes to see where uninitialised values come from
==2764260== ERROR SUMMARY: 50 errors from 1 contexts (suppressed: 0 from 0)
With another valgrind version, a slightly different message but I think it is the same issue:
==17170== Memcheck, a memory error detector
==17170== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==17170== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==17170== Command: ./cif
==17170==
==17170== Invalid read of size 1
==17170== at 0x6B545C: void gemmi::split_element_and_charge<gemmi::SmallStructure::Site>(std::string const&, gemmi::SmallStructure::Site*) (small.hpp:77)
==17170== by 0x6B1CB4: gemmi::make_small_structure_from_block(gemmi::cif::Block const&) (smcif.hpp:73)
==17170== by 0x6A4AD5: chemfiles::CIFFormat::init_() (CIF.cpp:67)
==17170== by 0x5E6EB1: chemfiles::CIFFormat::CIFFormat(std::string, chemfiles::File::Mode, chemfiles::File::Compression) (CIF.hpp:36)
Line 172 in f804883
When loading ccp4 map from MemoryStream, the position of the stream is not advanced, since the Stream is passed by value. This breaks the loader since the header bytes are read again as grid data later.
Passing the Stream f by reference fixes this issue.
Getting the "PDB" header (make_pdb_headers) from 6Y2E doesn't work because the length of the chain ID is longer than a single character. I suspect the below is the problem:
gemmi/include/gemmi/to_pdb.hpp
Line 476 in d59ef39
I see a similar problem with HELIX/SHEET.
I think the real problem here is that the multi-character chain IDs (AAA, BBB), cause incompatibility with the PDB spec and thus these sections.
At minimum perhaps detect the use of multi-character chain IDs and perhaps just don't write these sections.
We've recently gotten some segfaults reading certain mmCIF files. I've attached an example here.
6j40.cif.txt
We see this in the trace:
terminate called after throwing an instance of 'tao::pegtl::parse_error'
what(): string:107819:0(9901474): Wrong number of values in the loop
It seems to be due to badly formed MMCIF files, because in some cases when we download a revised version of those same MMCIF files the problem goes away.
Is there a way we can safeguard the parser to avoid this?
ATOM 1 N VAL 1 -16.669 -10.959 28.434 1.00 0.00
ATOM 2 H1 VAL 1 -17.186 -11.857 28.339 1.00 0.00
ATOM 3 H2 VAL 1 -17.171 -10.340 29.102 1.00 0.00
ATOM 4 H3 VAL 1 -16.611 -10.493 27.506 1.00 0.00
ATOM 5 CA VAL 1 -15.289 -11.229 28.944 1.00 0.00
ATOM 6 HA VAL 1 -15.385 -11.932 29.771 1.00 0.00
ATOM 7 CB VAL 1 -14.375 -11.856 27.866 1.00 0.00
ATOM 8 HB VAL 1 -14.217 -11.118 27.080 1.00 0.00
ATOM 9 CG1 VAL 1 -13.033 -12.232 28.471 1.00 0.00
ATOM 10 HG11 VAL 1 -12.398 -12.673 27.702 1.00 0.00
ATOM 11 HG12 VAL 1 -12.552 -11.340 28.872 1.00 0.00
ATOM 12 HG13 VAL 1 -13.185 -12.954 29.274 1.00 0.00
ATOM 13 CG2 VAL 1 -15.028 -13.077 27.254 1.00 0.00
ATOM 14 HG21 VAL 1 -14.368 -13.502 26.498 1.00 0.00
ATOM 15 HG22 VAL 1 -15.215 -13.818 28.031 1.00 0.00
ATOM 16 HG23 VAL 1 -15.973 -12.791 26.791 1.00 0.00
ATOM 17 C VAL 1 -14.653 -9.919 29.386 1.00 0.00
ATOM 18 O VAL 1 -14.750 -8.913 28.681 1.00 0.00
ATOM 19 N VAL 2 -14.017 -9.937 30.554 1.00 0.00
ATOM 20 H VAL 2 -13.943 -10.768 31.141 1.00 0.00
ATOM 21 CA VAL 2 -13.360 -8.760 31.109 1.00 0.00
ATOM 22 HA VAL 2 -13.357 -7.980 30.347 1.00 0.00
ATOM 23 CB VAL 2 -14.104 -8.228 32.362 1.00 0.00
ATOM 24 HB VAL 2 -14.122 -9.019 33.112 1.00 0.00
ATOM 25 CG1 VAL 2 -13.372 -7.021 32.952 1.00 0.00
ATOM 26 HG11 VAL 2 -13.910 -6.663 33.830 1.00 0.00
ATOM 27 HG12 VAL 2 -12.362 -7.313 33.239 1.00 0.00
ATOM 28 HG13 VAL 2 -13.322 -6.226 32.208 1.00 0.00
ATOM 29 CG2 VAL 2 -15.533 -7.860 32.012 1.00 0.00
ATOM 30 HG21 VAL 2 -16.041 -7.489 32.902 1.00 0.00
ATOM 31 HG22 VAL 2 -15.532 -7.086 31.245 1.00 0.00
ATOM 32 HG23 VAL 2 -16.054 -8.741 31.637 1.00 0.00
ATOM 33 C VAL 2 -11.949 -9.145 31.526 1.00 0.00
ATOM 34 O VAL 2 -11.743 -10.200 32.132 1.00 0.00
ATOM 35 N VAL 3 -10.980 -8.318 31.150 1.00 0.00
ATOM 36 H VAL 3 -11.139 -7.480 30.591 1.00 0.00
ATOM 37 CA VAL 3 -9.584 -8.537 31.507 1.00 0.00
ATOM 38 HA VAL 3 -9.459 -9.513 31.977 1.00 0.00
ATOM 39 CB VAL 3 -8.664 -8.525 30.259 1.00 0.00
ATOM 40 HB VAL 3 -8.780 -7.564 29.758 1.00 0.00
ATOM 41 CG1 VAL 3 -7.208 -8.688 30.664 1.00 0.00
ATOM 42 HG11 VAL 3 -6.579 -8.677 29.774 1.00 0.00
ATOM 43 HG12 VAL 3 -6.919 -7.868 31.322 1.00 0.00
ATOM 44 HG13 VAL 3 -7.080 -9.636 31.187 1.00 0.00
ATOM 45 CG2 VAL 3 -9.062 -9.632 29.304 1.00 0.00
ATOM 46 HG21 VAL 3 -8.408 -9.613 28.432 1.00 0.00
ATOM 47 HG22 VAL 3 -8.970 -10.596 29.805 1.00 0.00
ATOM 48 HG23 VAL 3 -10.094 -9.485 28.987 1.00 0.00
ATOM 49 C VAL 3 -9.203 -7.386 32.446 1.00 0.00
ATOM 50 O VAL 3 -9.320 -6.210 32.076 1.00 0.00
This fragment was generated by amber and seems to be complaint with PDB 3.0 nomenculature
Hi,
Is there a way to make a copy of a gemmi.Structure object in python? I want to write a function that takes a structure as input, makes a copy (so the original is not affected), makes some modifications (e.g. adding residues) and returns the modified structure.
I've tried copy.copy
and copy.deepcopy
but both give me the error:
TypeError: can't pickle gemmi.Structure objects
Any help would be appreciated.
Cheers,
Paul
Hi,
first of all thank you for this great lib.
I have a question about the order of operations (triplets) e.g. for P 1 21/n 1 gemmi gives:
x,y,z (identity) -x,-y,-z -x+1/2,y+1/2,-z+1/2 x+1/2,-y+1/2,z+1/2
which is fine but not in the order given by the tables, where the order would be
(e.g. http://img.chem.ucl.ac.uk/sgp/large/014hy1.htm)
x,y,z -x+1/2,y+1/2,-z+1/2 -x,-y,-z x+1/2,-y+1/2,z+1/2
so how can I get the order corresponding to the tables?
(this may be an obvious/stupid question, but I am not a crystallographer)
Regards
bm
Being able to move grid objects between python processes would be great for enabling multiprocessing using normal python libraries.
Would also enable pickleing, which may be useful to some.
I faced a problem reading .pdb
files generated by tleap
(from amber md package).
.pdb
file (4 protein chains):CRYST1 116.952 70.342 107.488 90.00 90.00 90.00 P 1 1
ATOM 1 N VAL 1 -16.669 -10.959 28.434 1.00 0.00
ATOM 2 H1 VAL 1 -17.186 -11.857 28.339 1.00 0.00
TER
ATOM 2081 N VAL 139 -23.085 -10.959 -28.434 1.00 0.00
ATOM 2082 H1 VAL 139 -22.568 -11.857 -28.339 1.00 0.00
TER
ATOM 4161 N VAL 277 23.085 12.972 28.434 1.00 0.00
ATOM 4162 H1 VAL 277 22.569 12.074 28.339 1.00 0.00
TER
ATOM 6241 N VAL 415 16.670 12.972 -28.434 1.00 0.00
ATOM 6242 H1 VAL 415 17.186 12.074 -28.339 1.00 0.00
TER
END
Sample script:
import gemmi
st = gemmi.read_pdb("test.pdb")
for model in st:
for chain in model:
print(chain)
Expected:
<gemmi.Chain with 1 res>
<gemmi.Chain with 1 res>
<gemmi.Chain with 1 res>
<gemmi.Chain with 1 res>
Actual:
<gemmi.Chain with 4 res>
PDB parsing is not an easy task. Most of troubles come from plenty of slightly varying (often undocumented) .pdb
flavors. Every library choose subset of this zoo to support. Support of amber .pdb
files may be outside of gemmi area of interest.
When a Grid
representing a reciprocal space is generated by get_f_phi_on_grid
, it applies a periodic boundary condition like a real space grid. I found this confusing. Is this intended?
Another related issue is that symmetrize_*
applies real space, space group operations, not reciprocal space, point group operations (and relevant phase relationships).
Hi guys, thanks for this tool. I've encountered segmentation faults for a couple of entries with the latest version (as of writing this post) of gemmi from the master branch and I wonder if anyone could have a look what is wrong. I'm using it from python.
OS: MacOS: 10.15.4
python: 3.7.7
import gemmi
path = '1to4-assembly-1_atom_site.cif.gz '
st = gemmi.read_structure(path)
st.setup_entities()
st.remove_ligands_and_waters()
cs = gemmi.ContactSearch(distance)
cs gemmi.ContactSearch.Ignore.SameChain
sub_cells = gemmi.SubCells(st[0], st.cell, 6)
sub_cells.populate() # here it blows up
It crashes for the assembly id 1 for all of the following structures:
1to4, 2gpc, 4xvv, 2hk3, 3n77, 3fbt, 3chi, 3esf, 2r3v, 4yiw, 2x4i, 5hm4, 3kbf, 4f2n, 3mux, 4whj
3zrd, 3glv, 3mnd, 4u04, 2goj, 1e59, 1tz3, 3clk, 2pmr, 3toe, 1y7y, 2xhf, 2wcu, 3p9x
You can download the assembly file from here: http://www.ebi.ac.uk/pdbe/static/entry/download/1to4-assembly-1_atom_site.cif.gz
Does the validation only ever report the first instance of an issue for a given item? E.g. I know that a mmCIF file has several new _pdbx_diffrn_data_section_contents.content_type values, but only ever get one problem report back.
Is that intential or a buglet?
Could we have wheel releases on pypi?
I am currently setting up CI builds for a new project which will be using gemmi, and it looks like building gemmi takes about 4 minutes, which adds up to some significant time with a build matrix of 4 python environments on 3 platforms.
Hi,
Thank you for this great package, I find it really useful. I was wondering if there is an easy way to find the matthews coefficient and the solvent content for a given mtz file (assuming we know the sequence). Basically I'm interested in estimating the number of copies in the ASU.
Thanks
Hi,
I am trying to parse a cif file 1ue4. My code looks like this:
doc = cif.read("1ue4.cif")
By other cif files it works normally, but by this file it raises an exception: Wrong number of values in the loop
(In the previous version it worked.)
Image processing real space density for many datasets is much easier when they are sampled to the same grid, but currently there is no easy way to do this that I can find: transform_f_phi_to_map often errors when given an exact size that is too small, and there isn't an easy way to find out what grid a map would bee fft'd to anyhow.
Also, because sometimes their preferred size is large, so being able to fft to the lowest common resolution would be helpful anyway when dealing with many datasets!
PanDDA in particular relies heavily on truncating datasets down to the lowest common resolution for its event finding, so this would be useful there!
File: smcif.hpp
Function: make_atomic_structure_from_block
In the case, when there is no occupancy pair defined in the cif file, it leads to the occ
member of the AtomicStructure::Site
being undefined / uninitialized. Maybe, for the sake of the high-level access, it would be better to set it to NAN
by default? For now, I am checking if the file has _atom_site_occupancy
.
Example:
loop_
_atom_site_label
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
H 1.0 1.0 1.0
ConnectionType.None
cannot be used from python since None
is a keyword.
Example:
import gemmi
print(gemmi.ConnectionType.Covale)
print(gemmi.ConnectionType.Disulf)
print(gemmi.ConnectionType.Hydrog)
print(gemmi.ConnectionType.MetalC)
print(gemmi.ConnectionType(4))
# print(gemmi.ConnectionType.None)
Syntax error (uncomment last line):
print(gemmi.ConnectionType.None)
^
SyntaxError: invalid syntax
Current work-around: ConnectionType.None
can be constructed as ConnectionType(4)
Having a function to calculate the RMSD between small molecules whose atoms are not necessarily listed in the same order seems like it would be potentially very useful, although it would need some graph matching and symmetry shenanigans!
Hi --
I have been writing a package for representing and manipulating reflection tables in pandas dataframes, and have been using gemmi
as a backend for some of the features related to IO of MTZ files and space group symmetry operations.
Currently, gemmi.Op.apply_to_hkl()
takes HKL indices as a list/tuple; however, gemmi.Op.phase_shift()
takes HKL indices as 3 subsequent int
arguments. Should these two call signatures be made consistent by changing phase_shift()
to take HKL indices as a list/tuple?
i.e. changing this:
phase_shift(self: gemmi.Op, h: int, k: int, l: int) -> float
to:
phase_shift(self: gemmi.Op, hkl: List[int[3]]) -> float
On a related note, it seems like it would be useful to vectorize these operations to work on lists/np.ndarrays of HKL indices. While it is easy to write a for-loop to handle it, it seems like a common enough use case that it might warrant overloading the functions to also handle call signatures like:
apply_to_hkl(self: gemmi.Op, hkl: List[ List[ int[3] ] ]) -> List[ List[ int[3] ] ]
and
phase_shift(self: gemmi.Op, hkl: List[ List[ int[3] ] ]) -> List[ float ]
These are really more of feature requests than "issues," but I look forward to your thoughts--
Thanks,
Jack
Hi @wojdyr, thanks for the package, I find it really useful. I was wondering if there is a way to change the spacegroup of a gemmi.Mtz
instance in python (including the reindexing of the reflections). For example, if I had a MTZ in P321
and I want to change it to P 31 2 1
I know it would possible to do mtz.spacegroup = gemmi.SpaceGroup('P 31 2 1')
but I have the impression this is just changing some attribute in gemmi.Mtz
and not reindexing the reflections. Thanks!
How can I cite gemmi in a scientific publication?
Thanks
Hi, I noticed that the following program causes segmentation fault in Python 3.6.4.
import gemmi
fn_mtz = "4MAA_map_coeffs.mtz"
mtz = gemmi.read_mtz_file(fn_mtz)
grid = mtz.get_f_phi_on_grid('2Fo-Fc', 'PH2Fo-Fc', mtz.get_size_for_hkl(sample_rate=3.0))
real_space = gemmi.transform_f_phi_grid_to_map(grid)
coord = gemmi.Position(0, 7.57895, 56.0842)
print(real_space.interpolate_value(coord))
The problematic MTZ file was created as:
phenix.fetch_pdb --mtz 4MAA
phenix.maps reflection_data.file_name=4MAA.mtz 4MAA.pdb map_coefficients.map_type=2Fo-Fc
It is available here: https://drive.google.com/open?id=1p87TDu4n9N15EVljGGu8U1YbunefEOmj
I am unsure if this is a bug in Gemmi or a bug in Sgtbx. Either way it is probably academic.
The following script can be used to confirm that Gemmi sometimes calculates an epsilon factor without centering which is twice the value given by Sgtbx. Oddly, this only applies to reflection 0,0,0
and affects all centric space groups.
#!/usr/bin/env cctbx.python
from cctbx import sgtbx
import gemmi
h = [0, 0, 0]
for s in list(sgtbx.space_group_symbol_iterator()):
xhm = s.universal_hermann_mauguin()
sg = sgtbx.space_group(s)
sgtbx_epsilon = sg.epsilon(h)
gemmi_go = gemmi.SpaceGroup(xhm).operations()
gemmi_epsilon = gemmi_go.epsilon_factor_without_centering(h)
if gemmi_epsilon != sgtbx_epsilon:
centric = gemmi_go.is_centric()
print(f"{xhm:>12}\t{centric}\t{gemmi_epsilon}\t{sgtbx_epsilon}")
Hi Marcin,
Thanks for the recent updates -- they have made our testing much easier. Would it be possible to push the most recent version to PyPI? I've seen it is on conda
, but I was hoping to make sure that pip
can reach it as well.
I also know that you just pushed a new release very recently, but please let me know if you are planning on pushing a v0.3.8 soon with the updates to gemmi.SpaceGroup
.
Thanks again,
Jack
Hi @wojdyr ,
First of all, thanks for this awesome tool !
Since two days ago I was reading the documentation and doing some testing in RCSB mmCIF files.
And when a was trying to get a particular model chain a few doubts came to me.
At the moment I did this to get all models:
doc = gemmi.cif.read_file('2mfe.cif')
block = doc.sole_block()
models = block.find('_atom_site.', ['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id', 'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id', 'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy', 'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id', 'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num'])
models
Out: <gemmi.cif.Table 65280 x 21>
models = block.find_mmcif_category('_atom_site.')
models
Out: <gemmi.cif.Table 65280 x 21>
models = block.find_loop('_atom_site.group_PDB').get_loop()
models
<gemmi.cif.Loop 65280 x 21>
Now I got a problem...
How I could save in mmCIF or PDB (style PDBx) format those models from chain "D" in separate files?
I'm stack with this poorly solution:
for row in models:
if row[6] is 'D':
print('%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s ' % tuple(row))
On the other hand, I was trying Biopython (v1.76) PDB class, which they get above result doing this:
from Bio.PDB import MMCIFParser() as p
from Bio.PDB import MMCIFIO() as io
file_path = "2mfe.cif"
pdb_code = file_path.split('.')[0]
chain = "D"
file = open(file_path, "r")
structure_file = p.get_structure(pdb_code, file)
for model in structure_file:
filename = (pdb_code + "_" + str(model.id) + "_" + chain + ".cif")
io.set_structure(model[chain])
io.save(filename)
Biopython's results only have this loop data (missing '_atom_site.pdbx_formal_charge ', '_atom_site.auth_comp_id' and '_atom_site.auth_atom_id'):
data_pdb
#
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.auth_seq_id
_atom_site.auth_asym_id
_atom_site.pdbx_PDB_model_num
ATOM ....
...
HETATM ....
#
Hi,
Based on the descriptions in this paper, see top left column on pp.3 for more info, I learned the following fact:
The (010) surface in C2/c is equivalent to the (001) surface in I2/b. And these two space groups are related through a change-of-basis transformation. An I2/b cell is with tetragonal lattice while a C2/c cell is related to the conventional monoclinic cell.
But The above description is very abstract and difficult to understand, so I want to know whether I can have a better understanding of the problem based on the help of gemmi?
Regards,
HY
The CCDs NH2, NH, GTE, DML and TML fail to parse.
gemmi seems to fall over on the _pdbx_chem_comp_identifier.identifier
which has an unquoted string like $l^{2}-azane
.
gemmi returns either "Wrong number of values in the loop" if the value is in a loop or "expected value" if in a pair.
I suspect it has to do with the unquoted string but since I am on the python side, all I get is a RunTimeError at gemmi.cif.read(file_path)
method and nothing more.
Molecular <parent>.add_<child>
functions apparently invalidate early stored references. While invalidation is common in C++ realm it might be quite surprising behavior to python users. Probably it deserves to be documented.
import gemmi
pdb_string = """\
ATOM 1 N LYS A 12 8.016 44.382 113.565 1.00 33.28 N
"""
st = gemmi.read_pdb_string(pdb_string)
model = st[0]
chain = model[0]
residue = chain[0]
atom = residue[0]
st.add_model(model)
# print(model) # uncomment to SIGSEGV
model.add_chain(chain)
# print(chain) # uncomment to SIGSEGV
residue.add_atom(atom)
# print(atom) # uncomment to SIGSEGV
find_shortest_path in chemcomp.hpp fails to find shortest path between ("C4", "O6", {"C5", "C6"}) and ("C5", "C2", {"C6", "O6"}) for acedrg-generated KDA, KDO, KDR, KME and KOT. Fix not urgent - can use workaround for now.
As @kmdalton mentioned in #28, we have been implementing a few tests to make sure that our vectorized implementations of certain spacegroup-related operations in reciprocalspaceship
are correct. In testing some code that maps Miller indices to the reciprocalspace ASU, we noticed some differences between our "reference implementations" in cctbx
and gemmi
.
We iterate over spacegroups using sgtbx.space_group_symbol_iterator()
, which has 530 spacegroup settings, and use the extended Hermann-Mauguin for instantiating the gemmi.SpaceGroup
. We are finding inconsistencies for 47/530 settings. This list of 47 spacegroups can be generated using the example below:
from cctbx import sgtbx
from cctbx import miller
import gemmi
import numpy as np
def build_emptyMTZ(sg):
"""Build empty gemmi.MTZ"""
m = gemmi.Mtz()
m.spacegroup = sg
m.add_dataset('crystal')
m.add_column('H', 'H')
m.add_column('K', 'H')
m.add_column('L', 'H')
m.add_column('M/ISYM', 'Y')
return m
def gemmi_map2asu(H, symbol):
"""Map HKLs to reciprocal space ASU using gemmi"""
xhm = symbol.universal_hermann_mauguin()
sg = gemmi.SpaceGroup(xhm)
m = build_emptyMTZ(sg)
data = np.append(H, np.zeros((len(H), 1)), 1)
m.set_data(data)
m.switch_to_asu_hkl()
h = m.column_with_label('H').array.astype(int)
k = m.column_with_label('K').array.astype(int)
l = m.column_with_label('L').array.astype(int)
return np.vstack([h, k, l]).T
def cctbx_map2asu(H, symbol):
"""Map HKLs to reciprocal space ASU using cctbx"""
sg = sgtbx.space_group(symbol)
asu = sgtbx.reciprocal_space_asu(sg.type())
H_asu = np.vstack([miller.asym_index(sg, asu, i.tolist()).h() for i in H])
return H_asu
hmin,hmax = -5, 5
H = np.mgrid[hmin:hmax+1:1,hmin:hmax+1:1,hmin:hmax+1:1].reshape((3, -1)).T
for symbol in sgtbx.space_group_symbol_iterator():
xhm = symbol.universal_hermann_mauguin()
H_cctbx = cctbx_map2asu(H, symbol)
H_gemmi = gemmi_map2asu(H, symbol)
if not np.array_equal(H_cctbx, H_gemmi):
print(xhm)
I 1 2 1
I 1 1 2
I 2 1 1
P 1 n 1
P 1 1 n
P n 1 1
I 1 m 1
I 1 1 m
I m 1 1
A 1 n 1
I 1 a 1
I 1 c 1
B 1 1 n
I 1 1 b
A 1 1 n
I 1 1 a
C n 1 1
I c 1 1
B n 1 1
I b 1 1
I 1 2/m 1
I 1 1 2/m
I 2/m 1 1
P 1 2/n 1
P 1 1 2/n
P 2/n 1 1
P 1 21/n 1
P 1 1 21/n
P 21/n 1 1
A 1 2/n 1
I 1 2/a 1
I 1 2/c 1
B 1 1 2/n
I 1 1 2/b
A 1 1 2/n
I 1 1 2/a
C 2/n 1 1
I 2/c 1 1
B 2/n 1 1
I 2/b 1 1
R 3 :R
R -3 :R
R 3 2 :R
R 3 m :R
R 3 c :R
R -3 m :R
R -3 c :R
symbol = sgtbx.space_group_symbols("I 1 2 1")
hkl = np.array([[-5, -5, 1]])
print(cctbx_map2asu(hkl, symbol)) # [[-5 5 1]]
print(gemmi_map2asu(hkl, symbol)) # [[ 5 5 -1]]
Is this expected behavior, or should gemmi
and sgtbx
both be using the same conventions for the ASU?
A consequence of this issue is removal of TER
records in .pdb
round trip which cannot be overwritten by any flags.
import gemmi
st = gemmi.read_pdb_string("""\
ATOM 1 N THR A 1 137.600 117.719 224.191 1.00107.30 N
ATOM 2 CA THR A 1 136.249 117.607 224.728 1.00121.14 C
ATOM 3 C THR A 1 136.208 116.647 225.914 1.00114.05 C
ATOM 4 O THR A 1 135.371 116.784 226.806 1.00103.45 O
ATOM 5 CB THR A 1 135.254 117.130 223.655 1.00127.19 C
ATOM 6 H1 THR A 1 137.645 118.519 223.528 1.00107.30 H
ATOM 7 H2 THR A 1 138.271 117.877 224.970 1.00107.30 H
ATOM 8 H3 THR A 1 137.849 116.841 223.692 1.00107.30 H
ATOM 9 HA THR A 1 135.957 118.603 225.061 1.00121.14 H
ATOM 10 N ALA A 2 137.110 115.672 225.916 1.00117.01 N
ATOM 11 CA ALA A 2 137.250 114.762 227.047 1.00123.58 C
ATOM 12 C ALA A 2 138.018 115.456 228.166 1.00131.44 C
ATOM 13 O ALA A 2 137.725 115.279 229.359 1.00121.63 O
ATOM 14 CB ALA A 2 137.957 113.491 226.619 1.00112.08 C
ATOM 15 H ALA A 2 137.755 115.491 225.147 1.00117.01 H
ATOM 16 HA ALA A 2 136.260 114.489 227.414 1.00123.58 H
ATOM 17 HB1 ALA A 2 138.054 112.823 227.475 1.00112.08 H
ATOM 18 HB2 ALA A 2 137.379 112.999 225.837 1.00112.08 H
ATOM 19 HB3 ALA A 2 138.948 113.737 226.237 1.00112.08 H
TER
""")
print(st[0][0][0].entity_type)
print(st[0][0][1].entity_type)
output:
EntityType.Unknown
EntityType.Unknown
Hi Marcin,
I am having an issue with Gemmi to read/write mmcif files and I was wondering if you could tell me what I am doing wrong.
I have included a script that does the following:
import gemmi
def test(in_filename, out_filename):
# Read the structure
structure = gemmi.read_structure(in_filename)
# Write the structure
structure.make_mmcif_document().write_file(out_filename)
if __name__ == '__main__':
test("4v5d.cif", "out1.cif")
test("out1.cif", "out2.cif")
When I do this, I get the following error:
$ python test.py
Traceback (most recent call last):
File "test.py", line 36, in <module>
test("out1.cif", "out2.cif")
File "test.py", line 6, in test
structure = gemmi.read_structure(in_filename)
RuntimeError: out1.cif:149:0(2233): Wrong number of values in the loop
I have attached the script and the initial cif file here. file_and_script.zip
Thanks in advance for your help!
The set_mmcif_category has this code section
for (size_t col = 0; col != w; ++col) {
size_t idx = col;
for (auto handle : values[col]) {
std::string& val = loop.values[idx];
PyObject* ptr = handle.ptr();
if (handle.is_none()) {
val = "?";
} else if (ptr == Py_False) {
val = ".";
} else if (ptr == Py_True) {
throw py::value_error("unexpected value True");
} else if (raw || PyFloat_Check(ptr) || PyLong_Check(ptr)) {
val = py::str(handle);
} else {
val = quote(py::str(handle));
}
idx += w;
}
}
For add_row, there is no such binding forcing the List to be a List[str]. Could we allow for automatic conversion if raw=False (raw can be defaulted to True to avoid breaking existing usage).
Gemmi stops parsing mmcif file when encountered no values in loop. Below is the example from Refmac's output. This may be a violation of cif format and Refmac will be fixed, but can gemmi support reading this?
...
loop_
_chem_comp_tor.comp_id
_chem_comp_tor.id
_chem_comp_tor.atom_id_1
_chem_comp_tor.atom_id_2
_chem_comp_tor.atom_id_3
_chem_comp_tor.atom_id_4
_chem_comp_tor.value_angle
_chem_comp_tor.value_angle_esd
_chem_comp_tor.period
var_1 FES FE1 S2 FE2 S1 0.000 20.000 1
var_2 FES FE2 S2 FE1 S1 0.000 20.000 1
var_3 FES FE2 S1 FE1 S2 0.000 20.000 1
#
loop_
_chem_comp_chir.comp_id
_chem_comp_chir.id
_chem_comp_chir.atom_id
_chem_comp_chir.volume_flag
_chem_comp_chir.volume_three
loop_
_chem_comp_chir_atom.comp_id
_chem_comp_chir_atom.chir_id
_chem_comp_chir_atom.atom_id
#
data_comp_MA6
....
I'm quite new to gemmi and crystal summetry, so I apologize in advance is the following issue is just a non-sense but, when I retrieve retrieved sym_ops operations using gemmi for space group 141:
ops = gemmi.find_spacegroup_by_number(141).operations()
sym_ops = ops.sym_ops
Retrieved operations are:
x,y,z
-y,x+1/2,z+1/4
-x+1/2,-y+1/2,z+1/2
y+1/2,-x,z+3/4
x,-y+1/2,-z+1/4
-y,-x,-z
-x+1/2,y,-z+3/4
y+1/2,x+1/2,-z+1/2
-x,-y+1/2,-z+1/4
y,-x,-z
x+1/2,y,-z+3/4
-y+1/2,x+1/2,-z+1/2
-x,y,z
y,x+1/2,z+1/4
x+1/2,-y+1/2,z+1/2
-y+1/2,-x,z+3/4
However, those operations do not match with those listed for instance in the bilbao crystallographic database (https://www.cryst.ehu.es/cgi-bin/cryst/programs/nph-wp-list):
(x,y,z) (-x+1/2,-y,z+1/2) (-y+1/4,x+3/4,z+1/4) (y+1/4,-x+1/4,z+3/4)
(-x+1/2,y,-z+1/2) (x,-y,-z) (y+1/4,x+3/4,-z+1/4) (-y+1/4,-x+1/4,-z+3/4)
(-x,-y,-z) (x+1/2,y,-z+1/2) (y+3/4,-x+1/4,-z+3/4) (-y+3/4,x+3/4,-z+1/4)
(x+1/2,-y,z+1/2) (-x,y,z) (-y+3/4,-x+1/4,z+3/4) (y+3/4,x+3/4,z+1/4)
I was wondering if this is an error, as my calculation runs smoothly with bilbao's but not gemmi operations, or I am missing something important.
Thanks very much in advance.
I ran into some confusion regarding the R 3
family of spacegroups when writing tests related to spacegroups in reciprocalspaceship
. I understand that spacegroups such as R 3
can be represented in two different ways -- rhombohedral / hexagonal. However, when they are specified explicitly, it seems that GroupOps.find_centering()
does not return a value for hexagonal representations:
import gemmi
r3 = gemmi.SpaceGroup("R 3")
r3r = gemmi.SpaceGroup("R 3:R")
r3h = gemmi.SpaceGroup("R 3:H")
print(r3.operations().find_centering()) # prints '\x00'
print(r3r.operations().find_centering()) # prints 'P'
print(r3h.operations().find_centering()) # prints'\x00'
I am only showing this for R 3:H
, but it is also true for the 6 other related spacegroups. It seems to me that there should be a returned centering in these cases, although I can understand not including a centering in the slightly ambiguous case of R 3
.
This also may be a misunderstanding of these spacegroups on my part, but the following symmetry operations are found for R 3:H
: [<gemmi.Op("x,y,z")>, <gemmi.Op("-y,x-y,z")>, <gemmi.Op("-x+y,-x,z")>]
. However, those symops match the rhombohedral setting here: http://img.chem.ucl.ac.uk/sgp/large/146bz1.htm, and vice versa for R 3:R
. I am uncertain whether these two are mixed-up in gemmi
, or whether I am mixing them up in looking at the space group diagrams.
Thanks for your help!
Jack
Curretly, python side functions like transform_f_phi_to_map return gemmi.FloatGrid, a python class rather than the underlying C++ Grid.
This seems problematic as there doesn't seem to be a easy way to pass grids generated on the python side back to C++ for more intensive operations like resampling.
I think this could be fixed relatively easily by making the underlying C++ object accessible as a property?
Is it possible to get a list of categories in a block without looping? This is not so much a must have as it is a convenience function.
Use case: was trying to add a unit test for a function that removes categories and wanted to see categories before and after and confirm all the categories in the list have been removed. I had to use a list comprehension but thought it should be useful enough to have a convenience function
in the write_remarks function in to_pdb.hpp , could you add the REVDAT entry. I've pasted some code that should work below.
std::string token;
std::istringstream revNum(st.get_info("_pdbx_audit_revision_history.ordinal"));
std::vector<std::string> revNums;
while(std::getline(revNum, token, ';'))
{
revNums.push_back(token);
}
std::istringstream revDate(st.get_info("_pdbx_audit_revision_history.revision_date"));
std::vector<std::string> revDates;
while(std::getline(revDate, token, ';'))
{
token.erase(std::remove_if(token.begin(), token.end(), ::isspace), token.end());
revDates.push_back(token);
}
for(int i = (int) revNums.size() -1; i >= 0; --i)
{
WRITEU("REVDAT %3s %-9s %-51s",
revNums[i].c_str(), revDates[i].c_str(), st.get_info("_entry.id").c_str());
}
and in mmcif.hpp
std::string old_revnum_tag = "_database_PDB_rev.num";
std::string new_revnum_tag = "_pdbx_audit_revision_history.ordinal";
add_info(old_revnum_tag);
add_info(new_revnum_tag);
if (st.info.count(old_revnum_tag) == 1 && st.info.count(new_revnum_tag) == 0)
st.info[new_revnum_tag] = st.info[old_revnum_tag];
std::string old_revdate_tag = "_database_PDB_rev.date";
std::string new_revdate_tag = "_pdbx_audit_revision_history.revision_date";
add_info(old_revdate_tag);
add_info(new_revdate_tag);
if (st.info.count(old_revdate_tag) == 1 && st.info.count(new_revdate_tag) == 0)
st.info[new_revdate_tag] = st.info[old_revdate_tag];
For the gemmi.Mtz
class, would it be possible to have min_1_d2
and max_1_d2
(and hence resolution_low
and resolution_high
) calculated using the current data? Currently, if I create a new MTZ object they return nan
. It is possible to get the values by other means, but it isn't obvious why some methods don't work as expected. I'm guessing this also applies if you read an MTZ and modify the data.
Here is some code to reproduce:
import gemmi
import numpy
def create_mtz():
"""Create a new MTZ object as described in the documentation"""
mtz = gemmi.Mtz()
mtz.spacegroup = gemmi.find_spacegroup_by_name("P 21 21 2")
mtz.cell.set(77.7, 149.5, 62.4, 90, 90, 90)
mtz.add_dataset("HKL_base")
for label in ["H", "K", "L"]:
mtz.add_column(label, "H")
mtz.add_dataset("synthetic")
mtz.add_column("F", "F")
mtz.add_column("SIGF", "Q")
mtz.add_column("FREE", "I", dataset_id=0, pos=3)
data = numpy.array(
[[4, 13, 8, 1, 453.9, 19.12], [4, 13, 9, 0, 102.0, 27.31]], numpy.float
)
mtz.set_data(data)
return mtz
def test():
"""Print different MTZ properties"""
mtz = create_mtz()
print(mtz.nreflections) # 2
print(mtz.make_d_array()) # [6.1258473 5.678323 ]
print(mtz.min_1_d2) # nan
print(mtz.max_1_d2) # nan
print(mtz.resolution_low()) # nan
print(mtz.resolution_high()) # nan
if __name__ == "__main__":
test()
When checking against a controlled vocabulary, the list of allowed values is given as a plain, comma-separated list. This makes it very hard to read and doesn't take into account that the allowed values themselves might have a comma (if they are character items).
It would be nice to get (shortened)
some.cif:16 in data_1: _pdbx_diffrn_data_section_contents.content_type: ''X-Ray structure factor amplitudes(+)'' is not one of:
"X-ray structure factor amplitudes"
"X-ray structure factor intensities, unmerged"
(each allowed value on new line) instead of the current one-liner:
some.cif:16 in data_1: _pdbx_diffrn_data_section_contents.content_type: ''X-Ray structure factor amplitudes(+)'' is not one of: X-ray structure factor amplitudes, X-ray structure factor intensities, unmerged.
Hello, I would like to ask a question about the as_json method. I believe the default behaviour is for it to lowercase the names (line 24 in to_json.hpp) but I was wondering if there is any way in python to be able to set this behaviour to false? I see there is the mmjson value but this also changes other settings? Thanks in advance for any tips advice you can provide.
Hi @wojdyr,
I wonder if there is any plan on considering supporting multithreading in python. Presently, when I try to use Python's multiprocessing module it crashes with message similar to:
TypeError: can't pickle gemmi.ResidueSpan objects
I triad approach described here to add custom pickling to the class shown here: , but only managed to make another step before it crashed with
TypeError: gemmi.ResidueSpan: No constructor defined!
So I'm not sure if this is something that needs to be supportd by the library, or if I'm doing something wrong. Any help would be greatly appreciated!
edit: I've just noticed #13, but have not seen any comments there...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.