biotite-dev / biotite Goto Github PK
View Code? Open in Web Editor NEWA comprehensive library for computational molecular biology
Home Page: https://www.biotite-python.org
License: BSD 3-Clause "New" or "Revised" License
A comprehensive library for computational molecular biology
Home Page: https://www.biotite-python.org
License: BSD 3-Clause "New" or "Revised" License
biotite.structure.info.residue()
raises an IndexError
for some residues, for example cyanide (CN
):
Traceback (most recent call last):
File "/home/kunzmann/data/coding/biotite-test/test.py", line 36, in <module>
info.residue("CN")
File "/home/kunzmann/data/coding/biotite/src/biotite/structure/info/atoms.py", line 108, in residue
array.bonds = BondList(
File "src/biotite/structure/bonds.pyx", line 157, in biotite.structure.bonds.BondList.__init__
self._bonds = _to_positive_index_array(bonds, atom_count) \
File "src/biotite/structure/bonds.pyx", line 789, in biotite.structure.bonds._to_positive_index_array
raise IndexError(
IndexError: Index 3 is out of range for an atom count of 3
When running Biotite's setup, when either Cython
or numpy
is not installed, causes a ModuleNotFoundError
. This is due to the import
statement prior to the point the setup is run.
Sometimes it is neccessary to access and/or alter the atom ids in a pdb file. Right now this is not possible because atom ids are not stored when reading from pdb files and are assigned starting from 1when writing pdb files.
Imaging the following case: I want to read a protein structure from PDB and only save its CA atoms to a new file, but without altering their atom ids. Right now atom ids are reassigned starting from 1 and there is no way to alter this.
Possible implementation would be to add another field to biopython.structure.Atom
which is initialized from the PDB file. biopython.structure.io.pdb.PDBFile#write
could then allow to renumber atom ids or reuse the old ones:
pdb_file.set_structure(stack)
pdb_file.write("1l2y.pdb", renumber=True) # writes atom ids starting from 1 (current default)
pdb_file.write("1l2y.pdb") # keeps atom ids of the AtomArray
If a BondList
is indexed with an index that contains no atoms, e.g. a boolean mask with false-only entries, the following exception is raised:
ValueError: zero-size array to reduction operation maximum which has no identity
Hi,
While plotting the feature map for the operon (as show in your example), i would like to remove the arrow head. How do i do this?
For the plot_plasmid_map, i did figure out that i could use arrow_head_length=0 for removing the arrow head.
Is there a possibility to plot the plasmid map as linear rather than circular?
Thanks in advance!!
The current limitation of the pdb format is that it only supports serial numbers up to 99999 atoms and 9999 residues. An enhanced version of the pdb format called "hybrid-36 pdb" overcomes this by switching to a hex representation for those ids.
Example
base pdb format
ATOM 1 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM 2 H 1 0.249 0.684 0.231 1.00 0.00
ATOM 3 H 1 0.586 -0.954 0.791 1.00 0.00
...
ATOM 99997 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM 99998 H 1 0.249 0.684 0.231 1.00 0.00
ATOM 99999 H 1 0.586 -0.954 0.791 1.00 0.00
ATOM 1 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM 2 H 1 0.249 0.684 0.231 1.00 0.00
ATOM 3 H 1 0.586 -0.954 0.791 1.00 0.00
hybrid-36 pdb
ATOM 1 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM 2 H 1 0.249 0.684 0.231 1.00 0.00
ATOM 3 H 1 0.586 -0.954 0.791 1.00 0.00
...
ATOM 99997 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM 99998 H 1 0.249 0.684 0.231 1.00 0.00
ATOM 99999 H 1 0.586 -0.954 0.791 1.00 0.00
ATOM A0000 O 1 0.251 -0.360 -0.046 1.00 0.00
ATOM A0001 H 1 0.249 0.684 0.231 1.00 0.00
ATOM A0002 H 1 0.586 -0.954 0.791 1.00 0.00
As you can see, the big advantage of this format is that it keeps backward compatibility if the residue and serial id are below the threshold for hex numbers - which is a clear advantage over other efforts to overcome the limitations of pdb.
This format was recently implemented in Modeller 9.22 and is also available an various other software packages like MDAnalysis. I think an implementation of this for biotite would not require a lot of code changes: The pdb reader can do this internally when parsing the serial and residue field. And for output, it could be either a modified version of PDBFile or a simple flag to alter the output numbering.
I'd like to put this up for diskussion here before making the effort to implement it. What's your opinion? Yay or nay ;)
structure.rmsd() supports AtomArrays as input for the reference and subject. It should support ndarrays as well, making it easier to use with PDBFile().get_coord.
When gro files with more then 100.000 atoms are written, the atom id field becomes to large, resulting in files that cannot be read by biotite anymore and also not displayed by PyMOL.
Steps to reproduce:
Create a file with more then 100.000 atoms. For examle:
gmx editconf -f spc216.gro -o huge_box.gro -box 20 20 20
gmx solvate -cp huge_box.gro -cs spc216.gro -o huge_box.gro
head huge_box.gro
> 216H2O,WATJP01,SPC216,SPC-MODEL,300K,BOX(M)=1.86206NM,WFVG,MAR. 1984
> 786570
> 1SOL OW 1 10.221 10.629 10.122
> 1SOL HW1 2 10.128 10.627 10.159
> 1SOL HW2 3 10.222 10.590 10.030
Now if we read in this box and write it out again, we cant read it in a second time because of a file format violation at line 100002 of thefile:
import biotite.structure.io as io
s = io.load_structure("huge_box.gro")
print(len(s)) # 786570
io.save_structure("huge_box2.gro", s)
s = io.load_structure("huge_box2.gro")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bauer/conda/lib/python3.7/site-packages/biotite/structure/io/general.py", line 78, in load_structure
array = file.get_structure()
File "/home/bauer/conda/lib/python3.7/site-packages/biotite/structure/io/gro/file.py", line 190, in get_structure
array.coord[m,i,0] = float(line[20:28])*10
ValueError: could not convert string to float: '0 3.70'
huge_box2.gro
33332SOL HW299996 2.494 7.113 15.071
33333SOL OW99997 3.512 7.306 16.291
33333SOL HW199998 3.565 7.316 16.207
33333SOL HW299999 3.485 7.396 16.325
33334SOL OW100000 3.703 5.761 15.860
33334SOL HW1100001 3.742 5.676 15.824
33334SOL HW2100002 3.605 5.763 15.841
33335SOL OW100003 2.125 5.912 15.616
33335SOL HW1100004 2.046 5.963 15.582
The correct behaviour would be to start writing atom ids with 1 after atom #99.999.
The function database.entrez.efetch_single_file() should have the possibility to append to an existing MultiGenBankFile to enable the user to store information from different fetch jobs in one file, as it is not possible to obtain information for more than about 300 ids at the same time. Also, the possibility to automatically delete an already obtained file should be added.
When throwing a error while creating an AtomArrayStack out of many structures, it is not shown which structure did cause the error.
Hey,
is there anything available or planned for pdb or cif header parsing and analysis?
Cheers
Peter
Since extensions modules do not provide its source code, the doctest
module cannot extract the docstring via the getsourcefile()
function.
Consequently, the doctests in these docstrings are not executed by tests/test_doctest.py
.
The get_structure()
method of PDBFile
raises an IndexError
if a single model is loaded from a PDB file with models of unequal length
When a BondList
is indexed with an index array, e.g. due to AtomArray
indexing, the order of the indices is not retained.
The reason is that an index array is implicitly converted into a boolean mask:
biotite/src/biotite/structure/bonds.pyx
Line 537 in d5e7207
This symptoms are completely non-sense bonds, as the atoms are properly reordered, but the atom indices of the bonds still point to the old position.
When using DsspApp.annotate_sse
on an extracted chain, if the chain has more than one character the program raises an error.
annotate_sse
should work no matter what the chain id.
Minimal code to reproduce
from tempfile import gettempdir
import biotite.structure as struc
import biotite.structure.io.mmtf as mmtf
import biotite.database.rcsb as rcsb
import biotite.application.dssp as dssp
file_name = rcsb.fetch("4YBB", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(file_name)
array = mmtf.get_structure(mmtf_file, model=1)
tk_dimer = array[struc.filter_amino_acids(array)]
bb_chain= tk_dimer[tk_dimer.chain_id == "BB"]
sse = dssp.DsspApp.annotate_sse(bb_chain)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "biotite/application/dssp/app.py", line 130, in annotate_sse
app.join()
File "biotite/application/application.py", line 58, in wrapper
return func(*args, **kwargs)
File "biotite/application/application.py", line 153, in join
self.evaluate()
File "biotite/application/dssp/app.py", line 71, in evaluate
super().evaluate()
File "biotite/application/localapp.py", line 236, in evaluate
raise SubprocessError(
subprocess.SubprocessError: 'mkdssp' returned with exit code 1: DSSP could not be created due to an error: bad lexical cast: source type value could not be interpreted as target
Under windows (10)
in src/biotite/application/msaapp.py
There is a line
self._out_file = NamedTemporaryFile("r", suffix=".fa")
This should be
self._out_file = NamedTemporaryFile("w+t", suffix=".fa", delete=False)
otherwise one gets a permission denied error. Of course this means the file will not automatically be deleted, so in the function cleanup() near line 155,
self._in_file.close()
should be something like
self._in_file.close()
os.remove(self.get_output_file_path())
I don't know if this will cause tears under linux and I see that you lose the elegance of having a file automatically deleted.
Hi,
import biotite.structure.io as io
s = io.load_structure("1l2y.pdb", extra_fields="asdasd")
s.asdasd
> AttributeError: 'AtomArray' object has no attribute 'asdasd'
In the above example, load_structure
does not throw an error for the passed extra_field even though its not a valid one. I think an error thrown during load_structure would be a nicer way to handle this because the caller immediately knows something went wrong.\
Strictly speaking, the elemet in a PDB structure (column 77-78) is part of the format specification and should always be set. Nevertheless, a lot of programs ignore this column and omit it in the output or only write it for backbone atoms leading to this column be only partially filled (happened to us).
Biotite on the other hand, requires this column for various functions including selection of atoms by their types which leads to an Error when these functions are called. The circumstance that this column is often not set can be worked arround by trying to guess the element from the the atom column (line 13-16), possibly in combination with a warning to the user.
If you are interested in a solution like that, I can make a pull request for this.
The GenBankFile object should also contain the ORGANISM field as a subfield of SOURCE, as otherwise the taxonomy of an entry can not be retrieved.
Right now, atoms can not be used as keys in hashmaps because they are not hashable. I.e the following code does not work:
struct = # some struct
my_map = {}
for atom in struct:
if atom.chain_id == 'A':
my_map[atom] = 'beatiful atom'
else:
my_map[atom] = 'not so beatiful atom'
my_atom = # some atom
print("This atom is %s" % my_map[my_atom])
TypeError: unhashable type: 'Atom'
The example 'Conservation of LexA DNA-binding site', looks different in 0.8.0 compared to 0.7.0. Especially M. tuberculosis is repeated often.
It seems like rcsb fasta fetch api has been removed/replaced by a new one on 9. Dec. Biotite still uses the old api and therefore fails to return a correct fasta file, but writes the HTML error page to the file instead:
Minimal working code:
# download fasta
import biotite.database.rcsb as rcsb
file = rcsb.fetch("5u6p", "fasta", ".", overwrite=True, verbose=True)
# Fetching file 1 / 1 (5u6p)...
# Done
The resulting 5u6p.fasta contains no fasta, but a html file:
# reading file content
open("./5u6p.fasta", "r").read()
#'<!DOCTYPE html><html lang="en"><head><script src=.....<lots of more stuff..>.'
Opening the file with a webbrowser reveals the following content:
We're sorry, the page you requested is unavailable:
http://www.rcsb.org/pdb/download/downloadFastaFiles.do?structureIdList=5u6p&compressionType=uncompressed
It has been permanently removed as part of our announced shutdown on December 9th, 2020.
See the Legacy API Shutdown December 9 announcement for further details.
Hi,
I wondered whether support for gzipped pdb files and others is possible right now or not. I tried to do this:
pdbfile: biotite.structure.io.pdb.PDBFile = pdb.PDBFile.read(gzip.open("5po6.pdb.gz")) structure = pdb.get_structure(pdbfile)
but this fails in the get structure function.
PDBFile.read really only supports filenames, not handles?
Cheers
Peter
Peculiarly, the data type of the charge annotation delivered by the function 'structure.info.residue' is a tupel instead of a NumPy array, and the individual elements are strings. One would expect the charge annotation to be a NumPy array and the individual elements to be integers.
fluoromethane = info.residue("CF0")
fluoro_charges = fluoromethane.charge
print(fluoro_charges)
The result is a tuple with strings as elements.
While parsing a larger FASTA file, I hit an error for a specific protein sequence (see below).
I could reproduce this behaviour with smaller sequences down to FU
(not, however, MU
). It seems, some sequences with char U
may be recognized as NucleotideSequence instead of ProteinSequence.
# Download the sequence
fasta_file = fasta.FastaFile.read('sequence.fasta')
avidin_seq, streptavidin_seq = fasta.get_sequences(fasta_file).values()
---------------------------------------------------------------------------
AlphabetError Traceback (most recent call last)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/alphabet.py in encode_multiple(self, symbols, dtype)
src/biotite/sequence/codec.pyx in biotite.sequence.codec.encode_chars()
AlphabetError: Symbol 'F' is not in the alphabet
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-223-c3874679f080> in <module>
----> 1 biotite_fasta.get_sequences(s0).values()
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in get_sequences(fasta_file)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)
ValueError: FASTA data cannot be converted either to 'NucleotideSequence' nor to 'ProteinSequence'
Creating a new atomarray from atoms containing extra fields produces an error. Please see the example below
from biotite.database.rcsb import fetch
import biotite.structure.io as io
from biotite.structure import array
# get a pdb to test
file = fetch("5u6p", "pdb", ".")
# load an array without extra annotation and create a new arr from it's atoms
structure = io.load_structure(file)
atoms = [atom for atom in structure]
new_arr = array(atoms) # works
# load an array with extra annotation
structure = io.load_structure(file, extra_fields=["atom_id"])
atoms = [atom for atom in structure]
new_arr = array(atoms) # error
# > Traceback (most recent call last):
# > File "test.py", line 16, in <module>
# > new_arr = array(atoms)
# > File "/home/๐/projects/biotite/src/biotite/structure/atoms.py", line 1175, in array
# > array._annot[name][i] = atoms[i]._annot[name]
# > KeyError: 'atom_id'
Popular sequence alignment programs like muscle use the _
(underscore) symbol to represent gaps in the alignment. However, Biotite only recognices the *
symbol. For convience, fasta io should work with both formats and silently convert the different gap formats.
The pdb specification allows pdb files with multiple different models. biotite can read these files, but there is no way to write them.
As of now, I only detect this kind of file when parsing model by model
was needed due to the models having an unequal amount of atoms.
There are different behaviors for Biotite's parsers.
The PDB and MMTF parsers treat files as though their first model ID is 1, i.e.,
when passing a model ID to pdb.get_structure
and mmtf.get_structure
, it will be
treated as an ordinal number, e.g., ID 7 gets the 7th model, no matter what its
actual ID is.
When using a model ID that goes over the total model count, the corresponding
ValueError
is issued:
"The file has n models, the given model m does not exist"
This is correctly showing the number of models in the file.
The PDBx parser treats files as though they have as many models as the highest
model ID suggests.
The model ID passed to pdbx.get_structure
is not treated as an ordinal, but as
an actual model ID, e.g., in a file that has models 21 to 40, passing 1 to 20
for the model ID will return empty AtomArray
instances, while the present
models can be reached by their corresponding IDs.
When using a model ID that goes over the highest model ID, again the
ValueError
is issued, but with n not standing for the actual count of models,
but for the highest model ID.
Since I'm building an MMTF database based on PDBx files, I can't safely use the
model IDs inferred while parsing.
For now I'm building a blacklist of PDB IDs, but I don't know if there are
problematic files that I'm not catching with my current method.
For my use case, it would be preferable if all parsers behaved like the PDBx
parser.
On a related note, there are two things that I would find useful, that I
currently don't know how to do with Biotite:
AtomArray
instances upon parsingExamples (model IDs):
Similar to "biotite.structure.connect_via_distances", "biotite.structure.io.mmtf.get_structure" seems to have difficulties registering bonds between individual residues. The example is exactly the same as in issue #266:
import biotite.database.rcsb as rcsb
from biotite.structure.io.mmtf import MMTFFile
from biotite.structure.io.mmtf import get_structure
file_path = rcsb.fetch("1l2y", "mmtf")
mmtf_file = MMTFFile.read(file_path)
mini_protein = get_structure(pdb_file, model = 1, include_bonds=True)
first_res = mini_protein[0:16]
second_res = mini_protein[16:35]
res_one_and_two = first_res + second_res
res_one_and_two_without_hyd = first_res_without_hyd +
second_res_without_hyd
print(res_one_and_two_without_hyd.bonds)
[[ 0 1 1]
[ 1 2 1]
[ 2 3 2]
[ 1 4 1]
[ 4 5 1]
[ 5 6 2]
[ 5 7 1]
[ 8 9 1]
[ 9 10 1]
[10 11 2]
[ 9 12 1]
[12 13 1]
[13 14 1]
[13 15 1]]
Here again, the bond [2 8 1] representing the bond between the nitrogen atom of the amino group of the first residue and the carbon atom of the carboxyl group of the second residue is missing, which must be a mistake.
The zipped HTML documentation is going to be uploaded in this issue continuously.
The xtc input reader intrinistically checks if the read in trajectory fits into memory before loading it - and throws a MemoryError if it does not fit. However, this check also fails when loading only parts of a trajectory:
import biotite.structure.io.xtc as xtc
f = XTCFile()
# this fails correctly because the file is too large
f.read("md.xtc")
# this should not fail because an xtc reduced to every10th frame works - but it does
f.read("md.xtc", step=10)
# also fails even though it only loads 10 models!
f.read("md.xtc", stop=10)
Functions like get_sequence()
may use the extended nucleotide alphabet, instead of the intended protein alphabet. This is because a few protein sequences have only symbols that are also valid for the extended nucleotide alphabet.
Can occur for some models, if the models in the file have unequal amount of atoms and I have to parse single models. As far as I can tell, the first model is always parsed properly.
It looks like atom coordinates from the last residues appear in the first one and all other atom coordinates are shifted further back in the AtomArray
. When exporting the models back into a file and looking at them in Chimera, the wrong atoms are bonded to each other.
Examples
Right now, the GenBankFile object stores the molecule type and strandedness in the LOCUS section as one entry in the database. This should rather be separated.
A lot of programs do not write element names in pdb files. Gro files miss that column entirely. Right now, biotite tries to "guess" the element for these rows based on the atom name. However, due to the large varity of force fields and programs out there, this is very limited and fails often, i.e when:
I suggest to limit the element detection to cases where the element can be predicted accurately (i.e "CA" almost always matches to element="C", "H*" and "[0-3]H" being "H" and "O*" is "O" is shared among almost every biological software). In less obvious cases, I suggest to place an empty string instead.
biotite/src/biotite/sequence/io/fasta/file.py
Line 106 in c8046d5
str
. Shouldn't FastaFile accept Sequence as value? It would code if we use biotite Sequence methods such as complement()
.The current function parameters for biotite.structure.dihedral_backbone
require to supply a chain id to calculate backbone torsion angles. Because some file formats (.gro) do not support chain ids, this can make it a bit cumberstone to use this function. What is your opinion on making chain_id
an optional parameter and calculate dihedrals on the whole protein if chain_id
is None
? This would also make it a lot easier to calculate dihedrals for multimers because we can call the function once to get dihedrals for all chains instead of calling it multiple times.
def dihedral_backbone(atom_array, chain_id=None):
do stuff
The function structure.base_pairs()
gives incorrect results, in my experience missing base pairs, if the input nucleic acid structure has an order for their atoms within each residue that is different from the RCSB standard.
The reason is that a standard base is superimposed on each nucleotide:
biotite/src/biotite/structure/basepairs.py
Lines 851 to 855 in d5e7207
Even though missing atoms are checked before the superimposition, the correct order is not.
This means that atoms can be superimposed onto each other, that are not equivalent.
Version: 0.22.0
The mmCIF and MMTF parsers only keep the first 3 characters of chain IDs.
This makes handling structures like 6n1d impractical, since information is lost
and duplicate chain IDs are produced.
The biotite.database.rcsb.fetch function should support the download of fasta files from the RCSB PDB.
I am trying to align 99 sequences with an avg sequence length of 865 via muscle, but the process seems to be either locked or blocked in some way (muscle is visible in htop but no CPU is used).
Example code, mainly copy-pasta from the biotite gallery/tutorial:
import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.database.entrez as entrez
import biotite.application.blast as blast
import biotite.application.muscle as muscle
# sequence
file = fasta.FastaFile.read("5u6p.fasta")
seq = seq.ProteinSequence(list(file.items())[0][1])
# Find homologous proteins using NCBI Blast
# Search only the UniProt/SwissProt database
blast_app = blast.BlastWebApp("blastp", seq, "swissprot", obey_rules=True)
blast_app.start()
blast_app.join()
alignments = blast_app.get_alignments()
# Get hit IDs for hits with score > 200
hits = []
for ali in alignments:
if ali.score > 200:
hits.append(ali.hit_id)
# Get the sequences from hit IDs
hit_seqs = []
for hit in hits:
file_name = entrez.fetch(hit, ".", "fa", "protein", "fasta")
fasta_file = fasta.FastaFile.read(file_name)
hit_seqs.append(fasta.get_sequence(fasta_file))
print(len(hit_seqs)) # 99
app = muscle.MuscleApp(hit_seqs)
app.start()
print(app.get_command()) # muscle -in /tmp/tmpdx_2xr27.fa -out /tmp/tmpk_1i3huh.fa -tree1 /tmp/tmp9qre27e5.tree -tree2 /tmp/tmpe4h0rl3k.tree -seqtype protein
app.join()
alignment = app.get_alignment()
print(alignment)
Interestingly, when I run the command from app.get_command()
manually, this produces the expected output file within ~15 sec, while the biotite script still didn't finish after 15 minutes. Finally, switching from MuscleApp
to ClustalOmegaApp
in the script works and produces an alignment. Therefore, this issue seems to be related to MuscleApp
only.
Hopefully, this issue is reproducible on other machines at all! ;)
When a Biotite object, like an AtomArray
, is indexed with a NumPy integral type object (e.g. int64
),
the value is not treated as integer. Hence, the indexing operation fails or gives unintentional results.
Example:
import numpy as np
import biotite.structure as struc
import biotite.structure.io as strucio
array = strucio.load_structure("tests/structure/data/1l2y.mmtf")[0]
atom = array[np.int64(0)]
Raised Exception:
Traceback (most recent call last):
File "/home/kunzmann/Documents/coding/biotite/test.py", line 20, in <module>
atom = array[np.int64(0)]
File "/home/kunzmann/Documents/coding/biotite/src/biotite/structure/atoms.py", line 594, in __getitem__
return self._subarray(index)
File "/home/kunzmann/Documents/coding/biotite/src/biotite/structure/atoms.py", line 206, in _subarray
new_length = new_coord.shape[-2]
IndexError: tuple index out of range
This is caused by the isinstance(x, int)
check, that evaluates to false for NumPy integral types.
A solution would be isinstance(x, numbers.Integral)
.
It would be nice if we could add the __repr__
method to biotite objects such as sequences, etc. Representations are especially useful when working in Jupiter notebook.
"biotite.structure.connect_via_distances" seems to have difficulties registering bonds between individual residues. In this example, it is a peptide bond, i. e. the bond linking two amino acids. The first two residues of the mini protein 1l2y (asparagine and leucine) are extracted, hydrogen atoms are removed and their BondList is printed out based on "connect_via_distances":
file_path = rcsb.fetch("1l2y", "pdb")
pdb_file = PDBFile.read(file_path)
mini_protein = get_structure(pdb_file, model = 1)
mini_protein.bonds = connect_via_distances(mini_protein)
bond_list = mini_protein.bonds
first_res = mini_protein[0:16]
second_res = mini_protein[16:35]
res_one_and_two = first_res + second_res
res_one_and_two_without_hyd = res_one_and_two[res_one_and_two.element != "H"]
carbon_index_of_asn = np.where(res_one_and_two.atom_name == "C")[0][0]
print(carbon_index_of_asn)
2
nitr_index_of_leu = np.where(res_one_and_two_without_hyd.atom_name == "N")[0][1]
print(nitr_index_of_leu)
8
print(res_one_and_two_without_hyd.bonds)
[[ 0 1 0]
[ 1 2 0]
[ 2 3 0]
[ 1 4 0]
[ 4 5 0]
[ 5 6 0]
[ 5 7 0]
[ 8 9 0]
[ 9 10 0]
[10 11 0]
[ 9 12 0]
[12 13 0]
[13 14 0]
[13 15 0]]
The peptide bond, i. e. the bond between the carbon atom of the hydroxyl group of residue 1 (asparagine) and the nitrogen atom of the amino group of residue 2 (leucine) does not occur in the BondList (the entry [2 8 0] can't be found), which must be a mistake.
When using hbond on a structure where no hydrogen atoms have been added an error should be thrown, informing you
of the total absence of hydrogen.
This is not the case, as can be seen, e.g., using hbond on https://www.rcsb.org/structure/1E7U.
In PDB mmcif files have a single white space at the end of _atom_site labels. Not having these white spaces makes the cif files created using biotite, sometimes results in errors in some other programs (MMAlign, TMAlign as far as I know). Just adding that space at the end of the label solves the problem.
Hi I just tried to figure out how small molecules are currently handled within the structure objects (AtomArrayStack).
If I extract for instance all residues using
residues: tuple=biotite.structure.get_residues(mystructure)
then I get a nice tuple containing water, ions and amino-acids, but I don't see the small molecule in this list for instance on structure 5po6 (residue name 8SS).
How can I access these?
With NetworkX becoming a Biotite dependency in the next update (#283), we could think about replacing the functions for resolving conflicting regions in biotite.structure.pseudoknots()
with graph based solutions:
_find_non_conflicting_regions()
-> networkx.isolates()
_conflict_cliques()
-> networkx.connected_components()
This could make the code cleaner and even increase the performance (although this need to be tested).
@tomtomhdx What is your opinion about this?
Currently, Biotite uses the classical setuptools
with a setup.py
for building the package. However, with the changes introduced by PEP517, we can now use more modern and cleaner tools, like Poetry.
This branch contains the current effort of moving Biotite from setuptools
to Poetry.
These are the issues that need to be resolved:
pyproject.toml
with the necessary informationconda
(python-poetry/poetry#105, python-poetry/poetry#190, python-poetry/poetry#1432)pytest
/sphinx
invocation (Poetry does not use a setup.py
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.