biotite-dev / biotite Goto Github PK

View Code? Open in Web Editor NEW

581.0 17.0 92.0 181.9 MB

A comprehensive library for computational molecular biology

Home Page: https://www.biotite-python.org

License: BSD 3-Clause "New" or "Revised" License

Python 76.82% Cython 23.18%

bioinformatics computational-biology structural-biology sequence-analysis pdb-files

biotite's People

Contributors

Stargazers

Watchers

Forkers

padix-key bud-graziano inambioinfo dr-moreb aspirincode avestamh aneeshpanoli sudarshangc clementleong simoneperazzoli t0mdavid-m aaabioinfo thomasnevolianis rajash6 sailfish009 r614 mariaisal danielmuthama sefgit jacobanter tetukas ebetica martinjakopec babasaraki matthieurouland floscha minghao2016 maxgreil discngine ovaibhav-m ajshedivy alex123012 entropybit bethsauer samhsarni mojusko tinchogb wilgroves claudejrogers f-allain ammounaliza gezmi rnaimehaom fasemoreakinyemi maldil jappy0 wankiwi jhkru aurelg magnushhoie edikedik dongcf croydon-brixton ncoish micro-mega fengsanyunyan alirezaomidi jonfunk21 aariam genostack foldx-ai salmaneunus27 anaka nemstudios ferrervi dacarlin bradyajohnston ilnann fidan666 deonnez kseniaiii gigafurz dominiquefastus ordaosbio 5htdp orpowell gauravcodepro agitw dlnp2 0ut0fcontrol mrz-zz gabiaxel maick-code guruace brunoscaglione trichter tcjaffe junincloud tjmier ljarosch

biotite's Issues

'structure.info.residue()' fails for some residues

biotite.structure.info.residue() raises an IndexError for some residues, for example cyanide (CN):

Traceback (most recent call last):
  File "/home/kunzmann/data/coding/biotite-test/test.py", line 36, in <module>
    info.residue("CN")
  File "/home/kunzmann/data/coding/biotite/src/biotite/structure/info/atoms.py", line 108, in residue
    array.bonds = BondList(
  File "src/biotite/structure/bonds.pyx", line 157, in biotite.structure.bonds.BondList.__init__
    self._bonds = _to_positive_index_array(bonds, atom_count) \
  File "src/biotite/structure/bonds.pyx", line 789, in biotite.structure.bonds._to_positive_index_array
    raise IndexError(
IndexError: Index 3 is out of range for an atom count of 3

Setup does complain about missing NumPy and Cython package

When running Biotite's setup, when either Cython or numpy is not installed, causes a ModuleNotFoundError. This is due to the import statement prior to the point the setup is run.

Make atom ids available (and alterable) in biopython.structure

Sometimes it is neccessary to access and/or alter the atom ids in a pdb file. Right now this is not possible because atom ids are not stored when reading from pdb files and are assigned starting from 1when writing pdb files.

Imaging the following case: I want to read a protein structure from PDB and only save its CA atoms to a new file, but without altering their atom ids. Right now atom ids are reassigned starting from 1 and there is no way to alter this.

Possible implementation would be to add another field to biopython.structure.Atom which is initialized from the PDB file. biopython.structure.io.pdb.PDBFile#write could then allow to renumber atom ids or reuse the old ones:

pdb_file.set_structure(stack)
pdb_file.write("1l2y.pdb", renumber=True) # writes atom ids starting from 1 (current default)
pdb_file.write("1l2y.pdb") # keeps atom ids of the AtomArray

Error when indexing 'BondList' with empty selection

If a BondList is indexed with an index that contains no atoms, e.g. a boolean mask with false-only entries, the following exception is raised:

ValueError: zero-size array to reduction operation maximum which has no identity

How to remove the arrow head while plotting a feature map?

Hi,
While plotting the feature map for the operon (as show in your example), i would like to remove the arrow head. How do i do this?
For the plot_plasmid_map, i did figure out that i could use arrow_head_length=0 for removing the arrow head.
Is there a possibility to plot the plasmid map as linear rather than circular?

Thanks in advance!!

Add support for hybrid serial numbers in PDB format?

The current limitation of the pdb format is that it only supports serial numbers up to 99999 atoms and 9999 residues. An enhanced version of the pdb format called "hybrid-36 pdb" overcomes this by switching to a hex representation for those ids.

Example
base pdb format

ATOM      1  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM      2  H           1       0.249   0.684   0.231  1.00  0.00
ATOM      3  H           1       0.586  -0.954   0.791  1.00  0.00
...
ATOM  99997  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM  99998  H           1       0.249   0.684   0.231  1.00  0.00
ATOM  99999  H           1       0.586  -0.954   0.791  1.00  0.00
ATOM      1  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM      2  H           1       0.249   0.684   0.231  1.00  0.00
ATOM      3  H           1       0.586  -0.954   0.791  1.00  0.00

hybrid-36 pdb

ATOM      1  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM      2  H           1       0.249   0.684   0.231  1.00  0.00
ATOM      3  H           1       0.586  -0.954   0.791  1.00  0.00
...
ATOM  99997  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM  99998  H           1       0.249   0.684   0.231  1.00  0.00
ATOM  99999  H           1       0.586  -0.954   0.791  1.00  0.00
ATOM  A0000  O           1       0.251  -0.360  -0.046  1.00  0.00
ATOM  A0001  H           1       0.249   0.684   0.231  1.00  0.00
ATOM  A0002  H           1       0.586  -0.954   0.791  1.00  0.00

As you can see, the big advantage of this format is that it keeps backward compatibility if the residue and serial id are below the threshold for hex numbers - which is a clear advantage over other efforts to overcome the limitations of pdb.

This format was recently implemented in Modeller 9.22 and is also available an various other software packages like MDAnalysis. I think an implementation of this for biotite would not require a lot of code changes: The pdb reader can do this internally when parsing the serial and residue field. And for output, it could be either a modified version of PDBFile or a simple flag to alter the output numbering.

I'd like to put this up for diskussion here before making the effort to implement it. What's your opinion? Yay or nay ;)

structure.rmsd() should support ndarray as input

structure.rmsd() supports AtomArrays as input for the reference and subject. It should support ndarrays as well, making it easier to use with PDBFile().get_coord.

GRO writer atom_id overflow

When gro files with more then 100.000 atoms are written, the atom id field becomes to large, resulting in files that cannot be read by biotite anymore and also not displayed by PyMOL.

Steps to reproduce:
Create a file with more then 100.000 atoms. For examle:

gmx editconf -f spc216.gro -o huge_box.gro -box 20 20 20
gmx solvate -cp huge_box.gro -cs spc216.gro -o huge_box.gro
head huge_box.gro
> 216H2O,WATJP01,SPC216,SPC-MODEL,300K,BOX(M)=1.86206NM,WFVG,MAR. 1984
> 786570
>     1SOL     OW    1  10.221  10.629  10.122
>     1SOL    HW1    2  10.128  10.627  10.159
>     1SOL    HW2    3  10.222  10.590  10.030

Now if we read in this box and write it out again, we cant read it in a second time because of a file format violation at line 100002 of thefile:

import biotite.structure.io as io
s = io.load_structure("huge_box.gro")
print(len(s)) # 786570
io.save_structure("huge_box2.gro", s)
s = io.load_structure("huge_box2.gro")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bauer/conda/lib/python3.7/site-packages/biotite/structure/io/general.py", line 78, in load_structure
    array = file.get_structure()
  File "/home/bauer/conda/lib/python3.7/site-packages/biotite/structure/io/gro/file.py", line 190, in get_structure
    array.coord[m,i,0] = float(line[20:28])*10
ValueError: could not convert string to float: '0   3.70'

huge_box2.gro

33332SOL    HW299996   2.494   7.113  15.071
33333SOL     OW99997   3.512   7.306  16.291
33333SOL    HW199998   3.565   7.316  16.207
33333SOL    HW299999   3.485   7.396  16.325
33334SOL     OW100000   3.703   5.761  15.860
33334SOL    HW1100001   3.742   5.676  15.824
33334SOL    HW2100002   3.605   5.763  15.841
33335SOL     OW100003   2.125   5.912  15.616
33335SOL    HW1100004   2.046   5.963  15.582

The correct behaviour would be to start writing atom ids with 1 after atom #99.999.

database.entrez.efetch_single_file can not append to existing file

The function database.entrez.efetch_single_file() should have the possibility to append to an existing MultiGenBankFile to enable the user to store information from different fetch jobs in one file, as it is not possible to obtain information for more than about 300 ids at the same time. Also, the possibility to automatically delete an already obtained file should be added.

Error printing in biotite.structure.stack lacking information about where it fails

When throwing a error while creating an AtomArrayStack out of many structures, it is not shown which structure did cause the error.

Header parsing

Hey,
is there anything available or planned for pdb or cif header parsing and analysis?

Cheers

Peter

Doctests ignore docstrings in extension modules

Since extensions modules do not provide its source code, the doctest module cannot extract the docstring via the getsourcefile() function.
Consequently, the doctests in these docstrings are not executed by tests/test_doctest.py.

Cannot load single model from PDB file

The get_structure() method of PDBFile raises an IndexError if a single model is loaded from a PDB file with models of unequal length

`BondList` indexing does not work for unsorted index arrays

When a BondList is indexed with an index array, e.g. due to AtomArray indexing, the order of the indices is not retained.
The reason is that an index array is implicitly converted into a boolean mask:

biotite/src/biotite/structure/bonds.pyx

Line 537 in d5e7207

mask = _to_bool_mask(index, length=copy._atom_count)

This symptoms are completely non-sense bonds, as the atoms are properly reordered, but the atom indices of the bonds still point to the old position.

DsspApp unable to work on multicharacter chain identifiers

The issue

When using DsspApp.annotate_sse on an extracted chain, if the chain has more than one character the program raises an error.

Expected behaviour

annotate_sse should work no matter what the chain id.

Steps to reproduce

Minimal code to reproduce

from tempfile import gettempdir
import biotite.structure as struc
import biotite.structure.io.mmtf as mmtf
import biotite.database.rcsb as rcsb
import biotite.application.dssp as dssp


file_name = rcsb.fetch("4YBB", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(file_name)
array = mmtf.get_structure(mmtf_file, model=1)
tk_dimer = array[struc.filter_amino_acids(array)]
bb_chain= tk_dimer[tk_dimer.chain_id == "BB"]
sse = dssp.DsspApp.annotate_sse(bb_chain)

Actual behaviour

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "biotite/application/dssp/app.py", line 130, in annotate_sse
    app.join()
  File "biotite/application/application.py", line 58, in wrapper
    return func(*args, **kwargs)
  File "biotite/application/application.py", line 153, in join
    self.evaluate()
  File "biotite/application/dssp/app.py", line 71, in evaluate
    super().evaluate()
  File "biotite/application/localapp.py", line 236, in evaluate
    raise SubprocessError(
subprocess.SubprocessError: 'mkdssp' returned with exit code 1: DSSP could not be created due to an error: bad lexical cast: source type value could not be interpreted as target

temp file in windows in biotite/application/msaapp.py

Under windows (10)
in src/biotite/application/msaapp.py
There is a line
self._out_file = NamedTemporaryFile("r", suffix=".fa")
This should be
self._out_file = NamedTemporaryFile("w+t", suffix=".fa", delete=False)
otherwise one gets a permission denied error. Of course this means the file will not automatically be deleted, so in the function cleanup() near line 155,
self._in_file.close()
should be something like
self._in_file.close()
os.remove(self.get_output_file_path())

I don't know if this will cause tears under linux and I see that you lose the elegance of having a file automatically deleted.

pdb.get_structure should throw an error for unknown extra fields

Hi,

import biotite.structure.io as io
s = io.load_structure("1l2y.pdb", extra_fields="asdasd")
s.asdasd
> AttributeError: 'AtomArray' object has no attribute 'asdasd'

In the above example, load_structure does not throw an error for the passed extra_field even though its not a valid one. I think an error thrown during load_structure would be a nicer way to handle this because the caller immediately knows something went wrong.\

PDB element symbol should be guessed from atom name if not present

Strictly speaking, the elemet in a PDB structure (column 77-78) is part of the format specification and should always be set. Nevertheless, a lot of programs ignore this column and omit it in the output or only write it for backbone atoms leading to this column be only partially filled (happened to us).

Biotite on the other hand, requires this column for various functions including selection of atoms by their types which leads to an Error when these functions are called. The circumstance that this column is often not set can be worked arround by trying to guess the element from the the atom column (line 13-16), possibly in combination with a warning to the user.
If you are interested in a solution like that, I can make a pull request for this.

sequence.io.genbank.GenBankFile should also parse the Organism line

The GenBankFile object should also contain the ORGANISM field as a subfield of SOURCE, as otherwise the taxonomy of an entry can not be retrieved.

Atoms are not hashable

Right now, atoms can not be used as keys in hashmaps because they are not hashable. I.e the following code does not work:

struct = # some struct
my_map = {}
for atom in struct:
    if atom.chain_id == 'A':
        my_map[atom] = 'beatiful atom'
    else:
        my_map[atom] = 'not so beatiful atom'

my_atom = # some atom
print("This atom is %s" % my_map[my_atom])
          TypeError: unhashable type: 'Atom'

Unexpected change in output of 'Conservation of LexA DNA-binding site' example

The example 'Conservation of LexA DNA-binding site', looks different in 0.8.0 compared to 0.7.0. Especially M. tuberculosis is repeated often.

RCSB fasta fetch tries to use a removed api

It seems like rcsb fasta fetch api has been removed/replaced by a new one on 9. Dec. Biotite still uses the old api and therefore fails to return a correct fasta file, but writes the HTML error page to the file instead:

Minimal working code:

# download fasta
import biotite.database.rcsb as rcsb
file = rcsb.fetch("5u6p", "fasta", ".", overwrite=True, verbose=True)
# Fetching file 1 / 1 (5u6p)...
# Done

The resulting 5u6p.fasta contains no fasta, but a html file:

# reading file content
open("./5u6p.fasta", "r").read()
#'<!DOCTYPE html><html lang="en"><head><script src=.....<lots of more stuff..>.'

Opening the file with a webbrowser reveals the following content:

We're sorry, the page you requested is unavailable:
http://www.rcsb.org/pdb/download/downloadFastaFiles.do?structureIdList=5u6p&compressionType=uncompressed

It has been permanently removed as part of our announced shutdown on December 9th, 2020.

See the Legacy API Shutdown December 9 announcement for further details.

pdb get_structure issues with gzip opened files

Hi,

I wondered whether support for gzipped pdb files and others is possible right now or not. I tried to do this:

pdbfile: biotite.structure.io.pdb.PDBFile = pdb.PDBFile.read(gzip.open("5po6.pdb.gz")) structure = pdb.get_structure(pdbfile)

but this fails in the get structure function.

PDBFile.read really only supports filenames, not handles?

Cheers

Peter

Charge annotation's data type incorporated in 'structure.info.residue' is a tuple with the elements as string

Peculiarly, the data type of the charge annotation delivered by the function 'structure.info.residue' is a tupel instead of a NumPy array, and the individual elements are strings. One would expect the charge annotation to be a NumPy array and the individual elements to be integers.

fluoromethane = info.residue("CF0")
fluoro_charges = fluoromethane.charge
print(fluoro_charges)

The result is a tuple with strings as elements.

FASTA parsing error

While parsing a larger FASTA file, I hit an error for a specific protein sequence (see below).

I could reproduce this behaviour with smaller sequences down to FU (not, however, MU). It seems, some sequences with char U may be recognized as NucleotideSequence instead of ProteinSequence.

Steps to reproduce

# Download the sequence
fasta_file = fasta.FastaFile.read('sequence.fasta')
avidin_seq, streptavidin_seq = fasta.get_sequences(fasta_file).values()

Error

---------------------------------------------------------------------------
AlphabetError                             Traceback (most recent call last)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/alphabet.py in encode_multiple(self, symbols, dtype)

src/biotite/sequence/codec.pyx in biotite.sequence.codec.encode_chars()

AlphabetError: Symbol 'F' is not in the alphabet

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-223-c3874679f080> in <module>
----> 1 biotite_fasta.get_sequences(s0).values()

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in get_sequences(fasta_file)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

ValueError: FASTA data cannot be converted either to 'NucleotideSequence' nor to 'ProteinSequence'

Creating an AtomArray from atoms with optional annotations fails

Creating a new atomarray from atoms containing extra fields produces an error. Please see the example below

from biotite.database.rcsb import fetch
import biotite.structure.io as io
from biotite.structure import array

# get a pdb to test
file = fetch("5u6p", "pdb", ".")

# load an array without extra annotation and create a new arr from it's atoms
structure = io.load_structure(file)
atoms = [atom for atom in structure]
new_arr = array(atoms)  # works

# load an array with extra annotation
structure = io.load_structure(file, extra_fields=["atom_id"])
atoms = [atom for atom in structure]
new_arr = array(atoms)  # error

# > Traceback (most recent call last):
# >   File "test.py", line 16, in <module>
# >     new_arr = array(atoms)
# >   File "/home/😉/projects/biotite/src/biotite/structure/atoms.py", line 1175, in array
# >     array._annot[name][i] = atoms[i]._annot[name]
# > KeyError: 'atom_id'

Biotite is unable to read alignments from Fasta files

Popular sequence alignment programs like muscle use the _ (underscore) symbol to represent gaps in the alignment. However, Biotite only recognices the * symbol. For convience, fasta io should work with both formats and silently convert the different gap formats.

Support writing multiple `AtomArray` to structure file

The pdb specification allows pdb files with multiple different models. biotite can read these files, but there is no way to write them.

Incongruent model IDs for Structure files that don't begin with model ID 1

As of now, I only detect this kind of file when parsing model by model
was needed due to the models having an unequal amount of atoms.

There are different behaviors for Biotite's parsers.

The PDB and MMTF parsers treat files as though their first model ID is 1, i.e.,
when passing a model ID to pdb.get_structure and mmtf.get_structure, it will be
treated as an ordinal number, e.g., ID 7 gets the 7th model, no matter what its
actual ID is.
When using a model ID that goes over the total model count, the corresponding
ValueError is issued:
"The file has n models, the given model m does not exist"
This is correctly showing the number of models in the file.

The PDBx parser treats files as though they have as many models as the highest
model ID suggests.
The model ID passed to pdbx.get_structure is not treated as an ordinal, but as
an actual model ID, e.g., in a file that has models 21 to 40, passing 1 to 20
for the model ID will return empty AtomArray instances, while the present
models can be reached by their corresponding IDs.
When using a model ID that goes over the highest model ID, again the
ValueError is issued, but with n not standing for the actual count of models,
but for the highest model ID.

Since I'm building an MMTF database based on PDBx files, I can't safely use the
model IDs inferred while parsing.
For now I'm building a blacklist of PDB IDs, but I don't know if there are
problematic files that I'm not catching with my current method.
For my use case, it would be preferable if all parsers behaved like the PDBx
parser.

On a related note, there are two things that I would find useful, that I
currently don't know how to do with Biotite:

Read out the available model IDs of a file before parsing a model from it
Attach a model ID attribute to AtomArray instances upon parsing

Examples (model IDs):

1crr (21 to 40)
1ezc (18 to 34)
1ezd (35 to 50)

"biotite.structure.io.mmtf.get_structure" doesn't register peptide bonds

Similar to "biotite.structure.connect_via_distances", "biotite.structure.io.mmtf.get_structure" seems to have difficulties registering bonds between individual residues. The example is exactly the same as in issue #266:

import biotite.database.rcsb as rcsb
from biotite.structure.io.mmtf import MMTFFile
from biotite.structure.io.mmtf import get_structure

file_path = rcsb.fetch("1l2y", "mmtf")
mmtf_file = MMTFFile.read(file_path)
mini_protein = get_structure(pdb_file, model = 1, include_bonds=True)
first_res = mini_protein[0:16]
second_res = mini_protein[16:35]
res_one_and_two = first_res + second_res
res_one_and_two_without_hyd = first_res_without_hyd +
second_res_without_hyd
print(res_one_and_two_without_hyd.bonds)

[[ 0 1 1]
[ 1 2 1]
[ 2 3 2]
[ 1 4 1]
[ 4 5 1]
[ 5 6 2]
[ 5 7 1]
[ 8 9 1]
[ 9 10 1]
[10 11 2]
[ 9 12 1]
[12 13 1]
[13 14 1]
[13 15 1]]

Here again, the bond [2 8 1] representing the bond between the nitrogen atom of the amino group of the first residue and the carbon atom of the carboxyl group of the second residue is missing, which must be a mistake.

Documentation

The zipped HTML documentation is going to be uploaded in this issue continuously.

xtc reader throws memory error even though it shouldn't

The xtc input reader intrinistically checks if the read in trajectory fits into memory before loading it - and throws a MemoryError if it does not fit. However, this check also fails when loading only parts of a trajectory:

import biotite.structure.io.xtc as xtc
f = XTCFile()

# this fails correctly because the file is too large
f.read("md.xtc")

# this should not fail because an xtc reduced to every10th frame works - but it does
f.read("md.xtc", step=10) 

# also fails even though it only loads 10 models!
f.read("md.xtc", stop=10)

Fasta convenience loader functions using extendened nucleotide alphabet instead of protein alphabet

Functions like get_sequence() may use the extended nucleotide alphabet, instead of the intended protein alphabet. This is because a few protein sequences have only symbols that are also valid for the extended nucleotide alphabet.

Shifted atom coordiantes when parsing single models from MMTF

Can occur for some models, if the models in the file have unequal amount of atoms and I have to parse single models. As far as I can tell, the first model is always parsed properly.
It looks like atom coordinates from the last residues appear in the first one and all other atom coordinates are shifted further back in the AtomArray. When exporting the models back into a file and looking at them in Chimera, the wrong atoms are bonded to each other.

Examples

2axd
1iey
1jy6

type and strandedness should be separated in GenBankFile.get_locus()

Right now, the GenBankFile object stores the molecule type and strandedness in the LOCUS section as one entry in the database. This should rather be separated.

Structure submodule: Replace errorprone element detection

A lot of programs do not write element names in pdb files. Gro files miss that column entirely. Right now, biotite tries to "guess" the element for these rows based on the atom name. However, due to the large varity of force fields and programs out there, this is very limited and fails often, i.e when:

atom name does not match the element name (1H, 2H)
Correct element abbreviation has more than one letter
For virtual sites in general (gromacs)

I suggest to limit the element detection to cases where the element can be predicted accurately (i.e "CA" almost always matches to element="C", "H*" and "[0-3]H" being "H" and "O*" is "O" is shared among almost every biological software). In less obvious cases, I suggest to place an empty string instead.

FastaFiles should accept Sequence

biotite/src/biotite/sequence/io/fasta/file.py

Line 106 in c8046d5

raise TypeError("'FastaFile' only supports sequence strings "

There is a check if FastaFile value is str. Shouldn't FastaFile accept Sequence as value? It would code if we use biotite Sequence methods such as complement().

backbone dihedral calculation without chain ids

The current function parameters for biotite.structure.dihedral_backbone require to supply a chain id to calculate backbone torsion angles. Because some file formats (.gro) do not support chain ids, this can make it a bit cumberstone to use this function. What is your opinion on making chain_id an optional parameter and calculate dihedrals on the whole protein if chain_id is None? This would also make it a lot easier to calculate dihedrals for multimers because we can call the function once to get dihedrals for all chains instead of calling it multiple times.

def dihedral_backbone(atom_array, chain_id=None):
    do stuff

'base_pairs()' does not work for structures with non-standard atom-order

The function structure.base_pairs() gives incorrect results, in my experience missing base pairs, if the input nucleic acid structure has an order for their atoms within each residue that is different from the RCSB standard.

The reason is that a standard base is superimposed on each nucleotide:

biotite/src/biotite/structure/basepairs.py

Lines 851 to 855 in d5e7207

    
           # Match the selected std_base to the base. 
        
           fitted, transformation = superimpose( 
        
               nucleotide[np.isin(nucleotide.atom_name, std_base.atom_name)], 
        
               std_base[np.isin(std_base.atom_name, nucleotide.atom_name)] 
        
           )

Even though missing atoms are checked before the superimposition, the correct order is not.
This means that atoms can be superimposed onto each other, that are not equivalent.

Chain IDs reduced to first 3 characters

Version: 0.22.0
The mmCIF and MMTF parsers only keep the first 3 characters of chain IDs.
This makes handling structures like 6n1d impractical, since information is lost
and duplicate chain IDs are produced.

biotite.database.rcsb.fetch should support fasta download

The biotite.database.rcsb.fetch function should support the download of fasta files from the RCSB PDB.

MuscleApp takes extremely long/does not finish

I am trying to align 99 sequences with an avg sequence length of 865 via muscle, but the process seems to be either locked or blocked in some way (muscle is visible in htop but no CPU is used).

Example code, mainly copy-pasta from the biotite gallery/tutorial:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.database.entrez as entrez
import biotite.application.blast as blast
import biotite.application.muscle as muscle

# sequence
file = fasta.FastaFile.read("5u6p.fasta")
seq = seq.ProteinSequence(list(file.items())[0][1])

# Find homologous proteins using NCBI Blast
# Search only the UniProt/SwissProt database
blast_app = blast.BlastWebApp("blastp", seq, "swissprot", obey_rules=True)
blast_app.start()
blast_app.join()
alignments = blast_app.get_alignments()

# Get hit IDs for hits with score > 200
hits = []
for ali in alignments:
    if ali.score > 200:
        hits.append(ali.hit_id)
# Get the sequences from hit IDs
hit_seqs = []
for hit in hits:
    file_name = entrez.fetch(hit, ".", "fa", "protein", "fasta")
    fasta_file = fasta.FastaFile.read(file_name)
    hit_seqs.append(fasta.get_sequence(fasta_file))
print(len(hit_seqs)) # 99

app = muscle.MuscleApp(hit_seqs)
app.start()
print(app.get_command())  # muscle -in /tmp/tmpdx_2xr27.fa -out /tmp/tmpk_1i3huh.fa -tree1 /tmp/tmp9qre27e5.tree -tree2 /tmp/tmpe4h0rl3k.tree -seqtype protein
app.join()
alignment = app.get_alignment()
print(alignment)

Interestingly, when I run the command from app.get_command() manually, this produces the expected output file within ~15 sec, while the biotite script still didn't finish after 15 minutes. Finally, switching from MuscleApp to ClustalOmegaApp in the script works and produces an alignment. Therefore, this issue seems to be related to MuscleApp only.

Hopefully, this issue is reproducible on other machines at all! ;)

Biotite objects do no treat NumPy integral types as integer index

When a Biotite object, like an AtomArray, is indexed with a NumPy integral type object (e.g. int64),
the value is not treated as integer. Hence, the indexing operation fails or gives unintentional results.

Example:

import numpy as np
import biotite.structure as struc
import biotite.structure.io as strucio
array = strucio.load_structure("tests/structure/data/1l2y.mmtf")[0]
atom = array[np.int64(0)]

Raised Exception:

Traceback (most recent call last):
  File "/home/kunzmann/Documents/coding/biotite/test.py", line 20, in <module>
    atom = array[np.int64(0)]
  File "/home/kunzmann/Documents/coding/biotite/src/biotite/structure/atoms.py", line 594, in __getitem__
    return self._subarray(index)
  File "/home/kunzmann/Documents/coding/biotite/src/biotite/structure/atoms.py", line 206, in _subarray
    new_length = new_coord.shape[-2]
IndexError: tuple index out of range

This is caused by the isinstance(x, int) check, that evaluates to false for NumPy integral types.
A solution would be isinstance(x, numbers.Integral).

String representation of objects

It would be nice if we could add the __repr__ method to biotite objects such as sequences, etc. Representations are especially useful when working in Jupiter notebook.

"biotite.structure.connect_via_distances" doesn't register peptide bonds

"biotite.structure.connect_via_distances" seems to have difficulties registering bonds between individual residues. In this example, it is a peptide bond, i. e. the bond linking two amino acids. The first two residues of the mini protein 1l2y (asparagine and leucine) are extracted, hydrogen atoms are removed and their BondList is printed out based on "connect_via_distances":

file_path = rcsb.fetch("1l2y", "pdb")
pdb_file = PDBFile.read(file_path)
mini_protein = get_structure(pdb_file, model = 1)
mini_protein.bonds = connect_via_distances(mini_protein)
bond_list = mini_protein.bonds
first_res = mini_protein[0:16]
second_res = mini_protein[16:35]
res_one_and_two = first_res + second_res
res_one_and_two_without_hyd = res_one_and_two[res_one_and_two.element != "H"]
carbon_index_of_asn = np.where(res_one_and_two.atom_name == "C")[0][0]
print(carbon_index_of_asn)

nitr_index_of_leu = np.where(res_one_and_two_without_hyd.atom_name == "N")[0][1]
print(nitr_index_of_leu)

print(res_one_and_two_without_hyd.bonds)

[[ 0 1 0]
[ 1 2 0]
[ 2 3 0]
[ 1 4 0]
[ 4 5 0]
[ 5 6 0]
[ 5 7 0]
[ 8 9 0]
[ 9 10 0]
[10 11 0]
[ 9 12 0]
[12 13 0]
[13 14 0]
[13 15 0]]

The peptide bond, i. e. the bond between the carbon atom of the hydroxyl group of residue 1 (asparagine) and the nitrogen atom of the amino group of residue 2 (leucine) does not occur in the BondList (the entry [2 8 0] can't be found), which must be a mistake.

Using 'hbond()' on structures without hydrogen

When using hbond on a structure where no hydrogen atoms have been added an error should be thrown, informing you
of the total absence of hydrogen.
This is not the case, as can be seen, e.g., using hbond on https://www.rcsb.org/structure/1E7U.

Writing mmcif (pdbx) files needs whitespace at the end of _atom_site labels

In PDB mmcif files have a single white space at the end of _atom_site labels. Not having these white spaces makes the cif files created using biotite, sometimes results in errors in some other programs (MMAlign, TMAlign as far as I know). Just adding that space at the end of the label solves the problem.

Altloc ID handling does not work properly

Hi I just tried to figure out how small molecules are currently handled within the structure objects (AtomArrayStack).

If I extract for instance all residues using
residues: tuple=biotite.structure.get_residues(mystructure)

then I get a nice tuple containing water, ions and amino-acids, but I don't see the small molecule in this list for instance on structure 5po6 (residue name 8SS).

How can I access these?

NetworkX for resolving conflicting pseudoknot cliques

With NetworkX becoming a Biotite dependency in the next update (#283), we could think about replacing the functions for resolving conflicting regions in biotite.structure.pseudoknots() with graph based solutions:

_find_non_conflicting_regions() -> networkx.isolates()
_conflict_cliques() -> networkx.connected_components()

This could make the code cleaner and even increase the performance (although this need to be tested).

@tomtomhdx What is your opinion about this?

Switching setup to Poetry

Currently, Biotite uses the classical setuptools with a setup.py for building the package. However, with the changes introduced by PEP517, we can now use more modern and cleaner tools, like Poetry.

This branch contains the current effort of moving Biotite from setuptools to Poetry.
These are the issues that need to be resolved:

Filling the pyproject.toml with the necessary information
Adding a build script for compilation of extension modules
Resolve interaction with conda (python-poetry/poetry#105, python-poetry/poetry#190, python-poetry/poetry#1432)
Stabilize build system for C-extensions (python-poetry/poetry#2740)
Document changes to the build system
Document changes pytest/sphinx invocation (Poetry does not use a setup.py)
Update CI for Travis CI
Update CI for AppVeyor

	# Match the selected std_base to the base.
	fitted, transformation = superimpose(
	nucleotide[np.isin(nucleotide.atom_name, std_base.atom_name)],
	std_base[np.isin(std_base.atom_name, nucleotide.atom_name)]
	)

biotite-dev / biotite Goto Github PK

biotite's People

Contributors

Stargazers

Watchers

Forkers

biotite's Issues

The issue

Expected behaviour

Steps to reproduce

Actual behaviour

Steps to reproduce

Error

Recommend Projects

Recommend Topics

Recommend Org