Awesome Structural BioInformatics

A curated list of awesome structural bioinformatics frameworks, libraries, software and resources.

So let it not look strange if I claim that it is much easier to explain the movement of the giant celestial bodies than to interpret in mechanical terms the origination of just a single caterpillar or a tiny grass. - Immanuel Kant, Natural History and the Theory of Heaven, 1755

Books on Cheminformatics, Bioinformatics, Quantum Chemistry strangle the subject to sleep 😴 and command a wild price 🤑 for the naps they induce.

Want a better way to learn than some random repo on github?

Spend 4-12 years of your life and hundreds of thousands of dollars chasing a paper with a stamp on it 🥇.

Or feed yourself 🍼.

Information should be cheap, fast enjoyable, silly, shared, disproven, contested, and most of all free.

Knowledge hodlers, and innovation stifflers are boring and old. This is for the young of mind and young of spirit 🚼 that love to dock & fold.

Protein BioInformatics

Protein Folding

Structure-function relationships are the fundamental object of knowledge in protein chemistry; they allow us to rationally design drugs, engineer proteins with new functions, and understand why mutations cause disease. - On The Origin of Proteins

There is now a testable explanation for how a protein can fold so quickly: A protein solves its large global optimization problem as a series of smaller local optimization problems, growing and assembling the native structure from peptide fragments, local structures first. - The Protein Folding Problem

The protein folding problem consists of three closely related puzzles:

(a) What is the folding code?
(b) What is the folding mechanism?
(c) Can we predict the native structure of a protein from its amino acid sequence? source

Data Sources

CATH/Gene3D - 151 Million Protein Domains Classified into 5,481 Superfamilies

NCBI Conserved Domains Database - resource for the annotation of functional units in proteins

Protein Data Bank

Scop 2 - Structural Classification of Proteins

UniProt - comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

Fold@Home

Deep Learning Protein Folding

AlphaFold 13

💾 Code
💾 Code - Prospr - Open Source Implementation
📖 Prospr Paper
AlphaFold @ Casp13: What Just Happened?

MiniFold - Open Source toy example of AlphaFold 13 algorithm

The DeepMind work presented @ CASP was not a technological breakthrough (they did not invent any new type of AI) but an engineering one: they applied well-known AI algorithms to a problem along with lots of data and computing power and found a great solution through model design, feature engineering, model ensembling and so on...

Based on the premise exposed before, the aim of this project is to build a model suitable for protein 3D structure prediction inspired by AlphaFold and many other AI solutions that may appear and achieve SOTA results.

Two different residual neural networks (ResNets) are used to predict angles between adjacent aminoacids (AAs) and distance between every pair of AAs of a protein. For distance prediction a 2D Resnet was used while for angles prediction a 1D Resnet was used.

PDNet

As deep learning algorithms drive the progress in protein structure prediction, a lot remains to be studied at this merging superhighway of deep learning and protein structure prediction. Recent findings show that inter-residue distance prediction, a more granular version of the well-known contact prediction problem, is a key to predicting accurate models. However, deep learning methods that predict these distances are still in the early stages of their development. To advance these methods and develop other novel methods, a need exists for a small and representative dataset packaged for faster development and testing. In this work, we introduce protein distance net (PDNET), a framework that consists of one such representative dataset along with the scripts for training and testing deep learning methods. The framework also includes all the scripts that were used to curate the dataset, and generate the input features and distance maps.

:desktop: Github

📖 Paper

📼 YouTube

Protein - Ligand Docking

"Docking is a method which predicts the prefered orientation of one molecule to a second when bound to each other to form a stable complex. Knoweldge of the prefered orientation in turn may be used to predict the strength of association or binding affinity between two molecules using scoring functions."

Pose - A conformation of the receptor and ligand molecules showing some intermolecular interactions (which may include hydrogen bonds as well as hydrophobic contacts
Posings - The process of searching for a pose in which there are favorable interactions between the receptor and the ligand molecules.
Scoring - The process of evaluating a particular pose using a number of descriptive features like number of intermolecular interactions including hydrogen bonds and hydrophobic contacts.
The best docking algorithm should be the one with the best scoring function and the best searching algorithm source
No single docking methods performs well for all targets and the quality of docking results is highly dependent on the ligand and binding site of interest source

In the early 1990s many approved HIV protease inhibitors were developed to target HIV infections using structure-based molecular docking. source

Saquinavir
Amprenavir

Scoring Functions in MD can be categorized into:

knowledge based - stastical potentials, frequency of interaction occurance, Boltzmann distribution, dataset dependent
force-field based - energy functions via molecular mechanics, coulombic interactions, van der Waals interactions (Lennard-Jones potential) * CHARMM (chemistry at Harvard macromolecular mechanics) * AMBER (assisted model building and energy refinement)
empirical - binding free energy calculated as the weighted sum of unccorrelated terms,(example - hydrogen bonds, hydrophobicity), Regression analysis find the best weights for each term * HYDE (part of BioSolveIT tools) * ChemScore * SCORE
consensus - combines scoring functions types into ensemble
```
* X-CSCORE
* MultiScore
```

One of the first appearances of Molecular Docking is said to have been 1982's

A Geometric Approach to MacroMolecule Ligand Interactions

They tell us Molecular Docking = "To position two molecules so that they interact favorably with one another..."

How???

Our approach is to reduce the number of degrees of freedom using simplifying assumptions that still retain some correspondence to a situation of biochemical interest. Specifically, we treat the geometric (hard sphere) interactions of two rigid bodies, where one body (the “receptor”) contains “pockets” or “grooves” that form binding sites for the second object, which we will call the “ligand”. Our goal is to fix the six degrees of freedom (3 translations and 3 orientations) that determine the best relative positions of the two objects.

Does the program reproduce known ligand-receptor geometries? If so, does it also provide alternative structures that are geometrically reasonable? To these ends, we have examined two systems for which the ligand receptor geometry has been established by crystallographic means.

What is the result of this Docking?

(1) Structures quite near the “correct” structures are readily recovered and identified as feasible solutions. (2) Other families of structures are found that are geometrically reasonable and that can be tested by simple scoring schemes, chemical intuition, or visual inspection with computer graphics.

Without allowing molecular flexibility, many aspects of ligand-receptor interactions are not properly described.

A common approach to docking combines a scoring function with an optimization algorithm. The scoring function quantifies the favorability of the protein-ligand interactions in a single pose, whichcan be conceptualized as a point in a continuous conformation space. A stochastic global optimization algorithm is used to explore and sample this conformation space. Then, local optimization is employed on the sampled points, usually by iteratively adjusting the pose in search of a local extremum of the scoring function. Ideally, the scoring function is differentiable to support efficient gradient-based optimization.

The information obtained from the docking technique can be used to suggest the binding energy, free energy and stability of complexes. At present, docking technique is utilized to predict the tentative binding parameters of ligand-receptor complex beforehand.

There are various databases available, which offer information on small ligand molecules such as CSD (Cambridge Structural Database), ACD (Available Chemical Directory), MDDR (MDL Drug Data Report) and NCI (National Cancer Institute Database).

Scoring Function

There are two common approaches to building a score function:

potentials of mean force
- often called statistics- or Boltzmann-based force fields
- measuring distance as a reflection of statistical tendencies within proteins
- . One takes a large set of proteins, collects statistics and converts them to a score function. One then expects this function to work well for proteins not included in its parameterisation.
an optimization calculation
- select underlying basis function
  - quasi-Lennard-Jones
  - various sigmoidal functions
- We can say that the correct structure is whatever is given in the protein data bank, but unfortunately, there is almost an infinity of incorrect structures for a sequence and one would like the score function to penalize all of them
- One way to encode this idea is to adopt a statistical approach and try to consider the distribution of incorrect structures source

Allowing gaps and insertions at any position and of any length leads to a combinatorial explosion of possibilities. The calculation can be made tractable by restricting the search space and forbidding gaps except in recognised loops in template structures.

There is a score function and a fast method for producing the best possible sequence to structure alignments and thus the best models possible. Unfortunately, the problem is still not solved

Protein - Ligand Docking Tools

Tools for exploring how two or more molecular structures fit together

AutoDock - suite of automated docking tools designed to predict how small molecules bind to a receptor of known 3D structure

AutoDock Vina - significantly improves the average accuracy of the binding mode predictions compared to AutoDock

📖 Paper

Gnina - deep learning framework for molecular docking -inside deepchem (/dock/pose_generation.py)

GOMoDo - GPCR online modeling and docking server

Smina used for minimization (local_only) as opposed to of docking, makes Vina much easer to use and 10-20x faster. Docking performance is about the same since partial charge calculation and file i/o isn't such a big part of the performance.

Appendix

Useful References

(2020) High-Throughput Docking Using Quantum Mechanical Scoring

(2020) Deep Learning Methods in Protein Structure Prediction

(2019)From Machine Learning to Deep Learning: Advances in scoring functions for protein-ligand docking

(2019) The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference, Molecular Biology and Evolution

(2018) DeepFam: deep learning based alignment-free method for protein family modeling and prediction

(2018) Derivative-free neural network for optimizing the scoring functions associated with dynamic programming of pairwise-profile alignment

(2017) Protein-Ligand Scoring with CNN

(2017) Quantum-chemical insights from deep tensor neural networks

(2014) MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

(2012) Molecular Docking: A powerful approach for structure-based drug discovery

(2011) The structural basis for agonist and partial agonist action on a β(1)-adrenergic receptor

(2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading

(2009) Amphipol-Assisted in Vitro Folding of G Protein-Coupled Receptors

(2005) GPCR Folding and Maturation from The G Protein-Coupled Receptors Handbook

(1982) A Geometric Approach to MacroMolecule Ligand Interactions

Brief Explanation of AlphaFold Jax Architecture

AlphaFold2 is Google's state of the art protein structure prediction model.

AF2 predicts 3D coordinates of all atoms of a protein, using the amino acid sequence and aligned sequences homology.

PreProcessing
- Input Sequence
- Multiple Sequence Alignments
- Structural Templates
Transformer (EvoFormer)
Recycling
Structure Module -> 3D coordinates

def softmax_cross_entropy(logits, labels):
  loss = -jnp.sum(labels * jax.nn.log_softmax(logits), axis=-1)
  return jnp.asarray(loss)

If you didn't know jax's nn.logsoftmax AF2's implemenation would not mean much to you.

So going down the rabbit hole in Jax's nn we have the softmax function:

(The LogSoftmax function, rescales elements to the range $(-\infty, 0)$ )

def log_softmax(x: Array, axis: Optional[Union[int, Tuple[int, ...]]] = -1) -> Array:  
  shifted = x - lax.stop_gradient(x.max(axis, keepdims=True))
  return shifted - jnp.log(jnp.sum(jnp.exp(shifted), axis, keepdims=True))

The accepted arguments are:

x : input array
axis: the axis or axes along which the log_softmax should be computed. Either an integer or a tuple of integers.

and an array is returned.

Inside this function we go further down the lane to:

lax.stop_gradient - is the identity function, that is, it returns argument x unchanged. However, stop_gradient prevents the flow of gradients during forward or reverse-mode automatic differentiation.

def stop_gradient(x):
  def stop(x):
    if (dtypes.issubdtype(_dtype(x), np.floating) or
        dtypes.issubdtype(_dtype(x), np.complexfloating)):
      return ad_util.stop_gradient_p.bind(x)
    else:
      return x  # only bind primitive on inexact dtypes, to avoid some staging
  return tree_map(stop, x)

This in turn relies upon tree_map

def tree_map(f: Callable[..., Any], tree: Any, *rest: Any,
                    is_leaf: Optional[Callable[[Any], bool]] = None) -> Any:
  
  leaves, treedef = tree_flatten(tree, is_leaf)
  all_leaves = [leaves] + [treedef.flatten_up_to(r) for r in rest]
  return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))

jnp.log
jnp.sum
jnp.exp

Automatic Differentiation Lecture Slides

Gans in Jax

Jax MD

Open Smiles - Get those old smiles out of here and imagine the wind in your hair in the drivers seat of open source smiles. The only problem - this project hasn't been updated in five years?

Fusion Proteins

ChimPipe - ChimPipe is a computational method for the detection of novel transcription-induced chimeric transcripts and fusion genes from Illumina Paired-End RNA-seq data. It combines junction spanning and paired-end read information to accurately detect chimeric splice junctions at base-pair resolution.

DeepNF - Deep network fusion for protein function prediction | 📖 paper

DeepPrior - predicts the probability of a gene fusion being a driver of an oncogenic process by directly exploiting the amino acid sequence of the fused protein, and it can prioritize gene fusions from different tumors. Unlike state-of-the-art tools, it also supports easy retraining and re-adaptation of the model | 📖 paper

DeFuse - gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries | 📖 paper

FusionCatcher - Finder of somatic fusion-genes in RNA-seq data

Jaffa - JAFFA is a multi-step pipeline that takes either raw RNA-Seq reads, or pre-assembled transcripts, then searches for gene fusions

StarFusion | 📖 paper

dlsrnsi / awesome-structural-bioinformatics Goto Github PK