microsoft / protein-sequence-models Goto Github PK

License: Other

Python 100.00%

protein-sequence-models's Introduction

Pytorch modules and utilities for modeling biological sequence data.

Here we will demonstrate the application of several tools we hope will help with modeling biological sequences.

Installation

$ pip install sequence-models
$ pip install git+https://github.com/microsoft/protein-sequence-models.git  # bleeding edge, current repo main branch

Loading pretrained models

Models require PyTorch. We tested on v1.9.0, v1.11.0,and 1.12. If you installed into a clean conda environment, you may also need to install pandas, scipy, and wget.

To load a model:

from sequence_models.pretrained import load_model_and_alphabet

model, collater = load_model_and_alphabet('carp_640M')

Available models are

carp_600k
carp_38M
carp_76M
carp_640M
mif
mifst
bigcarp_esm1bfinetune
bigcarp_esm1bfrozen
bigcarp_random

Convolutional autoencoding representations of proteins (CARP)

We make available pretrained CNN protein sequence masked language models of various sizes. All of these have a ByteNet encoder architecture and are pretrained on the March 2020 release of UniRef50 using the same masked language modeling task as in BERT and ESM-1b.

CARP is described in this preprint.

You can also download the weights manually from Zenodo.

To encode a batch of sequences:

seqs = [['MDREQ'], ['MGTRRLLP']]
x = collater(seqs)[0]  # (n, max_len)
rep = model(x)  # (n, max_len, d_model)

CARP also supports computing representations from arbitrary layers and the final logits.

rep = model(x, repr_layers=[0, 2, 32], logits=True)

Compute embeddings in bulk from FASTA

We provide a script that efficiently extracts embeddings in bulk from a FASTA file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the CARP_640M model:

$ python scripts/extract.py carp_640M examples/some_proteins.fasta \
    examples/results/some_proteins_emb_carp_640M/ \
    --repr_layers 0 32 33 logits --include mean per_tok

Directory examples/results/some_proteins_emb_carp_640M/ now contains one .pt file per extracted embedding; use torch.load() to load them. scripts/extract.py has flags that determine what .pt files are included:

--repr-layers (default: final only) selects which layers to include embeddings from. 0 is the input embedding. logits is the per-token logits.

--include specifies what embeddings to save. You can use the following:

per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
mean includes the embeddings averaged over the full sequence, per layer (only valid for representations).
logp computes the average log probability per sequence and stores it in a csv (only valid for logits).

scripts/extract.py also has --batchsize and --device flags. For example, to use GPU 2 on a multi-GPU machine, pass --device cuda:2. The default is to use a batchsize of 1 and cpu if cuda is not detected or cuda:0 if cuda is detected.

Masked Inverse Folding (MIF) and Masked Inverse Folding with Sequence Transfer (MIF-ST)

We make available pretrained masked inverse folding models with and without sequence pretraining transfer from CARP-640M.

MIF and MIF-ST are described in this preprint

You can also download the weights manually from Zenodo.

To encode a sequence with its structure:

from sequence_models.pdb_utils import parse_PDB, process_coords
coords, wt, _ = parse_PDB('examples/gb1_a60fb_unrelaxed_rank_1_model_5.pdb')
coords = {
        'N': coords[:, 0],
        'CA': coords[:, 1],
        'C': coords[:, 2]
    }
dist, omega, theta, phi = process_coords(coords)
batch = [[wt, torch.tensor(dist, dtype=torch.float),
          torch.tensor(omega, dtype=torch.float),
          torch.tensor(theta, dtype=torch.float), torch.tensor(phi, dtype=torch.float)]]
src, nodes, edges, connections, edge_mask = collater(batch)
# can use result='repr' or result='logits'. Default is 'repr'.
rep = model(src, nodes, edges, connections, edge_mask)

Compute embeddings in bulk from csv

We provide a script that efficiently extracts embeddings in bulk from a csv file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the mifst model:

$ python scripts/extract_mif.py mifst examples/gb1s.csv \
    examples/ \
    examples/results/some_proteins_mifst/ \
    repr --include mean per_tok

Directory examples/results/some_proteins_mifst/ now contains one .pt file per extracted embedding; use torch.load() to load them. scripts/extract_mif.py has flags that determine what .pt files are included:

The syntax is:

$ python script/extract_mif.py <model> <csv_fpath> <pdb_dir> <out_dir> <result> --include <pooling options>

The input csv should have columns for name, sequence, and pdb. The script looks in pdb_dir for the filenames in the pdb column.

The options for result are repr or logits.

--include specifies what embeddings to save. You can use the following:

per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
mean includes the embeddings averaged over the full sequence, per layer (only valid for representations).
logp computes the average log probability per sequence and stores it in a csv (only valid for logits).

scripts/extract.py also has a --device flags. For example, to use GPU 2 on a multi-GPU machine, pass --device cuda:2. The default is to use cpu if cuda is not detected or cuda:0 if cuda is detected.

Biosynthetic gene cluster CARP (BiGCARP)

We make available pretrained CNN Pfam domain masked language models of BGCs. All of these have a ByteNet encoder architecture and are pretrained on antiSMASH using the same masked language modeling task as in BERT and ESM-1b.

BiGCARP is described in this preprint. Training code is available here.

You can also download the weights and datasets manually from Zenodo.

To encode a batch of sequences:

bgc = [['#;PF07690;PF06609;PF00083;PF00975;PF12697;PF00550;PF14765'],
       ['t3pks;PF07690;PF06609;PF00083;PF00975;PF12697;PF00550;PF14765;PF00698']]
model, collater = load_model_and_alphabet('bigcarp_esm1bfinetune')
x = collater(bgc)[0]
rep = model(x)

Sequence Datasets and Dataloaders

In sampler.py, you will find two Pytorch sampler classes: SortishSampler, a sampler to sort similarly length sample sequences into length-defined buckets; and ApproxBatchSampler, a batch sampler which grabs sequences from length-defined buckets until the batch has the set approximate max number of tokens or max number of tokens squared.

from sequence_models.samplers import SortishSampler, ApproxBatchSampler

# grab datasets
ds = dataset # your sequence dataset

# build dataloaders
len_ds = np.array([len(i[0]) for i in ds]) # list of lengths of the sequence in dataset (in order)
bucket_size = 1000 # number of length-defined buckets
max_tokens = 8000 # max number of tokens per batch
max_batch_size = 100 # max number of samples per batch
sortish_sampler = SortishSampler(len_ds, bucket_size)
batch_sampler = ApproxBatchSampler(sortish_sampler, max_tokens, max_batch_size, len_ds)
collater = collater # your collater function
dl = DataLoader(ds_train, collate_fn=collater, batch_sampler=batch_sampler, num_workers=16)

Pre-implemented Models

Struct2SeqDecoder (GNN)

The Struct2SeqDecoder model was adapted from Ingraham et al.. This model uses protein structural information encoded as a graph nodes and edges representing the structural information of each amino acid residue and their relations to each other, respectively.

If you already have node features, edge features, connections between nodes, encoded sequences (src), and edge mask (edge_mask); you can directly use the the Struct2SeqDecoder as demonstrated below:

from sequence_models.constants import trR_ALPHABET
from sequence_models.gnn import Struct2SeqDecoder

num_letters = len(trR_ALPHABET) # length of your amino acid alphabet  
node_features = 10 # number of node features
edge_features = 11 # number of edge features
hidden_dim =  128 # your choice of hidden layer dimension
num_decoder_layers = 3 # your choice of number of decoder layers to use
dropout = 0.1 # dropout used by decoder layer
use_mpnn = False # if True, use MPNN layer, else use Transformer layer for decoder 
direction = 'bidirectional' # direction of information flow/masking: forward, backward or bidirectional 

model = Struct2SeqDecoder(num_letters, node_features, edge_features, hidden_dim,
            num_decoder_layers, dropout, use_mpnn, direction)
out = model(nodes, edges, connections, src, edge_mask)

If you do not have prepared inputs, but have 2d maps representing the distance between residues (dist) and the dihedral angles between residues (omega, theta, and phi), you can use our preprocessing functions to generate nodes, edges, and connections as demonstrated below:

from sequence_models.gnn import get_node_features, get_k_neighbors, get_edge_features, \
    get_mask, replace_nan

# process features
node = get_node_features(omega, theta, phi) # generate nodes
dist = dist.fill_diagonal_(np.nan) # if the diagonal of dist tensor is not already filled with nans, it should 
                                    # to prevent selecting self when getting k nearest residues in the next step 
connections = get_k_neighbors(dist, n_connections) # get connections
edge = get_edge_features(dist, omega, theta, phi, connections) # generate edge
edge_mask = get_mask(edge) # get edge mask (in the scenario where there is missing edge features between neighbors)
edge = replace_nan(edge) # replace nans with 0s 
node = replace_nan(node)

Alternatively, we have also prepared StructureCollater, a collater function found in collaters.py that also performs this task:

from sequence_models.collaters import StructureCollater

n_connections = 20 # number of connections per amino acid residue  
collater = StructureCollater(n_connections=n_connections)
ds = dataset # Dataset must return sequences, dists, omegas, thetas, phis 
dl = Dataloader(ds, collate_fn=collater)

ByteNet

The ByteNet model was adapted from Kalchbrenner et al.. ByteNet uses stacked convolutional encoder and decoder layers to preserve temporal resolution of sequential data.

from sequence_models.convolutional import ByteNet
from sequence_models.constants import trR_ALPHABET

n_tokens = len(trR_ALPHABET) # number of tokens in token dictionary
d_embedding = 128 # dimension of embedding
d_model = 128 # dimension to use within ByteNet model, //2 every layer
n_layers = 3 # number of layers of ByteNet block
kernel_size = 3 # the kernel width
r = ??? # used to calculate dilation factor
padding_idx = trR_ALPHABET.index('-') # location of padding token in ordered alphabet
causal = True # if True, chooses MaskedCausalConv1d() over MaskedConv1d()
dropout = 0.1 

x = torch.randn(32, 128) # input (n samples, len of seqs) 
input_mask = torch.ones(32, 128, 1) # mask (n samples, len of seqs, 1)
model = ByteNet(n_tokens, d_embedding, d_model, n_layers, kernel_size, r, 
            padding_idx=padding_idx, causal=causal, dropout=dropout)
out = model(x, input_mask)

We have also an implemented versions of ByteNet to be able to use 2d inputs (ByteNet2d) and as a language model (ByteNetLM):

from sequence_models.convolutional import ByteNet2d, ByteNetLM

x = torch.randn(32, 128, 128, 64) # input (n samples, len of seqs, len of seqs, feature dimension)
input_mask = torch.ones # (n samples, len of seqs, len of seqs, 1), optional
model = ByteNet2d(d_in, d_model, n_layers, kernel_size, r, dropout=0.0)
out = model(x, input_mask)

x = torch.randn(32, 128) # input (n samples, len of seqs) 
input_mask = torch.ones(32, 128, 1) # mask (n samples, len of seqs, 1)
model = ByteNetLM(n_tokens, d_embedding, d_model, n_layers, kernel_size, r,
                    padding_idx=None, causal=False, dropout=0.0)
out = model(x, input_mask)

trRosetta The trRosetta model was implemented according to Yang et al.. In this model, multiple sequence alignments (MSAs) are used to predict distances between amino acid residues as well as their dihedral angles (omega, theta, phi). Predictions are in the format of bins. Omega, theta and phi angle are binned into 24, 24, and 12 bins, respectively with 15 degrees segments and one no-contact bin. Yang et al. has pretrained five models (model ids: 'a', 'b', 'c', 'd', 'e'). To run a single model:

from sequence_models.trRosetta_utils import trRosettaPreprocessing, parse_a3m
from sequence_models.trRosetta import trRosetta
from sequence_models.constants import trR_ALPHABET

msas = parse_a3m(path_to_msa) # load in msas in a3m format
alphabet = trR_ALPHABET # load your alphabet order
tr_preprocessing = trRosettaPreprocessing(alphabet) # setup preprocessor for msa
msas_processed = tr_preprocessing.process(msas)

n2d_layers = 61 # keep at 61 if you want to use pretrained version
model_id = 'a' # choose your pretrained model id
decoder = True # if True, return 2d structure maps, else returns hidden layer
p_dropout = 0.0
model = trRosetta(n2d_layers, model_id, decoder, p_dropout)
out = model(msas_processed) # returns dist_probs, theta_probs, phi_probs, omega_probs

To run an ensemble of models:

from sequence_models.trRosetta_utils import trRosettaPreprocessing, parse_a3m
from sequence_models.trRosetta import trRosetta, trRosettaEnsemble
from sequence_models.constants import trR_ALPHABET

msas = parse_a3m(path_to_msa) # load in msas in a3m format
alphabet = trR_ALPHABET # load your alphabet order
tr_preprocessing = trRosettaPreprocessing(alphabet) # setup preprocessor for msa
msas_processed = tr_preprocessing.process(msas)

n2d_layers = 61 # keep at 61 if you want to use pretrained version
model_ids = 'abcde' # choose your pretrained model id
decoder = True # if True, return 2d structure maps, else returns hidden layer
p_dropout = 0.0
base_model = trRosetta
model = trRosettaEnsemble(base_model, n2d_layers, model_ids)
out = model(msas_processed)

If you would like to convert bin prediction into actual values, use probs2value. Here is an example of converting distance bin predictions into values:

from sequence_models.trRosetta_utils import probs2value

dist_probs, theta_probs, phi_probs, omega_probs  = model(x) # structure predictions (batch, # of bins, len of seq, len of seq)
preperty = 'dist' # choose between 'dist', 'theta', 'phi', or 'omega'
mask = mask # your 2d mask (batch, len of seq, len of seq)
dist_values = probs2value(dist, property, mask):

protein-sequence-models's People

Contributors

Stargazers

Watchers

protein-sequence-models's Issues

Old Zenodo link in readme

The GitHub readme uses an old Zenodo link and doesn't match the URL in the paper, which does resolve.

Weights for MIF-ST and MIF-GVP

Hello, I see the pre-trained MIF-ST model has input node features of size [node features + alphabet size] not [node features + CARP embedding size]. So I'm assuming this means you guys concatenated the logits (not corrupted/masked one-hot encodings) at each position from CARP to the node features as your input?

Also, is the MIF-GVP model different from the pretrained model for Inverse folding in the GVP Paper? I see you guys got higher sequence recovery than them so I wasn't sure. If so would you be able to upload the weights for that model as well?

Great Paper by the way! As someone working on training multi-modal models for protein property prediction I was very excited to find a pretrained structure+language model as previously I could only find one or the other.

requirements.txt is empty

A simple mistake, but would help a lot with reproducibility. Thanks!!

extract_mif.py Tensor Size runtime error

Hello,

I'm trying to calculate mifst log p values for ~20,000 proteins using your extract_mif.py script. All necessary packages were installed successfully, and a csv was created containing sequence information and file names as specified. The model weights were also successfully downloaded. I've copied the error I get when running the script below. Any help you can provide in resolving this issue would be greatly appreciated. Thanks!

$ python extract_mif.py mifst rbd_model_seqs.csv pgcn_rbd_decoys/ mif_scores/ logits --include=logp
Loading model...
Loading data...
0%| | 0/21808 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/scratch/ja961/extract_mif.py", line 69, in
src, nodes, edges, connections, edge_mask = collater(batch)
File "/home/ja961/anaconda3/envs/pyr/lib/python3.10/site-packages/sequence_models/collaters.py", line 314, in call
nodes[i, :ell] = V
RuntimeError: The expanded size of the tensor (791) must match the existing size (597) at non-singleton dimension 0. Target sizes: [791, 10]. Tensor sizes: [597, 10]

This DOI prefix is currently inactive.

Hi!
I want to use the dataset provided in your paper but found that url https://doi.org/10.1234/mifst would lead to an error This DOI prefix is currently inactive. Could you please solve that? Thx for your help in advance!

No such file or directory: 'examples/results/some_proteins_mifst/wt_mifst_mean.pt'

when i run demo,it raise exception

~/code/protein-sequence-models » python scripts/extract_mif.py mifst examples/gb1s.csv \                               wenyuhao@x86_64-conda_cos6-linux-gnu
    examples/ \
    examples/results/some_proteins_mifst/ \
    repr --include mean per_tok
Loading model...
Loading data...
  0%|                                                                                                                                 | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "scripts/extract_mif.py", line 78, in <module>
    torch.save(rep.mean(dim=0).detach().cpu(),
  File "/data/wenyuhao/anaconda3/envs/fastText/lib/python3.8/site-packages/torch/serialization.py", line 376, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/data/wenyuhao/anaconda3/envs/fastText/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/data/wenyuhao/anaconda3/envs/fastText/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'examples/results/some_proteins_mifst/wt_mifst_mean.pt'

please add:

if not os.path.exists(args.out_dir):
    os.mkdir(args.out_dir)

scripts/extract_mif.py 78line

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Could you public the training code of MIF?

Hi!
I want to train the MIF model on my customized dataset, and I don't know whether there is usable code for directly training with little change. Or could you please public the training code for MIF so I can train the model for further use?
Thank you in advance and I'm looking forward to your reply!

calculate stability and melting temperature from protein sequence

I am reading the biorxiv preprint and I understand I could use the CARP models to predict/calculate stability and melting temperature from a query protein sequence? Would it be possible to have a utils script to do that from this codebase?

In Facebook's ESM, for example, there is a likelihood score calculator script that does something of the like:

esm/examples/inverse_folding/score_log_likelihoods.py ./examples/inverse_folding/data/5YH2.pdb     ./examples/inverse_folding/data/5YH2_mutated_seqs.fasta --chain C    --outpath output/5YH2_mutated_seqs_scores.csv

Would it be possible to have an equivalent for CARP? Thanks in advance.

Confusing results taken from FLIP paper

In the paper it is mentioned that <<Values for models other than CARP-640M are taken from Dallago et al. (2021). >>), however the values reported do not seem to appear in any of the FLIP tables. What is even more confusing is that there are error bars coming from different random runs, while in the FLIP paper it seems the results are from a single run.

Another confusing part is the fine-tuning label in the results table. Does this mean that the language models are fine tuned on the tasks or only the head added on top. I am asking this as the FLIP paper mentions that the embedding models are kept frozen.

DDP with sampler

Hello, when using DDP with the sampler , due to the dynamic batch size, the number of batches allocated by different ranks is inconsistent. This is what makes different ranks unable to communicate in DDP. Are there other solutions？

Applying BiGCARP to other genomes

Hi,

I'm interested in using BiGCARP and understanding its output. I read the publication but, the README is hard to follow, especially as someone who doesn't have expertise in deep learning. Are there any demonstrations for how to use BiGCARP on genomes of interest, similar to what DeepBGC did.

Thanks

zero-shot inference and inverse folding with fixed residues

Hello,

Thank you for sharing your model.
I have a few questions on how to best use MIF-ST please.

In your example script, the model is provided with features derived from the PDB coordinates and sequence ("wt").

For zero-shot inference of e.g. single point mutants, should we replace the input sequence "wt" with a mask token "#" at the mutated residue for computing the MMP score as in ESM (only taking the logits at that residue and computing the difference of log probabilities) ? Or should we forward through the model coords+wt and coords+mutant to score them independently with e.g. average log-likelihood over all residue logits ?

For inverse folding, e.g. redesigning certain residues in a sequence, can I use MIF-ST ? And would it be reasonable to mask all the residues to mutate at once, forward through the model coords+masked_seq and then sample from the logits at the masked residues ?

Thanks !

TrRosetta example doesn't work

The example code provided in the README.md for TrRosetta doesn't work and throws the following exception:

Traceback (most recent call last):
  File "...", line 1, in <module>
    msas_processed = tr_preprocessing.process(msas)
  File ".venv/lib/python3.10/site-packages/sequence_models/trRosetta_utils.py", line 254, in process
    x = F.one_hot(x, len(trR_ALPHABET)).unsqueeze(0).float()
TypeError: one_hot(): argument 'input' (position 1) must be Tensor, not list

code:

from sequence_models.trRosetta_utils import trRosettaPreprocessing, parse_a3m
from sequence_models.trRosetta import trRosetta
from sequence_models.constants import trR_ALPHABET

msas = parse_a3m(path_to_msa) # load in msas in a3m format
alphabet = trR_ALPHABET # load your alphabet order
tr_preprocessing = trRosettaPreprocessing(alphabet) # setup preprocessor for msa
msas_processed = tr_preprocessing.process(msas)

Is the theta angle here the same as the Psi angle in the dihedral angle?

Hello, please excuse me. I am a beginner in the field of proteins. Aren't the common dihedral angles Phi and Psi?

Are the omega, theta and phi angles corresponding to the main chain torsion angles omega, psi and phi angles?

Hi authors, I am trying to use your MIF-ST model and now preparing the node/edge features. Just to be sure I got the features correctly, I was wondering if the omega, theta and phi angles in your paper are referring to the torsion angles that are used in peptide/protein structure (definitions can be found here).

According to Figure 2 in your MIF paper, the answer is no. But the Figure 2 was a bit confusing to me since the omega, theta and phi are all related to the beta carbons. However, not all of the amino acids have beta carbons. I also checked the paper from Ingraham et al. (2019) which is cited in your MIF paper for this part. In that paper they didn't give much details but used the terms "omega", "psi" and "phi". Could you help me with this? Great thanks in advance!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.