dauparas / proteinmpnn Goto Github PK
View Code? Open in Web Editor NEWCode for the ProteinMPNN paper
License: MIT License
Code for the ProteinMPNN paper
License: MIT License
Hi,
I am trying to score structures, however I am not sure why the network does not give the same score with the same input structure, noise, temperature and seed.
e.g
>protein_unrelaxed_rank_1_model_5, score=1.5509, global_score=1.5065, fixed_chains=['B', 'C', 'D'], designed_chains=['E'], CA_model_name=v_48_020, git_hash=905b0081f3aa3404d7ac22174bc25710f9995ff8, seed=565
>protein_unrelaxed_rank_1_model_5, score=1.5227, global_score=1.5124, fixed_chains=['B', 'C', 'D'], designed_chains=['E'], CA_model_name=v_48_020, git_hash=905b0081f3aa3404d7ac22174bc25710f9995ff8, seed=565
These two runs was submitted on the exact same structure (coming from AF2), fixed positions, noise, temperature, model and seed. Is this an expected behaviour? What else would I need to set for the same score?
Thank you so much!
The code currently silently ignores --use_soluble_model if --ca_only is specified. I only discovered this because I noticed that I was getting the same sequences designed with and without --use_soluble_model. Ideally, there should be an error if you try to use both --use_soluble_model and --ca_only.
Hello!
Context: I used ProteinMPNN to generate 500 designed sequences for a loop region in my structure. The info on the first three sequences are shown below. After the first 5 or so sequences, score
and global_score
no longer seem to be in any rank order, so I am assuming the sequences in the output .fa
file do not get ranked and reorganized?
>T=0.1, sample=1, score=1.1318, global_score=1.6505, seq_recovery=0.1250
>T=0.1, sample=2, score=1.1756, global_score=1.6445, seq_recovery=0.1250
>T=0.1, sample=3, score=1.1776, global_score=1.6387, seq_recovery=0.1875
Question: Which metric should I be using to rank these sequences? score
? global_score
? Is a higher number better, or a lower number? Or should I be using an external metric for ranking? I do not have access to a GPU for accelerated AlphaFold modeling, so I need to narrow down the sequences to model beforehand.
Many thanks for your help!!
Morgan
Sort of like the omit_AAs dictionary, except for individual positions?
Hello! Can you specify exactly what score and global score means in the context of the original PDB sequence and the designed sequence? I understand the recovery as the probability of sequence identity to the original PDB string. Thanks!
In designing a complex sequence, the target protein has all the backbones, but the "pairs" of proteins to be designed are designed by a certain method, so "pairs" of proteins to be designed: N, CA, C, and O.
In this case, I think I should use the CA model, but it would be better to use the "vanilla_model_weights" model by adding the coordinates of "CB" to the proteins to be designed as "pairs" in some appropriate way. What do you think? (There is also the problem of how to add the "CB" coordinates.)
Great work! Can you please provide processed training/validation/testing datasets as well?
Hello,
We are facing the following error when running the examples that takes jsonl files as inputs.
Examples that run the inference on the pdb (with parse_pdbs and StructureDatasetPDB() runs without errors.
Does the code require different formats/parsing for jsonl files? I think currently it is using the parsing from StructureDataset() class
Thank you for the attention!
Hi,
I would like to fine-tune on certain complexes, always knowing which chain I want to be masked. How should I input this?
How do the list.csv, test_cluster and valid_cluster change for multimers if at all? I could not find anything about this in the training directory. What doees that mean?
Thank you!
Basically, is there a way to limit the total differences from the input sequence? Could one limit the total mutations to <5 per 100 residues of chain length or something similar?
Hello,
I am wondering if you have a list of PDB Ids + chains used to train and validate the model. These would be useful for comparisons to other methods and general benchmarking. I could not find these listed in the supplemental.
The list of proteins/chains would be the 25k training clusters, 402 monomer backbones, and then the last set of 690 monomers, 732 homomers (with less than 2000 residues), and 98 heteromers described in the paper.
Would it be possible to provide these? Thanks!
Hello ProteinMPNN team,
Thank you for your work!
In your paper you refold some native protein and compare with alphafold.
("We reasoned that ProteinMPNN might generate sequences for native backbones more strongly encoding the structures than the original native sequences, as evolution in most cases does not optimize for stability, and completely redesigned a set of 396 native structures").
Best
Barthelemy
Hi, I was wondering if there is a way to have a safety filter in silico to be sure that the generated protein sequence with ProteinMPNN is "safe" to use. For example that it is not a toxin, a prion or a cell penetrating peptide?
Do you think a simple blast would be sufficient?
Are there better tools available (AI based or not) to safely design proteins sequences?
Hi @dauparas ,
This is a layman's question inspired by news related to chatGPT and GPT4.
There might be a way to count it but i do not know how.
Can you please let me know?
Thanks a lot.
Hello,
I read your ProteinMPNN paper. Impressive work!
I tested model v_48_002 on CATH 4.2 testset with sampling_temp 0.001 and num_seqs 4 (all the other setting followed quickdemo.ipynb). The mean and median result among 1120 chains is respectively 42.19 and 45.41. In the paper it is 47.9.
Is that the same model weight? or is there any modification to the defalut setting I missed?
One more thing, would you like to upload the model weight trained without backbone noise?
Thank you!
i did not see a implementation of dot-product attention in proteinMPNN code. according to proteinMPNN paper, i think it should be a dot-product attention just like canonical transformer. so the code implement additive attention instead?Did i miss something in code or it's a dismatch between code and paper desciption?
hi proteinmpnn team
i read the paper about the function design part
i wonder if protein mpnn can be used to design ppi in certain interface
i tried it with one side of the interface with residue within 5A of the other side into design and other fixed
the designed sequences are predicted of their structure and docked to the other side
but ddG calculated by interface anaylzer with beta nove 16 is positve but wt is much negative
is there anything to discuss about such situation?
I am trying to make single point mutations in a sequence, where I will mutate the single position to all 20 canonical amino acids. I am trying to use design_only_positions and omit_AA, where the first list will include only one position and omit_AA will include all AA except for one.
I am getting an error regarding omit_AA. Is there an example of how to use omit_AA correctly? Thank you!
Hi,
Thank you so much for your work on ProteinMPNN!
I have the examples scripts working and I was modifying them (submit_example_4.sh in particular) with my own inputs.
However, in the parsed_pdbs.json file that is created my different sequences have both N_chain_A and O_chain_A and every value on the O_chain_A has NaN for all the coordinates. ProteinMPNN still runs and generates outputs for all the sequences but the score, global score, and seq_recovery all = nan (which I am assuming is connected to the O_chain_A=NaN).
I imagine that this is due to my pdb files.
Please would you be able to explain why this might be happening, what the difference between N_chain_A and O_chain_A is, and how I might be able to fix it?
Please do direct me towards any further reading as well!
Thank you very much!
Adam
{"original": {"H": [26,27,28,29,30,35,36,37,38,56,57,58,59,62,63,64,65,105,106,111,112,113,114,115,116,117], "L": [27,28,29,30,36,37,38,56,57,65,105,106,107,108]}}
I'm trying to fix the CDR regions so that they are not changed by ProteinMPNN. However, I don't seem to have the right numbering (see the .jsonl
above).
Is the numbering just the actual position in the string? Does it start at 0 or 1? Is it Clothia? Is it Kabat?
What would you recommend for the number of sequences to create for design? What have you used for production runs? In the examples, these are usually one or two, but what is generally recommended?
Hi,
thank you for providing the proteinMPNN code. I am interested in trying out variations of the clustering thresholds, is it possible for you to share the code that produces the train/test/validation clusters used for multi-chain training, i.e how could I generate my own version of list.csv, train_clusters.txt and test_clusters.txt?
Thanks a lot in advance!
Olivia
Dear ProteinMPNN team,
Thank you very much for your great efforts.I want to know why the designed amino acid sequence has a long string of A(i.g. APGDAAAALAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)It’s looks unusual.
Thank you for your helps in advance
I am trying to design an interface, but the amino acids I am getting are almost completely polar, which is weird for an interface. Am I missing something? Is there something one should consider when designing interfaces? I am quite sure that the distance I am using between the two chains is adequate for an interface.
I'm having a little trouble getting the code to run with a PSSM as input. I'm also not 100% sure how PSSM biasing is influenced by the pssm_multi, pssm_threshold, pssm_log_odds_flag, and pssm_bias_flag flags. It would be great to have an example with input files and flag usage when including a PSSM.
Thanks!
I have a small, monomeric beta-barrel transmembrane protein that I would like to redesign for soluble expression. I ran ProteinMPNN twice, using either the standard model or the soluble model (--use_soluble_model flag) as input (generating 20 sequences each, T=0.1).
The parental sequence has ~53% hydrophobic amino acids. After running the standard model, the output sequences range from 53-57% hydrophobic. Using the soluble model the output sequences range from 54-59% hydrophobic, so they're more hydrophobic than the parental sequence. Alphafold agrees that the top scoring sequences will fold into the correct structure, however I can still see large hydrophobic patches on the surface of the soluble model generated structure in the transmembrane region.
I haven't tried expressing this protein yet, but it definitely does not look soluble to me. I will try biasing against hydrophobic residues next, however I'm surprised that the soluble model didn't automatically redesign the transmembrane region to be less hydrophobic. Has anybody had similar experiences, or know of any other tricks for redesigning membrane proteins for soluble expression? Has anybody previously optimized amino-acid bias values for soluble expression (otherwise I'll just guess bias values).
I am attempting to follow the work flow recommended in the RFDiffusion paper, https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2.
I am obtaining low amino acid sequence diversity in my ProteinMPNN outputs. My problem is not as severe as the one shown in an earlier reported issue (#46), but it is problematic. Here is a typical example. I am executing ProteinMPNN as shown in ProteinMPNN/examples/submit_example_1.sh.
MIYKHAGYYNAKKGKGKGYTFSTGAKGKGYTKRFKKFSVGKGKATDKETLRAMLTLGGIIFEIDKKKKNKWKGYSTDKGLTAGYSTGKGTKALGYQITPNFGVGYAYNKKPYFGVSYQTKDGSVGVGYNFGLRIVSVSYGNPKTGKGAGYSYKA
{ A : 6.5%
C : 0.0%
D : 2.6%
E : 1.3%
F : 4.5%
G : 18.2%
H : 0.6%
I : 3.9%
K : 18.2%
L : 3.9%
M : 1.3%
N : 3.9%
P : 1.9%
Q : 1.3%
R : 1.9%
S : 5.8%
T : 8.4%
V : 4.5%
W : 0.6%
Y : 10.4% }
There is a surprisingly large number of G and K residues. I also wonder about the high abundance of Y. The calculated isoelectric point is 10.08. I generated 10 candidate sequences from this particular structure. They were all pretty similar to this one.
A response to the earlier issue report was as follows:
Hello! This might happen if the model is uncertain about the prediction, or the input backbone is of low quality. You could try adding negative alanine bias.
Originally posted by @dauparas in #46 (comment)
I can of course attempt to apply negative biases to certain amino acids, as recommended in the earlier post. Before I do this, I would like to ask whether there are any criteria we can use to measure the "quality" of input backbones.
My PDB input files are being generated by RFDiffusion. I specify a partial scaffold, and RFDiffusion hallucinates the rest. At least in PyMol, the secondary structures of the RFDiffusion output files look reasonable. The automated secondary structure assignment algorithm in PyMol is identifying regions of alpha helix and beta sheet. That doesn't mean that I don't have issues with my RFDiffusion outputs, but I don't know what to look for.
Thanks for any information you can provide.
Hey Justas, great work with ProteinMPNN. Out of curiosity, is someone already working on translating your repo into a pip-installable package?
Traceback (most recent call last):
File "E:\GitHub\ProteinMPNN\protein_mpnn_run.py", line 466, in
main(args)
File "E:\GitHub\ProteinMPNN\protein_mpnn_run.py", line 162, in main
pdb_dict_list = parse_PDB(args.pdb_path, ca_only=args.ca_only)
File "E:\GitHub\ProteinMPNN\protein_mpnn_utils.py", line 166, in parse_PDB
xyz, seq = parse_PDB_biounits(biounit, atoms=sidechain_atoms, chain=letter)
File "E:\GitHub\ProteinMPNN\protein_mpnn_utils.py", line 85, in parse_PDB_biounits
for line in open(x,"rb"):
OSError: [Errno 22] Invalid argument: 'inputs/*pdb'
Hi,
wonderful tool, however it is a bit hard to use when trying to design an interface. When I have only a couple of AAs for design and a lots of fixed ones, it is tedious to list all the fixed ones. Is it possible to include a script to which we only provide a list of enabled positions?
Thank you!
Hi @dauparas
Thanks for the good job done, I am already using ProteinMPNN and would like to train or finetune it on a custom dataset.
Are there some scripts on how to process PDB files into the format compatible with your training script please?
PDBID_CHAINID.pt - I am not sure how to create the values for mask, bfac and occ.
And some metadata will not always be available since I would like to use a mixture of experimental and predicted structures at the input ... Can some metadata be missing from the training files?
Thanks for any hints!
Dear ProteinMPNN authors,
Thank you very much for your great efforts. I seem to meet some obstacles to conceptually understand the implementation. Per my reading of the paper, the design task should be formulated as input the backbone structure only, and return the designed sequences
.
However, the provided example (e.g. submit_example_3.sh - directly from the .pdb path) actually take both backbones and sequences as inputs, and return the re-sampled sequences, if I do not misunderstand the codes:
ProteinMPNN/protein_mpnn_run.py
Line 302 in e61ecb7
ProteinMPNN/protein_mpnn_utils.py
Line 1057 in e61ecb7
It would be appreciated if you could comment on whether I have the correct understanding. If yes, which script would be the one for me, that I would like to input the backbone structure only, and return the designed sequences
?
I haven't yet had a chance to start editing the code (and maybe I could streamline this), but why is this needed as opposed to putting these options into the single final run script?
python ../helper_scripts/parse_multiple_chains.py --input_path=$folder_with_pdbs --output_path=$path_for_parsed_chains
python ../helper_scripts/assign_fixed_chains.py --input_path=$path_for_parsed_chains --output_path=$path_for_assigned_chains --chain_list "$chains_to_design"
python ../helper_scripts/make_fixed_positions_dict.py --input_path=$path_for_parsed_chains --output_path=$path_for_fixed_positions --chain_list "$chains_to_design" --position_list "$fixed_positions"
python ../protein_mpnn_run.py \
--jsonl_path $path_for_parsed_chains \
--chain_id_jsonl $path_for_assigned_chains \
--fixed_positions_jsonl $path_for_fixed_positions \
--out_folder $output_dir \
--num_seq_per_target 2 \
--sampling_temp "0.1" \
--seed 37 \
Hi Justin! I just looked around the repo and couldn't find a .yaml file for setting up the conda environment (mlfold
in the examples). Did I miss it, or does that still need to be added?
Very excited to try out some sequence design with ProteinMPNN!
I'd like to use different PDB files to train with, was wondering if there is a script that was used to auto generate the two .pt files mentioned in the issue header. Thanks :).
Hi @dauparas !
Would you mind telling us which variables are the last layer representations of the encoder and decoder?
I want to output those representations.
Thank you!
Dear Authors, Hugging face has some issues running the code.
Here is the mail error of colab notebook.
ValueError Traceback (most recent call last)
in <cell line: 2>()
25 S_sample = sample_dict["S"]
26 else:
---> 27 sample_dict = model.tied_sample(X, randn_2, S, chain_M, chain_encoding_all, residue_idx, mask=mask, temperature=temp, omit_AAs_np=omit_AAs_np, bias_AAs_np=bias_AAs_np, chain_M_pos=chain_M_pos, omit_AA_mask=omit_AA_mask, pssm_coef=pssm_coef, pssm_bias=pssm_bias, pssm_multi=pssm_multi, pssm_log_odds_flag=bool(pssm_log_odds_flag), pssm_log_odds_mask=pssm_log_odds_mask, pssm_bias_flag=bool(pssm_bias_flag), tied_pos=tied_pos_list_of_lists_list[0], tied_beta=tied_beta, bias_by_res=bias_by_res_all)
28 # Compute scores
29 S_sample = sample_dict["S"]
2 frames
/usr/local/lib/python3.10/dist-packages/opt_einsum/contract.py in contract_path(*operands, **kwargs)
236 size_dict[char] = dim
237 elif dim not in (1, size_dict[char]):
--> 238 raise ValueError("Size of label '{}' for operand {} ({}) does not match previous "
239 "terms ({}).".format(char, tnum, size_dict[char], dim))
240 else:
ValueError: Size of label 'i' for operand 1 (348) does not match previous terms (462).
When I run the "example from vanilla_proteinmpnn/examples/ to design a single PDB file",it occured RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
What should I do to solve the problem?
Thanks for reply
Hi,
Great work. I wonder if we could use one command line to add many different fixed residues into one jsonl files. As it is a little bit of tired to generate fixed residues jsonl files for different fixed residues one by one.
Hello.
Terrific work and thanks for making code available. I’m wondering if there is current any code for producing new model weight? I would find it interesting to train the code on a subset of the PDB.
Thanks. Take care.
i was curious about why the parse multichain script allow the sequence jump as well as the coords nan(namely maintain the missing), but new released script that parse cif to .pt omit the missing coords and sequence? i know its for training availability, why not make it consistent?
I try to design a sequence for a multi-state protein. It has four available conformations that are different from each other. How can I tie all residues between these four PDBs so that I can design a sequence optimised by all of four structures? I couldn't find an easy way to do this, having looked through the examples.
Dear ProteinMPNN team,
Thank you for building such fantastic method to generate sequences from structure. I am thinking of getting a PSSM logo from a known structure. how can I get the predicted PSSM from a structure?
After I checked the code, it looks like I can use log_probs in this line
, and I also need to run this many times(not sure how big to be converged ) because ProteinMPNN decoder is in random decoding order. is this correct?
Thank you for your helps in advance
Hi, thanks for the great work!
I have been trying to reproduce the results reported in the ProteinMPNN paper. While I can only obtain test ppl ~ 5.7, which is a little bigger than the result in the paper 4.74/5.25 w/ and w/o adding noise during training. I also loaded the v_48_002.pt weight and I can obtain test ppl 5.29, which is almost the same as the result reported in the paper.
So I wonder whether you can release the weights which doesn't add noise during training? Or whether you can share your training code? Thank you so much!
Hello,
I am just starting to use your model, it seems very good work !
When I follow your quickdemo.ipynb example but put my own pdb file it seems that the parsed sequences dont match.
The sequence of the pdb file has 297 residues (only made of the 20 natural AAs) and pdb_dict_list[0][f"seq_chain_{chain}"] is 296 residues long. The first residue has been cropped, it is an M.
Unfortunately I dont think I can share the pdb file but I wonder if there's anything I can check that could cause the first residue to be missing after parsing the pdb with your utils ...
Thanks for any hints !
Is it possible to fix positions while using the homooligimer example?
Hi, I have read through the ProteinMPNN preprint but don't really understand how the temperature parameter is involved in ProteinMPNN design.
Would you mind describing briefly how temperature is involved in ProteinMPNN?
Thank you!
Hi, I'm learning your ProteinMPNN framework. When going through your script, protein_mpnn_utils.py, I am confused with the variable order_mask_backward
, which is defined and used in line 1085 and line 1086, respectively.
order_mask_backward = torch.einsum('ij, biq, bjp->bqp',(1-torch.triu(torch.ones(mask_size,mask_size, device=device))), permutation_matrix_reverse, permutation_matrix_reverse)
mask_attend = torch.gather(order_mask_backward, 2, E_idx).unsqueeze(-1)
As I understand it (according to line 1086, i.e., the second line above), order_mask_backward
should be the tensor that records: for each residue, which residues are decoded before it in the reverse order (with the value of 1, meaning that these residues can be seen). In that case, index along dim=-1
and dim=-2
both represent the position of residues so that order_mask_backward
can be gathered by E_idx
(E_idx
is a tensor that records for each residue, which residues are recognized as neighbors, with a shape of [num_batch, num_residues, num_neighbors]
).
However, to my understanding, order_mask_backward
defined in the first line above records that, for decoding pair (q, p), whether there exists corresponding residue pair (i, j), subject to i > j. If exists, the value is 1, else 0. Here, q
and p
is the index along dim=-2
and dim=-1
of tensor order_mask_backward
respectively, i
and j
is the position of residue in the sequence.
To clarify, take an easy example as follows.
import torch
import torch.nn.functional as F
num = 4 # num of residues
a = torch.Tensor([2,3,0,1]).long() # random decoding order, i.e., a[position_of_residue] = value of decoding order
one_hot_a = F.one_hot(a, num_classes=num).float()
one_hot_a = one_hot_a.unsqueeze(0)
result = torch.einsum('ij, biq, bjp->bqp', (1-torch.triu(torch.ones(num, num))), one_hot_a, one_hot_a) # given by line 1085
result
tensor([[[0., 0., 1., 1.],
[1., 0., 1., 1.],
[0., 0., 0., 0.],
[0., 0., 1., 0.]]])
For example, result[0][0][2] = 1
, meaning that there exists such residue pair (i, j, s.t. i > j) that has a decoding order of (0, 2). In fact, residue 2 and residue 0 constitute such pair that could satisfy the above conditions. This example proves that my understanding of the variable order_mask_backward
given by line 1085 may be right. However, in that case, order_mask_backward
does not go along with E_idx
in line 1086 because the index along dim=-2
and dim=-1
does not represent the position of residues.
Modification of the equation in torch.einsum() as follows may solve that problem.
torch.einsum('ji, bqi, bpj->bqp', (1-torch.triu(torch.ones(num, num)), one_hot_a, one_hot_a))
tensor([[[0., 1., 0., 0.],
[0., 0., 0., 0.],
[1., 1., 0., 1.],
[1., 1., 0., 0.]]])
In the above result, result[0][0][1] = 1
, meaning that the decoding order of residue 1 is before (in the reverse order, i.e., decoding backwards) residue 0 (3 and 2, respectively), so that residue 1 can be seen when decoding residue 0.
I'm not sure if my understanding is correct and if it has an influence on the model training result.
Hey ProteinMPNN team! Big fan of the work and thank you for the excellent guides that are runnable in Colab.
I'm trying to work through the code base + util file to better understand how to score sequences (the neg. log probability of the sequence given a backbone). My question is, given a set of sequences and a PDB file for my backbone, how can I a) featurize the backbone + b) then compute the neg. log probability for a set of sequences I have wrt to the backbone?
For reference, here is a similar example of how to do so with ESM-IF1.
Thanks for your help!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.