dauparas / proteinmpnn Goto Github PK

Code for the ProteinMPNN paper

License: MIT License

Shell 0.11% Python 15.50% Jupyter Notebook 84.39%

proteinmpnn's Issues

What does scoring depend on?

Hi,

I am trying to score structures, however I am not sure why the network does not give the same score with the same input structure, noise, temperature and seed.

e.g

>protein_unrelaxed_rank_1_model_5, score=1.5509, global_score=1.5065, fixed_chains=['B', 'C', 'D'], designed_chains=['E'], CA_model_name=v_48_020, git_hash=905b0081f3aa3404d7ac22174bc25710f9995ff8, seed=565

>protein_unrelaxed_rank_1_model_5, score=1.5227, global_score=1.5124, fixed_chains=['B', 'C', 'D'], designed_chains=['E'], CA_model_name=v_48_020, git_hash=905b0081f3aa3404d7ac22174bc25710f9995ff8, seed=565

These two runs was submitted on the exact same structure (coming from AF2), fixed positions, noise, temperature, model and seed. Is this an expected behaviour? What else would I need to set for the same score?

Thank you so much!

Disallow --use_soluble_model if --ca_only is specified

The code currently silently ignores --use_soluble_model if --ca_only is specified. I only discovered this because I noticed that I was getting the same sequences designed with and without --use_soluble_model. Ideally, there should be an error if you try to use both --use_soluble_model and --ca_only.

Criteria for selecting top-ranked sequence?

Hello!

Context: I used ProteinMPNN to generate 500 designed sequences for a loop region in my structure. The info on the first three sequences are shown below. After the first 5 or so sequences, score and global_score no longer seem to be in any rank order, so I am assuming the sequences in the output .fa file do not get ranked and reorganized?

>T=0.1, sample=1, score=1.1318, global_score=1.6505, seq_recovery=0.1250
>T=0.1, sample=2, score=1.1756, global_score=1.6445, seq_recovery=0.1250
>T=0.1, sample=3, score=1.1776, global_score=1.6387, seq_recovery=0.1875

Question: Which metric should I be using to rank these sequences? score? global_score? Is a higher number better, or a lower number? Or should I be using an external metric for ranking? I do not have access to a GPU for accelerated AlphaFold modeling, so I need to narrow down the sequences to model beforehand.

Many thanks for your help!!
Morgan

Abnormal results

hello! I found that my result concludes some abnormal amino acid X, I know that it means undefined amino acid. But how did this happen? When I run the sample, the result is normal. (length of my protein is 1366 aa )

Any way to specify per-residue AA makeups?

Sort of like the omit_AAs dictionary, except for individual positions?

Clarification about score and global_score?

Hello! Can you specify exactly what score and global score means in the context of the original PDB sequence and the designed sequence? I understand the recovery as the probability of sequence identity to the original PDB string. Thanks!

Should is it used the CA model or the full model?

In designing a complex sequence, the target protein has all the backbones, but the "pairs" of proteins to be designed are designed by a certain method, so "pairs" of proteins to be designed: N, CA, C, and O.
In this case, I think I should use the CA model, but it would be better to use the "vanilla_model_weights" model by adding the coordinates of "CB" to the proteins to be designed as "pairs" in some appropriate way. What do you think? (There is also the problem of how to add the "CB" coordinates.)

processed training datasets?

Great work! Can you please provide processed training/validation/testing datasets as well?

Error running the examples with jsonl inputs

Hello,

We are facing the following error when running the examples that takes jsonl files as inputs.

Examples that run the inference on the pdb (with parse_pdbs and StructureDatasetPDB() runs without errors.

Does the code require different formats/parsing for jsonl files? I think currently it is using the parsing from StructureDataset() class

Thank you for the attention!

How to prepare input files for fine-tuning with multimers?

Hi,

I would like to fine-tune on certain complexes, always knowing which chain I want to be masked. How should I input this?

How do the list.csv, test_cluster and valid_cluster change for multimers if at all? I could not find anything about this in the training directory. What doees that mean?

Thank you!

Is there a way to set a maximum number of mutations per designed sequence?

Basically, is there a way to limit the total differences from the input sequence? Could one limit the total mutations to <5 per 100 residues of chain length or something similar?

Lists of PDB Ids for training and testing

Hello,

I am wondering if you have a list of PDB Ids + chains used to train and validate the model. These would be useful for comparisons to other methods and general benchmarking. I could not find these listed in the supplemental.

The list of proteins/chains would be the 25k training clusters, 402 monomer backbones, and then the last set of 690 monomers, 732 homomers (with less than 2000 residues), and 98 heteromers described in the paper.

Would it be possible to provide these? Thanks!

Choosing the redesigned structure

Hello ProteinMPNN team,

Thank you for your work!
In your paper you refold some native protein and compare with alphafold.
("We reasoned that ProteinMPNN might generate sequences for native backbones more strongly encoding the structures than the original native sequences, as evolution in most cases does not optimize for stability, and completely redesigned a set of 396 native structures").

If I understood correctly, the refolding part is sequence only (No template, no MSA), which sounds like the fairest way to use alphafold in your case. Am I right there ?
How have you chosen this 396 structures ( and can you share the list). As for the few proteins I tried randomly so far, alphafold with no template and no MSA is just too wrong to be uswed as an assessment. Whereas in your plot a large part of 396 native proteins seems to have reasonable LDDT after refolding.

Best

Barthelemy

Safety of generated sequences

Hi, I was wondering if there is a way to have a safety filter in silico to be sure that the generated protein sequence with ProteinMPNN is "safe" to use. For example that it is not a toxin, a prion or a cell penetrating peptide?
Do you think a simple blast would be sufficient?
Are there better tools available (AI based or not) to safely design proteins sequences?

How many paramters are there in ProteinMPNNmodels?

Hi @dauparas ,

This is a layman's question inspired by news related to chatGPT and GPT4.
There might be a way to count it but i do not know how.

Can you please let me know?
Thanks a lot.

Request for the best model weight for CATH 4.2 testset

Hello,
I read your ProteinMPNN paper. Impressive work!
I tested model v_48_002 on CATH 4.2 testset with sampling_temp 0.001 and num_seqs 4 (all the other setting followed quickdemo.ipynb). The mean and median result among 1120 chains is respectively 42.19 and 45.41. In the paper it is 47.9.
Is that the same model weight? or is there any modification to the defalut setting I missed?
One more thing, would you like to upload the model weight trained without backbone noise?
Thank you!

Discrepancy between paper and code on attention module

i did not see a implementation of dot-product attention in proteinMPNN code. according to proteinMPNN paper, i think it should be a dot-product attention just like canonical transformer. so the code implement additive attention instead?Did i miss something in code or it's a dismatch between code and paper desciption?

about ppi design

hi proteinmpnn team
i read the paper about the function design part
i wonder if protein mpnn can be used to design ppi in certain interface
i tried it with one side of the interface with residue within 5A of the other side into design and other fixed
the designed sequences are predicted of their structure and docked to the other side
but ddG calculated by interface anaylzer with beta nove 16 is positve but wt is much negative
is there anything to discuss about such situation?

Example of using omit_AA?

I am trying to make single point mutations in a sequence, where I will mutate the single position to all 20 canonical amino acids. I am trying to use design_only_positions and omit_AA, where the first list will include only one position and omit_AA will include all AA except for one.

I am getting an error regarding omit_AA. Is there an example of how to use omit_AA correctly? Thank you!

O_chain_A = NaN and Global_Score=NaN

Hi,
Thank you so much for your work on ProteinMPNN!
I have the examples scripts working and I was modifying them (submit_example_4.sh in particular) with my own inputs.
However, in the parsed_pdbs.json file that is created my different sequences have both N_chain_A and O_chain_A and every value on the O_chain_A has NaN for all the coordinates. ProteinMPNN still runs and generates outputs for all the sequences but the score, global score, and seq_recovery all = nan (which I am assuming is connected to the O_chain_A=NaN).
I imagine that this is due to my pdb files.
Please would you be able to explain why this might be happening, what the difference between N_chain_A and O_chain_A is, and how I might be able to fix it?
Please do direct me towards any further reading as well!
Thank you very much!
Adam

What's the numbering scheme for fixed_residues?

{"original": {"H": [26,27,28,29,30,35,36,37,38,56,57,58,59,62,63,64,65,105,106,111,112,113,114,115,116,117], "L": [27,28,29,30,36,37,38,56,57,65,105,106,107,108]}}

I'm trying to fix the CDR regions so that they are not changed by ProteinMPNN. However, I don't seem to have the right numbering (see the .jsonl above).

Is the numbering just the actual position in the string? Does it start at 0 or 1? Is it Clothia? Is it Kabat?

How many sequences to create?

What would you recommend for the number of sequences to create for design? What have you used for production runs? In the examples, these are usually one or two, but what is generally recommended?

mmseq clustering logic

Hi,

thank you for providing the proteinMPNN code. I am interested in trying out variations of the clustering thresholds, is it possible for you to share the code that produces the train/test/validation clusters used for multi-chain training, i.e how could I generate my own version of list.csv, train_clusters.txt and test_clusters.txt?

Thanks a lot in advance!
Olivia

Amino acid sequence has too many "A"

Dear ProteinMPNN team,
Thank you very much for your great efforts.I want to know why the designed amino acid sequence has a long string of A(i.g. APGDAAAALAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)It’s looks unusual.
Thank you for your helps in advance

Designing Interface Residues

I am trying to design an interface, but the amino acids I am getting are almost completely polar, which is weird for an interface. Am I missing something? Is there something one should consider when designing interfaces? I am quite sure that the distance I am using between the two chains is adequate for an interface.

An example demonstrating PSSM biasing

I'm having a little trouble getting the code to run with a PSSM as input. I'm also not 100% sure how PSSM biasing is influenced by the pssm_multi, pssm_threshold, pssm_log_odds_flag, and pssm_bias_flag flags. It would be great to have an example with input files and flag usage when including a PSSM.

Thanks!

General Soluble Model Question

I have a small, monomeric beta-barrel transmembrane protein that I would like to redesign for soluble expression. I ran ProteinMPNN twice, using either the standard model or the soluble model (--use_soluble_model flag) as input (generating 20 sequences each, T=0.1).

The parental sequence has ~53% hydrophobic amino acids. After running the standard model, the output sequences range from 53-57% hydrophobic. Using the soluble model the output sequences range from 54-59% hydrophobic, so they're more hydrophobic than the parental sequence. Alphafold agrees that the top scoring sequences will fold into the correct structure, however I can still see large hydrophobic patches on the surface of the soluble model generated structure in the transmembrane region.

I haven't tried expressing this protein yet, but it definitely does not look soluble to me. I will try biasing against hydrophobic residues next, however I'm surprised that the soluble model didn't automatically redesign the transmembrane region to be less hydrophobic. Has anybody had similar experiences, or know of any other tricks for redesigning membrane proteins for soluble expression? Has anybody previously optimized amino-acid bias values for soluble expression (otherwise I'll just guess bias values).

What constitutes a "low quality" backbone?

I am attempting to follow the work flow recommended in the RFDiffusion paper, https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2.

I am obtaining low amino acid sequence diversity in my ProteinMPNN outputs. My problem is not as severe as the one shown in an earlier reported issue (#46), but it is problematic. Here is a typical example. I am executing ProteinMPNN as shown in ProteinMPNN/examples/submit_example_1.sh.

MIYKHAGYYNAKKGKGKGYTFSTGAKGKGYTKRFKKFSVGKGKATDKETLRAMLTLGGIIFEIDKKKKNKWKGYSTDKGLTAGYSTGKGTKALGYQITPNFGVGYAYNKKPYFGVSYQTKDGSVGVGYNFGLRIVSVSYGNPKTGKGAGYSYKA
{ A : 6.5%
  C : 0.0%
  D : 2.6%
  E : 1.3%
  F : 4.5%
  G : 18.2%
  H : 0.6%
  I : 3.9%
  K : 18.2%
  L : 3.9%
  M : 1.3%
  N : 3.9%
  P : 1.9%
  Q : 1.3%
  R : 1.9%
  S : 5.8%
  T : 8.4%
  V : 4.5%
  W : 0.6%
  Y : 10.4% }

There is a surprisingly large number of G and K residues. I also wonder about the high abundance of Y. The calculated isoelectric point is 10.08. I generated 10 candidate sequences from this particular structure. They were all pretty similar to this one.

A response to the earlier issue report was as follows:

Hello! This might happen if the model is uncertain about the prediction, or the input backbone is of low quality. You could try adding negative alanine bias.

Originally posted by @dauparas in #46 (comment)

I can of course attempt to apply negative biases to certain amino acids, as recommended in the earlier post. Before I do this, I would like to ask whether there are any criteria we can use to measure the "quality" of input backbones.

My PDB input files are being generated by RFDiffusion. I specify a partial scaffold, and RFDiffusion hallucinates the rest. At least in PyMol, the secondary structures of the RFDiffusion output files look reasonable. The automated secondary structure assignment algorithm in PyMol is identifying regions of alpha helix and beta sheet. That doesn't mean that I don't have issues with my RFDiffusion outputs, but I don't know what to look for.

Thanks for any information you can provide.

New feature: Python package

Hey Justas, great work with ProteinMPNN. Out of curiosity, is someone already working on translating your repo into a pip-installable package?

OSError: [Errno 22] Invalid argument: 'inputs/*pdb' .How do fix this bug?

Traceback (most recent call last):
File "E:\GitHub\ProteinMPNN\protein_mpnn_run.py", line 466, in
main(args)
File "E:\GitHub\ProteinMPNN\protein_mpnn_run.py", line 162, in main
pdb_dict_list = parse_PDB(args.pdb_path, ca_only=args.ca_only)
File "E:\GitHub\ProteinMPNN\protein_mpnn_utils.py", line 166, in parse_PDB
xyz, seq = parse_PDB_biounits(biounit, atoms=sidechain_atoms, chain=letter)
File "E:\GitHub\ProteinMPNN\protein_mpnn_utils.py", line 85, in parse_PDB_biounits
for line in open(x,"rb"):
OSError: [Errno 22] Invalid argument: 'inputs/*pdb'

Adding support for residues to design

Hi,

wonderful tool, however it is a bit hard to use when trying to design an interface. When I have only a couple of AAs for design and a lots of fixed ones, it is tedious to list all the fixed ones. Is it possible to include a script to which we only provide a list of enabled positions?
Thank you!

training with a custom dataset

Hi @dauparas

Thanks for the good job done, I am already using ProteinMPNN and would like to train or finetune it on a custom dataset.
Are there some scripts on how to process PDB files into the format compatible with your training script please?

PDBID_CHAINID.pt - I am not sure how to create the values for mask, bfac and occ.

And some metadata will not always be available since I would like to use a mixture of experimental and predicted structures at the input ... Can some metadata be missing from the training files?

Thanks for any hints!

Why sequence design takes protein sequences as inputs?

Dear ProteinMPNN authors,

Thank you very much for your great efforts. I seem to meet some obstacles to conceptually understand the implementation. Per my reading of the paper, the design task should be formulated as input the backbone structure only, and return the designed sequences.

However, the provided example (e.g. submit_example_3.sh - directly from the .pdb path) actually take both backbones and sequences as inputs, and return the re-sampled sequences, if I do not misunderstand the codes:

ProteinMPNN/protein_mpnn_run.py

Line 302 in e61ecb7

    
           log_probs = model(X, S, mask, chain_M*chain_M_pos, residue_idx, chain_encoding_all, randn_1)

ProteinMPNN/protein_mpnn_utils.py

Line 1057 in e61ecb7

    
           def forward(self, X, S, mask, chain_M, residue_idx, chain_encoding_all, randn, use_input_decoding_order=False, decoding_order=None):

It would be appreciated if you could comment on whether I have the correct understanding. If yes, which script would be the one for me, that I would like to input the backbone structure only, and return the designed sequences?

Why the difficult interface to specify residues?

I haven't yet had a chance to start editing the code (and maybe I could streamline this), but why is this needed as opposed to putting these options into the single final run script?

python ../helper_scripts/parse_multiple_chains.py --input_path=$folder_with_pdbs --output_path=$path_for_parsed_chains

python ../helper_scripts/assign_fixed_chains.py --input_path=$path_for_parsed_chains --output_path=$path_for_assigned_chains --chain_list "$chains_to_design"

python ../helper_scripts/make_fixed_positions_dict.py --input_path=$path_for_parsed_chains --output_path=$path_for_fixed_positions --chain_list "$chains_to_design" --position_list "$fixed_positions"

python ../protein_mpnn_run.py \
        --jsonl_path $path_for_parsed_chains \
        --chain_id_jsonl $path_for_assigned_chains \
        --fixed_positions_jsonl $path_for_fixed_positions \
        --out_folder $output_dir \
        --num_seq_per_target 2 \
        --sampling_temp "0.1" \
        --seed 37 \

Setting up conda environment

Hi Justin! I just looked around the repo and couldn't find a .yaml file for setting up the conda environment (mlfold in the examples). Did I miss it, or does that still need to be added?

Very excited to try out some sequence design with ProteinMPNN!

scripts to generate PDBID_CHAINID.pt and PDBID.pt

I'd like to use different PDB files to train with, was wondering if there is a script that was used to auto generate the two .pt files mentioned in the issue header. Thanks :).

Getting last layer representations of the encoder and decoder?

Hi @dauparas !
Would you mind telling us which variables are the last layer representations of the encoder and decoder?
I want to output those representations.
Thank you!

Error in running colab version.

Dear Authors, Hugging face has some issues running the code.

Here is the mail error of colab notebook.

Generating sequences...

ValueError Traceback (most recent call last)
in <cell line: 2>()
25 S_sample = sample_dict["S"]
26 else:
---> 27 sample_dict = model.tied_sample(X, randn_2, S, chain_M, chain_encoding_all, residue_idx, mask=mask, temperature=temp, omit_AAs_np=omit_AAs_np, bias_AAs_np=bias_AAs_np, chain_M_pos=chain_M_pos, omit_AA_mask=omit_AA_mask, pssm_coef=pssm_coef, pssm_bias=pssm_bias, pssm_multi=pssm_multi, pssm_log_odds_flag=bool(pssm_log_odds_flag), pssm_log_odds_mask=pssm_log_odds_mask, pssm_bias_flag=bool(pssm_bias_flag), tied_pos=tied_pos_list_of_lists_list[0], tied_beta=tied_beta, bias_by_res=bias_by_res_all)
28 # Compute scores
29 S_sample = sample_dict["S"]

2 frames
/usr/local/lib/python3.10/dist-packages/opt_einsum/contract.py in contract_path(*operands, **kwargs)
236 size_dict[char] = dim
237 elif dim not in (1, size_dict[char]):
--> 238 raise ValueError("Size of label '{}' for operand {} ({}) does not match previous "
239 "terms ({}).".format(char, tnum, size_dict[char], dim))
240 else:

ValueError: Size of label 'i' for operand 1 (348) does not match previous terms (462).

CUBLAS_STATUS_NOT_INITIALIZED

When I run the "example from vanilla_proteinmpnn/examples/ to design a single PDB file",it occured RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
What should I do to solve the problem？
Thanks for reply

some issues about jsonl files

Hi,

Great work. I wonder if we could use one command line to add many different fixed residues into one jsonl files. As it is a little bit of tired to generate fixed residues jsonl files for different fixed residues one by one.

Re-training?

Hello.

Terrific work and thanks for making code available. I’m wondering if there is current any code for producing new model weight? I would find it interesting to train the code on a subset of the PDB.

Thanks. Take care.

question about dataset

i was curious about why the parse multichain script allow the sequence jump as well as the coords nan(namely maintain the missing), but new released script that parse cif to .pt omit the missing coords and sequence? i know its for training availability, why not make it consistent?

Using multiple input structures for a single sequence output

I try to design a sequence for a multi-state protein. It has four available conformations that are different from each other. How can I tie all residues between these four PDBs so that I can design a sequence optimised by all of four structures? I couldn't find an easy way to do this, having looked through the examples.

How to get the predicted PSSM from structure?

Dear ProteinMPNN team,

Thank you for building such fantastic method to generate sequences from structure. I am thinking of getting a PSSM logo from a known structure. how can I get the predicted PSSM from a structure?

After I checked the code, it looks like I can use log_probs in this line
, and I also need to run this many times(not sure how big to be converged ) because ProteinMPNN decoder is in random decoding order. is this correct?

Thank you for your helps in advance

Weights without adding noise during training or training code

Hi, thanks for the great work!
I have been trying to reproduce the results reported in the ProteinMPNN paper. While I can only obtain test ppl ~ 5.7, which is a little bigger than the result in the paper 4.74/5.25 w/ and w/o adding noise during training. I also loaded the v_48_002.pt weight and I can obtain test ppl 5.29, which is almost the same as the result reported in the paper.
So I wonder whether you can release the weights which doesn't add noise during training? Or whether you can share your training code? Thank you so much!

parse_PDB returns shorter sequence than pdb file

Hello,

I am just starting to use your model, it seems very good work !
When I follow your quickdemo.ipynb example but put my own pdb file it seems that the parsed sequences dont match.

The sequence of the pdb file has 297 residues (only made of the 20 natural AAs) and pdb_dict_list[0][f"seq_chain_{chain}"] is 296 residues long. The first residue has been cropped, it is an M.

Unfortunately I dont think I can share the pdb file but I wonder if there's anything I can check that could cause the first residue to be missing after parsing the pdb with your utils ...

Thanks for any hints !

Homooligimer while also fixing residues

Is it possible to fix positions while using the homooligimer example?

How is the temperature parameter involved in ProteinMPNN design?

Hi, I have read through the ProteinMPNN preprint but don't really understand how the temperature parameter is involved in ProteinMPNN design.

Would you mind describing briefly how temperature is involved in ProteinMPNN?

Thank you!

About the variable "order_mask_backward" in protein_mpnn_utils.py, line 1085

Hi, I'm learning your ProteinMPNN framework. When going through your script, protein_mpnn_utils.py, I am confused with the variable order_mask_backward, which is defined and used in line 1085 and line 1086, respectively.

order_mask_backward = torch.einsum('ij, biq, bjp->bqp',(1-torch.triu(torch.ones(mask_size,mask_size, device=device))), permutation_matrix_reverse, permutation_matrix_reverse)
mask_attend = torch.gather(order_mask_backward, 2, E_idx).unsqueeze(-1)

As I understand it (according to line 1086, i.e., the second line above), order_mask_backward should be the tensor that records: for each residue, which residues are decoded before it in the reverse order (with the value of 1, meaning that these residues can be seen). In that case, index along dim=-1 and dim=-2 both represent the position of residues so that order_mask_backward can be gathered by E_idx (E_idx is a tensor that records for each residue, which residues are recognized as neighbors, with a shape of [num_batch, num_residues, num_neighbors]).

However, to my understanding, order_mask_backward defined in the first line above records that, for decoding pair (q, p), whether there exists corresponding residue pair (i, j), subject to i > j. If exists, the value is 1, else 0. Here, q and p is the index along dim=-2 and dim=-1 of tensor order_mask_backward respectively, i and j is the position of residue in the sequence.

To clarify, take an easy example as follows.

import torch
import torch.nn.functional as F

num = 4 # num of residues
a = torch.Tensor([2,3,0,1]).long() # random decoding order, i.e., a[position_of_residue] = value of decoding order

one_hot_a = F.one_hot(a, num_classes=num).float()
one_hot_a = one_hot_a.unsqueeze(0)
result = torch.einsum('ij, biq, bjp->bqp', (1-torch.triu(torch.ones(num, num))), one_hot_a, one_hot_a) #  given by line 1085
result

tensor([[[0., 0., 1., 1.],
         [1., 0., 1., 1.],
         [0., 0., 0., 0.],
         [0., 0., 1., 0.]]])

For example, result[0][0][2] = 1, meaning that there exists such residue pair (i, j, s.t. i > j) that has a decoding order of (0, 2). In fact, residue 2 and residue 0 constitute such pair that could satisfy the above conditions. This example proves that my understanding of the variable order_mask_backward given by line 1085 may be right. However, in that case, order_mask_backward does not go along with E_idx in line 1086 because the index along dim=-2 and dim=-1 does not represent the position of residues.

Modification of the equation in torch.einsum() as follows may solve that problem.

torch.einsum('ji, bqi, bpj->bqp', (1-torch.triu(torch.ones(num, num)), one_hot_a, one_hot_a))

tensor([[[0., 1., 0., 0.],
         [0., 0., 0., 0.],
         [1., 1., 0., 1.],
         [1., 1., 0., 0.]]])

In the above result, result[0][0][1] = 1, meaning that the decoding order of residue 1 is before (in the reverse order, i.e., decoding backwards) residue 0 (3 and 2, respectively), so that residue 1 can be seen when decoding residue 0.

I'm not sure if my understanding is correct and if it has an influence on the model training result.

Is there an example on how to score a set of sequences w/ a given backbone (PDB)?

Hey ProteinMPNN team! Big fan of the work and thank you for the excellent guides that are runnable in Colab.

I'm trying to work through the code base + util file to better understand how to score sequences (the neg. log probability of the sequence given a backbone). My question is, given a set of sequences and a PDB file for my backbone, how can I a) featurize the backbone + b) then compute the neg. log probability for a set of sequences I have wrt to the backbone?

For reference, here is a similar example of how to do so with ESM-IF1.

Thanks for your help!

dauparas / proteinmpnn Goto Github PK

proteinmpnn's Issues

Generating sequences...

Recommend Projects

Recommend Topics

Recommend Org