Code Monkey home page Code Monkey logo

protein-dpo's Introduction

protein-dpo

Screenshot 2024-05-20 at 10 34 10 AM

Introduction

This repository holds inference and training code for ProteinDPO (Protein Direct Preference Optimization), a preference optimized structure-conditioned protein language model based on ESM-IF1. We describe ProteinDPO in the paper “Aligning Protein Generative Models with Experimental Fitness via Direct Preference Optimization”.

Getting Started

  1. Clone this repository:
git clone https://github.com/evo-design/protein-dpo.git
  1. Navigate to the repository directory:
cd protein-dpo
  1. Use conda and pip to install required dependencies:

Use the environment.yml file provided in this repository to create and activate a Conda environment with all the necessary dependencies.

conda env create -f environment.yml
conda activate <environment_name>

Pip install the most recent esm package from the Github repository.

pip install git+https://github.com/facebookresearch/esm.git
  1. Download Model Weights

Download Protein DPO model weights from the Zenodo Repository and instert them in the weights folder.

Download vanilla ESM-IF1 model weights within the weights directory with the following commands:

cd weights/
wget https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt

Sampling

Sampling is simply a slightly modified script from the ESM-IF1 github. Note, stabilization of any protein backbone with ProteinDPO is not guaranteed to preserve its function, thus we strongly recommend functional or heavily conserved residues be preserved with the --fixed_pos argument.

  1. Run The Sampling Script
python sample.py --pdbfile <path_to_input_pdb> --weights_path <path_to_model_weights> [additional_arguments]

If no weights_path is provided the scripts defaults to the vanilla model weights.

Additional arguments:

--temperature: sampling temperature, lower temperature sampling will have lower diversity
--outpath: path for sampled sequence output
--num-samples: desired number of samples
--fixed_pos: positions to fix for sampling, first residue is 1 not 0

Scoring

  1. Prepare your dataset:
aa_seq : Amino acid sequence of mutant variant
WT_Name : Path to the native PDB file 
<feature> : Scalar label of the feature for optimization
wt_seq: Amino acid sequence of the native sequence
mut_type: string of mutation (eg. <native_aa><pos><mutant_aa>), separate simulatenous mutations with colons (eg. <native_aa><pos><mutant_aa>:<native_aa><pos><mutant_aa>:... etc.)

The file fireprot_homologue_free.csv is provided as an example. Note to score this csv, PDB files need to be downloaded and the 'WT_Name' column updated with their respective paths.

  1. Run Scoring Script
python score.py --dataset_path <path_to_sequences_csv> --weights_path <path_to_model_weights> [additional_arguments]

Replace <path_to_model_weights> with the path to the trained protein-dpo model or any ESM-IF1 compatible weights of your choice. If no weights_path is provided the script defaults to the vanilla model weights.

Additional arguments:

--normalize: pass if you want to normalize likelihood with wild-type sequence
--whole_seq: pass if you want to utilize liklihood of entire sequence, not just mutated residue(s)
--sum: pass if you want to sum likelihoods instead of averaging
--out_path: path for output csv
  1. Analyze Results

Located at the path given by the --out_path argument will be a csv containing the specified model likelihood for each sequence.

Citation

Please cite the following preprint when referencing ProteinDPO.

@article {widatalla2024aligning,
	author = {Widatalla, Talal and Rafailov, Rafael and Hie, Brian},
	title = {Aligning protein generative models with experimental fitness via Direct Preference Optimization},
	year = {2024},
	doi = {10.1101/2024.05.20.595026},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/05/21/2024.05.20.595026},
	journal = {bioRxiv}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

protein-dpo's People

Contributors

twidatalla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

protein-dpo's Issues

sample.py doesn't run

There are some stray quotes in there that cause the script to fail. I also added an option to use cpu instead of gpu, and to run from outside the repo. I tried submitting a pull request, but I don't have permissions to create the branch...

Installation error

Could you update the environment file?

Here is the error message from the code: $conda env create -f environment.yml

Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 0.41.0 Requires-Python >=3.10; 0.41.1 Requires-Python >=3.10; 2.6.0 Requires-Python <4,>=3.10; 2.6.1 Requires-Python <4,>=3.10; 2.6.2 Requires-Python <4,>=3.10; 2.7.0 Requires-Python <4,>=3.10; 3.0.0b1 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement fair-esm==2.0.1 (from versions: 0.1.0, 0.1.1, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 0.4.2, 0.5.0, 1.0.0, 1.0.2, 1.0.3, 2.0.0)
ERROR: No matching distribution found for fair-esm==2.0.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.