Code Monkey home page Code Monkey logo

knowrlm's Introduction

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

The official implementation of the ICML'2024 paper Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Environments

To set up the environment for running KnowRLM, use the command pip install -r requirements.txt. For executing the code in Step1 and Step2, please follow the specific environment setup instructions provided in their respective libraries. It is recommended to create separate virtual environments for each step to ensure compatibility and avoid conflicts.

Getting started

Step1: Obtain the Initial Candidate Library

Acquire the initial set of 96 candidate sequences using the CLADE package. These sequences will be added to the candidate sequence library. The CLADE package is available here. Refer to the paper: Qiu Yuchi, Jian Hu, Guo-Wei Wei, "Cluster learning-assisted directed evolution" Nature Computational Science (2021).

Step2: Train the Reward Model

Train the reward function using the candidate sequence library.The reward function outputs predictions for the entire protein space, which are saved as a CSV file. The reward value for each protein can be directly retrieved from this table. The reward function can be found here. Refer to the paper:Wittmann, Bruce J., Yisong Yue, and Frances H. Arnold. "Informed training set design enables efficient machine learning-assisted directed protein evolution." Cell Systems (2021).

Step3: Obtain Candidate Sequences Through Reinforcement Learning

Run the script from the script folder using: python GB1_env.py or PhoQ_env.py to obtain the top 96 candidate sequences. Add these sequences to the candidate sequence library. Repeat Steps 2 and 3 for n iterations, where nโˆˆ[1,3], based on the experimental setup. The data sources for the amino acid knowledge graph used in this project can be found here. Refer to the paper:Breimann, S., Kamp, F., Steiner, H., and Frishman, D. "AAontology: An ontology of amino acid scales for interpretable machine learning." bioRxiv(2023)

Following the steps outlined above, the candidate sequence libraries obtained from our experimental runs are stored in /data/96, /data/192, /data/288, and /data/384, with the numbers indicating the quantity of candidate sequences. If you find the environment setup too cumbersome, you may consider using these pre-generated libraries directly.

Step4: Predict Candidate Sequences Using the Predictor

The predictor and reward model are the same. Train the predictor using the candidate sequence library obtained from the previous steps. Predict the globally optimal 96 candidate sequences and add them to the candidate sequence library. The final library will contain 96 + 96ร—n + 96 sequences. The experimental results are stored in the results directory. These results can also be directly used as the output of the reward function in Step 2 to provide rewards for the next round of reinforcement learning.

Step5: Evaluation

Run the script from the script folder using: python mean_max_3.py or NDCG.py.

Note: This method involves data sampling rather than training, leading to randomness. It is recommended to conduct multiple experiments. The results reported in the paper represent the optimal outcomes.

Reference

If you use our repository, please cite the following related paper:

@inproceedings{wangknowledge,
  title={Knowledge-aware Reinforced Language Models for Protein Directed Evolution},
  author={Wang, Yuhao and Zhang, Qiang and Qin, Ming and Zhuang, Xiang and Li, Xiaotong and Gong, Zhichen and Wang, Zeyuan and Zhao, Yu and Yao, Jianhua and Ding, Keyan and others},
  booktitle={Forty-first International Conference on Machine Learning}
}

knowrlm's People

Contributors

zju-wyh avatar

Stargazers

Xiang Zhuang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.