Code Monkey home page Code Monkey logo

plus's Introduction

Pre-training of deep bidirectional protein sequence representations with structural information (IEEE Access 2021)

Official Pytorch implementation of PLUS | Paper

Abstract

Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role.

Data & Pre-trained Models

  • Data : Pfam, Homology, Solubility, Localization, Stability, Fluorescence, SecStr, Transmembrane
  • Pre-trained Models : PLUS-RNN_BASE, PLUS-RNN_LARGE, PLUS-TFM

How to Run

Example:

python plus_embedding.py --data-config config/data/embedding.json --model-config config/model/plus-rnn_large.json --run-config config/run/embedding.json --pretrained-model pretrained_models/PLUS-RNN_LARGE.pt --device 0 --output-path results/plus-rnn_large

Requirements

  • Python >=3.6
  • PyTorch 1.3.1
  • Numpy 1.17.4
  • SciPy 1.4.1
  • Pandas 1.1.1
  • Pillow 7.0.0
  • Scikit-learn 0.22.1

plus's People

Contributors

dependabot[bot] avatar konstin avatar seonwoo-min avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

plus's Issues

Embedding a sequence without finetuning

Hi,

I'm working on bio_embeddings (poster, talk), a tool that embeds sequences using various methods, and we'd like to include PLUS-RNN as an embedding method.

For bio_embeddings, we generate embeddings from sequence without fine-tuning. Do you have an example of how to turn a sequence of amino acids into a sequence of embeddings with PLUS-RNN? Unfortunately I could only find examples that involve a Trainer.

I'm also working on making PLUS as publishable python package, you can see the current state here.

Paper Appendix

Very interesting work!
I am very curious about the details on pre-training and fine-tuning. However, I did not find supplementary information. So, where should I get the paper appendix?

No pr model

Hi, thanks for your great work!
However, I could not find a pre-trained pr_model for evaluation tasks. Is there anything wrong? Besides, I used the pre-trained P-ELMO and lm model. When I assigned the pr model to 'P-ELMo_Homology.pt', I got wrong message like below:

Missing key(s) in state_dict: "hidden.weight", "hidden.bias", "output.weight", "output.bias". 
Unexpected key(s) in state_dict: "x_embed.weight", "fc_lm.weight", "fc_lm.bias", "rnn.weight_ih_l0", "rnn.weight_hh_l0", "rnn.bias_ih_l0", "rnn.bias_hh_l0", "rnn.weight_ih_l0_reverse", "rnn.weight_hh_l0_reverse", "rnn.bias_ih_l0_reverse", "rnn.bias_hh_l0_reverse", "rnn.weight_ih_l1", "rnn.weight_hh_l1", "rnn.bias_ih_l1", "rnn.bias_hh_l1", "rnn.weight_ih_l1_reverse", "rnn.weight_hh_l1_reverse", "rnn.bias_ih_l1_reverse", "rnn.bias_hh_l1_reverse", "rnn.weight_ih_l2", "rnn.weight_hh_l2", "rnn.bias_ih_l2", "rnn.bias_hh_l2", "rnn.weight_ih_l2_reverse", "rnn.weight_hh_l2_reverse", "rnn.bias_ih_l2_reverse", "rnn.bias_hh_l2_reverse", "fc.weight", "fc.bias", "ordinal_weight", "ordinal_bias". 

Should I train a pr model myself? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.