Code Monkey home page Code Monkey logo

progen's Introduction

ProGen - (wip)

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Pytorch and Jax (the weights will be made easily transferrable between the two). You can think of this as GPT for proteins sequences.

Requirements

We are going to use Poetry for managing the dependencies for this project. So first install it using the one-liner bash command.

Next, git clone the project and install the dependencies

$ git clone [email protected]:lucidrains/progen
$ cd progen
$ poetry install

For training on GPUs, you may need to rerun pip install with the correct CUDA version. You can follow the instructions here

# ex. CUDA 11.1
$ pip install --upgrade "jax[cuda111]" -f https://storage.googleapis.com/jax-releases/jax_releases.html

For running any scripts, you'll notice that it will always be prepended with poetry run

Usage

from jax import random
from haiku import PRNGSequence
from progen_transformer import ProGen

model = ProGen(
    num_tokens = 256,
    dim = 512,
    seq_len = 1024,
    window_size = 256,       # local attention window size
    depth = 12,              # depth
    heads = 8,               # attention heads
    dim_head = 64,           # dimension per head
    ff_glu = True,           # use GLU in feedforward, from Noam's paper
    global_mlp_depth = 2     # last N global gmlp layers
)

rng = PRNGSequence(42)
seq = random.randint(next(rng), (1024,), 0, 256)

params = model.init(next(rng), seq)
logits = model.apply(params, next(rng), seq) # (1024, 256)

Training

Download Uniref50 from UniProt and place uniref50.fasta in the root directory

$ poetry run python generate_data.py

You should see a lot of green if everything succeeds. Then

$ poetry run python train.py

By default, the script will checkpoint and resume automatically, but if you wish to clear your progress and restart, just add a --new flag

$ poetry run python train.py --new

Model checkpoints will be saved periodically to ./ckpts

Finally, to sample from your checkpoint, just do

$ poetry run python sample.py

You can pass a prime with --prime. You can either pass the annotations, followed by #, to get the generated sequence, or pass the sequence (also followed by #) and get the generated annotations

$ poetry run python sample.py --prime "[Tax=Mammalia] #"

Mixed Precision

To use mixed precision training, you'll need to install the latest Haiku with the following command

$ pip install git+https://github.com/deepmind/dm-haiku

Then make sure to set the --mixed_precision flag when invoking the training script

$ poetry run python train.py --mixed_precision

Todo

  • model parallelism with pjit
  • join in GO annotations with pandas dataframe
  • setup annotation -> template string system, all configuration driven, find easy way to test. offer two types of annotations, one parsed from uniref descriptions, the other from GO annotation presence
  • add multiple data sources (check out trembl)
  • when sampling, prime with entire sequence prior to the pound sign (intersection of sequence and annotation)
  • utilize all cores when processing data
  • save all training settings in the checkpoints too
  • bfloat16 on xla
  • resume from correct place in tfrecord even if batch size is changed inbetween runs, display number of sequences processed
  • train compressed gzip tfrecords from google cloud storage path
  • remove tfrecord package and just use tfrecordwriter with gzip
  • generate validation tfrecords
  • checkpoint and resume from a google cloud storage path
  • use jinja2 for wandb html sample logging
  • manage experimental tracker state, and also allow ability to turn it off by piping to noop
  • add a confirmation before clearing a folder for --new run
  • engineer mask in cross entropy loss so that padding can be reused as end-of-string token
  • flip seq # annotation order with prob set in config
  • keep N last checkpoints

Acknowledgements

Many thanks goes out to Ben Wang, who showed this type of large-scale training can be achieved with GPT-J

Citations

@misc{madani2020progen,
    title   = {ProGen: Language Modeling for Protein Generation}, 
    author  = {Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po-Ssu Huang and Richard Socher},
    year    = {2020},
    eprint  = {2004.03497},
    archivePrefix = {arXiv},
    primaryClass = {q-bio.BM}
}
@misc{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
    year    = {2021},
    eprint  = {2104.09864},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}

progen's People

Contributors

lucidrains avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

progen's Issues

protein bert uniref90 dataset

(discussed in discord)

after running the first step (create_uniref_db) of https://github.com/nadavbra/protein_bert I got a 24GB file "uniref_proteins_and_annotations.db" .
It seems it could be useful for generate sequences for this project, sharing the links there

CREATE TABLE "protein_annotations" (
    "index"    INTEGER,
    "tax_id"    REAL,
    "uniprot_name"    TEXT,
    "go_annotations"    TEXT,
    "flat_go_annotations"    TEXT,
    "n_go_annotations"    INTEGER,
    "complete_go_annotation_indices"    TEXT,
    "n_complete_go_annotations"    INTEGER
);

Sample look like this:

index tax_id uniprot_name go_annotations flat_go_annotations n_go_annotations complete_go_annotation_indices n_complete_go_annotations
0 0 1.57204e+06 A0A5A9P0L4_9TELE {"GO Molecular Function": ["GO:0003755", "GO:0005524", "GO:0004672", "GO:0005509"], "GO Biological Process": [], "GO Cellular Component": []} ["GO:0003755", "GO:0004672", "GO:0005509", "GO:0005524"] 4 [2761, 3561, 4193, 4205] 4
1 1 648755 UPI0016133188 {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} [] 0 [] 0
2 2 1.93059e+06 A0A410P257_9BACT {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} [] 0 [] 0
3 3 519421 UPI0019403D63 {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} [] 0 [] 0
4 4 72004 A0A6B0RPA5_9CETA {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": []} ["GO:0004672", "GO:0005524"] 2 [3561, 4205] 2
5 5 375764 A0A672ZWI7_9TELE {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} [] 0 [] 0
6 6 1.41558e+06 A0A6P7YNV3_9AMPH {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]} ["GO:0004672", "GO:0005524", "GO:0005886"] 3 [3561, 4205, 4526] 3
7 7 240159 A0A4U5TZD8_COLLU {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0016021", "GO:0005886"]} ["GO:0004672", "GO:0005524", "GO:0005886", "GO:0016021"] 4 [3561, 4205, 4526, 10019] 4
8 8 146911 UPI00074FFD9C {"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []} [] 0 [] 0
9 9 260995 A0A6P8RG40_GEOSA {"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]} ["GO:0004672", "GO:0005524", "GO:0005886"] 3 [3561, 4205, 4526] 3

OOM Error when training the model

I get this Out Of Memory error (jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 134217728 bytes.) every time I try to train the model no matter if I use mixed precision, use wandb or not, or if I change config parameters to use a smaller subset of the database for training.

I have tried many "solutions" online but none seem to be working, anyone has any idea what might be going wrong?
Training on two Nvidia GeForce GPUs.

Question on Checkpoints

Hi, thank you for sharing the code.
I'm wondering if you have provided the pretrained checkpoints somewhere.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.