lucidrains / progen Goto Github PK

View Code? Open in Web Editor NEW

108.0 8.0 17.0 209 KB

Implementation and replication of ProGen, Language Modeling for Protein Generation, in Jax

License: MIT License

Python 99.72% Shell 0.28%

artificial-intelligence deep-learning proteins

progen's Introduction

progen's People

Contributors

Stargazers

Watchers

Forkers

jiahaoyao sailfish009 naghipourfar ronaldosc enijkamp utdal superxiang onurboyar dot23 piyumalanthony kchennen xuefeng11 asapsav 5l1v3r1 mattfeng contestq

progen's Issues

protein bert uniref90 dataset

(discussed in discord)

after running the first step (create_uniref_db) of https://github.com/nadavbra/protein_bert I got a 24GB file "uniref_proteins_and_annotations.db" .
It seems it could be useful for generate sequences for this project, sharing the links there

https://gitlab.com/rom1504/uniref data
colab to get the db and do a few queries https://colab.research.google.com/drive/1BGYEBDmD0yToLNou2T-t-QbJV5wCtIBz#scrollTo=21U3PpCp-pxr
There are 135301051 records in the db, in a table looking like:

CREATE TABLE "protein_annotations" (
    "index"    INTEGER,
    "tax_id"    REAL,
    "uniprot_name"    TEXT,
    "go_annotations"    TEXT,
    "flat_go_annotations"    TEXT,
    "n_go_annotations"    INTEGER,
    "complete_go_annotation_indices"    TEXT,
    "n_complete_go_annotations"    INTEGER
);

Sample look like this:

	index	tax_id	uniprot_name	go_annotations	flat_go_annotations	n_go_annotations	complete_go_annotation_indices	n_complete_go_annotations
0	0	1.57204e+06	A0A5A9P0L4_9TELE	{"GO Molecular Function": ["GO:0003755", "GO:0005524", "GO:0004672", "GO:0005509"], "GO Biological Process": [], "GO Cellular Component": []}	["GO:0003755", "GO:0004672", "GO:0005509", "GO:0005524"]	4	[2761, 3561, 4193, 4205]	4
1	1	648755	UPI0016133188	{"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []}	[]	0	[]	0
2	2	1.93059e+06	A0A410P257_9BACT	{"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []}	[]	0	[]	0
3	3	519421	UPI0019403D63	{"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []}	[]	0	[]	0
4	4	72004	A0A6B0RPA5_9CETA	{"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": []}	["GO:0004672", "GO:0005524"]	2	[3561, 4205]	2
5	5	375764	A0A672ZWI7_9TELE	{"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []}	[]	0	[]	0
6	6	1.41558e+06	A0A6P7YNV3_9AMPH	{"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]}	["GO:0004672", "GO:0005524", "GO:0005886"]	3	[3561, 4205, 4526]	3
7	7	240159	A0A4U5TZD8_COLLU	{"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0016021", "GO:0005886"]}	["GO:0004672", "GO:0005524", "GO:0005886", "GO:0016021"]	4	[3561, 4205, 4526, 10019]	4
8	8	146911	UPI00074FFD9C	{"GO Molecular Function": [], "GO Biological Process": [], "GO Cellular Component": []}	[]	0	[]	0
9	9	260995	A0A6P8RG40_GEOSA	{"GO Molecular Function": ["GO:0005524", "GO:0004672"], "GO Biological Process": [], "GO Cellular Component": ["GO:0005886"]}	["GO:0004672", "GO:0005524", "GO:0005886"]	3	[3561, 4205, 4526]	3

Add flag for wandb ID

Cannot run on cluster with wandb logs. Can only run with wandb off

OOM Error when training the model

I get this Out Of Memory error (jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 134217728 bytes.) every time I try to train the model no matter if I use mixed precision, use wandb or not, or if I change config parameters to use a smaller subset of the database for training.

I have tried many "solutions" online but none seem to be working, anyone has any idea what might be going wrong?
Training on two Nvidia GeForce GPUs.

Question on Checkpoints

Hi, thank you for sharing the code.
I'm wondering if you have provided the pretrained checkpoints somewhere.

lucidrains / progen Goto Github PK

progen's Introduction

ProGen - (wip)

Requirements

Usage

Training

Mixed Precision

Todo

Acknowledgements

Citations

progen's People

Contributors

Stargazers

Watchers

Forkers

progen's Issues

Recommend Projects

Recommend Topics

Recommend Org