eternagame / eternabrain Goto Github PK

View Code? Open in Web Editor NEW

19.0 17.0 11.0 114.15 MB

Deep learning to solve RNA design puzzles

Home Page: https://software.eternagame.org/

License: Other

Python 99.16% R 0.84%

deep-learning rna bioinformatics convolutional-neural-networks neural-networks rna-design secondary-structure

eternabrain's People

Stargazers

Watchers

Forkers

jadeshi amoliu randy18 cmpt-470-group stjordanis harimenath natshah vhovenga lqx-ai boiko188 harel-coffee

eternabrain's Issues

Use pickle to save GMM instead of running each time

Problems

Base mutations are only to 4 (C)
Previous mutated sequence carries over to next iteration

Example:

Should be:

# this is for only 1 randomly generated sequence
[4 2 4 4 2 2 3 1 2]
[1 2 4 4 2 2 3 1 2]  [2 2 4 4 2 2 3 1 2]  [3 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 1 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 3 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Currently is:

[4 2 4 4 2 2 3 1 2]
[4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Eventually, the entire list becomes 4s.

Encode base sequence, locks, structures, pairmaps as one-hot

Puzzle Info

Sequence: GGGAUAACCU Structure: (((....))) Locks: oooxxxxooo

Base sequence

[0,0,1,0],[0,0,1,0],[0,0,1,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,1],[0,0,0,1],[0,1,0,0]

Structure

[0,1,0,0],[0,1,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,1,0],[0,0,1,0],[0,0,1,0]

Pairmap

[9,0,0,0],[8,0,0,0],[7,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[2,0,0,0],[1,0,0,0],[0,0,0,0]

Locks

[0,0,0,0],[0,0,0,0],[0,0,0,0],[1,0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]

Incorrect shape size for NN input data

Tensorflow

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))

Logits and labels not of same size.

TFLearn

net = tflearn.input_data(shape=[None, 4716])

Shape size is incorrect for input matrix.

Update encode_movesets

Needs to work for moves with multiple base changes in one move
Figure out a way to encode number of moves needed to complete the puzzle

Add target energy to features

Add target energy to list of features

Use ViennaRNA to get energy given target structure of puzzle

Puzzles to look at first

Hi Rohan,

Here are the getting started puzzles I was talking about: [6892343, 6892344, 6892345, 6892346, 6892347, 6892348, 7254756, 7254757, 7254758, 7254759, 7254760, 7254761]

I uploaded the problems file to the dropbox folder - github didn't like it's size when I tried to commit it directly. User data will also be there too. Hope everything went well with the SAT tests!

Things DSP should do after CNN and base structure selector:

Can look at EteRNAbot rules for more strategies.

Correct base pairings
Make all bases in loop A's
Change end pairs to G-C pairs
- for i in pairmap: get index of last pair before -1, get index of paired base, and change bases at those indices to G-C

Boosting
- Hairpins - boost with G
  - if num_unpaired_bases_in_a_row >= 3 then boost with G
- Internal - boost with opposite G's
  - `if number of unpaired on either side is within +-3, then put Gs on either side
- Single-bond stack - G-G in the loop
  - for every paired base: if '(' has one '.' following it and its complementary ')' has one '.' preceding it, then G-G boost
- Bulges
  - http://eternagame.wikia.com/wiki/Zigzag
- U-G-U-G superboost
  - for every paired base: if '(' has two '.'s following it and its complementary ')' has two '.'s preceding it, then UGUG boost
- http://eternagame.wikia.com/wiki/Boosting
If bases aren't pairing correctly:
- Flip orientation of base pairs nearby
- Change pairs to G-C pairs

PCA before or after Clustering

Add multi-GPU support for CNN

On 1 GPU, limited to 10 convolutional layers
When parallelized across multiple GPUs, can add more layers as more memory available

Get current structure and free energy from Eterna/Vienna API

Currently using Vienna web interface
Will train faster when using Eterna or Vienna locally

Data too High Dimension for clustering

The encoded moveset data has 4 dimensions, which is too many for sklearn clustering models
Need to find a way to reduce dimensions/PCA before fitting data to a model

Base sequences at current time

Resets base sequence to starting sequence each time rather than adding previous base changes to overall sequence

CNN

Use BEAR notation as additional structural feature

SAP

For comparisons, use BEAR instead of dot-bracket notation

Current structure and energy lists not of same length as base sequence

[[sequence],[current structure],[target structure],[energy]]

Sequence and target are of length 85 for pid 6892348, but current structure and energy are of length 84 (only occurs for datapoints above 431)
Most likely due to bug in structure_and_energy_at_current_time

X and y not of same length

Actual length should be 29971
Features of 6892348 - length 30092
Labels of 6892348 - length 29830
Difference of 262

Not changing certain end pairs to G-C

Where the next consecutive base is paired, the DSP does not change the last base pair in a stack to G-C.

Neural Net Predictions

Currently predicting only base

Options:

Predict location using sequence, structures, and energy, then feed location prediction into separate neural net along with sequence, structures, and energy to predict base
Predict location and base together, in a num_locations x num_bases matrix (or just 85 x 4)

Update encode_movesets_style

Properly encode resetting or pasting sequences

Use Vienna 1.8.5 for folding

Eterna uses Vienna1 for calculating folding; EternaBrain is currently using Vienna2

Keras model has NaNs in losses

After a certain number of epochs, the Keras model's loss becomes NaN and accuracy drops to 0.2698 (roughly 1/4)

Make structure_and energy function work with multiple puzzles

Currently structure_and_energy_at_current_time works only with 1 puzzle ID. Would reduce the number of pickles and the amount of time unpickling when training.

Expedite EternaBrain benchmarking

Instead of copying and pasting structures from Eterna website:

Copy and paste all structures into a .txt file
Read in structures and attempt to solve each one
Record number of puzzles solved

TensorFlow accuracy

Change learning rate (below 0.0001)
Change dropout rate (below 0.5)
Increase number of epochs
Change number of layers
Change number of nodes

Malformed String

Occurring only on puzzles 6892343, 7254758

Encode location

Encode location similar to bases

[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0] # location 10; number of bases = 15

Add oligo information to training data

For multi-state puzzles, the features should look like this:

[sequence, current_structure, target_structure, current_energy, 
target_energy, reporter, A, B, C] # A, B, C are other oligos

Encode locked bases

Some puzzles have bases which cannot be mutated - this feature might need to be encoded

Is the training / validation / test datasets publicly available?

I was wondering if the training, validation, and test datasets for EternaBrain were publicly available? In particular, I am looking for a dataset of RNA sequences and contact maps. I don't really need the specific player moves.

Thank you for your help, and sorry if this is off topic.

Add SmartPredictor

Checks if base prediction already matches existing base in that location.
Example:

base_sequence = [1,1,1,1]
Prediction = [1630.97,1630.88,1630.56,1630.30]
argmax(Prediction) == 0 # base A
location = 3
if base_sequence[location] == argmax(Prediction):
   SmartPredictor() # takes 2nd highest probability and uses that as base

Run TensorFlow models on multiple GPUs

More metrics for neural net

Precision
Recall
F1 score

Memory Error

Python running out of memory when clustering movesets (not on reduced dimension data)
8GB
Too many features for clustering algorithm
6892344, 6892345, 7254756, 7254759

Get data by puzzle ID and by player experience

read player experience data file for single-state puzzles
set a minimum threshold for experience
select players above threshold
feed uid into function to read movesets from a specific set of puzzle IDs and user IDs

biorXiv paper comments

Came across the recent biorXiv posting of this work - awesome stuff! I should probably comment through the official Disqus comment thread but whatever.

It was interesting to note how you honed down the input data and implemented a couple of regularization strategies to improve the accuracy of the tandem-CNN model. I'm curious if you have messed around with different loss functions? Though the probability vectors at hand that are being predicted are short, some probability theory can still be applied - I think. See this for some insight on choosing a loss function. If you know, or can reason out, the distribution of the noise, a tailored loss function may improve the accuracy.

Some other off the shelf loss functions: https://www.tensorflow.org/api_docs/python/tf/losses.

Best of luck!