The eternabrain's discuss from eternagame

Update encode_movesets_style

Properly encode resetting or pasting sequences

biorXiv paper comments

Came across the recent biorXiv posting of this work - awesome stuff! I should probably comment through the official Disqus comment thread but whatever.

It was interesting to note how you honed down the input data and implemented a couple of regularization strategies to improve the accuracy of the tandem-CNN model. I'm curious if you have messed around with different loss functions? Though the probability vectors at hand that are being predicted are short, some probability theory can still be applied - I think. See this for some insight on choosing a loss function. If you know, or can reason out, the distribution of the noise, a tailored loss function may improve the accuracy.

Some other off the shelf loss functions: https://www.tensorflow.org/api_docs/python/tf/losses.

Best of luck!

Base sequences at current time

Resets base sequence to starting sequence each time rather than adding previous base changes to overall sequence

Current structure and energy lists not of same length as base sequence

[[sequence],[current structure],[target structure],[energy]]

Sequence and target are of length 85 for pid 6892348, but current structure and energy are of length 84 (only occurs for datapoints above 431)
Most likely due to bug in structure_and_energy_at_current_time

Encoded sequences not of same length

Get data by puzzle ID and by player experience

read player experience data file for single-state puzzles
set a minimum threshold for experience
select players above threshold
feed uid into function to read movesets from a specific set of puzzle IDs and user IDs

Memory Error

Python running out of memory when clustering movesets (not on reduced dimension data)
8GB
Too many features for clustering algorithm
6892344, 6892345, 7254756, 7254759

Neural Net Predictions

Currently predicting only base

Options:

Predict location using sequence, structures, and energy, then feed location prediction into separate neural net along with sequence, structures, and energy to predict base
Predict location and base together, in a num_locations x num_bases matrix (or just 85 x 4)

Keras model has NaNs in losses

After a certain number of epochs, the Keras model's loss becomes NaN and accuracy drops to 0.2698 (roughly 1/4)

Use BEAR notation

CNN

Use BEAR notation as additional structural feature

SAP

For comparisons, use BEAR instead of dot-bracket notation

Incorrect shape size for NN input data

Tensorflow

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))

Logits and labels not of same size.

TFLearn

net = tflearn.input_data(shape=[None, 4716])

Shape size is incorrect for input matrix.

Add domain-specific fixes

Things DSP should do after CNN and base structure selector:

Can look at EteRNAbot rules for more strategies.

Correct base pairings
Make all bases in loop A's
Change end pairs to G-C pairs
- for i in pairmap: get index of last pair before -1, get index of paired base, and change bases at those indices to G-C

Boosting
- Hairpins - boost with G
  - if num_unpaired_bases_in_a_row >= 3 then boost with G
- Internal - boost with opposite G's
  - `if number of unpaired on either side is within +-3, then put Gs on either side
- Single-bond stack - G-G in the loop
  - for every paired base: if '(' has one '.' following it and its complementary ')' has one '.' preceding it, then G-G boost
- Bulges
  - http://eternagame.wikia.com/wiki/Zigzag
- U-G-U-G superboost
  - for every paired base: if '(' has two '.'s following it and its complementary ')' has two '.'s preceding it, then UGUG boost
- http://eternagame.wikia.com/wiki/Boosting
If bases aren't pairing correctly:
- Flip orientation of base pairs nearby
- Change pairs to G-C pairs

Add SmartPredictor

Checks if base prediction already matches existing base in that location.
Example:

base_sequence = [1,1,1,1]
Prediction = [1630.97,1630.88,1630.56,1630.30]
argmax(Prediction) == 0 # base A
location = 3
if base_sequence[location] == argmax(Prediction):
   SmartPredictor() # takes 2nd highest probability and uses that as base

Fix base mutations

Problems

Base mutations are only to 4 (C)
Previous mutated sequence carries over to next iteration

Example:

Should be:

# this is for only 1 randomly generated sequence
[4 2 4 4 2 2 3 1 2]
[1 2 4 4 2 2 3 1 2]  [2 2 4 4 2 2 3 1 2]  [3 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 1 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 3 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Currently is:

[4 2 4 4 2 2 3 1 2]
[4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Eventually, the entire list becomes 4s.

TensorBoard Scalars not loading

Use Vienna 1.8.5 for folding

Eterna uses Vienna1 for calculating folding; EternaBrain is currently using Vienna2

TensorFlow 2 Compatibility

Is the training / validation / test datasets publicly available?

I was wondering if the training, validation, and test datasets for EternaBrain were publicly available? In particular, I am looking for a dataset of RNA sequences and contact maps. I don't really need the specific player moves.

Thank you for your help, and sorry if this is off topic.

Encode locked bases

Some puzzles have bases which cannot be mutated - this feature might need to be encoded

Not changing certain end pairs to G-C

Where the next consecutive base is paired, the DSP does not change the last base pair in a stack to G-C.

More metrics for neural net

Precision
Recall
F1 score

Encode base sequence, locks, structures, pairmaps as one-hot

Puzzle Info

Sequence: GGGAUAACCU Structure: (((....))) Locks: oooxxxxooo

Base sequence

[0,0,1,0],[0,0,1,0],[0,0,1,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,1],[0,0,0,1],[0,1,0,0]

Structure

[0,1,0,0],[0,1,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,1,0],[0,0,1,0],[0,0,1,0]

Pairmap

[9,0,0,0],[8,0,0,0],[7,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[2,0,0,0],[1,0,0,0],[0,0,0,0]

Locks

[0,0,0,0],[0,0,0,0],[0,0,0,0],[1,0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]

X and y not of same length

Actual length should be 29971
Features of 6892348 - length 30092
Labels of 6892348 - length 29830
Difference of 262

Malformed String

Occurring only on puzzles 6892343, 7254758

Update encode_movesets

Needs to work for moves with multiple base changes in one move
Figure out a way to encode number of moves needed to complete the puzzle

Run TensorFlow models on multiple GPUs

Add target energy to features

Add target energy to list of features

Use ViennaRNA to get energy given target structure of puzzle

Data too High Dimension for clustering

The encoded moveset data has 4 dimensions, which is too many for sklearn clustering models
Need to find a way to reduce dimensions/PCA before fitting data to a model

Use pickle to save GMM instead of running each time

PCA before or after Clustering

Encode location

Encode location similar to bases

[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0] # location 10; number of bases = 15

Add oligo information to training data

For multi-state puzzles, the features should look like this:

[sequence, current_structure, target_structure, current_energy, 
target_energy, reporter, A, B, C] # A, B, C are other oligos

Expedite EternaBrain benchmarking

Instead of copying and pasting structures from Eterna website:

Copy and paste all structures into a .txt file
Read in structures and attempt to solve each one
Record number of puzzles solved

Add multi-GPU support for CNN

On 1 GPU, limited to 10 convolutional layers
When parallelized across multiple GPUs, can add more layers as more memory available

Make structure_and energy function work with multiple puzzles

Currently structure_and_energy_at_current_time works only with 1 puzzle ID. Would reduce the number of pickles and the amount of time unpickling when training.

TensorFlow accuracy

Change learning rate (below 0.0001)
Change dropout rate (below 0.5)
Increase number of epochs
Change number of layers
Change number of nodes

Puzzles to look at first

Hi Rohan,

Here are the getting started puzzles I was talking about: [6892343, 6892344, 6892345, 6892346, 6892347, 6892348, 7254756, 7254757, 7254758, 7254759, 7254760, 7254761]

I uploaded the problems file to the dropbox folder - github didn't like it's size when I tried to commit it directly. User data will also be there too. Hope everything went well with the SAT tests!

Get current structure and free energy from Eterna/Vienna API

Currently using Vienna web interface
Will train faster when using Eterna or Vienna locally

eternagame / eternabrain Goto Github PK

eternabrain's Issues

CNN

SAP