Code Monkey home page Code Monkey logo

eternabrain's Issues

biorXiv paper comments

Came across the recent biorXiv posting of this work - awesome stuff! I should probably comment through the official Disqus comment thread but whatever.

It was interesting to note how you honed down the input data and implemented a couple of regularization strategies to improve the accuracy of the tandem-CNN model. I'm curious if you have messed around with different loss functions? Though the probability vectors at hand that are being predicted are short, some probability theory can still be applied - I think. See this for some insight on choosing a loss function. If you know, or can reason out, the distribution of the noise, a tailored loss function may improve the accuracy.

Some other off the shelf loss functions: https://www.tensorflow.org/api_docs/python/tf/losses.

Best of luck!

Get data by puzzle ID and by player experience

  1. read player experience data file for single-state puzzles
  2. set a minimum threshold for experience
  3. select players above threshold
  4. feed uid into function to read movesets from a specific set of puzzle IDs and user IDs

Memory Error

  • Python running out of memory when clustering movesets (not on reduced dimension data)
  • 8GB
  • Too many features for clustering algorithm
  • 6892344, 6892345, 7254756, 7254759

Neural Net Predictions

Currently predicting only base

Options:

  • Predict location using sequence, structures, and energy, then feed location prediction into separate neural net along with sequence, structures, and energy to predict base
  • Predict location and base together, in a num_locations x num_bases matrix (or just 85 x 4)

Use BEAR notation

CNN

Use BEAR notation as additional structural feature

SAP

For comparisons, use BEAR instead of dot-bracket notation

Incorrect shape size for NN input data

Tensorflow

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))

Logits and labels not of same size.

TFLearn

net = tflearn.input_data(shape=[None, 4716])

Shape size is incorrect for input matrix.

Add domain-specific fixes

Things DSP should do after CNN and base structure selector:

Can look at EteRNAbot rules for more strategies.

  • Correct base pairings
  • Make all bases in loop A's
  • Change end pairs to G-C pairs
    • for i in pairmap: get index of last pair before -1, get index of paired base, and change bases at those indices to G-C
  • Boosting
    • Hairpins - boost with G
      • if num_unpaired_bases_in_a_row >= 3 then boost with G
    • Internal - boost with opposite G's
      • `if number of unpaired on either side is within +-3, then put Gs on either side
    • Single-bond stack - G-G in the loop
      • for every paired base: if '(' has one '.' following it and its complementary ')' has one '.' preceding it, then G-G boost
    • U-G-U-G superboost
      • for every paired base: if '(' has two '.'s following it and its complementary ')' has two '.'s preceding it, then UGUG boost
  • If bases aren't pairing correctly:
    • Flip orientation of base pairs nearby
    • Change pairs to G-C pairs

Add SmartPredictor

Checks if base prediction already matches existing base in that location.
Example:

base_sequence = [1,1,1,1]
Prediction = [1630.97,1630.88,1630.56,1630.30]
argmax(Prediction) == 0 # base A
location = 3
if base_sequence[location] == argmax(Prediction):
   SmartPredictor() # takes 2nd highest probability and uses that as base

Fix base mutations

Problems

  1. Base mutations are only to 4 (C)
  2. Previous mutated sequence carries over to next iteration

Example:

Should be:

# this is for only 1 randomly generated sequence
[4 2 4 4 2 2 3 1 2]
[1 2 4 4 2 2 3 1 2]  [2 2 4 4 2 2 3 1 2]  [3 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 1 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 3 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Currently is:

[4 2 4 4 2 2 3 1 2]
[4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]  [4 2 4 4 2 2 3 1 2]
# calculate reward
[4 2 4 4 2 2 3 1 2]
[4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]  [4 4 4 4 2 2 3 1 2]
# calculate reward

Eventually, the entire list becomes 4s.

Is the training / validation / test datasets publicly available?

I was wondering if the training, validation, and test datasets for EternaBrain were publicly available? In particular, I am looking for a dataset of RNA sequences and contact maps. I don't really need the specific player moves.

Thank you for your help, and sorry if this is off topic.

Encode locked bases

Some puzzles have bases which cannot be mutated - this feature might need to be encoded

Encode base sequence, locks, structures, pairmaps as one-hot

Puzzle Info

Sequence: GGGAUAACCU Structure: (((....))) Locks: oooxxxxooo

Base sequence

[0,0,1,0],[0,0,1,0],[0,0,1,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,1],[0,0,0,1],[0,1,0,0]

Structure

[0,1,0,0],[0,1,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,0],[0,0,1,0],[0,0,1,0],[0,0,1,0]

Pairmap

[9,0,0,0],[8,0,0,0],[7,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[-1,0,0,0],[2,0,0,0],[1,0,0,0],[0,0,0,0]

Locks

[0,0,0,0],[0,0,0,0],[0,0,0,0],[1,0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]

X and y not of same length

  • Actual length should be 29971
  • Features of 6892348 - length 30092
  • Labels of 6892348 - length 29830
  • Difference of 262

Update encode_movesets

  • Needs to work for moves with multiple base changes in one move
  • Figure out a way to encode number of moves needed to complete the puzzle

Data too High Dimension for clustering

  • The encoded moveset data has 4 dimensions, which is too many for sklearn clustering models
  • Need to find a way to reduce dimensions/PCA before fitting data to a model

Encode location

Encode location similar to bases

[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0] # location 10; number of bases = 15

Add oligo information to training data

For multi-state puzzles, the features should look like this:

[sequence, current_structure, target_structure, current_energy, 
target_energy, reporter, A, B, C] # A, B, C are other oligos

Expedite EternaBrain benchmarking

Instead of copying and pasting structures from Eterna website:

  • Copy and paste all structures into a .txt file
  • Read in structures and attempt to solve each one
  • Record number of puzzles solved

Add multi-GPU support for CNN

  • On 1 GPU, limited to 10 convolutional layers
  • When parallelized across multiple GPUs, can add more layers as more memory available

TensorFlow accuracy

  • Change learning rate (below 0.0001)
  • Change dropout rate (below 0.5)
  • Increase number of epochs
  • Change number of layers
  • Change number of nodes

Puzzles to look at first

Hi Rohan,

Here are the getting started puzzles I was talking about: [6892343, 6892344, 6892345, 6892346, 6892347, 6892348, 7254756, 7254757, 7254758, 7254759, 7254760, 7254761]

I uploaded the problems file to the dropbox folder - github didn't like it's size when I tried to commit it directly. User data will also be there too. Hope everything went well with the SAT tests!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.