Code Monkey home page Code Monkey logo

unsupervised-features-learning-for-binary-similarity's Issues

Question

I can't creat my own dataset,there are something wrong with "cfg=json.loads(self.r2.cmd('agfj'+str(func[s])))".what can I do to make it execute successfully,please?

PairFactory.py

    while i < number_of_pairs:
        #count+=1
        #print("count: ",count)
        #print(i)
        if chunk * int(number_of_pairs/2) + i > data_len:
            break
        # chunk *125+i >443000
        #print("id :",chunk*int(number_of_pairs)/2+i)
        p = true_pairs_id[chunk * int(number_of_pairs/2) + i]
        #print(chunk*int(number_of_pairs/2)+int(i/2))
        #print("p: ",len(p))
        #print("p: ",p)
        q0 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[0],))
        if self.feature_type == 'acfg':
            adj0, node0, lenghts0 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj0, node0, lenghts0 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))

        q1 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[1],))
        #print("p1: ",len(p1))
        if self.feature_type == 'acfg':
            adj1, node1, lenghts1 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj1, node1, lenghts1 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q1.fetchone()[0])))
    
        pairs.append(((adj0, node0), (adj1, node1)))
        lenghts.append([lenghts0, lenghts1])
        labels.append(+1)

        p = false_pairs_id[chunk * int(number_of_pairs/2) +i]
        #print("p false: ",len(p))
        q0 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[0],))
        if self.feature_type == 'acfg':
            adj0, node0,lenghts0 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj0, node0, lenghts0 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))

        q1 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[1],))
        if self.feature_type == 'acfg':
            adj1, node1, lenghts1 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj1, node1, lenghts1 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q1.fetchone()[0])))

        pairs.append(((adj0, node0), (adj1, node1)))
        lenghts.append([lenghts0, lenghts1])
        labels.append(-1)

        i += 2
        
        If i increases by 2 each time, should i be divided by 2 in the following code, otherwise it will skip one.
        p = true_pairs_id[chunk * int(number_of_pairs/2) + i]
        TO
        p = true_pairs_id[chunk * int(number_of_pairs/2) + int(i/2)]

error

graph1, graph2 = zip(*pairs) ValueError: not enough values to unpack (expected 2, got 0)

Question about training accuracy

Hi!

Thanks for your excellent work and I feel exciting about it.

I have two little questions about the training/validation accuracy, though.

For the binary similarity task

I use RNN as the block embedding structure and got the validation AUC around 0.83 after 50 epochs. I did not change any file except train.sh. May I know whether it is a normal behavior?

Attached is my train.sh

#!/bin/sh

# Type of the network to use

#NETWORK_TYPE="Attention_Mean"
#NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
#NETWORK_TYPE="Annotations"

# Root path for the experiment
MODEL_PATH=experiments/

# Path to the sqlite db with diassembled functions
DB_PATH=../data/OpenSSL_dataset.db

# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy

# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json

# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"

# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"

python3 train.py --o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID

For the compiler provenance task

Similarly, I use RNN and try to predict the COMPILER+OPT. The final accuracy is around 74%.

#!/bin/sh

# Type of the network to use
# NETWORK_TYPE="Attention_Mean"
# NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
# NETWORK_TYPE="Annotations"

# What to classify:
# CLASSIFICATION_KIND="Family"      # Compiler Family
# CLASSIFICATION_KIND="Compiler"      # Compiler Family + Version
CLASSIFICATION_KIND="Compiler+Opt"   # Compiler Familt + Version + Optimization
# CLASSIFICATION_KIND="Opt"      # Optimization


# Root path for the experiment
MODEL_PATH=experiments/

# Path to the sqlite db with diassembled functions
DB_PATH=../data/restricted_compilers_dataset.db

# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy

# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json

# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"

# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"

python3 train.py -o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID -cl $CLASSIFICATION_KIND

I suspect it is a normal behavior since COMPILER+OPT is a much more difficult task.

TBO, I personally feel it not a big deal, but I do appreciate if anyone could take a look at in case I made any mistake.

Thanks for any effort of answering my (dummy) questions in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.