lucamassarelli / unsupervised-features-learning-for-binary-similarity Goto Github PK

Code for the paper "Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis"

License: Other

Python 94.16% Shell 0.90% Ruby 4.02% Perl 0.92%

unsupervised-features-learning-for-binary-similarity's Issues

Question

I can't creat my own dataset,there are something wrong with "cfg=json.loads(self.r2.cmd('agfj'+str(func[s])))".what can I do to make it execute successfully,please?

I don't quite understand the meaning of true pair, flase pair in the database

PairFactory.py

    while i < number_of_pairs:
        #count+=1
        #print("count: ",count)
        #print(i)
        if chunk * int(number_of_pairs/2) + i > data_len:
            break
        # chunk *125+i >443000
        #print("id :",chunk*int(number_of_pairs)/2+i)
        p = true_pairs_id[chunk * int(number_of_pairs/2) + i]
        #print(chunk*int(number_of_pairs/2)+int(i/2))
        #print("p: ",len(p))
        #print("p: ",p)
        q0 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[0],))
        if self.feature_type == 'acfg':
            adj0, node0, lenghts0 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj0, node0, lenghts0 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))

        q1 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[1],))
        #print("p1: ",len(p1))
        if self.feature_type == 'acfg':
            adj1, node1, lenghts1 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj1, node1, lenghts1 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q1.fetchone()[0])))
    
        pairs.append(((adj0, node0), (adj1, node1)))
        lenghts.append([lenghts0, lenghts1])
        labels.append(+1)

        p = false_pairs_id[chunk * int(number_of_pairs/2) +i]
        #print("p false: ",len(p))
        q0 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[0],))
        if self.feature_type == 'acfg':
            adj0, node0,lenghts0 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj0, node0, lenghts0 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))

        q1 = cur.execute("SELECT " + self.feature_type + " FROM " + self.feature_type + " WHERE id=?", (p[1],))
        if self.feature_type == 'acfg':
            adj1, node1, lenghts1 = self.get_data_from_acfg(json_graph.adjacency_graph(json.loads(q0.fetchone()[0])))
        elif self.feature_type == 'lstm_cfg':
            adj1, node1, lenghts1 = self.get_data_from_cfg(json_graph.adjacency_graph(json.loads(q1.fetchone()[0])))

        pairs.append(((adj0, node0), (adj1, node1)))
        lenghts.append([lenghts0, lenghts1])
        labels.append(-1)

        i += 2
        
        If i increases by 2 each time, should i be divided by 2 in the following code, otherwise it will skip one.
        p = true_pairs_id[chunk * int(number_of_pairs/2) + i]
        TO
        p = true_pairs_id[chunk * int(number_of_pairs/2) + int(i/2)]

error

graph1, graph2 = zip(*pairs) ValueError: not enough values to unpack (expected 2, got 0)

Question about training accuracy

Hi!

Thanks for your excellent work and I feel exciting about it.

I have two little questions about the training/validation accuracy, though.

For the binary similarity task

I use RNN as the block embedding structure and got the validation AUC around 0.83 after 50 epochs. I did not change any file except train.sh. May I know whether it is a normal behavior?

Attached is my train.sh

#!/bin/sh

# Type of the network to use

#NETWORK_TYPE="Attention_Mean"
#NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
#NETWORK_TYPE="Annotations"

# Root path for the experiment
MODEL_PATH=experiments/

# Path to the sqlite db with diassembled functions
DB_PATH=../data/OpenSSL_dataset.db

# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy

# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json

# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"

# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"

python3 train.py --o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID

For the compiler provenance task

Similarly, I use RNN and try to predict the COMPILER+OPT. The final accuracy is around 74%.

#!/bin/sh

# Type of the network to use
# NETWORK_TYPE="Attention_Mean"
# NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
# NETWORK_TYPE="Annotations"

# What to classify:
# CLASSIFICATION_KIND="Family"      # Compiler Family
# CLASSIFICATION_KIND="Compiler"      # Compiler Family + Version
CLASSIFICATION_KIND="Compiler+Opt"   # Compiler Familt + Version + Optimization
# CLASSIFICATION_KIND="Opt"      # Optimization


# Root path for the experiment
MODEL_PATH=experiments/

# Path to the sqlite db with diassembled functions
DB_PATH=../data/restricted_compilers_dataset.db

# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy

# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json

# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"

# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"

python3 train.py -o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID -cl $CLASSIFICATION_KIND

I suspect it is a normal behavior since COMPILER+OPT is a much more difficult task.

TBO, I personally feel it not a big deal, but I do appreciate if anyone could take a look at in case I made any mistake.

Thanks for any effort of answering my (dummy) questions in advance.

lucamassarelli / unsupervised-features-learning-for-binary-similarity Goto Github PK

unsupervised-features-learning-for-binary-similarity's Issues

Question

I don't quite understand the meaning of true pair, flase pair in the database

PairFactory.py

error

Question about training accuracy

For the binary similarity task

For the compiler provenance task

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent