Hi!
Thanks for your excellent work and I feel exciting about it.
I have two little questions about the training/validation accuracy, though.
For the binary similarity task
I use RNN as the block embedding structure and got the validation AUC around 0.83 after 50 epochs. I did not change any file except train.sh
. May I know whether it is a normal behavior?
Attached is my train.sh
#!/bin/sh
# Type of the network to use
#NETWORK_TYPE="Attention_Mean"
#NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
#NETWORK_TYPE="Annotations"
# Root path for the experiment
MODEL_PATH=experiments/
# Path to the sqlite db with diassembled functions
DB_PATH=../data/OpenSSL_dataset.db
# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy
# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json
# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"
# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"
python3 train.py --o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID
For the compiler provenance task
Similarly, I use RNN and try to predict the COMPILER+OPT. The final accuracy is around 74%.
#!/bin/sh
# Type of the network to use
# NETWORK_TYPE="Attention_Mean"
# NETWORK_TYPE="Arith_Mean"
NETWORK_TYPE="RNN"
# NETWORK_TYPE="Annotations"
# What to classify:
# CLASSIFICATION_KIND="Family" # Compiler Family
# CLASSIFICATION_KIND="Compiler" # Compiler Family + Version
CLASSIFICATION_KIND="Compiler+Opt" # Compiler Familt + Version + Optimization
# CLASSIFICATION_KIND="Opt" # Optimization
# Root path for the experiment
MODEL_PATH=experiments/
# Path to the sqlite db with diassembled functions
DB_PATH=../data/restricted_compilers_dataset.db
# Path to embedding matrix
EMBEDDING_MATRIX=../data/i2v/embedding_matrix.npy
# Path to instruction2id dictionary
INS2ID=../data/i2v/word2id.json
# Add this argument to train.py to use random instructions embeddings
RANDOM_EMBEDDINGS="-r"
# Add this argument to train.py to use trainable instructions embeddings
TRAINABLE_EMBEDDINGS="-te"
python3 train.py -o $MODEL_PATH -n $DB_PATH -nn $NETWORK_TYPE -e $EMBEDDING_MATRIX -j $INS2ID -cl $CLASSIFICATION_KIND
I suspect it is a normal behavior since COMPILER+OPT is a much more difficult task.
TBO, I personally feel it not a big deal, but I do appreciate if anyone could take a look at in case I made any mistake.
Thanks for any effort of answering my (dummy) questions in advance.