Hi, I am trying to use code2vec for doing binary classification for

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Using code2vec for binary classification about code2vec HOT 4 CLOSED

tech-srl commented on July 28, 2024

Using code2vec for binary classification

from code2vec.

Comments (4)

urialon commented on July 28, 2024 2

Hi @hsellik ,
First, you probably should better use code2seq, even for binary classification.

One option is to select the argmax among [true, false], ignoring the PAD_OR_OOV.
Second, I think that if you train on a large enough dataset for a long enough time - you will see that the PAD_OR_OOV probability will be almost zero (because the model won't see any OOVs in the training set).
If you wish to completely mask the option to assign any probability to PAD_OR_OOV - you can mask it before the softmax:

Here, take the logits tensor (these are the scores for every possible class, before applying softmax) and add "minus infinity" to the index that represents PAD_OR_OOV, which is supposed to be index zero.

This should be something like:

logits = logits + tf.log(1-tf.one_hot(indices=[0], depth=tf.shape(logits)[-1]))

Explanation: tf.one_hot will create a vector like [1, 0, 0, 0, 0]. Then 1-tf.one_hot is [0, 1, 1, 1, 1]. Finally, applying the tf.log will make this vector be [-inf, 0, 0, 0, 0]. Adding this to the original logits will keep all values the same, except for the first column that corresponds to the PAD_OR_OOV symbol.

This will not allow the model to assign any probability to PAD_OR_OOV at training time.
At test time, do the same for the scores tensor here.

Let me know if this works.

from code2vec.

hsellik commented on July 28, 2024 1

Thank you for this thorough explanation!

I am planning to try out code2vec, then code2seq and see to which degree the results will improve. As the preprocessing / training pipeline is very well documented and similar, I think it'll be interesting to see.

from code2vec.

urialon commented on July 28, 2024

Hi,
I am guessing this is something with the small size of the dataset, which is smaller than some of the batch sizes.
Try decreasing config.READING_BATCH_SIZE here to around 2000.

However, in general, I am doubtful that it will work with such a small dataset.
You can also try code2seq with target sequences of length 1. It is a better model and less sparse than code2vec.
The modifications will be very similar to the modifications you have done so far.

Best,
Uri

from code2vec.

hsellik commented on July 28, 2024

Hi @urialon,

I am also trying to use code2vec for binary classification. RIght now I am playing around with a super small dataset, but I observed that I get results like True (34%), False (33%). Since they do not add up to 100, I started debugging and noticed that there is a PAD_OR_OOV in raw_prediction_results which takes rest of the percentage.

Am I supposed to select the best % from these values or is there a way to avoid the OOV value having an effect on my labels?

I have edited the JavaExtractor to output either True/False as label and also changed MAX_TARGET_VOCAB_SIZE to 2.

Thanks in advance,
Hendrig

from code2vec.

Using code2vec for binary classification about code2vec HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent