Code Monkey home page Code Monkey logo

Comments (4)

urialon avatar urialon commented on July 28, 2024 2

Hi @hsellik ,
First, you probably should better use code2seq, even for binary classification.

  1. One option is to select the argmax among [true, false], ignoring the PAD_OR_OOV.
  2. Second, I think that if you train on a large enough dataset for a long enough time - you will see that the PAD_OR_OOV probability will be almost zero (because the model won't see any OOVs in the training set).
  3. If you wish to completely mask the option to assign any probability to PAD_OR_OOV - you can mask it before the softmax:

Here, take the logits tensor (these are the scores for every possible class, before applying softmax) and add "minus infinity" to the index that represents PAD_OR_OOV, which is supposed to be index zero.

This should be something like:

logits = logits + tf.log(1-tf.one_hot(indices=[0], depth=tf.shape(logits)[-1]))

Explanation: tf.one_hot will create a vector like [1, 0, 0, 0, 0]. Then 1-tf.one_hot is [0, 1, 1, 1, 1]. Finally, applying the tf.log will make this vector be [-inf, 0, 0, 0, 0]. Adding this to the original logits will keep all values the same, except for the first column that corresponds to the PAD_OR_OOV symbol.

This will not allow the model to assign any probability to PAD_OR_OOV at training time.
At test time, do the same for the scores tensor here.

Let me know if this works.

from code2vec.

hsellik avatar hsellik commented on July 28, 2024 1

Thank you for this thorough explanation!

I am planning to try out code2vec, then code2seq and see to which degree the results will improve. As the preprocessing / training pipeline is very well documented and similar, I think it'll be interesting to see.

from code2vec.

urialon avatar urialon commented on July 28, 2024

Hi,
I am guessing this is something with the small size of the dataset, which is smaller than some of the batch sizes.
Try decreasing config.READING_BATCH_SIZE here to around 2000.

However, in general, I am doubtful that it will work with such a small dataset.
You can also try code2seq with target sequences of length 1. It is a better model and less sparse than code2vec.
The modifications will be very similar to the modifications you have done so far.

Best,
Uri

from code2vec.

hsellik avatar hsellik commented on July 28, 2024

Hi @urialon,

I am also trying to use code2vec for binary classification. RIght now I am playing around with a super small dataset, but I observed that I get results like True (34%), False (33%). Since they do not add up to 100, I started debugging and noticed that there is a PAD_OR_OOV in raw_prediction_results which takes rest of the percentage.

Am I supposed to select the best % from these values or is there a way to avoid the OOV value having an effect on my labels?

I have edited the JavaExtractor to output either True/False as label and also changed MAX_TARGET_VOCAB_SIZE to 2.

Thanks in advance,
Hendrig

from code2vec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.