Comments (4)
Hi @hsellik ,
First, you probably should better use code2seq, even for binary classification.
- One option is to select the argmax among [true, false], ignoring the PAD_OR_OOV.
- Second, I think that if you train on a large enough dataset for a long enough time - you will see that the PAD_OR_OOV probability will be almost zero (because the model won't see any OOVs in the training set).
- If you wish to completely mask the option to assign any probability to PAD_OR_OOV - you can mask it before the softmax:
Here, take the logits
tensor (these are the scores for every possible class, before applying softmax) and add "minus infinity" to the index that represents PAD_OR_OOV, which is supposed to be index zero.
This should be something like:
logits = logits + tf.log(1-tf.one_hot(indices=[0], depth=tf.shape(logits)[-1]))
Explanation: tf.one_hot
will create a vector like [1, 0, 0, 0, 0]. Then 1-tf.one_hot
is [0, 1, 1, 1, 1]. Finally, applying the tf.log
will make this vector be [-inf, 0, 0, 0, 0]. Adding this to the original logits
will keep all values the same, except for the first column that corresponds to the PAD_OR_OOV symbol.
This will not allow the model to assign any probability to PAD_OR_OOV at training time.
At test time, do the same for the scores
tensor here.
Let me know if this works.
from code2vec.
Thank you for this thorough explanation!
I am planning to try out code2vec, then code2seq and see to which degree the results will improve. As the preprocessing / training pipeline is very well documented and similar, I think it'll be interesting to see.
from code2vec.
Hi,
I am guessing this is something with the small size of the dataset, which is smaller than some of the batch sizes.
Try decreasing config.READING_BATCH_SIZE
here to around 2000.
However, in general, I am doubtful that it will work with such a small dataset.
You can also try code2seq with target sequences of length 1. It is a better model and less sparse than code2vec.
The modifications will be very similar to the modifications you have done so far.
Best,
Uri
from code2vec.
Hi @urialon,
I am also trying to use code2vec for binary classification. RIght now I am playing around with a super small dataset, but I observed that I get results like True (34%), False (33%). Since they do not add up to 100, I started debugging and noticed that there is a PAD_OR_OOV in raw_prediction_results which takes rest of the percentage.
Am I supposed to select the best % from these values or is there a way to avoid the OOV value having an effect on my labels?
I have edited the JavaExtractor to output either True/False as label and also changed MAX_TARGET_VOCAB_SIZE to 2.
Thanks in advance,
Hendrig
from code2vec.
Related Issues (20)
- Preprocessor step disposing numbers in (variable) names HOT 4
- How to release a model HOT 1
- Repeating metric values HOT 3
- Model for other task. HOT 2
- I run this "python3 code2vec.py --load models/dataset/saved_model_iter2 --test data/dataset/dataset.test.c2v" and I got this issue! is there any help? HOT 5
- I don't know how to apply the output files created by astminer. HOT 1
- Can I get the exact values for the context HOT 2
- Matrix size-incompatible during using sample model HOT 2
- bias-variance tradeoff HOT 1
- Application to real case study HOT 11
- Javascript Benchmark with Code2Vec HOT 3
- There is no entire model and model weights file to load HOT 4
- How to create code embeddings from Java codebase and store it in a vector database? HOT 4
- Issues encountered when processing big data HOT 1
- File Not found error HOT 2
- Queries regarding Java Extractor HOT 1
- Which version of JDK do I need to install before running this project? HOT 3
- How to create code2vec input HOT 8
- Queries on ...dict.c2v file HOT 1
- Is there any library or API available for generating embeddings of each line of a Java code file while preserving AST (Abstract Syntax Tree) structure information? I'm already familiar with fold2vec. Are there any other alternatives? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code2vec.