Comments (5)
Hi @hsellik ,
Thank you for your interest in code2seq and for your kind words. I am happy to hear that you find this project useful.
Multi-class classification is very simple. It is actually a special case of generating sequences - the classes can all be just 1-word long sequences, there are no necessary modifications to the network. The required changes are only to the input side, where you should input your labels rather than the method names.
- Answers:
config.TARGET_VOCAB_MAX_SIZE
- right. You can set this value to some very large number (e.g., 999999) as a sanity check, just to verify that the code doesn't find more classes in the data than you thought there should be.config.MAX_TARGET_PARTS
to 1 - exactly- Change the first field in JavaExtractor to output my desired label - exactly
- Train the model as specified in the README.md - exactly
- This should have no effect, but sometimes newer versions of JavaParser change the AST a little, so I recommend to check manually if the paths look roughly the same with the new version. Note that there are no hashes in code2seq (hashes belong to code2vec)
Good luck!
from code2seq.
Hi @hsellik ,
Congratulations on the accepted paper, feel free to share a link.
As far as I know, machine learning with imbalanced classes is a known problem that, as far as I know, does not have a standard solution. If in your training data you have only a small number of "positive" examples - a common practice is to up-weight these positive examples in the computation of the loss. If you can get more negative examples "for free" - it might help to add them to the training data and up-weight the positive examples.
Maybe you could modify the existing AST paths such that they will capture the errors that you are trying to capture better. For example, maybe instead of the current, standard, AST paths, you would want only paths where one of the leaves serves as a loop or array index, or something like that? Maybe you could examine ~10-100 examples where the model was mistaken, and think which (existing or non-existing) path could have helped the model not making the mistake.
I don't see how training the model to predict the description and the binary label might help, but you can never know.
from code2seq.
Here is the paper.
Hmm, the AST path selection idea seems quite an interesting one which I might even try out. Thank you for the feedback!
from code2seq.
Thank you!
from code2seq.
I've been playing around with code2seq model for a while now. Getting around 0.8 precision and recall when training and testing on balanced data for the problem of detecting Off-by-One errors from Java code. I also got a paper published in DeepTest2020 workshop (co-located with ICSE), which uses code2vec pretty much the same way.
However, precision drops by a huge amount when testing on imbalanced data with more natural bug/normal code distribution. Tried to mitigate this by also training on a dataset with smaller ratio of bugs, which bumped precision and losing some recall, but it's still not good enough for production use due to high number of false positives. Hence, I'm thinking of ways to improve.
Is there any difference on how the model will get implemented?
Since the model is originally used for describing methods, I was suggested to let the model output some of the description and then append the binary label to it. The aim would be for the model to run through a longer sequence and hence get a better result.
I don't know if it makes any sense to you, this thing is a bit of a dark arts section for me 😜. Is there any point to try such an approach?
from code2seq.
Related Issues (20)
- Generating embeddings for Python and Java HOT 5
- Help with implementing local service with JavaExtractor HOT 10
- I can not preprocess Python dataset
- Error running prediction on Code2seq released model
- I got Out of Memory Error during Training
- Unable to get embeddings from the trained model for Java
- Extract Path Contexts Only HOT 5
- InvalidArgumentError in sess.run() HOT 3
- Visualize Python AST HOT 2
- Extract java files HOT 2
- Getting "was not completed in time" error when preprocessing dataset HOT 11
- code2seq for Python HOT 3
- Error processing property '_dropout_mask_cache' of <ContextValueCache> HOT 6
- Sampling k paths from AST tree HOT 11
- I am getting TimeError while using code2seq to predict long method HOT 2
- Generating code documentation with code2seq HOT 8
- Tensorflow out-of-bound error while trying to train the Code2Seq model on our own python dataset HOT 6
- Model is predicting empty string for custom python dataset HOT 8
- Exporting code vectors HOT 6
- Encountered error of preprocess data HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code2seq.