Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Here is <a href="https://deeptestconf.github.io/pdfs/2020-Briem-DeepTest.pdf" rel="nof

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training for Custom Multi Class Classification about code2seq HOT 5 CLOSED

tech-srl commented on June 17, 2024

Training for Custom Multi Class Classification

from code2seq.

Comments (5)

urialon commented on June 17, 2024 1

Hi @hsellik ,
Thank you for your interest in code2seq and for your kind words. I am happy to hear that you find this project useful.

Multi-class classification is very simple. It is actually a special case of generating sequences - the classes can all be just 1-word long sequences, there are no necessary modifications to the network. The required changes are only to the input side, where you should input your labels rather than the method names.

Answers:

config.TARGET_VOCAB_MAX_SIZE - right. You can set this value to some very large number (e.g., 999999) as a sanity check, just to verify that the code doesn't find more classes in the data than you thought there should be.
config.MAX_TARGET_PARTS to 1 - exactly
Change the first field in JavaExtractor to output my desired label - exactly
Train the model as specified in the README.md - exactly

This should have no effect, but sometimes newer versions of JavaParser change the AST a little, so I recommend to check manually if the paths look roughly the same with the new version. Note that there are no hashes in code2seq (hashes belong to code2vec)

Good luck!

from code2seq.

urialon commented on June 17, 2024 1

Hi @hsellik ,
Congratulations on the accepted paper, feel free to share a link.

As far as I know, machine learning with imbalanced classes is a known problem that, as far as I know, does not have a standard solution. If in your training data you have only a small number of "positive" examples - a common practice is to up-weight these positive examples in the computation of the loss. If you can get more negative examples "for free" - it might help to add them to the training data and up-weight the positive examples.

Maybe you could modify the existing AST paths such that they will capture the errors that you are trying to capture better. For example, maybe instead of the current, standard, AST paths, you would want only paths where one of the leaves serves as a loop or array index, or something like that? Maybe you could examine ~10-100 examples where the model was mistaken, and think which (existing or non-existing) path could have helped the model not making the mistake.

I don't see how training the model to predict the description and the binary label might help, but you can never know.

from code2seq.

hsellik commented on June 17, 2024 1

Here is the paper.

Hmm, the AST path selection idea seems quite an interesting one which I might even try out. Thank you for the feedback!

from code2seq.

hsellik commented on June 17, 2024

Thank you!

from code2seq.

hsellik commented on June 17, 2024

@urialon

I've been playing around with code2seq model for a while now. Getting around 0.8 precision and recall when training and testing on balanced data for the problem of detecting Off-by-One errors from Java code. I also got a paper published in DeepTest2020 workshop (co-located with ICSE), which uses code2vec pretty much the same way.

However, precision drops by a huge amount when testing on imbalanced data with more natural bug/normal code distribution. Tried to mitigate this by also training on a dataset with smaller ratio of bugs, which bumped precision and losing some recall, but it's still not good enough for production use due to high number of false positives. Hence, I'm thinking of ways to improve.

Is there any difference on how the model will get implemented?

Since the model is originally used for describing methods, I was suggested to let the model output some of the description and then append the binary label to it. The aim would be for the model to run through a longer sequence and hence get a better result.

I don't know if it makes any sense to you, this thing is a bit of a dark arts section for me 😜. Is there any point to try such an approach?

from code2seq.

Training for Custom Multi Class Classification about code2seq HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent