Code Monkey home page Code Monkey logo

Comments (5)

urialon avatar urialon commented on June 17, 2024 1

Hi @hsellik ,
Thank you for your interest in code2seq and for your kind words. I am happy to hear that you find this project useful.

Multi-class classification is very simple. It is actually a special case of generating sequences - the classes can all be just 1-word long sequences, there are no necessary modifications to the network. The required changes are only to the input side, where you should input your labels rather than the method names.

  1. Answers:
  • config.TARGET_VOCAB_MAX_SIZE - right. You can set this value to some very large number (e.g., 999999) as a sanity check, just to verify that the code doesn't find more classes in the data than you thought there should be.
  • config.MAX_TARGET_PARTS to 1 - exactly
  • Change the first field in JavaExtractor to output my desired label - exactly
  • Train the model as specified in the README.md - exactly
  1. This should have no effect, but sometimes newer versions of JavaParser change the AST a little, so I recommend to check manually if the paths look roughly the same with the new version. Note that there are no hashes in code2seq (hashes belong to code2vec)

Good luck!

from code2seq.

urialon avatar urialon commented on June 17, 2024 1

Hi @hsellik ,
Congratulations on the accepted paper, feel free to share a link.

As far as I know, machine learning with imbalanced classes is a known problem that, as far as I know, does not have a standard solution. If in your training data you have only a small number of "positive" examples - a common practice is to up-weight these positive examples in the computation of the loss. If you can get more negative examples "for free" - it might help to add them to the training data and up-weight the positive examples.

Maybe you could modify the existing AST paths such that they will capture the errors that you are trying to capture better. For example, maybe instead of the current, standard, AST paths, you would want only paths where one of the leaves serves as a loop or array index, or something like that? Maybe you could examine ~10-100 examples where the model was mistaken, and think which (existing or non-existing) path could have helped the model not making the mistake.

I don't see how training the model to predict the description and the binary label might help, but you can never know.

from code2seq.

hsellik avatar hsellik commented on June 17, 2024 1

Here is the paper.

Hmm, the AST path selection idea seems quite an interesting one which I might even try out. Thank you for the feedback!

from code2seq.

hsellik avatar hsellik commented on June 17, 2024

Thank you!

from code2seq.

hsellik avatar hsellik commented on June 17, 2024

@urialon

I've been playing around with code2seq model for a while now. Getting around 0.8 precision and recall when training and testing on balanced data for the problem of detecting Off-by-One errors from Java code. I also got a paper published in DeepTest2020 workshop (co-located with ICSE), which uses code2vec pretty much the same way.

However, precision drops by a huge amount when testing on imbalanced data with more natural bug/normal code distribution. Tried to mitigate this by also training on a dataset with smaller ratio of bugs, which bumped precision and losing some recall, but it's still not good enough for production use due to high number of false positives. Hence, I'm thinking of ways to improve.

Is there any difference on how the model will get implemented?

Since the model is originally used for describing methods, I was suggested to let the model output some of the description and then append the binary label to it. The aim would be for the model to run through a longer sequence and hence get a better result.

I don't know if it makes any sense to you, this thing is a bit of a dark arts section for me 😜. Is there any point to try such an approach?

from code2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.