Hi, thanks for this great work! Can you clarify the below issue? This is the Stack tra

Invalid Argument Error? about code2seq HOT 7 CLOSED

tech-srl commented on September 26, 2024

Invalid Argument Error?

from code2seq.

Comments (7)

urialon commented on September 26, 2024 1

My understanding of the extraction step is that I specify the target as say, the method name or a caption and in the list of contexts, I can specify any type of component suitable for my problem

That's right! There are several things to notice:

The words in target should be split by |, i.e.: print|bmp|to|file
The 3-tuple type_of_statement|token_1|token_2 should be split by comma (,) rather than |, and each of them internally should be split by |.
The network reads the 1st and 3rd fields as a set of subtokens, and the 2nd field as a sequence (using an LSTM). So I would suggest switch the order and make type_of_statement to be the middle field, and set config.MAX_PATH_LENGTH = 1. So finally it will look like:
print|bmp|to|file subtoken1|subtoken2|subtoken3,type_of_statement,subtoken4|subtoken5|subtoken6

Where subtoken1|subtoken2|subtoken3 are the components of token_1 in your example,
and subtoken4|subtoken5|subtoken6 are the components of token_2 from your example.
Since type_of_statement is a single value (rather than a sequence of symbols you can set config.MAX_PATH_LENGTH = 1 and training will be faster because the LSTM will not be used.

from code2seq.

urialon commented on September 26, 2024 1

basically yes, see also Section 2 of the code2vec paper, where it is explained more thoroughly:
https://arxiv.org/abs/1803.09473

from code2seq.

PankajB1997 commented on September 26, 2024

On a related note, could you please explain the role of config.MAX_PATH_LENGTH in a bit more detail? I am not familiar with the model, so still trying to figure out this error, which seems to be related to this constant.

from code2seq.

urialon commented on September 26, 2024

Hi Pankaj,
Did you run the model on a dataset that you preprocessed yourself, i.e., not our preprocessed dataset? Did you preprocess your dataset with a non-default max_path_length value? Or did you decrease the default value in config.MAX_PATH_LENGTH?
In general, config.MAX_PATH_LENGTH in the model should be greater by 1 than the max_path_length value of the preprocessing. This is indeed confusing.

config.MAX_PATH_LENGTH is the number of nodes in each "path".
For legacy reasons, in the JavaExtractor, the max_path_length is the number of edges and is set to 8 by default. This is the reasons that the default value for config.MAX_PATH_LENGTH is: 8+1.

from code2seq.

PankajB1997 commented on September 26, 2024

Hello, thank you for the response!

Yes, I'm using another dataset for which I wrote another extractor, and then I ran preprocess.sh on just the extracted result (i.e. my self created train.raw.txt, val.raw.txt, test.raw.txt). I guess my mistake is that I did not take into account the max_path_length property in my extraction code.

My understanding of the extraction step is that I specify the target as say, the method name or a caption and in the list of contexts, I can specify any type of component suitable for my problem. My extracted rows deal with code lines individually and are of the form target type_of_statement|token_1|token_2 ..., where type_of_statement is chosen from a set of 25 possible values indicating the type of code statement and tokens are similar to your example.

So just to clarify, how would you suggest me to account for max_path_length in my extraction code?

from code2seq.

PankajB1997 commented on September 26, 2024

Thank you for your help, this clarified a lot!! :)

from code2seq.

PankajB1997 commented on September 26, 2024

Btw, wanted to seek understanding on your usage of Abstract Syntax Tree in extraction step. Quoting from the paper:

Given the AST of a code snippet, we consider all pairwise paths between terminals, and represent them as sequences of terminal and nonterminal nodes. We then use these paths with their terminals’ values to represent the code snippet itself.

Does this mean that given the AST, you are extracting all possible terminal-to-terminal paths from the tree and extracting contexts in the form terminal node token, path of intermediate non-terminal nodes, terminal node token?

from code2seq.

Invalid Argument Error? about code2seq HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent