Comments (12)
Okay,I know why this happens.
Because I use batch size is 2, hence they random choose two target labels into a batch;
And you project focuses on one function, rather one file; (For example, the AboutPage.java has two functions)
Thank you.
from code2seq.
Hi @lishi0927 ,
Please provide the exact command line that you ran.
Please also mention if you have changed anything in the code compared to the default (git diff
).
Uri
from code2seq.
Thank you for your reply.
My command line is "python3 reader.py";
I only change the "self.TRAIN_PATH = java-small/java-small" in Line 206 in the reader.py;
from code2seq.
Oh I see, running reader.py
as the main file is mostly meant for debugging.
To use it with a real dataset you'll need to load the dataset's settings from the .dict.c2s
file, and create a vocabulary out of them.
You'll need to copy lines 33-55 from model.py
and paste them to line 220 in reader.py
(possibly with some minor adaptation of variable names).
The issue is that in the test in reader.py
- settings like config.DATA_MAX_CONTEXTS
and dictionaries like nodes_to_index
are hard-coded in the test code.
In a real training (as it is in model.py
) - these settings are loaded from the dataset's dictionary file, and the dictionaries like nodes_to_index
are created based on them.
Let me know if you tried that and experienced problems.
from code2seq.
I have tried your advice, but it still has the same bug;
Here is my codes of main function in the reader.py, I use the java-small dataset, hence the java-small/java-small.dict.c2s is in the same file folder with the train and test file.
target_word_to_index = {Common.PAD: 0, Common.UNK: 1, Common.SOS: 2,
'a': 3, 'b': 4, 'c': 5, 'd': 6, 't': 7}
subtoken_to_index = {Common.PAD: 0, Common.UNK: 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5}
node_to_index = {Common.PAD: 0, Common.UNK: 1, '1': 2, '2': 3, '3': 4, '4': 5}
import numpy as np
class Config:
def __init__(self):
self.SAVE_EVERY_EPOCHS = 1
self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
self.BATCH_SIZE = 2
self.TEST_BATCH_SIZE = self.BATCH_SIZE
self.READER_NUM_PARALLEL_BATCHES = 1
self.READING_BATCH_SIZE = 2
self.SHUFFLE_BUFFER_SIZE = 100
self.MAX_CONTEXTS = 4
self.DATA_NUM_CONTEXTS = 4
self.MAX_PATH_LENGTH = 3
self.MAX_NAME_PARTS = 2
self.MAX_TARGET_PARTS = 4
self.RANDOM_CONTEXTS = True
self.CSV_BUFFER_SIZE = None
self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
self.TARGET_VOCAB_MAX_SIZE = 27000
config = Config()
with open('{}.dict.c2s'.format(config.TRAIN_PATH), 'rb') as file:
subtoken_to_count = pickle.load(file)
node_to_count = pickle.load(file)
target_to_count = pickle.load(file)
max_contexts = pickle.load(file)
num_training_examples = pickle.load(file)
print('Dictionaries loaded.')
if config.DATA_NUM_CONTEXTS <= 0:
config.DATA_NUM_CONTEXTS = max_contexts
subtoken_to_index, index_to_subtoken, subtoken_vocab_size = \
Common.load_vocab_from_dict(subtoken_to_count, add_values=[Common.PAD, Common.UNK],
max_size=config.SUBTOKENS_VOCAB_MAX_SIZE)
print('Loaded subtoken vocab. size: %d' % subtoken_vocab_size)
target_to_index, index_to_target, target_vocab_size = \
Common.load_vocab_from_dict(target_to_count, add_values=[Common.PAD, Common.UNK, Common.SOS],
max_size=config.TARGET_VOCAB_MAX_SIZE)
print('Loaded target word vocab. size: %d' % target_vocab_size)
node_to_index, index_to_node, nodes_vocab_size = \
Common.load_vocab_from_dict(node_to_count, add_values=[Common.PAD, Common.UNK], max_size=None)
print('Loaded nodes vocab. size: %d' % nodes_vocab_size)
epochs_trained = 0
reader = Reader(subtoken_to_index, target_word_to_index, node_to_index, config, False)
output = reader.get_output()
target_index_op = output[TARGET_INDEX_KEY]
target_string_op = output[TARGET_STRING_KEY]
target_length_op = output[TARGET_LENGTH_KEY]
path_source_indices_op = output[PATH_SOURCE_INDICES_KEY]
node_indices_op = output[NODE_INDICES_KEY]
path_target_indices_op = output[PATH_TARGET_INDICES_KEY]
valid_context_mask_op = output[VALID_CONTEXT_MASK_KEY]
path_source_lengths_op = output[PATH_SOURCE_LENGTHS_KEY]
path_lengths_op = output[PATH_LENGTHS_KEY]
path_target_lengths_op = output[PATH_TARGET_LENGTHS_KEY]
path_source_strings_op = output[PATH_SOURCE_STRINGS_KEY]
path_strings_op = output[PATH_STRINGS_KEY]
path_target_strings_op = output[PATH_TARGET_STRINGS_KEY]
sess = tf.InteractiveSession()
tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.tables_initializer()).run()
reader.reset(sess)
try:
while True:
target_indices, target_strings, target_lengths, path_source_indices, \
node_indices, path_target_indices, valid_context_mask, path_source_lengths, \
path_lengths, path_target_lengths, path_source_strings, path_strings, \
path_target_strings = sess.run(
[target_index_op, target_string_op, target_length_op, path_source_indices_op,
node_indices_op, path_target_indices_op, valid_context_mask_op, path_source_lengths_op,
path_lengths_op, path_target_lengths_op, path_source_strings_op, path_strings_op,
path_target_strings_op])
print('Target strings: ', Common.binary_to_string_list(target_strings))
print('Context strings: ', Common.binary_to_string_3d(
np.concatenate([path_source_strings, path_strings, path_target_strings], -1)))
print('Target indices: ', target_indices)
print('Target lengths: ', target_lengths)
print('Path source strings: ', Common.binary_to_string_3d(path_source_strings))
print('Path source indices: ', path_source_indices)
print('Path source lengths: ', path_source_lengths)
print('Path strings: ', Common.binary_to_string_3d(path_strings))
print('Node indices: ', node_indices)
print('Path lengths: ', path_lengths)
print('Path target strings: ', Common.binary_to_string_3d(path_target_strings))
print('Path target indices: ', path_target_indices)
print('Path target lengths: ', path_target_lengths)
print('Valid context mask: ', valid_context_mask)
#target_indices = sess.run(target_index_op)
#print('Target indices: ', target_indices)
except tf.errors.OutOfRangeError:
print('Done training, epoch reached')
from code2seq.
Right, I see.
In the line: self.DATA_NUM_CONTEXTS = 4
set the initialization value to 0
instead of 4
.
This will signal to load this value from the dictionary, rather than using the hard-coded value.
Sorry for not noticing this difference before.
from code2seq.
Thank you for your patient reply.
It makes sense, but I have tried to fix some configuration parameters:
def __init__(self):
self.SAVE_EVERY_EPOCHS = 1
self.TRAIN_PATH = self.TEST_PATH = 'java-small/java-small'
self.BATCH_SIZE = 2
self.TEST_BATCH_SIZE = self.BATCH_SIZE
self.READER_NUM_PARALLEL_BATCHES = 1
self.READING_BATCH_SIZE = 2
self.SHUFFLE_BUFFER_SIZE = 100
self.MAX_CONTEXTS = 0
self.DATA_NUM_CONTEXTS = 0
self.MAX_PATH_LENGTH = 0
self.MAX_NAME_PARTS = 0
self.MAX_TARGET_PARTS = 0
self.RANDOM_CONTEXTS = True
self.CSV_BUFFER_SIZE = None
self.SUBTOKENS_VOCAB_MAX_SIZE = 190000
self.TARGET_VOCAB_MAX_SIZE = 27000
from code2seq.
Let me know if there is still any problem.
from code2seq.
I use a simple example, but the result is
Target strings: ['content', 'render']
Context strings: [[], []]
Target indices: [[0]
[0]]
Target lengths: [0 0]
Path source strings: [[], []]
Path source indices: []
Path source lengths: []
Path strings: [[], []]
Node indices: []
Path lengths: []
Path target strings: [[], []]
Path target indices: []
Path target lengths: []
Valid context mask: []
Target strings: ['pre|head']
Context strings: [[]]
Target indices: [[0]]
Target lengths: [0]
Path source strings: [[]]
Path source indices: []
Path source lengths: []
Path strings: [[]]
Node indices: []
Path lengths: []
Path target strings: [[]]
Path target indices: []
Path target lengths: []
Valid context mask: []
Done training, epoch reached
why there are so many empty outputs?
I use two java files in the java-small datasets, the test_dataset.train.raw.txt is as following:
test_dataset.train.raw.txt
from code2seq.
I think this is because you zeroed too many config parameters, and should have zeroed only DATA_NUM_CONTEXTS
. All the following should not be zeroed, please set them to these values (as in config.py:
self.MAX_CONTEXTS = 200
self.DATA_NUM_CONTEXTS = 0
self.MAX_PATH_LENGTH = 9
self.MAX_NAME_PARTS = 5
self.MAX_TARGET_PARTS = 6
Of course, you can reduce these numbers if, for example, 200 paths is too many to observe at once.
Let me know if you experience any additional problems.
from code2seq.
Thank you.
It makes sense, but I still have a question.
I know that all features in the different java files are extracted into the raw files, each line stores (target name, path, padding). But when I run the reader.py, the target labels can be assigned into different tensors.
For example, the function in AboutBlock.java is [render], the functions in AboutPage.java are [prehead, content];
Hence the train.c2s is shown as follows:
content ...
render...
pre|head...
But the outputs are:
Target strings: ['render', 'content']
Target strings: ['pre|head']
How to combine these target labels? Why not ['render'] or [content, pre|head]?
from code2seq.
Great!
If you will pass is_evaluating=True
to the Reader object initialization (here) - the target labels will not be shuffled (here) and will appear in the same order as in the textual file.
from code2seq.
Related Issues (20)
- Generating embeddings for Python and Java HOT 5
- Help with implementing local service with JavaExtractor HOT 10
- I can not preprocess Python dataset
- Error running prediction on Code2seq released model
- I got Out of Memory Error during Training
- Unable to get embeddings from the trained model for Java
- Extract Path Contexts Only HOT 5
- InvalidArgumentError in sess.run() HOT 3
- Visualize Python AST HOT 2
- Extract java files HOT 2
- Getting "was not completed in time" error when preprocessing dataset HOT 11
- code2seq for Python HOT 3
- Error processing property '_dropout_mask_cache' of <ContextValueCache> HOT 6
- Sampling k paths from AST tree HOT 11
- I am getting TimeError while using code2seq to predict long method HOT 2
- Generating code documentation with code2seq HOT 8
- Tensorflow out-of-bound error while trying to train the Code2Seq model on our own python dataset HOT 6
- Model is predicting empty string for custom python dataset HOT 8
- Exporting code vectors HOT 6
- Encountered error of preprocess data HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code2seq.