edinburghnlp / code-docstring-corpus Goto Github PK

Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Home Page: https://arxiv.org/abs/1707.02275

License: Other

Python 64.62% Shell 35.38%

corpus code-generation documentation-generator docstrings neural-machine-translation

code-docstring-corpus's Introduction

code-docstring-corpus

This repository contains preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Paper: https://arxiv.org/abs/1707.02275

Update

The code-docstring-corpus version 2, with class declarations, class methods, module docstrings and commit SHAs is now available in the directory V2

Installation

The dependencies can be installed using pip:

pip install -r requirements.txt

Extraction scripts require AST Unparser ( https://github.com/simonpercivall/astunparse ), NMT tokenization requires the Moses tokenizer scripts ( https://github.com/moses-smt/mosesdecoder )

Details

We release a parallel corpus of 150370 triples of function declarations, function docstrings and function bodies. We include multiple corpus splits, and an additional "monolingual" code-only corpus with corresponding synthetically generated docstrings.

The corpora were assembled by scraping from open source GitHub repository with the GitHub scraper used by Bhoopchand et al. (2016) "Learning Python Code Suggestion with a Sparse Pointer Network" (paper: https://arxiv.org/abs/1611.08307 - code: https://github.com/uclmr/pycodesuggest ) .

The Python code was then preprocessed to normalize the syntax, extract top-level functions, remove comments and semantically irrelevant whitespaces, and separate declarations, docstrings (if present) and bodies. We did not extract classes and their methods.

directory	description
parallel-corpus	Main parallel corpus with a canonical split in 109108 training triples, 2000 validation triples and 2000 test triples. Each triple is annotated by metadata (repository owner, repository name, source file and line number). Also two versions of the above corpus reassembled into pairs: (declaration+body, docstring) and (declaration+docstring, body), for code documentation tasks and code generation tasks, respectively. You may refer to the Readme in this folder for descriptions about escape tokens
code-only-corpus	A code-only corpus of 161630 pairs of function declarations and function bodies, annotated with metadata.
backtranslations-corpus	A corpus of docstrings automatically generated from the code-only corpus using Neural Machine Translation, to enable data augmentation by "backtranslation"
nmt-outputs	Test and validation outputs of the baseline Neural Machine Translation models.
repo_split.parallel-corpus	An alternate train/validation/test split of the parallel corpus which is "repository-consistent": no repository is split between training, validation or test sets.
repo_split.code-only-corpus	A "repository-consistent" filtered version of the code-only corpus: it only contains fragments which appear in the training set of the above repository.
scripts	Preprocessing scripts used to generate the corpora.
V2	code-docstring-corpus version 2, with class declarations, class methods, module docstrings and commit SHAs.

Baseline results

In order to compute baseline results, the data from the canonical split (parallel-corpus directory) was further sub-tokenized using Sennrich et al. (2016) "Byte Pair Encoding" (paper: https://arxiv.org/abs/1508.07909 - code: https://github.com/rsennrich/subword-nmt ). Finally, we trained baseline Neural Machine Translation models for both the code2doc and the doc2code tasks using Nematus (Sennrich et al. 2017, paper: https://arxiv.org/abs/1703.04357 - code: https://github.com/rsennrich/nematus ).

Baseline outputs are available in the nmt-outputs directory.

We also used the code2doc model to generate the docstring corpus from the code-only corpus which is available in the backtranslations-corpus directory.

Model	Validation BLEU	Test BLEU
declbodies2desc.baseline	14.03	13.84
decldesc2bodies.baseline	10.32	10.24
decldesc2bodies.backtransl	10.85	10.90

Bleu scores are computed using Moses multi-bleu.perl script

Reference

If you use this corpus for a scientific publication, please cite: Miceli Barone, A. V. and Sennrich, R., 2017 "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" arXiv:1707.02275 https://arxiv.org/abs/1707.02275

code-docstring-corpus's People

Contributors

Stargazers

Watchers

Forkers

suryabhupa huangshiyou csxqli zxydi1992 stanstarks dchichkov cdmatters mldeveloper01 macintoshxz fiveoceans-dev keita1 andyucroo jessiechu pinjiahe xennygrimmato carlos-gemmell rsantana-isg vitalyromanov andrewyoung97 fabriceyhc paloukari fatmalearning dyxgwb ruyunw yaxche-io phuong-tk-nguyen yzhen-li narame7 chenw23 thenefelibata bibosx0 digitalcompanion zoudajia divyanshmanocha nursei7 zzyuanyaozz miaodexingz simon-benigeri toufiqueparag yjqiang defoe-code majid-heidary mathemusician text2code drsunithaev sudosadia harel-coffee

code-docstring-corpus's Issues

dataset

hi ,

I want to know that in this we will train two nmt models?

code2doc
2)doc2code

Nice!!

help required

Can you please tell me how to run this code .. i have downloaded the code but i am unable to run it .
can u please guide me with the directions if possible .

waiting for a positive reply.

Thank you so much.

Recover original code snippets from `data_ps.all.*`

I found source code and descriptions are encoded in data_ps.all.*. For example, d is replaced with qz. May I know if there is a script that I can use to convert those encoded code/descriptions into their original formats? Thanks a bunch!

Examples?

Can you provide some examples of this working?

parallel-corpus

In the parallel-corpus directory is the data already tokenized?
or we need to rokenize it using moses tokenizer and BPE tokenizer?
and as this data is in codes then how to read the data because when i am writing the code as
def load_doc(filename)
file=codecs.open(filename,mode='rb' , encoding='utf-8',errors='ignore')
text=file.read()
return text

then the ouput is not showing the text written in the folder but it is giving the ouput like
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00.

why?

Performance of SOTA model on this dataset

Thank you for your interesting paper "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation." You said you didn't try AST-based generation in the paper because that was not the purpose of the paper, but have you tried any stronger model on the dataset since then to see what the current SOTA would lead to? Have you been trying to create larger dataset?

Idea for generating test cases

Train seq2seq RNNs to generate syntactically valid inputs to gold code, and then use gold code as oracle to get output for generated input (if the input has correct syntax).   Inside of try/except, seq2seq receives gold code as input and has to generate an input for gold code that doesn’t set off exception.   Reward seq2seq +1 if no exception, -1 if exception. & add entropy (or minibatch discrimination?) to loss so it’ll have varied generations.  To learn which syntax error it made, have seq2seq predict which exception it set off and lessen negative reward if prediction is correct.   Maybe SL pretrain with already provided test cases (if we have any) before RL stage (entropy in RL loss will ensure RL stage stays varied).

blue score

hi i followed all your steps, took your dataset and done all the pre processing and tokenization but i did not got the same results as you got. I got BLUE score as 0. can you please tell me why is it so? @Avmb @rsennrich

Can I get the source code as body directly without syntax parsing?

Thank you for sharing you code. In your code, the body part is represented by the syntax parse of source code which contains many symbols such as 'DCNL' and 'DCSP'. Is it possible that I get the source code directly without syntax parsing from your code? Thanks.

Question about creating a dataset format for NeuralCodeSum

Hello. @Avmb
I have a question about dataset format of NeuralCodeSum.
When I checked, https://github.com/wasiahmad/NeuralCodeSum/tree/master/data
The dataset was from this repositories as you supported the dataset to NeuralCodesum.

Could I know how you make a dataset format for NeuralCodeSum?
It was made like token word list without underscore and others.
If there is some script to parse code to dataset format or way, I hope to know it.

Thank you:)

Are all the descriptions used in the baseline methods?

Hi, it is really a great corpus!
I noticed that some descriptions may contain parameter definitions/explanations, which may be hard to generate, so I am just wondering what your target sequence is in the baseline methods. The first sentence of the descriptions (may describe the functionality) or the entire description paragraph? Thanks!

tokenization of the data

Hi,

When i am running this command for this particular file only. The error is arising as you can see below.Can you please tell. me why is this so only for this particular file.

file name : data.ps_decldesc.train

all the files got tokenized very easily,but this file was not able to do so .why?

error:
gauravs-MBP:~ g$ ~~/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <~~/code-docstring-corpus/parallel-corpus/data_ps.decldesc.train >~/code-docstring-corpus/parallel-corpus/data_ps.decldesc.train1
Tokenizer Version 1.1
Language: en
Number of threads: 1
utf8 "\xFF" does not map to Unicode at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 180, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (overflows) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character: \xff\xff\xff\xff\x5c\x27\x29\x20\x44\x43\x4e\x4c\x20 (unexpected non-continuation byte 0xff, immediately after start byte 0xff; need 13 bytes, got 1) in substitution (s///) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.
Malformed UTF-8 character (fatal) at /Users/gaurav/Downloads/mosesdecoder/scripts/tokenizer/tokenizer.perl line 240, line 1133.

Thank you so much.

How much memory does this require while training?

We are running this experiment on Google Colab Pro (16 GB machines) and are receiving out of memory errors using the same vocabulary size (89500). How much memory would you recommend for us to have to run this experiment?

Parallel corpus V2 possibly incorrect

Hello,

I have tried to use your dataset V2 and found an intresting thing:

The number of lines in parallel_methods_desc is 397241
The number of lines in parallel_methods_bodies is 397225
The number of lines in parallel_desc is 148620
The number of lines in parallel_body is 148603

It seems that dataset is possibly corrupted or some descriptions use several lines in files.

Could you explain the correct approach to match bodies and descriptions?

Thank you.

Edited (June 9th, 2018):

The same problem is encountered with the first dataset version:

The number of lines in data_ps.bodies.train is 109109
The number of lines in data_ps.descriptions.train is 109130