nougatca / spt-code Goto Github PK

View Code? Open in Web Editor NEW

38.0 1.0 6.0 68.1 MB

License: MIT License

Python 95.59% ANTLR 4.37% Ruby 0.01% Batchfile 0.01% JavaScript 0.02%

spt-code's Introduction

SPT-Code

Requirements

Minimize requirements

The list of minimize requirements can be found in requirements.txt.

Additional requirements

If you need to reprocess the raw dataset, or use your own dataset, then you will also need to install the following packages.

tree_sitter==0.19.0
antlr4-python3-runtime==4.9.2

Besides, antlr4 need to be installed, installation guidance here.

If you encounter errors about my-languages.so when preprocessing the dataset, please run sources/data/asts/build_lib.py first.

Datasets and Tokenizers

We provide pre-processed datasets, saved as pickle binary files, which can be loaded directly as instances.

The pre-processed datasets can be downloaded here: (OneDrive, iCloud, GoogleDrive). Put the downloaded dataset pickle file into {dataset_root}/dataset_saved/ (default to.../dataset/dataset_saved), the program will automatically detect and use it.

It is also possible to use a custom dataset, simply by placing it in the specified location according to the relevant settings in the source code, or by modifying the corresponding dataset loading script in the source code. The dataset loading code is located in the sources/data/data.py and sources/data/data_utils.py files.

Pre-trained Tokenizers and Models

Custom tokenizers (we call "vocab") can be downloaded here: (OneDrive, iCloud, Google Drive). Extract it in a certain directory. Specific the argument trained_vocab of main.py where the tokenizers are located or put it in {dataset_root}/vocab_saved (default to.../dataset/vocab_saved).

You may pre-train SPT-Code by yourself. We also provide pre-trained models available here. Extract and put it in a directory, then specific the argument trained_model like tokenizers before.

Runs

Run main.py to start pre-train, fine-tune or test. All arguments are located in args.py, specific whatever you need.

Some example scripts are as following.

# pre-training
python main.py \
--do-pre-train \
--pre-train-tasks cap,mass,mng \
--batch-size 64 \
--eval-batch-size 64 \
--cuda-visible-devices 0,1,2,3 \
--fp16 \
--model-name pre_train

# summarization on pre-trained model and vocab
python main.py \
--do-fine-tune \
--task summarization \
--summarization-language java \
--model-name summarization_java \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../pre_trained/models/all/'

# bug fixing without pre-training
python main.py \
--do-fine-tune \
--train-from-scratch \
--task bug_fix \
--bug_fix_scale medium

# only test on translation
python main.py \
--only-test \
--task translation \
--translation-source-language java \
--translation-target-language c_sharp \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../outputs/translation_java_c_sharp_20210826_052653/models/'

spt-code's People

Contributors

Stargazers

Watchers

Forkers

xiaochaolee qyb156 librastalker nhasabni fjgao unosweng

spt-code's Issues

The class BartForClassificationAndGeneration's forward function in bart.py

The below code snippets are simplified for highlighting only the control flow by the function execution set_model_mode(), where the first task TASK_CODE_AST_PREDICTION changes the mode to MODEL_MODE_CLS and the next two tasks switch the mode to MODEL_MODE_GEN.

When running the program, the mode MODEL_MODE_GEN for those two tasks worked well. However, the mode MODEL_MODE_CLS configured by the task TASK_CODE_AST_PREDICTION caused the runtime error which was triggered from the forward() function in the class BartForClassificationAndGeneration. In this function, the mode MODEL_MODE_GEN in the IF conditional expression (CODE LINK) only executed the next function forward_gen() without any error.

pre_train.py

def pre_train():
    for task in tasks:
        if task == enums.TASK_CODE_AST_PREDICTION:
            model.set_model_mode(enums.MODEL_MODE_CLS)
        elif task == enums.TASK_MASS:
            model.set_model_mode(enums.MODEL_MODE_GEN)
        elif task == enums.TASK_METHOD_NAME_PREDICTION:
            model.set_model_mode(enums.MODEL_MODE_GEN)
    return model, (code_vocab, ast_vocab, nl_vocab)

bart.py

class BartForClassificationAndGeneration(BartForConditionalGeneration):
    def forward():
        if self.mode == enums.MODEL_MODE_GEN:
            return self.forward_gen(input_ids=input_ids,..)
        else:
            raise ValueError

Environment:

ubuntu 22.04.4 LTS, python 3.8.17, and conda 23.5.2

Command:

python main.py --do-pre-train --pre-train-tasks cap --batch-size 16 --eval-batch-size 32 --cuda-visible-devices 0 --fp16 --model-name pre_train

Error message:

Running configurations initialized successfully
----------------------------------------------------------------------------------------------------
Start pre-training task: cap
  0%|          | 0/1770 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main(main_args)
  File "main.py", line 22, in main
    model, vocab = pre_train(args=args)
  File "/home/[email protected]/Documents/0-research-spt-code/spt-code/sources/pre_train.py", line 218, in pre_train
    cap_result = trainer.train()
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
    loss = self.compute_loss(model, inputs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
    outputs = model(**inputs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/my-conda-path/envs/spt-code/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/[email protected]/Documents/0-research-spt-code/spt-code/sources/models/bart.py", line 85, in forward
    raise ValueError
ValueError

As shown below, there is only one location in the source files to set the model mode MODEL_MODE_CLS and others use MODEL_MODE_GEN through the function set_model_mode().

/my-program-path/sources $ grep -r set_model_mode . --include=*.py
./downstream_tasks/completion.py:    model.set_model_mode(enums.MODEL_MODE_GEN)
./downstream_tasks/translation.py:    model.set_model_mode(enums.MODEL_MODE_GEN)
./downstream_tasks/summarization.py:    model.set_model_mode(enums.MODEL_MODE_GEN)
./downstream_tasks/search.py:    model.set_model_mode(enums.MODEL_MODE_GEN)
./downstream_tasks/bug_fix.py:    model.set_model_mode(enums.MODEL_MODE_GEN)
./models/bart.py:            self.set_model_mode(mode)
./models/bart.py:    def set_model_mode(self, mode):
./pre_train.py:            model.set_model_mode(enums.MODEL_MODE_CLS)
./pre_train.py:            model.set_model_mode(enums.MODEL_MODE_GEN)
./pre_train.py:            model.set_model_mode(enums.MODEL_MODE_GEN)

I want to understand the case that TASK_CODE_AST_PREDICTION changes the mode to MODEL_MODE_CLS in pre_train.py (line 153). I wonder if there was something I missed. The other 2 pre-training tasks worked well.

Questions about the baselines in the paper

Hi,

It is a great and interesting work!
I am curious that do you have a chance to compare with the model proposed in this work CodeT5 They did evaluation on the similar datasets as yours.

Thanks.

Bug

Hello, I found that using the source code you provide, by calling the
nl = extract_nl_from_code(source=source, root=root, lang=lang, name=name)
method to get the results of the program method call name are basically wrong, may I ask what method can correct this error, get the correct name
Some examples of errors like the follow
['wip', 'wip.compareAn', 'f (q.is', ' ', ' if', ' ', ' ', ' ', ' QueueDr']
['er.requireNonN', 'urn to', 't().toObserv', 'e()', 'ons.listSo', 'n)).flatMapIter', '>iden']
['er.requireNonN', 'urn to', 't().toObserv', 'e()', 'ons.listSo', 'n)).flatMapIter', '>iden']

How to get original functions of tokenized function codes that are present in 'dataset_saved/classcial_summarization'

Hi, thanks for sharing your great work!

I have downloaded your tokenized data that are present in 'dataset_saved/classcial_summarization'. But how could I map the tokenized code to the original functions from https://github.com/EdinburghNLP/code-docstring-corpus.

It could be very helpful if you provide some guidelines to get the original function.

Thanks a lot!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.