Code Monkey home page Code Monkey logo

transcoder's Introduction

This repository is deprecated, please now refer to : https://github.com/facebookresearch/CodeGen

TransCoder

Pytorch original implementation of TransCoder in Unsupervised Translation of Programming Languages Model

Dependencies

  • Python 3
  • NumPy
  • PyTorch
  • fastBPE (generate and apply BPE codes)
  • Moses (scripts to clean and tokenize text only - no installation required)
  • Apex (for fp16 training)
  • libclang (for C++ tokenization)
  • submitit (to run the preprocessing pipeline on remote machine)
  • six
  • sacrebleu (pip install sacrebleu=="1.2.11")

If your libclang.so is not in /usr/lib/llvm-7/lib/, replace the path to libclang.so to the correct path in clang.cindex.Config.set_library_path('path_to_libclang') in code_tokenizer.py

If you run the data preprocessing pipeline, you will have to compile fastBPE. Go in XLM/tools/fastBPE and carry out the steps described in the ReadMe.

Translate

Transcompilation between Java, C++ and Python 3.

The model can be tested on new inputs using the translate.py script.

For instance, python translate.py --src_lang cpp --tgt_lang java --model_path trained_model.pth < input_code.cpp will translate the C++ code contained in input_code.cpp to Java.

Download a pre-trained model

The model checkpoint (.pth file) are provided. We used the validation set to select the best checkpoint for each language pair, and choose the model to use to compute the test scores. We selected:

  • this model for C++ -> Java, Java -> C++ and Java -> Python
  • this model for C++ -> Python, Python -> C++ and Python -> Java

Run an evaluation

  • Download the test and validation data and unzip it. In that folder, the test and validation data are preprocessed (tokenized , BPE applied) and binarized to be used directly in XLM and to test the released model. We also release the raw data here.
  • put all the binarized data into data/XLM-cpp-java-python-with-comments
  • run XLM/train.py in eval_only mode. For instance:
python XLM/train.py 
--n_heads 8 
--bt_steps 'python_sa-cpp_sa-python_sa,cpp_sa-python_sa-cpp_sa,java_sa-cpp_sa-java_sa,cpp_sa-java_sa-cpp_sa,python_sa-java_sa-python_sa,java_sa-python_sa-java_sa' # The evaluator will use this parameter to infer the languages to test on 
--max_vocab '-1'  
--word_blank '0.1' 
--n_layers 6  
--generate_hypothesis true 
--max_len 512 
--bptt 256  
--fp16 true 
--share_inout_emb true 
--tokens_per_batch 6000 
--has_sentences_ids true 
--eval_bleu true  
--split_data false  
--data_path 'path_to_TransCoder_folder/data/XLM-cpp-java-python-with-comments'  
--eval_computation true 
--batch_size 32 
--reload_model 'model_1.pth,model_1.pth'  
--amp 2  
--max_batch_size 128 
--ae_steps 'cpp_sa,python_sa,java_sa' 
--emb_dim 1024 
--eval_only True 
--beam_size 10 
--retry_mistmatching_types 1 
--dump_path '/tmp/' 
--exp_name='eval_final_model_wc_30' 
--lgs 'cpp_sa-java_sa-python_sa' 
--encoder_only=False

Train a new model

Data needed

Data you need to pretrain a model with MLM:

  • training data (monolingual): source code in each language , ex: train.python.pth (actually you have 8 of these train.python.[0..7].pth because data is split on 8 gpu)
  • test / valid data (monolingual): source code in each language to test perplexity of model , ex: test.python.pth / valid.python.pth

Data you need to train AE and BT :

  • training data (monolingual): functions standalone in each language, ex: train.python_sa.[0..7].pth
  • test / valid data (monolingual + parallel):
    • monolingual functions to test perplexity of model, ex: test.python_sa.pth / valid.python_sa.pth
    • parallel functions to test the translation model (with BLEU and unit tests), ex: test.python_sa-cpp_sa.pth / valid.python_sa-cpp_sa.pth

All of these data should be contain in one folder. The path is given in the --data argument. We provide the parallel test and validation data. See section Run an evaluation and Validation and Test Sets Release. To obtain all the monolingual data (all code/functions // train/test/valid) see the following section.

NB : In our case, the training data is too large to fit on a single machine. Thus, we split it into 8 binarized files and at training, we split the data across 8 GPU. If your training data can fit on a single machine, regroup all your training data into a single file for instance train.python.pth

Download/preprocess data

To get the monolingual data, first download cpp / java / python source code from Google BigQuery (https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code). To run our preprocessing pipeline, you need to donwlaod the raw source code on your machine in json format, and put each programming language in a dedicated folder. A sample of it is given in data/test_dataset. The pipeline extracts source code from json, tokenizes it, extracts functions, applies bpe, binarizes the data and creates symlink with appropriate name to be used directly in XLM. The folder that ends with .XLM-syml is the data path you give for XLM traning. You will have to add the test and valid parallel we provide in "Run an evaluation" data to that folder.

To test the pipeline run pytest preprocessing/test_preprocess.py, you will see the pipeline output in data/test_dataset folder.

To run the pipeline (either locally or on remote machine ), command example:

python -m preprocessing.preprocess 
absolute_path_to_TranscCoder/data/test_dataset # path to the root folder where you have the json
--lang1 java # languages to prepocess
--lang2 python #
--lang3 cpp # can be None if you want to preprocess only 2 languages
--keep_comments True # True if you want to keep code comments in you code, False to remove them
--bpe_train_size 0 # Set the size of the training data subset on which the bpe codes are trained. 0 -> parameter disabled and all training data are used
--test_size 10 # size of test/validation sets , usually 1000, here 10 to test the command on the json samples
--local True # True if you want to launch the pipeline locally , False to launch on remote machine. In that case it uses submitit

If you want to preprocess another programming language, you have to implement the functions tokenize_newlang, detokenize_newlang, extract_function_newlang, get_function_name_newlang in preprocessing/src/code_tokenizer.py and run the pipeline with newlang.

NB: If you run the pipeline for cpp/java/python with --keep-comments True, you dont need to train the bpe codes and vocab, they are provided in data/bpe.cpp-java-python.with_comments. In the folder where have your json folders, you simply have to add a folder cpp-java-python.with_comments and copy the codes and vocab files. The pipeline will see it and pass the BPE traing step.

Pretrain a model with MLM

Example:

python XLM/train.py 

--n_heads 8 
--bt_steps '' 
--max_vocab 64000 
--word_mask_keep_rand '0.8,0.1,0.1' 
--word_blank 0 
--data_path 'path_to_TransCoder_folder/data/XLM-cpp-java-python-with-comments' 
--save_periodic 0 
--bptt 512 
--lambda_clm 1 
--ae_steps '' 
--fp16 true 
--share_inout_emb true 
--lambda_mlm 1 
--sinusoidal_embeddings false 
--word_shuffle 0 
--mlm_steps 'cpp,java,python' 
--attention_dropout 0 
--split_data false 
--length_penalty 1 
--max_epoch 100000 
--stopping_criterion '_valid_mlm_ppl,10' 
--lambda_bt 1 
--dump_path '/output_folder_path' 
--lambda_mt 1 
--epoch_size 100000 
--early_stopping false 
--gelu_activation false 
--n_layers 6 
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0003,weight_decay=0.01' 
--validation_metrics _valid_mlm_ppl 
--eval_bleu false 
--dropout '0.1' 
--mt_steps '' 
--reload_emb '' 
--batch_size 32 
--context_size 0 
--word_dropout 0 
--reload_model '' 
--min_count 0 
--lgs 'cpp-java-python' 
--sample_alpha 0 
--word_pred '0.15' 
--amp 2 
--max_batch_size 0 
--clip_grad_norm 5 
--emb_dim 1024 
--encoder_only true 
--beam_size 1 
--clm_steps '' 
--exp_name mlm_cpp_java_python_with_coms 
--lambda_ae 1 
--lg_sampling_factor '-1' 
--eval_only false

Train a model with the denoising auto-encoder and back-translation objectives

Example:

python XLM/train.py 
--n_heads 8 
--bt_steps 'python_sa-cpp_sa-python_sa,cpp_sa-python_sa-cpp_sa,java_sa-cpp_sa-java_sa,cpp_sa-java_sa-cpp_sa,python_sa-java_sa-python_sa,java_sa-python_sa-java_sa' 
--max_vocab '-1' 
--word_mask_keep_rand '0.8,0.1,0.1' 
--gen_tpb_multiplier 1 
--word_blank '0.1' 
--n_layers 6 
--save_periodic 1 
--dump_path '/output_folder_path' 
--max_len 512 
--bptt 256 
--lambda_clm 1 
--ae_steps 'cpp_sa,python_sa,java_sa' 
--fp16 true 
--share_inout_emb true 
--lambda_mlm 1 
--sinusoidal_embeddings false 
--mlm_steps '' 
--word_shuffle 3 
--tokens_per_batch 6000 
--has_sentences_ids true 
--attention_dropout 0 
--split_data false 
--length_penalty 1 
--max_epoch 10000000 
--stopping_criterion '' 
--lambda_bt 1 
--generate_hypothesis true 
--lambda_mt 1 
--epoch_size 30000 
--data_path 'path_to_TransCoder_folder/data/XLM-cpp-java-python-with-comments' 
--gelu_activation false 
--split_data_accross_gpu global 
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' 
--eval_computation true 
--validation_metrics '' 
--eval_bleu true 
--dropout '0.1' 
--mt_steps '' 
--reload_emb '' 
--batch_size 32 
--context_size 0 
--word_dropout '0.1' 
--reload_model 'path_to_MLM_checkpoint,path_to_MLM_checkpoint' 
--min_count 0 
--eval_bleu_test_only false 
--group_by_size true 
--early_stopping false 
--sample_alpha 0 
--word_pred '0.15' 
--amp 2 
--max_batch_size 128 
--clip_grad_norm 5 
--emb_dim 1024 
--encoder_only false 
--lgs 'cpp_sa-java_sa-python_sa' 
--clm_steps '' 
--exp_name bt_with_comments_sa_final_modif_test 
--beam_size 1 
--lambda_ae '0:1,100000:0.1,300000:0' 
--lg_sampling_factor '-1' 
--eval_only false

Train in multi GPU

To train a model in multi GPU replace python XLM/train.py with:

export NGPU=2; python -m torch.distributed.launch --nproc_per_node=$NGPU XLM/train.py

Validation and Test Sets Release

We release our validation and test dataset. You can download the raw data here.

The format of each line in each file is <FUNCTION_ID> | <function>. The function are tokenized. You can detokenize them with the script preprocessing/detokenize.py. You can extract the function id and use it to find the corresponding test script in data/evaluation/geeks_for_geeks_successful_test_scripts/<language> if it exists.

For instance, for the line COUNT_SET_BITS_IN_AN_INTEGER_3 | <function> in the file test.cpp.shuf.valid.tok, the corresponding test script can be found in data/evaluation/geeks_for_geeks_successful_test_scripts/cpp/COUNT_SET_BITS_IN_AN_INTEGER_3.cpp. If the script is missing, it means there was an issue with our automatically created tests for the corresponding function.

The code generated by your model can be tested by injecting it where the TO_FILL comment is in the test script.

Little guide to download Github from Google Big Query

  • Create a Google platform account ( you will have around 300 $ given for free , that is sufficient for Github)
  • Create a Google Big Query project here
  • In this project, create a dataset
  • In this dataset, create one table per programming language. The results of each SQL request (one per language) will be stored in these tables.
  • Before running your SQL request, make sure you change the query settings to save the query results in the dedicated table (more -> Query Settings -> Destination -> table for query results -> put table name)
  • Run your SQL request (one per language and dont forget to change the table for each request)
  • Export your results to google Cloud :
    • In google cloud storage, create a bucket and a folder per language into it
    • Export your table to this bucket ( EXPORT -> Export to GCS -> export format JSON , compression GZIP)
  • To download the bucket on your machine, use the API gsutil:
    • pip install gsutil
    • gsutil config -> to config gsutil with your google account
    • gsutil -m cp -r gs://name_of_bucket/name_of_folder . -> copy your bucket on your machine

Example of query for python :

SELECT 
 f.repo_name,
 f.ref,
 f.path,
 c.copies,
 c.content
FROM `bigquery-public-data.github_repos.files` as f
  JOIN `bigquery-public-data.github_repos.contents` as c on f.id = c.id
WHERE 
  NOT c.binary
  AND f.path like '%.py'

Google link for more info here

References

This Code was used to train and evaluate the TransCoder model. Our paper was published at NeurIPS 2020:

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

* Equal Contribution

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

License

TransCoder is under the license detailed in the Creative Commons Attribution-NonCommercial 4.0 International license. See LICENSE for more details.

transcoder's People

Contributors

brozi avatar malachaux avatar quentinn42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transcoder's Issues

How to get parallel data to train AE and BT

Hi,
I try to preprocess another programming language to train my new model. But I cannot figure out how to get parallel data when trainning AE & BT,eg test.python_sa-cpp_sa.pth. I'll appreciate it very much if you could help me.

how to run translate.py without GPU?

Is it possible to run using only CPU? I have tried to run it by changing all instances of

device='cuda:0'

to device='cpu'

or setting CUDA_VISIBLE_DEVICES="" etc

but nevertheless not able to git rid of all the errors regardless of what I try. What's the easy way to run it on a CPU only machine?

How to getting word embedding from the trained model of TransCoder?

I am trying to extract a words embedding of the various tokenized (.tok) files. I have preprocessed the various dataset using preprocessing pipeline suggested in the TransCoder. I have also trained the model and can also used pretrained (TransCoder) to extract embedding matrix and embedding vectors of various tokens of various tokenized file.
Authors have plotted t-SNE visualization of a cross-lingual token embeddings. They obtained by encoding programming language tokens into TransCoder's lookup table.
Can authors explain how you did that? I also want to extract embedding of these tokens.

Training time

Hello. How much time did it take to train the model on 32 V100 GPUs?

ablation on datasize

Hi, appreciate your greet work for code translation!
I wonder if you have done ablation study on the data size. Since the unsupervised model needs way much more training data (over 500M funcitons ) than exisiting code PLMs, such as CodeT5(8.35M funcitons).
How's the performance of TransCoder if less data are provided?

How to run preprocess 

I ran the command as instructed, but it doesn't work.

python -m preprocessing.preprocess ./data/test_dataset --lang1 java --lang2 python --lang3 cpp --keep_comments True --bpe_train_size 0 --test_size 10 --local True

cpp: process ...
cpp: tokenizing 2 json files ...
java: process ...
java: tokenizing 2 json files ...
100% 50/50 [00:01<00:00, 29.15it/s]
100% 50/50 [00:01<00:00, 40.85it/s]
java: split train, test and valid ...
python: process ...
python: tokenizing 2 json files ...
100% 50/50 [00:03<00:00, 13.14it/s]
100% 50/50 [00:04<00:00, 12.36it/s]
python: split train, test and valid ...
100% 100/100 [00:18<00:00, 5.47it/s]
100% 150/150 [00:10<00:00, 14.43it/s]
cpp: split train, test and valid ...
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "TransCoder/preprocessing/src/dataset.py", line 87, in process
nlines, size_gb = job.result()
File "TransCoder/preprocessing/src/utils.py", line 263, in result
self._result = self.func(*self.args, **self.kwargs)
File "TransCoder/preprocessing/src/dataset.py", line 51, in split_train_test_valid
size_gb = all_tok.stat().st_size
File "/usr/lib/python3.6/pathlib.py", line 1158, in stat
return self._accessor.stat(self)
File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'data/test_dataset/cpp/all.with_comments.tok'
"""

what is all.with_comments.tok?
Where is all.with_comments.tok located?

add online test

would like to be able to demo test / use this online possibly in emscripten ? was hoping someone has some url's to send code to

extracting docstring

I was trying to extract docstring using the preprocessing pipeline. But it seems like it is not implemented. Is it correct? Do you plan to implement the process to extract docstrings?

Showing error about transcoder

showing below error
Traceback (most recent call last):
File "TransCoder/translate.py", line 171, in
translator = Translator(params)
File "TransCoder/translate.py", line 83, in init
encoder, decoder = build_model(self.reloaded_params, self.dico)
File "/content/TransCoder/XLM/src/model/init.py", line 181, in build_model
enc_path, map_location=lambda storage, loc: storage.cuda(params.local_rank))
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 830, in restore_location
result = map_location(storage, location)
File "/content/TransCoder/XLM/src/model/init.py", line 181, in
enc_path, map_location=lambda storage, loc: storage.cuda(params.local_rank))
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 71, in _cuda
with torch.cuda.device(device):
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 225, in enter
self.prev_idx = torch.cuda.current_device()
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 432, in current_device
_lazy_init()
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

test_dataset.py missing

In the readme it says the following:

"To test the pipeline run pytest preprocessing/test_dataset.py, you will see the pipeline output in data/test_dataset folder."

I'm unable to find the test_dataset.py file. Am I missing something?

Which part of the code should I modify when I have less than 8 GPUs ?

After preprocessing, I got files train.{lang}.[0..7].pth. But when I try to pretrain a model with MLM, I got an error saying 'no file called train.{lang}.pth exists'.
I think that might because I have to regroup my dataset into a single file, but which part of the code should I modify or extra parameters should I put in to achieve that?

Multiple translation

Is there any way to modify the translate.py file to translate a batch of code files to a target language using this model?

"AttributeError: type object 'Callable' has no attribute '_abc_registry'" always shows

I installed the packages mentioned here https://github.com/facebookresearch/TransCoder#dependencies
when I run

python translate.py --src_lang java --tgt_lang python --model_path trained_model.pth < all.java`  

It shows

Traceback (most recent call last):
  File "translate.py", line 22, in <module>
    import torch
  File "/opt/conda/lib/python3.7/site-packages/torch/__init__.py", line 280, in <module>
    from .functional import *
  File "/opt/conda/lib/python3.7/site-packages/torch/functional.py", line 2, in <module>
    import torch.nn.functional as F
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F401
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 5, in <module>
    from .. import functional as F
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 15, in <module>
    from .._jit_internal import boolean_dispatch, List
  File "/opt/conda/lib/python3.7/site-packages/torch/_jit_internal.py", line 257, in <module>
    import typing
  File "/home/jovyan/lib/python3/site-packages/typing.py", line 1359, in <module>
    class Callable(extra=collections_abc.Callable, metaclass=CallableMeta):
  File "/home/jovyan/lib/python3/site-packages/typing.py", line 1007, in __new__
    self._abc_registry = extra._abc_registry
AttributeError: type object 'Callable' has no attribute '_abc_registry'

And no matter what I type on the command line, 'AttributeError: type object 'Callable' has no attribute '_abc_registry'' always shows.
According to the solutions metioned in google, I unintalled typing package by deleting typing.py because the same error shows when I use pip uninstall typing. The new error shows like this

Traceback (most recent call last):
  File "translate.py", line 24, in <module>
    import preprocessing.src.code_tokenizer as code_tokenizer
  File "/home/jovyan/TransCoder-master/preprocessing/src/code_tokenizer.py", line 21, in <module>
    from sacrebleu import tokenize_v14_international
ImportError: cannot import name 'tokenize_v14_international' from 'sacrebleu' (/home/jovyan/lib/python3/site-packages/sacrebleu/__init__.py)

Then I lower the version of sacrebleu (solution from google) then 'AttributeError: type object 'Callable' has no attribute '_abc_registry'' shows...

I wonder how to solve this problem? Thx!

pipeline run error

I run pipeline using new data for python, java, cpp from using this bigquery

SELECT
f.repo_name,
f.ref,
f.path,
c.content
FROM bigquery-public-data.github_repos.files as f
JOIN bigquery-public-data.github_repos.contents as c on f.id = c.id
WHERE
NOT c.binary
AND f.path like '%.py' #.cpp, .java
AND c.content IS NOT NULL
limit 50000

But when I run pipeline, I got error and all .tok files has no data at all.

File "/home/kit/transcoder/preprocessing/src/utils.py", line 178, in learn_bpe_file
assert process.returncode == 0, f"failed to learn bpe on {str(file_path)}"

If I used sample data that come whith Gihub and run pipeline, there is data in *.tok file.

What is wrong with my query ?
Do you have query that you used to create sample data in /transcoder/data/test_dataset?

Thanks

regroup and select data for training bpe in /home/kit/transcoder/data/test_dataset/cpp-java-python.with_comments/train.with_comments.tok.NoneGB ...
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.2.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.6.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.5.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.4.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.1.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.3.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.7.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/cpp/train.with_comments.0.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.2.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.6.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.5.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.4.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.1.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.3.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.7.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/java/train.with_comments.0.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.2.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.6.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.5.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.4.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.1.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.3.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.7.tok
adding 0 lines of /home/kit/transcoder/data/test_dataset/python/train.with_comments.0.tok
training bpe on /home/kit/transcoder/data/test_dataset/cpp-java-python.with_comments/train.with_comments.tok.NoneGB...
Traceback (most recent call last):
File "/home/kit/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kit/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/kit/transcoder/preprocessing/preprocess.py", line 113, in
preprocess(args.root, args.lang1, args.lang2, args.keep_comments, args.local,
File "/home/kit/transcoder/preprocessing/preprocess.py", line 66, in preprocess
dataset.train_bpe(ncodes=ncodes, size_gb=size_gb)
File "/home/kit/transcoder/preprocessing/src/dataset.py", line 195, in train_bpe
learn_bpe_file(data_train_bpe, ncodes, self.codes)
File "/home/kit/transcoder/preprocessing/src/utils.py", line 178, in learn_bpe_file
assert process.returncode == 0, f"failed to learn bpe on {str(file_path)}"
AssertionError: failed to learn bpe on /home/kit/transcoder/data/test_dataset/cpp-java-python.with_comments/train.with_comments.tok.NoneGB

Why Truncate Test and Validation Data

Hi, I'm just wondering why we want to truncate the test and validation data to the length of the shortest possible line. It seems like this would result in translation of incomplete code?

for split in ['test', 'valid']:
for f_type in ['functions_standalone', 'functions_class']:
truncate_files(l.folder.joinpath(
f'{split}{self.suffix}.{f_type}.tok') for l in self.langs)

Scores do not match with paper

Hi @malachaux, I ran the evaluation on provided test and val set using both the model_1.pth and model_2.pth checkpoints but the scores are not matching with the paper. What could be the possible reason for the same?

Thanks,
Kunal Pagarey

File "/transcoder/XLM/src/data/loader.py", line 313, in check_data_params assert all([all([os.path.isfile(p) or os.path.isfile(p.replace('pth', '0.pth'))

Got this error : File "/transcoder/XLM/src/data/loader.py", line 313, in check_data_params
assert all([all([os.path.isfile(p) or os.path.isfile(p.replace('pth', '0.pth'))

When run train.py using this command witj --eval_only false.

python XLM/train.py --n_heads 8 --bt_steps '' --max_vocab 64000 --word_mask_keep_rand '0.8,0.1,0.1' --word_blank 0 --data_path '/home/kit/transcoder/data/XLM-cpp-java-python-with-comments' --save_periodic 0 --bptt 512 --lambda_clm 1 --ae_steps '' --fp16 true --share_inout_emb true --lambda_mlm 1 --sinusoidal_embeddings false --word_shuffle 0 --mlm_steps 'cpp,java,python' --attention_dropout 0 --split_data false --length_penalty 1 --max_epoch 100000 --stopping_criterion '_valid_mlm_ppl,10' --lambda_bt 1 --dump_path '/home/kit/transcoder/output' --lambda_mt 1 --epoch_size 100000 --early_stopping false --gelu_activation false --n_layers 6 --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0003,weight_decay=0.01' --validation_metrics _valid_mlm_ppl --eval_bleu false --dropout '0.1' --mt_steps '' --reload_emb '' --batch_size 1 --context_size 0 --word_dropout 0 --reload_model '' --min_count 0 --lgs 'cpp-java-python' --sample_alpha 0 --word_pred '0.15' --amp 2 --max_batch_size 0 --clip_grad_norm 5 --emb_dim 1024 --encoder_only true --beam_size 1 --clm_steps '' --exp_name mlm_cpp_java_python_with_coms --lambda_ae 1 --lg_sampling_factor '-1' --eval_only false

"None option is not viable in processing.py"

Hello,

In the readme it details that one can select the third language as None during the processing pipeline. However this option does not work: Running the pipeline on your test data with lang1 = python, lang2=cpp, lang3=None returns:

Traceback (most recent call last):
  File "/private/home/lawrencestewart/.conda/envs/transcoder/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/private/home/lawrencestewart/.conda/envs/transcoder/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/private/home/lawrencestewart/DeepTyper/preprocessing/preprocess.py", line 154, in <module>
    lang3=args.lang3, size_gb=args.bpe_train_size, test_size=args.test_size)
  File "/private/home/lawrencestewart/DeepTyper/preprocessing/preprocess.py", line 51, in preprocess
    test_size=test_size, lang3=lang3)
  File "/private/home/lawrencestewart/DeepTyper/preprocessing/src/dataset.py", line 148, in __init__
    self.lang1 = Language(root, lang1)
  File "/private/home/lawrencestewart/DeepTyper/preprocessing/src/dataset.py", line 25, in __init__
    ), f"failed to initalize Language {self.l}, there is no directory {str(self.folder)}"
AttributeError: 'Language' object has no attribute 'l'

Not high priority but I thought it would be of interest! Cheers :)

Multi-bleu.perl file not found

In evaluator.py, the code now tries to download a multi-bleu.perl from BLEU_SCRIPT_URL = 'https://raw.githubusercontent.com/facebookresearch/XLM/master/src/evaluation/multi-bleu.perl'. However, there isn't such a file in this repo. I found multi-bleu.perl from the Moses repo here. Is there any difference between the TransCoder one and the original Moses one?

How to get parallel dataset from already shared raw tokenized data ?

Hi I have looked into the raw tokenized parallel data which is in .tok format. Downloaded the same from https://dl.fbaipublicfiles.com/transcoder/TransCoder_tokenized_test_set_functions.zip . Seems the same methods are written into all 3 language C++, Python and Java. I need to know the generation process of binarized .pth files like "python_sa-cpp_sa-python_sa","cpp_sa-python_sa-cpp_sa"..
Please help. Any help would be much appreciated.

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

After following the instruction for downloading GitHub data from google big query, I encounter the following issue while running the pipeline. The exception is raised, but the pipeline still runs for a while before it stopped with no other error.
I am kinda a noob in python, I tried to print out s, and err.value in the raw_decoder function, but nothing was printed.
I am using python 3.7.

Has anyone else encountered the same issue?

root@56c7f8eec716:/ghome/yuy/transcoder/TransCoder# . preprocess_test.sh
cpp: process ...
java: process ...
python: process ...
cpp: tokenizing 189 json files ...
s= repo_name,ref,path,copies,content
err.value= 0
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/python-3.7.5/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/src/dataset.py", line 71, in process
self.process_json_and_tok(keep_comments, tok_executor)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/src/dataset.py", line 36, in process_json_and_tok
job.result()
File "/ghome/yuy/transcoder/TransCoder/preprocessing/src/utils.py", line 278, in result
self._result = self.func(*self.args, **self.kwargs)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/src/utils.py", line 81, in process_and_tokenize_json_file
x = json.loads(line)
File "/usr/local/python-3.7.5/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/python-3.7.5/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/python-3.7.5/lib/python3.7/json/decoder.py", line 358, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/python-3.7.5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/python-3.7.5/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/preprocess.py", line 119, in
lang3=args.lang3, size_gb=args.bpe_train_size, test_size=args.test_size)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/preprocess.py", line 65, in preprocess
lang_executor=mp_executor, tok_executor=cluster_ex1, split_executor=cluster_ex2)
File "/ghome/yuy/transcoder/TransCoder/preprocessing/src/dataset.py", line 165, in process_languages
self.sizes[lang.l] = jobs[i].result()
File "/usr/local/python-3.7.5/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/usr/local/python-3.7.5/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

signal only works in main thread

I got this error
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2464, in __call__ return self.wsgi_app(environ, start_response) File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2450, in wsgi_app response = self.handle_exception(e) File "/usr/local/lib/python3.6/dist-packages/flask_cors/extension.py", line 161, in wrapped_function return cors_after_request(app.make_response(f(*args, **kwargs))) File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1867, in handle_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.6/dist-packages/flask_cors/extension.py", line 161, in wrapped_function return cors_after_request(app.make_response(f(*args, **kwargs))) File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/TransCoder/server.py", line 115, in translate result = run_model_1(input_string, check_source_value, check_target_value, userDirName) File "/TransCoder/server.py", line 145, in run_model_1 output = translator1.translate(input_dir, src_lang, tgt_lang) File "/TransCoder/translate.py", line 115, in translate tokens = [t for t in tokenizer(input_file)] File "/TransCoder/preprocessing/src/code_tokenizer.py", line 338, in tokenize_cpp tokens_and_types = get_cpp_tokens_and_types(s) File "/TransCoder/preprocessing/src/timeout.py", line 32, in wrapper signal.SIGALRM, partial(_handle_timeout, 0)) File "/usr/lib/python3.6/signal.py", line 47, in signal handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler)) ValueError: signal only works in main thread

I made a flask server, then I uploaded cpp file and converted to java what's the problem?

How to download github repos from BigQuery?

Can you provide some details on how to download the data in JSON format? It seems we can download a fraction of the data free of cost. However, the documentation is not easy to understand how to download the data in the JSON format. So, providing some details would be very helpful.

Data path (os.path.isdir(path) returning false when it exists) using FB XLM

I'm trying to run an evaluation with the below configuration,

python XLM/train.py
--n_heads 8
--bt_steps 'python_sa-cpp_sa-python_sa,cpp_sa-python_sa-cpp_sa,java_sa-cpp_sa-java_sa,cpp_sa-java_sa-cpp_sa,python_sa-java_sa-python_sa,java_sa-python_sa-java_sa' # The evaluator will use this parameter to infer the languages to test on
--max_vocab '-1'
--word_blank '0.1'
--n_layers 6
--generate_hypothesis true
--max_len 512
--bptt 256
--fp16 true
--share_inout_emb true
--tokens_per_batch 6000
--has_sentences_ids true
--eval_bleu true
--split_data false
--data_path '/content/drive/MyDrive/TransCoder/repository/TransCoder/data/XLM-cpp-java-python-with-comments/'
--eval_computation true
--batch_size 32
--reload_model 'model_1.pth,model_1.pth'
--amp 2
--max_batch_size 128
--ae_steps 'cpp_sa,python_sa,java_sa'
--emb_dim 1024
--eval_only True
--beam_size 10
--retry_mistmatching_types 1
--dump_path '/tmp/'
--exp_name='eval_final_model_wc_30'
--lgs 'cpp_sa-java_sa-python_sa'
--encoder_only=False

Looks like the data_path params is not getting updated when passed as an argument, coz of this the assertion statements in loader.py

assert os.path.isdir(params.data_path), params.data_path this assertion failed

Train Dataset

Can you privide the monolingual data that can be downloaded directly? (don't use google cloud)

RuntimeError: CUDA error: device-side assert triggered

Traceback (most recent call last):
File "TransCoder/translate.py", line 179, in
input, lang1=params.src_lang, lang2=params.tgt_lang, beam_size=params.beam_size)
File "TransCoder/translate.py", line 138, in translate
min(self.reloaded_params.max_len, 3 * len1.max().item() + 10)),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

С++ (Boost + OpenMP) to Python

Hi!
I have a C++ project where Boost library and OpenMP #pragmas are used. I have several c++ files
image
Is it possible to convert this C++ project to Python? And if it is possible, what should I do in this case?

AssertionError: The path to the BPE tokens is incorrect

Hi,

I'm running the following command:
python translate.py --src_lang cpp --tgt_lang python --model_path model_2.pth < code_to_be_translated.cc
But I get the error:

Traceback (most recent call last):
  File "TransCoder/translate.py", line 166, in <module>
    params.BPE_path), f"The path to the BPE tokens is incorrect: {params.BPE_path}"
AssertionError: The path to the BPE tokens is incorrect: data/BPE_with_comments_codes

I installed the dependencies, including fastBPE following the readme commands (in XLM/tools)
Should I provide a BPE_path in my command ? I didn't provide any as in the Read me example none is provided. If yes, what is the BPE_path that should be given (what to install, where to find, ...)

Thanks !

Error on line43 for translation

Got the error below when trying to run the translation command (python translate.py --src_lang python --tgt_lang java --model_path model_1.pth < count_ten.py). Please advise what causes it and how to fix it?

File "translate.py", line 43
help=f"Source language, should be either {', '.join(SUPPORTED_LANGUAGES[:-1])} or {SUPPORTED_LANGUAGES[-1]}")
^
SyntaxError: invalid syntax

Effective batch size and number of epochs on the full data

Hi, I am curious what was the effective batch size in your experiments? Does the batch size impact training stability? In the paper, you mention that batch size was set to 32 (sequences of length 512) and 32 V100 GPUs were used. So, does it mean the effective batch size was 1024 (32x32)?

Also, it seems you have trained for max steps of 100,000 on MLM. Since you have more than 700M functions in 3 languages, were you able to pass through the entire data once using max steps of 100,000? My calculation says, with 100k steps and an effective batch size of 1024, you covered around 100M examples (nb function). May I know approximately how many epochs were completed on the whole data with your training setup?

AssertionError: failed to binarize for XLM the file

Hi,

We I try to run the pipeline, I got this AssertionError indicating that I failed to binarize for XLM the file.
AssertionError: failed to binarize for XLM the file

I don't why this may happen, and I tried to read the source codes, but couldn't find a clue. Can anyone help me out?

Thank you.

Just a small readme error

Hi,

just a small thing. In the readme it says:

If you run the data preprocessing pipeline, you will have to compile fastBPE. Go in preprocessing/tools/fastBPE and run: g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

The correct path would have to be XLM/tools/fastBPE, and you would have to carry out the steps described in XLM/tools/README.md first.

Thank you very much for uploading the code!

no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp

I run translation using this command.

python translate.py --src_lang cpp --tgt_lang java --model_path model_1.pth < ADD_1_TO_A_GIVEN_NUMBER.cpp

I use VMware ubuntu 16.04. I got error:

RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:47

How can I resolve this issue.?
Can I run it on vm machine ? Or I have to run on physical machine that has GPU installed.

Any installation document avaiable such as what version of OS, packages versions and etc.

Thanks

OSError: /usr/lib/llvm-7/lib//libclang-11.so

I am trying to setup the code in google cloud VM. I have installed all the packages as mentioned in the dependencies page and I have installed the latest cython version(0.29.21). I am getting the below error when I try to run the translate.py file
Command ran -

python translate.py --src_lang cpp --tgt_lang python --model_path model_2.pth < input_code.cpp

Error -

Traceback (most recent call last):
  File "/home/sudheerbabu_pakanati/transcoder/lib/python3.7/site-packages/clang/cindex.py", line 4173, in get_cindex_library
    library = cdll.LoadLibrary(self.get_filename())
  File "/usr/lib/python3.7/ctypes/__init__.py", line 434, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.7/ctypes/__init__.py", line 356, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/llvm-7/lib//libclang-11.so: cannot open shared object file: No such file or directory

TransCoder on C++

Please, someone transcode TransCoder and its dependencies on python to C++ language pls

Using the encoder for a downstream task

Thank you very much for your excellent work!

In the file "translate.py" line 128:

enc1 = self.encoder('fwd', x=x1, lengths=len1,
                                langs=langs1, causal=False) 

I noticed that when x1 shape is (n, 1) then the enc1 shape is (1, n, 1024), where n is the number of input tokens (len1).

My question is about enc1, does enc1 represent the sequence of hidden-states at the output of the last layer of the encoder model?

For example, can I use the encoder output enc1 as an input to a bidirectional LSTM network to perform some kind of source code classification? or there is a better way?

Moreover, the decoder takes the enc1 as input along with len1, and target language as follows

self.decoder.generate(enc1, len1, lang2_id, ...

Accordingly, I assume that the decoder network maintains its Q,K,V weights to learn how to attends to enc1 of shape (1, n, 1024) that represents the input sequence of length n, and in this case, the enc1 vectors represent the values V, right?

Best Regards

Java Errors

Thank you for providing all the code/data!

I have one question regarding the translations in Java. When I run the evaluation, I either obtain success (only in the case the translation is exact and thus not executed at all) or error : /bin/bash: module: command not found. Since I obtain the latter as soon as something is executed, it must be the general command for Java (eg the Python translations run fine). Do you maybe have an idea what I might be missing (in the system configuration or similar)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.