kolkir / cppminer Goto Github PK

View Code? Open in Web Editor NEW

23.0 5.0 6.0 100 KB

cppminer produces a code2seq compatible datasets from C++ code bases.

License: MIT License

C++ 4.03% Python 88.58% CMake 0.60% Shell 6.79%

cppminer's Introduction

cppminer

cppminer produces a code2seq compatible datasets from C++ code bases.

Experimental C++ dataset mined from the Chromium project sources.

This tool consists from three scripts which should be run consistently.

1. Miner

The miner.py is the main utility which traverse c++ sources, parse them and produce raw dataset files.

It has following command line interface:

usage: miner.py [-h] [-c contexts-number] [-l path-length] [-d ast-depth] [-p processes-number] [-e libclang-path] path out

positional arguments:
  path                  the path sources directory
  out                   the output path

optional arguments:
  -h, --help            show this help message and exit
  -c contexts-number, --max_contexts_num contexts-number
                        maximum number of contexts per sample
  -l path-length, --max_path_len path-length
                        maximum path length (0 - no limit)
  -d ast-depth, --max_ast_depth ast-depth
                        maximum depth of AST (0 - no limit)
  -p processes-number, --processes_num processes-number
                        number of parallel processes
  -e libclang-path, --libclang libclang-path
                        path to libclang.so file

The input path is traversed recursively and all files with following extensions c, cc, cpp are parsed. It is recommended to use the c++ compilation database which provides all required compilation flags for project files.

These files have following format:

Each row is an example.
Each example is a space-delimited list of fields, where:
1. The first field is the target label, internally delimited by the "|" character (for example: compare|ignore|case)
2. Each of the following field are contexts, where each context has three components separated by commas (","). None of these components can include spaces nor commas.

Context's components are a token, a path, and another token.

Each token component is a token in the code, split to subtokens using the "|" character.

Each path is a path between two tokens, split to path nodes using the "|" character. Example for a context:

my|key,StringExression|MethodCall|Name,get|value

Here my|key and get|value are tokens, and StringExression|MethodCall|Name is the syntactic path that connects them.

2. Merge

The merge.py is the utility which concatenates all raw file, shuffles them and produce three files dataset.train.c2s, dataset.test.c2s and dataset.val.c2s into the given directory. Also it can clean source files after merging. The important settings is the map_file_size which defines the size of the database file used for merging, you should increase default value of 6Gb for large datasets.

It has following command line interface:

usage: merge.py [-h] [-c clear_resources_flag] [-m map_file_size] path

merge resources generated by cppminer to a code2seq dataset

positional arguments:
  path                  the dataset sources path

optional arguments:
  -h, --help            show this help message and exit
  -c clear_resources_flag, --clear_resources clear_resources_flag
                        if True clear resource files
  -m map_file_size, --map_size map_file_size
                        size of the DB file, default(6442450944 bytes)

3. Code2vec preprocess

The third utility is the preprocess.sh from the code2seq folder, this is modified script from the original project which generates dataset in format suitable for the code2seq model. in general it creates new files with truncated and padded number of paths for each example.

cppminer's People

Contributors

Stargazers

Watchers

Forkers

lida-ghaemi lbacchiani colebuckleyy mkravchik phypoh datmac

cppminer's Issues

UnpicklingError

File "D:\Masters\industrialproject\C++attempt1\code2vec\code2vec\vocabularies.py", line 224, in _load_word_freq_dict
token_to_count = pickle.load(file)
_pickle.UnpicklingError: invalid load key, '\x1f'.

I am getting this error when I am trying to use the Chromium preprocessed data with Code2vec.

target label and context content

@Kolkir
Hi. I hope you doing good. I have confusion with the target label and context path. After running miner.py. How can I differentiate between the target label and the context path? Also If I have my own label. How can I use my own label?

Thanks

miner.py

Hey

When running miner.py and giving path and output. I am getting the following error.

if len(self.samples) > 0:
AttributeError: 'AstParser' object has no attribute 'samples'

How to fix this?

ValueError: ctypes objects containing pointers cannot be pickled

Can anyone find a solution for this?

Traceback (most recent call last):
  File "/Users/lsiddiqsunny/Documents/cppminer/src/miner.py", line 144, in <module>
    main()
  File "/Users/lsiddiqsunny/Documents/cppminer/src/miner.py", line 119, in main
    p.start()
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled

LIBCLANG TOOLING ERROR

When I try python3 miner.py -c 1000 ../data path/to on my ubuntu16.04,I got this error. it looks like this:
Parallel processes num: 4
Max contexts num: 1000
Max path length: 0
Max sub-tokens num: 5
Max AST depth: 0
Input path: /home/cleowang/code2seq/cppminer/data
Output path: /home/cleowang/code2seq/cppminer/src/path/to
Parsing files ...
LIBCLANG TOOLING ERROR: fixed-compilation-database: Error while opening fixed database: No such file or directory
json-compilation-database: Error while opening JSON database: No such file or directory

LIBCLANG TOOLING ERROR: fixed-compilation-database: Error while opening fixed database: No such file or directory
json-compilation-database: Error while opening JSON database: No such file or directory

How can i fix it? thank you!

ModuleNotFoundError

Hi,

I am trying to train the model and I am getting this error message.

error:
No module named 'cppminer.cpp_parser'

I am not able to resolve the issue .