Code Monkey home page Code Monkey logo

cppminer's Introduction

cppminer

cppminer produces a code2seq compatible datasets from C++ code bases.

Experimental C++ dataset mined from the Chromium project sources.

This tool consists from three scripts which should be run consistently.

1. Miner

The miner.py is the main utility which traverse c++ sources, parse them and produce raw dataset files.

It has following command line interface:

usage: miner.py [-h] [-c contexts-number] [-l path-length] [-d ast-depth] [-p processes-number] [-e libclang-path] path out

positional arguments:
  path                  the path sources directory
  out                   the output path

optional arguments:
  -h, --help            show this help message and exit
  -c contexts-number, --max_contexts_num contexts-number
                        maximum number of contexts per sample
  -l path-length, --max_path_len path-length
                        maximum path length (0 - no limit)
  -d ast-depth, --max_ast_depth ast-depth
                        maximum depth of AST (0 - no limit)
  -p processes-number, --processes_num processes-number
                        number of parallel processes
  -e libclang-path, --libclang libclang-path
                        path to libclang.so file

The input path is traversed recursively and all files with following extensions c, cc, cpp are parsed. It is recommended to use the c++ compilation database which provides all required compilation flags for project files.

These files have following format:

  • Each row is an example.

  • Each example is a space-delimited list of fields, where:

    1. The first field is the target label, internally delimited by the "|" character (for example: compare|ignore|case)
    2. Each of the following field are contexts, where each context has three components separated by commas (","). None of these components can include spaces nor commas.

Context's components are a token, a path, and another token.

Each token component is a token in the code, split to subtokens using the "|" character.

Each path is a path between two tokens, split to path nodes using the "|" character. Example for a context:

my|key,StringExression|MethodCall|Name,get|value

Here my|key and get|value are tokens, and StringExression|MethodCall|Name is the syntactic path that connects them.

2. Merge

The merge.py is the utility which concatenates all raw file, shuffles them and produce three files dataset.train.c2s, dataset.test.c2s and dataset.val.c2s into the given directory. Also it can clean source files after merging. The important settings is the map_file_size which defines the size of the database file used for merging, you should increase default value of 6Gb for large datasets.

It has following command line interface:

usage: merge.py [-h] [-c clear_resources_flag] [-m map_file_size] path

merge resources generated by cppminer to a code2seq dataset

positional arguments:
  path                  the dataset sources path

optional arguments:
  -h, --help            show this help message and exit
  -c clear_resources_flag, --clear_resources clear_resources_flag
                        if True clear resource files
  -m map_file_size, --map_size map_file_size
                        size of the DB file, default(6442450944 bytes)

3. Code2vec preprocess

The third utility is the preprocess.sh from the code2seq folder, this is modified script from the original project which generates dataset in format suitable for the code2seq model. in general it creates new files with truncated and padded number of paths for each example.

cppminer's People

Contributors

kolkir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cppminer's Issues

UnpicklingError

File "D:\Masters\industrialproject\C++attempt1\code2vec\code2vec\vocabularies.py", line 224, in _load_word_freq_dict
token_to_count = pickle.load(file)
_pickle.UnpicklingError: invalid load key, '\x1f'.

I am getting this error when I am trying to use the Chromium preprocessed data with Code2vec.

target label and context content

@Kolkir
Hi. I hope you doing good. I have confusion with the target label and context path. After running miner.py. How can I differentiate between the target label and the context path? Also If I have my own label. How can I use my own label?

Thanks

miner.py

Hey

When running miner.py and giving path and output. I am getting the following error.

if len(self.samples) > 0:
AttributeError: 'AstParser' object has no attribute 'samples'

How to fix this?

ValueError: ctypes objects containing pointers cannot be pickled

Can anyone find a solution for this?

Traceback (most recent call last):
  File "/Users/lsiddiqsunny/Documents/cppminer/src/miner.py", line 144, in <module>
    main()
  File "/Users/lsiddiqsunny/Documents/cppminer/src/miner.py", line 119, in main
    p.start()
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/lsiddiqsunny/.pyenv/versions/3.9.4/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
ValueError: ctypes objects containing pointers cannot be pickled

LIBCLANG TOOLING ERROR

When I try python3 miner.py -c 1000 ../data path/to on my ubuntu16.04,I got this error. it looks like this:
Parallel processes num: 4
Max contexts num: 1000
Max path length: 0
Max sub-tokens num: 5
Max AST depth: 0
Input path: /home/cleowang/code2seq/cppminer/data
Output path: /home/cleowang/code2seq/cppminer/src/path/to
Parsing files ...
LIBCLANG TOOLING ERROR: fixed-compilation-database: Error while opening fixed database: No such file or directory
json-compilation-database: Error while opening JSON database: No such file or directory

LIBCLANG TOOLING ERROR: fixed-compilation-database: Error while opening fixed database: No such file or directory
json-compilation-database: Error while opening JSON database: No such file or directory

How can i fix it? thank you!

ModuleNotFoundError

Hi,

I am trying to train the model and I am getting this error message.

error:
No module named 'cppminer.cpp_parser'

I am not able to resolve the issue .

Thanks

LIBCLANG TOOLING ERROR

LIBCLANG TOOLING ERROR: fixed-compilation-database: Error while opening fixed database: no such file or directory
json-compilation-database: Error while opening JSON database: no such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.