Code Monkey home page Code Monkey logo

opened's Introduction

OPENED Extraction Tool

LPC 2022 blurb describing the goal of the tool and an initial prototype is here: https://lpc.events/event/16/contributions/1370/

Dependencies

  1. Works on a) kernel verion 5.4.0-131, Ubuntu 22:04, Intel arch x86 arch b) Dockerfile works with Wondows 10, WSL2+Docker Desktop with Ubuntu 22.04 App from MS store. There is a known issue with Apple Silicon based Macbooks with installing a) gcc-multlib b) TXL and c) Codequery, described here
  2. git
  3. Docker

Download

  1. Run git clone [email protected]:eBPFDevSecTools/opened.git followed by git submodule update --init --recursive
  2. cd opened
  3. To update the submodules a) git submodule update --remote --merge b) cd codequery; git pull

Install

Process 1: Docker

  1. mkdir op To store the output of extraction phase (or any other folder name)
  2. docker build . -t opened/extract:0.01

Process 2: On Host

  1. For now: You will need to parse the Dockerfile and execute the installation steps on your host system.
  2. In future we will provide a script for on-host installation (Issue #24).

Updating local branch

  1. run git pull
  2. run git submodule update --recursive
  3. If you have docker for install, you are done.
  4. If you have on-host install, you will need to re-install codequery by running the relevant instructions from Dockerfile.

Extraction code and artefacts

Code extraction consists of three phases 1) Determining the necessary functions and data-structures to be copied, 2) (Manual) disambiguation of the target set of functions identified in previous step and 3) Extracting required code from source files to generate an independantly compilable module.

Phase I: Determining necessary functions and data-structures for extracting specific functionality

  1. Run annotated function call graph extraction phase,
python3 src/extraction_runner.py --help
usage: extraction_runner.py [-h] -s SRC_DIR -f FUNCTION_NAME [-d DB_FILE_NAME] [-g FUNCTION_CALL_GRAPH_PATH] -r REPO_NAME

optional arguments:
  -h, --help            show this help message and exit
  -s SRC_DIR, --src_dir SRC_DIR
                        directory with source code
  -f FUNCTION_NAME, --function_name FUNCTION_NAME
                        function name to be extracted
  -d DB_FILE_NAME, --db_file_name DB_FILE_NAME
                        Optional sqlite3 database with cqmakedb info
  -g FUNCTION_CALL_GRAPH_PATH, --function_call_graph_path FUNCTION_CALL_GRAPH_PATH
                        directory to put function and map dependency call graph file. Output of phase I
  -r REPO_NAME, --repo_name REPO_NAME
                        Project repository name

NOTE: example is given in run2.sh.

Phase II

  1. Open the func.out file and remove the duplicate function and struct definitions. This will output an annotated function call graph in a file named func.out. Note that func.out may have duplicate function defintions. We expect the developer to disambiguate and identify the required set of functions to be extracted in Phase II.

Phase III: Extracting Required Code

  1. Run the function extractor to extract and dump required functions and map definitions.
python3 src/function-extractor.py -h
usage: function-extractor.py [-h] -o OPDIR -c CODEQUERYOUTPUTFILE -e EXTRACTEDFILENAME -t STRUCT_INFO -f FUNC_INFO -s SRCDIR -b BASEDIR [--isCilium]

Function Extractor

optional arguments:
  -h, --help            show this help message and exit
  -o OPDIR, --opdir OPDIR
                        directory to dump extracted files to
  -c CODEQUERYOUTPUTFILE, --codequeryOutputFile CODEQUERYOUTPUTFILE
                        Function and Map dependency output from codequery
  -e EXTRACTEDFILENAME, --extractedFileName EXTRACTEDFILENAME
                        Output file with extracted function
  -t STRUCT_INFO, --struct_info STRUCT_INFO
                        json file containing struct definitions in the repo
  -f FUNC_INFO, --func_info FUNC_INFO
                        json file containing function definitions in the repo
  -s SRCDIR, --srcdir SRCDIR
                        Directory containing source files for function to be extraced from
  -b BASEDIR, --basedir BASEDIR
                        Base Directory path relative to which directory structure in opdir will be created
  --isCilium            whether repository is cilium

Note that STRUCT_INFO and FUNC_INFO are generated using the annotator script in the eBPF-projects-annotations repo

Note that extracted.c may contain duplicate eBPF Map defintions within and ATTENTION section. We expect the developer to choose the right map definition and delete the offending defintion.

Compilation

Run make to compile the extracted code.

opened's People

Contributors

dushyantbehl avatar lcastanheira-1 avatar palanik1 avatar pkodeswaran avatar sdsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

opened's Issues

Handle duplicate function definitions in .c files guarded by macro guards

Is your feature request related to a problem? Please describe.
Consider sock4_update_revnat in cilium/bpf_sock.c defined at line numbers 168 and 193 in the same file. Currently, the developer has to pick one. By default, if there are multiple defs, we process the first definition.
Three alternatives are:

  1. Possibly alert developer and exit processing.
  2. Pick first defn as current practice
  3. Pick all definitions and protect with macro guards. In principle this should work, since this is already done while extracting functions.

This is not an issue with .h files, since we include them as is. I think the single function definition requirement matters only when there are multiple defns across different files.

Tracepoint_Probe Bug

The tool is failing for the following files:

  1. https://github.com/iovisor/bcc/blob/master/examples/tracing/biolatpcts.py
  2. https://github.com/iovisor/bcc/blob/master/examples/tracing/hello_fields.py
  3. https://github.com/iovisor/bcc/blob/master/examples/tracing/kvm_hypercall.py
  4. https://github.com/iovisor/bcc/blob/master/examples/tracing/trace_fields.py
  5. https://github.com/iovisor/bcc/blob/master/examples/tracing/urandomread.py

They all have tracepoint_probe/raw_tracepoint_probe as functions.

We created a 'bcc' folder in 'opened_extraction/examples' and added all the segregated bcc .c files and the respective python loader files. Then created two folders in 'opened_extraction/op' named 'commented_bcc' and 'txl_bcc'.
Inside the docker container on running python3 src/annotator.py -o op/bcc/txl_bcc -s examples/bcc -c op/bcc/commented_bcc -t op/bcc/bcc.function_file_list.json -u op/bcc/bcc.struct_file_list.json we are getting the following error:

Traceback (most recent call last):
File "/root/src/annotator.py", line 306, in
create_code_comments(txl_func_file, bpf_helper_file, cmt_op_dir, isCilium)
File "/root/src/annotator.py", line 170, in create_code_comments
cmt.parseTXLFunctionOutputFileForComments(xmlFile, opFile, srcFile, helperdict, map_update_fn, map_read_fn)
File "/root/src/code_commentor.py", line 143, in parseTXLFunctionOutputFileForComments
output= funcName.split('(')[-2].split(" ")[-2]
IndexError: list index out of range

@sdsen @palanik1

Merge project specific asset files

Is your feature request related to a problem? Please describe.
Some eBPF projects use libbpf as well as cilium's wrappers or bcc functions. We will miss coverage if assume a single library type.
Describe the solution you'd like
Check functions in input code across all libraries.
Describe alternatives you've considered
NA
Additional context
NA

Decide on putting #defines before or after includes

Is your feature request related to a problem? Please describe.
For some functions in cilium, some #defines should be before #includes while for some #includes after (due to redefinition)
Describe the solution you'd like
Pick a scheme that works for katran and most cilium functions..

Add copyright information

Is your feature request related to a problem? Please describe.
Add copyright information in "extracted.c" file to ensure proper attribution.

Describe the solution you'd like
take name of appropriate entity as input and append it to extracted.c file.

Describe alternatives you've considered
NA
Additional context
NA

list index out of range when running function_extractor.py

image
after I successfully ran the extraction_runner.py, I got func.out as following (There is only one function because I pick connection_table_lookup which only occurs once in Katran)
image
Then I tried to run function_extractor.py
image
It shows error like this
image
Can I get a sense of what is going on with this?

ML-Based Equivalence Checking

This feature builds on parallel attend to use OpenAI to create documentation. The goal of this feature is to build a lightweight equivalence check based on the text documentation of a function. Ideally using techniques eg TTIDF and word2vec to determine if two functions could be equivalent based on their documentation.

review and increase the capability labels for bpf-specific functions.

Is your feature request related to a problem? Please describe.
We have just looked at map_read, sys_info etc. labels, need to scan and identify more capability labels.
Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Handle duplicate #defines during extraction

Is your feature request related to a problem? Please describe.
When including #defines from multiple files, there could be duplicate #defines, with potentially different values.
Describe the solution you'd like
Create a duplicate #defines dict and alert user to possible duplicates with divergent values to fix

generate e2e artefacts for cilium function extraction

Is your feature request related to a problem? Please describe.
Create folders with extracted cilium function which load on an interface. for each folder generate separate makefiles etc. as well.
put folders in op dir.

Describe the solution you'd like
target functions:

  • handle_ipv6
  • tail_handle_ipv6
  • handle_ipv4
  • tail_handle_ipv4
  • bpf_redir_proxy
  • bpf_sock_ops_ipv4
  • bpf_sock_ops_ipv6
  • bpf_sockmap
  • sock4_connect

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Run opensource static analyzer on code

Is your feature request related to a problem? Please describe.
Play around with static-analyzers to see the kind of information they provide out-of-box, and incorporate relevant information into our comment_stub/summarizer output.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve UX

Is your feature request related to a problem? Please describe.
enhance the UX for users of the tool. Also, document the workflow.
Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Cilium missing .h files

We are trying to use the newest make files for extraction of cilium. We now face a problem that there is a missing library called bpf_features.h caused the make file error.
We have already change the path of the make file to the examples/cilium where the libraries lies in, but this missing one isn't there....

Should this .h file given or should it be extracted from the tool?
Thanks

Screen Shot 2022-12-04 at 10 21 27 PM

Screen Shot 2022-12-04 at 10 21 48 PM

using OpenAi to Create Automatic Documentation

OpenAI enables automatic creation of documentation using the davinci model. An example can be found here: https://beta.openai.com/examples/default-explain-code

The goal is to write an API that takes lines of Code of interests and generates documentations.

#linesOfCode an array of code
#ModelParameters a dict defining characteristics of the model.
***({ModelName: ".."; Param1: "...", Param2: "...", Param3: "..."}

CreateDocumentation(linesOfCode, ModelParameters)

ARM support for dependency setup (TXL and codequery)

Is your feature request related to a problem? Please describe.
The problem is many dependencies of the OPENED tool isn't supported on ARM architecture. Specifically, they are gcc-multilib, codequery, and TXL. There is an alternative package to substitute gcc-multilib. The qt5-default dependency of codequery is also not supported, but there might be an alternative.

There's no alternative for TXL on ARM architectures.

Describe the solution you'd like
A docker container for setting up the dependencies for OPENED, targeted at ARM architectures specifically.

Describe alternatives you've considered
There's an alternative package to use on ARM architecture for gcc-multilib, but no working alternatives have been found for codequery and TXL:

  • gcc-multilib is not supported on ARM architectures. A working solution is to install the gcc-multilib-i686-linux-gnu package instead. However, with this alternative packet, the C_INCLUDE_PATH var needs to be changed to /usr/include/aarch64-linux-gnu/ by running export C_INCLUDE_PATH=/usr/include/aarch64-linux-gnu/ for example.
  • codequery dependency qt5-default is not support on ARM architecture. We haven't found an alternative solution to this issue yet.
  • TXL does not support ARM and there's currently no alternative solution.

clean up repo, such that all generated artefacts are in op/ folder.

Is your feature request related to a problem? Please describe.
clean up repo, such that all generated artefacts are in op/ folder.
Describe the solution you'd like

  1. clean present main folder
  2. change run*.sh files to generate output in op folder
    Describe alternatives you've considered
    None.
    Additional context
    NA

cqsearch recursion is hanging

Is your feature request related to a problem? Please describe.
For a bunch of cilium functions. Debug bottleneck.
Describe the solution you'd like
The cqsearch should complete function_call_graph generation.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Add BU student comments to asset

Is your feature request related to a problem? Please describe.
As part of course project, EECS 528 BU students have generated docStrings for parts of cilium codebase. We should create a db with the docstrings and modify code_comment.py to read that db for generating comment_stubs.

change input API

Is your feature request related to a problem? Please describe.
As a input to extraction_runner, specify the github repo.
Describe the solution you'd like

  1. read in the repo-id, and run find operation to search the .c and .h ebpf files
  2. save the commit info
    Describe alternatives you've considered
    None.
    Additional context
    None.

Integrate Comment stub generation code

Is your feature request related to a problem? Please describe.
The new phase-one of extraction as asked for in #15 should also provide option to generate stub comment.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Split extraction phase 1 code

Is your feature request related to a problem? Please describe.
The steps to generate a) txl-based function and struct identification and b) codequery based function-call-graph and map-graph should be split into 2 steps. This is because step-1 is for entire repo and can be reused for all functions. Whereas step-2 is extraction target specific.

Describe the solution you'd like
Split extraction_runner.py to 2 separate executors and make the 2nd one take first's artefacts as input if necessary.

Describe alternatives you've considered
None.

Additional context
None.

Create on host installer script

Is your feature request related to a problem? Please describe.
We need a installer script for users who do not want to use docker.

Describe the solution you'd like
A python script which will mimic the Dockerfile operations on host.

Describe alternatives you've considered
None.

Additional context
None.

Run extraction on Mizar

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add loading script for extracted code object file

Is your feature request related to a problem? Please describe.
Need verifier code, as a starting piece, need a mechanism to attach the code to a virtual interface, and also clean out the interface once the testing of code is done.

Describe the solution you'd like
See above.

Describe alternatives you've considered
NA

Additional context
None

cilium helper functions need to be recognized in code_commentor

Is your feature request related to a problem? Please describe.
Cilium uses wrappers for bpf_helper_functions, our current code_commentor doesn't recognize the wrapper helper functions, and hence doesn't document them in the comment stub.
Describe the solution you'd like

  1. update the relevant asset files to include the following helpers. https://github.com/cilium/cilium/blob/5a143f9d2181037c4e721767ec6e4ee72dbd662f/bpf/include/linux/bpf.h#L3835

Describe alternatives you've considered
None.
Additional context
NA

Preserve Directory Structure

Is your feature request related to a problem? Please describe.
Currently all .h files are copied to opdir.. This destroys the directory structure of include files.
Describe the solution you'd like
Take in a base folder and maintain directory structure relative to this folder.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add Test cases

Is your feature request related to a problem? Please describe.
Need a set of test cases and scripts to ensure new code does not break functionalities.

Cilium function extraction test (list of functions can't be extracted)

commented_OPENED_cilium_bpf_overlay.c:

  1. handle_ipv6
  2. tail_handle_ipv6
  3. handle_ipv4
  4. tail_handle_ipv4

commented_OPENED_cilium_cilium-probe-kernel-hz.c:

  1. main (we are not sure if this is a problem but since there are many main functions, probably still need a main for a specific file to see how it uses those functions?)

commented_OPENED_cilium_sockops_bpf_redir.c:

  1. bpf_redir_proxy

commented_OPENED_cilium_sockops_bpf_sockops.c:

  1. bpf_sock_ops_ipv4
  2. bpf_sock_ops_ipv6
  3. bpf_sockmap

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.