google / oss-fuzz-gen Goto Github PK

LLM powered fuzzing via OSS-Fuzz.

License: Apache License 2.0

Dockerfile 0.74% Python 91.59% Shell 1.90% C++ 0.79% HTML 1.88% Java 0.41% C 2.70%

ai fuzzing llm security

oss-fuzz-gen's Introduction

A Framework for Fuzz Target Generation and Evaluation

This framework generates fuzz targets for real-world C/C++ projects with various Large Language Models (LLM) and benchmarks them via the OSS-Fuzz platform.

More details available in AI-Powered Fuzzing: Breaking the Bug Hunting Barrier:

Current supported models are:

Vertex AI code-bison
Vertex AI code-bison-32k
Gemini Pro
OpenAI GPT-3.5-turbo
OpenAI GPT-4

Generated fuzz targets are evaluated with four metrics against the most up-to-date data from production environment:

Compilability
Runtime crashes
Runtime coverage
Runtime line coverage diff against existing human-written fuzz targets in OSS-Fuzz.

Here is a sample experiment result from 2024 Jan 31. The experiment included 1300+ benchmarks from 297 open-source projects.

Overall, this framework manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

Note that these reports are not public as they may contain undisclosed vulnerabilities.

Usage

Check our detailed usage guide for instructions on how to run this framework and generate reports based on the results.

Collaborations

Interested in research or open-source community collaborations? Please feel free to create an issue or email us: [email protected].

Bugs Discovered

So far, we have reported 6 new bugs/vulnerabilities found by automatically generated targets built by this framework:

Project	Bug	LLM	Prompt template
`cJSON`	OOB read	Vertex AI	default
`libplist`	OOB read	Vertex AI	default
`hunspell`	OOB read	Vertex AI	default
`zstd`	OOB write	Vertex AI	default
Undisclosed	stack buffer underflow	Vertex AI	default
Undisclosed	use of unitialised memory	Vertex AI	default

These bugs could only have been discovered with newly generated targets. They were not reachable with existing OSS-Fuzz targets.

Current top coverage improvements by project

Project	Coverage increase % *
tinyxml2	29.84
inih	29.67
lodepng	26.21
libarchive	23.39
cmark	21.61
fribidi	18.20
lighttpd	17.56
libmodbus	16.59
valijson	16.21
libiec61850	13.53
hiredis	13.50
cmake	12.62
pugixml	12.43
meshoptimizer	12.23
libusb	11.12
json	10.84

* Percentage coverage is calculated using a denominator of the total lines of source code compiled during the OSS-Fuzz build process for the entire project.

Citing This Work

Please click on the 'Cite this repository' button located on the right-hand side of this GitHub page for citation details.

oss-fuzz-gen's People

Contributors

Stargazers

Watchers

Forkers

gmh5225 davidkorczynski asdyxcyxc another-rex henices brosjsy lowprivshighhopes marktefftech resery satriaadhipradana 0xsouthdev binaryninja zzvip888 louis-xer qiutianchloe blackhu sallywang147 architectureofthings bobwilmes sreuben04 0x6b7966 qqq-tech bcoskun w3llr00t3d axwerbit-y mundefr-fource eltociear tacticusal-n santapakwiqque limetaryd matrixintcat-announcermark binderpost-b beathite16 gisteroxcaptail kongoniiparkel d-gazettekissez blacked-l gament-y chattypyrei cogentri33 wasgrou82 r-planten saberieflashgal fabio-silva458 techthiyanes quantum-decrypt-security dragon28 hhy5277 yanxg will7455 evdcush quanpinjie chunlicui 454036792 geist-seele lamardealmaker natarajinnavoto marvinmw crazyboystop quanttide mihaimaruseac garagon globalhelpforall xiaobei0125 daoyuan14 happy-qop stephenchq01 mist1987 raineydavid saislam arthurscchan marklee131 thecodeofmontecristo fdt622 synthwave-systems linhai2015 intery89 yorkeehuang happy-0x0 chapering ch1hyun manunio zweij0401 layla7120 m0dred chicharitomu14 caspian88

oss-fuzz-gen's Issues

Address the root cause of the extra line in benchmark

Related: the second issue in #87.

Replace print with logging.info in

oss-fuzz-gen/data_prep/project_src.py

Line 156 in 1a0c832

print(f'Retrieving human-written fuzz targets of {project} from Google '

and

oss-fuzz-gen/data_prep/project_src.py

Line 161 in 1a0c832

print(f'Retrieving human-written fuzz targets of {project} from local '

Invalid cloud build tags for one-off experiments

Cloud build request failed due to invalid build tag name. This is likely due to UNIQUE_POD_NAME not expanded properly.
Here is an example.

Capture more corner cases in benchmark name parsing

For example:
output-jsonnet-jsonnet::internal:: in 2024-02-11-64-dg-comparison

The name is incomplete because of having a ( in the namespace, and our parser mistook that as the ( for params:

jsonnet::internal::(anonymous namespace)::Interpreter::builtinExtVar(jsonnet::internal::LocationRange const&, std::__1::vector<jsonnet::internal::(anonymous namespace)::Value, std::__1::allocator<jsonnet::internal::(anonymous namespace)::Value> > const&)

Support languages other than C/C++

Remove function name usage.

Currently, both the function signature and function name store the function signature, which will be used as the unique function identifier.

We can simplify the code by only keeping the function signature.

Related:
#64 (comment)

Handle VertexAI error response 'Text too long'

The error is caused by 'Text too long'. Here is the message:
https://pantheon.corp.google.com/logs/query;aroundTime=2024-02-13T12:58:00.000Z;cursorTimestamp=2024-02-13T13:00:23.372992698Z;duration=PT15M;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22oss-fuzz%22%0Aresource.labels.location%3D%22us-central1%22%0Aresource.labels.cluster_name%3D%22llm-experiment%22%0Aresource.labels.namespace_name%3D%22default%22%0Alabels.k8s-pod%2Fbatch_kubernetes_io%2Fcontroller-uid%3D%228f6b14ca-f7a2-4653-85f1-90e542b95e79%22%20severity%3E%3DWARNING%0Atimestamp%3D%222024-02-13T13:00:23.372992698Z%22%0AinsertId%3D%225lnkmvgn576p21g1%22?project=oss-fuzz

Handling this requires two tasks:

Investigate its root cause. This appears to happen in the code fixing step. We need to understand what the prompt was, and why it is overlong (e.g., Did the error parser parse too much text?)
Capture this error and log it so that the experiment won't break because of it.

Distinguish `benchmark` and `project`

We need unambiguous names:
Each benchmark should be a function, and a project may have multiple benchmarks/functions.

For example, our current benchmark.yaml should be project.yaml.
This will involve other renaming/refactoring modifications.

Pull code from this repo to run daily experiments

Better detection of false positive crashes.

e.g. if the top of the crash stacktrace points to the target source file.

Use anonymous client to retrieve AST from the public bucket

This requires replacing gsutil download with code similar to this example solution.

Again, this requires a temporary fix before #10 is ready so that all people can access ASTs. We will switch to something different once #10 is on.

Thanks, @trashvisor : )

Capture and handle missing ASTs.

Error examples.

This needs a temporary fix before #10 removes all AST functionalities, and similar error handling code after #10.

@trashvisor, thanks!

Provide runtime feedback to LLM

e.g. to increase coverage of a generated target.

Parse and save training data during experiments.

python -m data_prep.parse_training_data --coverage --experiment-dir <result-dir>

Integrate into production OSS-Fuzz

Integrate this into production OSS-Fuzz projects so all projects benefit from coverage increases and new bugs.

A helper script to find all fuzz targets that share the same error in one experiment

Fix false negative 'function used in fuzz targets'

Our current code automatically fails an LLM-generated fuzz target if it does not contain the function under test.
However, the current pattern-matching is native and has false negatives (i.e., function used correctly but not recognized by us), and rejects valid fuzz targets.

This relates to our function name parsing regex:

oss-fuzz-gen/experiment/benchmark.py

Lines 40 to 43 in 4e3a5ab

    
           names = re.findall(r'.*?\s*([\w:<>+*~]+)\s*\([^\(]*\)', function_signature) 
        
           if names: 
        
             # Normalize names. 
        
             return re.sub(r'[^\w:]', '-', names[-1])

A quick temporary solution is checking the function names without special chars.
Later we can make better use of data from FI

Easily trigger experiment runs based on code in a PR

Automatically push docker images from the PR to gcloud
Provide an easy trigger for experiments runs using the image from the PR

Adapt code fixing to the new benchmark YAML

Related: #54

Remove PYTHONPATH usage in data preparation scripts / docs.

Similar to #22.

Investigate general code fixer improvements.

Analyzing the current set of compilation errors, and seeing what fixes can be made via jcc, prompt engineering or other to improve the results.

Capture, log, and retry `gcloud` authentication failures

Context: #11

Capturing this error, logging it as a warning, and re-sendind the request is possibly the best solution for now before knowing the exact cause.

Initial `gcloud` authentication failure in GKE experiments

Some initial cloud build requests failed due to glcoud authentication errors (example1, example2). Some related observations and guesses:

This has recurred on multiple GKE experiments, yet I failed to reproduce them in local experiments. Maybe this is because we have to authenticate gcloud manually before local experiments?
The number of initial experiments affected by this seems random: sometimes only the first one, sometimes multiple. This could be due to parallelism in experiments.
This error disappears after a while in each experiment.

Unittests for code fixing prompt

Unittests for code generation prompt

Save successful compilation fixes.

Create separate dirs for scheduled / one-off / other experiments in our bucket

Coverage diffs: handle templates properly.

It's likely that we aren't handling template instantiations properly in our textcov diffing, leading to some inflated coverage numbers for certain C++ projects that use templates extensively.

Adapt code generation to the new benchmark YAML

Related: #54

Refactor prompt generation and evaluation loop.

Currently our prompt generation is tied to our template format here: https://github.com/google/oss-fuzz-gen/tree/main/prompts/template_xml

We should make it easier and more flexible for others to test different prompt generation strategies, by allowing these custom prompts to be python modules instead that look something like the following:

def generate(benchmark:  Benchmark) -> str:
  ...

i.e. the module would be expected to define a generate function which produces a full prompt to pass to the LLM.

Similarly, we should also make our generation/evaluation loop more configurable, e.g. extract the logic here:

oss-fuzz-gen/run_one_experiment.py

Line 247 in 51a636b

model.prompt_path = model.prepare_generate_prompt(

into a driver.py that can be similarly replaced:

def evaluate(model: models.LLM,  benchmark: Benchmark, prompt_generator: Module):
  prompt = generator.generate(benchmark)
  targets = generate_targets(model, prompt)
  results = evaluate(targets) 
  ...

And tying this all together, the resulting invocations would look something like:

./run_all_experiments --driver /path/to/driver.py --prompt_generator prompts/custom_generator.py

Allow more than 30 CPUs in auto-pilot GKE

More reliable `target_name` determination in benchmarks

Many of our benchmarks have incorrect target_name set, leading to bad runtime results.

Re-select `comparison` becnhmark-set from the new benchmarks.

This requires looking into the failures and selecting benchmarks from recent results based on the following guidelines:

< 30 benchmarks in total. This ensures we can finish the experiment quickly. Particularly for PR experiments.
<=2 benchmarks from one project to include more projects.
<= 1 benchmark with a 100% build rate or > 10% coverage increase from one project, with ~10 benchmarks in total. Well-performing benchmarks are for regression checks; No point in having too many.
<=1 benchmark failed with the same error from the same project to track improvements.
Avoid the same general error from different projects (e.g., size_t or others from standard lib undefined)
Interesting projects to include:

tinyxml2
icu
avahi
Project with complex function signature (e.g., #89, guetzli, abseil-cpp, cppitertools's operator*, etc.)

Some interesting failures to include in comparison:

Incorrect path in #include <...> / Undefined function.
Missing build failure.
Function not called.
Incorrect function usage.
Need the defs of data types used in the function under test.
Incorrect usage of the function under test.

<=1 relatively good performing benchmark from a project (e.g., 60% build rate, 1% coverage increase).

Given there are many benchmarks to select, please document all selected benchmarks with justification here for future references.
Feel free to document some interesting benchmarks that are not selected or not sure if they should be selected.

If more benchmarks are needed, use https://llm-exp.oss-fuzz.com/Result-reports/ochang-2024-01-25/sort.html.
However, their names are not as convenient as the one above. Some of them might not be available either.

Runtime error parser

Thanks, @jonathanmetzman : )

Pod ephemeral local storage usage exceeds the total limit of containers 10Gi.

One-off experiments (with volumeMounts) failed due exceeding the 10Gi limit.
These two experiments (1, 2) failed because of this, but the corresponding jobs and PoEs were not preserved.

This can be reproduced by running an one-off experiment will all benchmark.

Temporary public AST bucket.

Before #10 is ready, the old public AST bucket should be public.

API for providing project context

Existing fuzz targets (path, binary name)
Function implementations
Data structure definitions
Usages of functions and data structures.

Prevent pushing code to `main` directly

Convert functions in `benchmark.yaml` into a more structured way.

For example,
from:

functions:
  - int main2(int argc, char **argv)
  - int scanopt_usage(scanopt_t *scanner, FILE *fp, const char *usage)
project: flex
target_path: /src/fuzz-main.c

to:

"functions":
- "name": "main2"
  "param_names":
  - "argc"
  - "argv"
  "param_types":
  - "int"
  - "char **"
  "return_type": "int"
  "signature": "int main2(int argc, char ** argv)"
- "name": "scanopt_usage"
  "param_names":
  - "scanner"
  - "fp"
  - "usage"
  "param_types":
  - "char **"
  - "struct._IO_FILE *"
  - "char *"
  "return_type": "int"
  "signature": "int scanopt_usage(char ** scanner, struct _IO_FILE * fp, char * usage)"

project: flex
target_path: /src/fuzz-main.c

Instructions for generating targets that read from files

I had some success manually including this in prompts for certain benchmarks to improve the quality of the fuzzing:

IMPORTANT: If the solution needs to load from a file,  you need to create a temporary file called "/tmp/input", write data to it from FuzzedDataProvider, and use that as the file input. Example:
<code>
FILE *handle = fopen("/tmp/input", "w");
std::vector<uint8_t> contents = stream.ConsumeRemainingBytes(); // or stream.ConsumeBytes
fwrite(contents.data(), contents.size(), 1, handle);
fclose(handle);

int result = ParseFile("/tmp/input");
</code>

Need to evaluate this against more targets.

Support completely new C/C++ projects

This will likely be heavily dependent on #10, since we won't have existing targets to provide as context.

Update benchmarks and fuzz targets bucket periodically in GKE

Save best, successful targets

Public reports that allow result comparison.

Allow comparing the results (e.g., build rate, crash rate, bug discovery) against past data.

Mitigate null terminator issues with LLVMFuzzerTestOneInput

Some of the currently generated targets incorrectly assume that the data parameter is null terminated, leading to false positive overflows when they're passed to functions expecting null terminated strings.

We need to better detect these, and experiment with including instructions in our prompts to avoid this.

More informative logging

Add a logging module that stores more data in each log's jsonPayload (e.g., project, function/benchmark, LLM response id (1-10), code fixing id(0-5), etc.). FuzzBench reference 1 , reference 2.
Replace current logging and print with it.

Example view:

This is useful to help us identify error messages to reproduce/fix them.

GitHub actions for linting

Pull repo for daily experiments

Some places to change:

No need to COPY
Need to pull from this repo in https://github.com/google/oss-fuzz-gen/blob/main/report/docker_run.sh

Use `extern <function signature>` to import the function under test, not its header file.

Test if this works on a few past targets (some successful cases and some failed ones).
Ask LLM to use it in the final problem of code generation prompt.
Add this to the example questions and solutions in code generation prompt in the same format.
Ask LLM to use it in the code fixing prompt if the error is about undeclared function.

Extract project context (function impls, data structure defs, usages) and include them in prompts

Having this context will likely significantly improve the quality of targets generated.

The prompts should include:

function source code
function xrefs (sample callsites and usages)
data structure definitions, including which header path they're defined in.

The header file where a function prototype is declared is hard to extract accurately, but to work around this we may be able to just include the function prototype in the target itself without knowing where the header is. The header files that contain the relevant type definitions that are part of the prototype will still need to be included.

Capture gcloud crashes and re-submit experiment

https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22oss-fuzz%22%0Aresource.labels.location%3D%22us-central1%22%0Aresource.labels.cluster_name%3D%22llm-experiment%22%0Aresource.labels.namespace_name%3D%22default%22%0Alabels.k8s-pod%2Fbatch_kubernetes_io%2Fcontroller-uid%3D%2266b327fe-6fdb-4a63-9c00-3346f27a5f9b%22%20severity%3E%3DWARNING;cursorTimestamp=2024-02-09T02:57:11.312515248Z;startTime=2024-02-09T02:33:01.547Z?e=-13802955&mods=logs_tg_prod&project=oss-fuzz

This is likely another cause of missing error in code-fixing prompt.

We need to:

Capture this error.
Log the corresponding instance (i.e., project, benchmark, sample id).
Re-submit the experiment.

	names = re.findall(r'.?\s([\w:<>+~]+)\s\([^\(]*\)', function_signature)
	if names:
	# Normalize names.
	return re.sub(r'[^\w:]', '-', names[-1])