dongdongshe / neuzz Goto Github PK

View Code? Open in Web Editor NEW

394.0 394.0 106.0 52.41 MB

neural network assisted fuzzer

License: Other

C 81.82% Python 18.18%

neuzz's People

Contributors

Stargazers

Watchers

Forkers

gavz explife0011 orf53975 chubbymaggie bb33bb 5hadowblad3 lqingyu ufwt avaudioplayer ith4cker makaisghr b4ubles killvxk fcccode jack51706 trevillie terry2012 eos21 kcatss cvvt fengjixuchui weizhunsun linhlhq mmisono gravitywavescollision yuroc0598 ackdav davidjura bwry wtwofire liu-xiaoli notmebutwind springri david378 macromachine lukas-dresel ammaraskar binleebit puppet-meteor xiaoxiongwang raekye minhtranca fiappy rosenzhu chenforstar daydayup40 m4rm0k 5l1v3r1 maik-s qiangtimer jbachell wwdaddf bao00065 gaojie-wang xiongz-c lvshuiyiqing lsabc zero-blue krisbolton che30122 fingerleakers wudiiv11 mengjunxie wliuxingxiangyu curry-faust lrjia gh0st0ne jtpastro shreyasspatil francescoschembri jackhong12 qiwk moyix skybulk tigerzz88 callenfu maoqifan1 enheng5 thuanpv sishuaigong yezihagendasi aditgeorge cloudfuzz joshua-williamson ncc-ucl nathanawmk labba tragedyn dthxe lmlaaron bolonglin sfoerster tarrett jfamo gvozdila xinguohua caihongdepiaoyi itallgoodman sbamohabbatchafjiri

neuzz's Issues

Viewing files in an interpretable manner

Asked this question in a closed question, but would rather just start a new discussion.

So I've succesfully ran NEUZZ, thanks to your help! Now, i've got the crash information, the bitmaps, and the seeds.

How do I view each of them in an interpretable manner? The crash information is in a one dimensional array, while the bitmaps and seeds are in an ELF file.

tf.set_random_seed is deprecated

WARNING: tensorflow:From nn.py:18: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

Some questions about the part of Mutation and Retraining in the paper

Your paper is well written, and the idea is innovative.
I have some question about the part of Mutation and Retraining in the section V IMPLEMENTATION.

How did you get the 5,120 mutated inputs for a seed input? I could't understand that why it is the number 5120.
What does it mean that "Next, we randomly choose 100 output neurons representing 100 unexplored edges in the target program and generate 10,240 mutated inputs from two seeds"?
In the Algorithm 1, is the gen_mutate executed twice for each m? or you just check the sign of the gradient for each of these bytes to decide the direction of the mutation, is it only executed once?
thank you .

Not able to run new program

I'm trying to run neuzz on a new program and am currently setting it up.

I've compiled it using the gcc command given and currently have a folder named example with example.c and its compiled version example in it. I can't seem to run afl on the example for this step:

Collect the training data by running AFL on the binary for a while(about an hour), then copy the queue folder to neuzz_in.

training data

i'm running neuzz in a linux enviroment.

How do I generate more training data on the programs (readelf, etc.) and then view it?

Unable to open files

After switching to a GPU based Azure Machine i've run into another problem I can't fix.

Ubuntu 18.0.4
Built neuzz using gcc -O3 -funroll-loops ./neuzz.c -o neuzz
Followed steps given by the README ~all steps ran except for this line: echo performance | tee cpu*/cpufreq/scaling_governor

This is what my readelf folder looks like now

Whenever I run the two modules seperately, this error is returned:

This is the output in the other terminal:

Setting up output directories...Spinning up the fork server...

Hello, when I run Neuzz it stuck like this: (on the Python module, I have connected by neuzz execution moduel ('127.0.0.1', 56218))


num_index 4096 7505 small 2048 medium 4096 large 7505
mutation len: 7506
Checking CPU scaling governor...
You have 8 CPU cores and 13 runnable tasks (utilization: 162%).
System under apparent load, performance may be spotty.
Checking CPU core loadout...
Found a free CPU core, binding to #1.
Setting up output directories...Spinning up the fork server...

Do you have any suggestion?

not able to understand

This is more a question about understanding the program. After running both

python nn.py ./readelf -a
./neuzz -i neuzz_in -o seeds -l 7506 ./readelf -a @@

The program succesfully runs. What does the accuracy that is presented during and after each epoch concretely represent? Couldn't seem to understand from your paper.

Implementation Error in neuzz.c line 1726

In line 1726 in function dry_run, the code is

                    else if(fault = FAULT_TMOUT){

which I believe it should be

                    else if(fault == FAULT_TMOUT){

Still it seems that it did not affect the overall execution. Guess it might be due to that no generated input leads to timeout error?

Crash in nn.py if no new seeds are found

It looks like Neuzz will currently crash if no new edges are uncovered during a particular round, because new_seed_list will be empty.

Backtrace:

Epoch 100/100
1/2 [===========>..................] - ETA: 0s - batch: 0.0000e+00 - size: 8.0000 - loss: 0.2662 - accur_1: 0.78382.8247524899999983e-05
3/2 [====================================] - 0s 2ms/step - batch: 1.0000 - size: 10.6667 - loss: 0.2772 - accur_1: 0.7724
#######debug1
Traceback (most recent call last):
  File "nn.py", line 417, in <module>
    setup_server()
  File "nn.py", line 411, in setup_server
    gen_grad(data)
  File "nn.py", line 392, in gen_grad
    gen_mutate2(model, 500, data[:5] == b"train")
  File "nn.py", line 316, in gen_mutate2
    rand_seed1 = [new_seed_list[i] for i in np.random.choice(len(new_seed_list), edge_num, replace=True)]
  File "mtrand.pyx", line 894, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken

Neuzz and QEMU mode?

I wanted to ask if there is a way in order to use neuzz in binaries that you do not have the source code?

Thank you.

Some issues about the function "get_adv2()" in nn.py

Hi!
I am reading your paper and code recently, they're really good. But I have some difficulties understanding the following code in get_adv2() in nn.py:
adv_list = [] loss = layer_list[-2][1].output[:, f] grads = K.gradients(loss, model.input)[0] iterate = K.function([model.input], [loss, grads])
What does the 'loss' mean here? Does it means the specific loss of the f^th output_neuron?

Infinite loop in splice_seed

If fl1 is two bytes or less, splice_seed will loop infinitely, because (l_diff - f_diff) >= 2 will never be true. To demonstrate the issue I pulled out the splice_seed function into its own file (attached) and then ran:

$ dd if=/dev/zero of=file1 bs=1 count=2 # Create a two-byte file
$ for i in `seq 2 100` ; do dd if=/dev/urandom of=file$i bs=1 count=$[ $RANDOM % 521 ] ; done # Create a bunch of other files with random data
$ python3 splice.py file1 file{2..100}
3 splice.py file1 file* | head
0 0
0 1
0 1
0 1
[...]

This does actually come up in practice, as I found when trying to reproduce the harfbuzz results:

moyix@isabella:~/git/neuzz/programs/harfbuzz$ ls -Sl seeds/ | tail
-rw------- 1 moyix moyix   41 Oct 17 17:39 id_0_000696
-rw------- 1 moyix moyix   30 Oct 17 17:47 id_0_001100
-rw------- 1 moyix moyix   16 Oct 17 17:44 id_0_000968
-rw------- 1 moyix moyix   15 Oct 17 17:53 id_0_001270
-rw------- 1 moyix moyix    8 Oct 17 18:36 id_1_001848_cov
-rw------- 1 moyix moyix    7 Oct 17 18:06 id_0_001567
-rw------- 1 moyix moyix    6 Oct 17 18:36 id_1_001849
-rw------- 1 moyix moyix    4 Oct 17 19:35 id_1_002991
-rw------- 1 moyix moyix    3 Oct 17 19:38 id_1_003024_cov
-rw------- 1 moyix moyix    2 Oct 17 21:01 id_2_003989

errors in builing

Dear Dongdong She,

I got the errors in building as follows. How to handle this error?

$ gcc -O3 -funroll-loops ./neuzz.c -o neuzz
./neuzz.c: In function ‘copy_seeds’:
./neuzz.c:1820:26: warning: ‘%s’ directive writing up to 255 bytes into a region of size 127 [-Wformat-overflow=]
1820 | sprintf(src, "%s/%s", in_dir, de->d_name);
| ^~
In file included from /usr/include/stdio.h:867,
from ./neuzz.c:3:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:36:10: note: ‘__builtin___sprintf_chk’ output 2 or more bytes (assuming 257) into a destination of size 128
36 | return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37 | __bos (__s), __fmt, __va_arg_pack ());
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./neuzz.c:1821:26: warning: ‘%s’ directive writing up to 255 bytes into a region of size 127 [-Wformat-overflow=]
1821 | sprintf(dst, "%s/%s", out_dir, de->d_name);
| ^~
In file included from /usr/include/stdio.h:867,
from ./neuzz.c:3:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:36:10: note: ‘__builtin___sprintf_chk’ output 2 or more bytes (assuming 257) into a destination of size 128
36 | return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37 | __bos (__s), __fmt, __va_arg_pack ());
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My environment is as following:

OS: Ubuntu 20.04
Python: conda virtual environment python=2.7 on miniconda 3.7
gcc: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Trying to run on new program

I'm trying to run my new program on neuzz.

Ubuntu 18.0.4. My folders look like this inside of the terminal with neuzz_in having all the training examples as a result of running afl-fuzz on example.exe (compiled program).

When I run
python ./example.exe -a

and

./neuzz -i neuzz_in -o seeds -l 7506 ./example.exe -a @@

I get this error:

In the other terminal: I get this error:

Issues with the implementation

I tried to launch a fuzzing campaign on tiff2pdf in expectation to find vulnerabilities there, however ended up finding vulnerabilities in the fuzzer itself.

Actually, I was not able to fuzz tiff2pdf at all (with an initial corpus of 216 files whose sizes are around 200 bytes), since it crashed and caused a segmentation fault, as you can see at [6].

The crash that we have is at [1], where a negative (too big) length is passed to memcpy. Why does it happen? At line [2] the input is indeed sanitized by ignoring differences that are smaller or equal to 2.
However, len is of type size_t whereas del_loc is of type int, therefore len-del_loc is unsigned and thus, it fails to check for locations that are higher than len.
The reason why the location is even higher than the length is due to an other bug, namely due to an uninitialized memory error.
The location table is allocated as int loc[10000];, however left uninitialized. Then, the input is parsed at [3]. Unfortunately, at [4], the length of loc is expected to have 1024 or more entries, hence it will read garbage if there are too few.

Moreover, trying to launch neuzz on the supplied testset would produce a mess in the directory, as shown at [5]. I haven't analyzed the cause of this yet. Note that I followed exactly the described steps to reproduce the results.

Kindly let me know if you require further information or help.

Best,
Andy Nguyen from ETH Zurich

[1] https://github.com/Dongdongshe/neuzz/blob/master/neuzz.c#L1318
[2] https://github.com/Dongdongshe/neuzz/blob/master/neuzz.c#L1312
[3] https://github.com/Dongdongshe/neuzz/blob/master/neuzz.c#L1871
[4] https://github.com/Dongdongshe/neuzz/blob/master/neuzz.c#L1310
[5] https://imgur.com/a/vvKB7HP

[6] Stack backtrace

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:522
522     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:522
#1  0x00005555555589a9 in gen_mutate ()
#2  0x000055555555ea05 in fuzz_lop.constprop ()
#3  0x000055555555f667 in start_fuzz ()
#4  0x0000555555555f5e in main ()
(gdb) ir
rax            0x5555558198c3   93824995137731
rbx            0x555555795a60   93824994597472
rcx            0x555555761047   93824994381895
rdx            0xfffffffffff4945d       -748451
rsi            0x555555817c0a   93824995130378
rdi            0x5555558198c3   93824995137731
rbp            0x33     0x33
rsp            0x7fffffffdea8   0x7fffffffdea8
r8             0x9      9
r9             0x555555762d00   93824994389248
r10            0x5555558166a4   93824995124900
r11            0x555555818e09   93824995134985
r12            0x55555578be20   93824994557472
r13            0xb2b08e9c       2997915292
r14            0x801    2049
r15            0xa67    2663
rip            0x7ffff7b72e4b   0x7ffff7b72e4b <__memmove_avx_unaligned_erms+891>
eflags         0x10286  [ PF SF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
---Type <return> to continue, or q <return> to quit---                          fs             0x0      0
gs             0x0      0
(gdb)

about the handling of crashes

Hello, I want to ask about the handling of crashes. How did you deal with these crashes? Are there any tools that can be used for reference? Thank you!

Some details about Neuzz

Hi, Dongdong!

I am reading your paper NEUZZ recently and it is really well written. I have some questions about the details in this paper.

"Furthermore, we only consider the edges that have been activated at least once in the training data."

"Intuitively, in our setting, the goal of gradient-based guidance is to find inputs that will change the output of the final
layer neurons corresponding to different edges from 0 to 1"

The goal of NEUZZ is to find new edges in the target program as many as possible, but when you build the NN model, you just
consider the edges that have been activated at least once in the training data, then select some output neurons to compute gradient to guide future mutation, and the final goal is to "change the output of the final layer neurons corresponding to different edges from 0 to 1".Now that the output of the final layer neurons represent the edges that have been found by the training data, what's the meaning of trying to change specific output neuron from 0 to 1.(I mean the edge represented by this neuron has been found by some input in the training data, why does NEUZZ try to find the edge again) . Why don't we also consider the edges that have not been activated in the training data, and try to change the output of the final layer neurons corresponding to these edges from 0 to 1, doesn't this means we successfully find some inputs which triger new edges that have not been activated by the training data?

" Next, we randomly choose 100 output neurons representing 100 unexplored edges in the target program "

What does the "unexplored edges " mean here, in the source code these edges are randomly choosen at every iteration. How does it enssure that these edges are those "unexplored edges ".

Thanks a lot!