eellak / build-recorder Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 7.0 248 KB

License: GNU Lesser General Public License v2.1

Makefile 6.90% M4 17.83% C 73.64% Awk 1.63%

build-recorder's People

Contributors

Stargazers

Watchers

Forkers

zvr fvalasiad mariachatzi sainiaditya1 jereviikari konstantinoszach ayamhassan

build-recorder's Issues

Getting cmd of process.

@zvr So, the approach i am taking so far is basically read /proc/<pid>/cmdline.

Problem is, according to this thread /proc/<pid>/cmdline is OS specific and it won't work on all Unixes, for example AIX.

So we either stick to it and limit ourselves to Linux, or we have to instead ditch the idea and go for syscall argument reading of execve and execveat.

malloc(): invalid size (unsorted)

reproduce:
./build_recorder gcc test/f2.c test/f5.c -ot

hash.c:116:
return strdup(ZERO_FILE_HASH);

gdb output:

Starting program: /media/fvalasiad/storage/Projects/opensrc/GSoC/gsoc2022--build-recorder/build_recorder gcc test/f2.c test/f5.c -ot
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 9109]
malloc(): invalid size (unsorted)

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1 0x00007ffff7a60536 in __GI_abort () at abort.c:79
#2 0x00007ffff7ab84f8 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7bc7b7f "%s\n")
at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7abff3a in malloc_printerr (str=str@entry=0x7ffff7bca4e0 "malloc(): invalid size (unsorted)")
at malloc.c:5389
#4 0x00007ffff7ac330c in _int_malloc (av=av@entry=0x7ffff7bf9a00 <main_arena>, bytes=bytes@entry=41) at malloc.c:3764
#5 0x00007ffff7ac4974 in __GI___libc_malloc (bytes=bytes@entry=41) at malloc.c:3078
#6 0x00007ffff7ac86eb in __GI___strdup (s=s@entry=0x5555555571c0 "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391")
at strdup.c:42
#7 0x0000555555555fec in get_file_hash (fname=<optimized out>) at src/hash.c:116
#8 0x0000555555555a12 in handle_syscall (exit=0x7fffffffe4d0, exit=0x7fffffffe4d0, entry=<optimized out>, pid=9109)
at src/tracer.c:210
#9 tracer_main (pid=9109, envp=0x7fffffffe4d0, envp@entry=0x7fffffffe6a8) at src/tracer.c:279
#10 0x0000555555555dcf in trace (envp=0x7fffffffe6a8, pid=<optimized out>) at src/tracer.c:325
#11 0x00005555555552a3 in main (argc=<optimized out>, argv=0x7fffffffe680, envp=0x7fffffffe6a8) at src/main.c:37

Autoconf checks

Autoconf should check that all required syscalls and parameters / arguments are available.

Changing the way `open(2)` is traced.

@zvr As previously discussed, open(2),openat(2),openat2(2),open_tree(2) and a couple of others are currently handled by converting their path argument to a relative. We've talked about how this is unsafe and that we should instead consider the following solution:

On open(2), open a file descriptor in the tracer by opening /proc/<pid>/fd/<fd> and store it in FILE_INFO.
On close(2), hash the file pointed by the tracer's open file descriptor,store the result and close the file descriptor.

This has the advantage that we are safe of any kind of concurrent file access, it also is much more generic and we won't have to create special handlers for the numerous variants of open(2), some of which are pretty complicated.

Starting to look at how we should track rename(2) i can see that this approach would help us there too. Because there are also rename(2) variants such as renameat(2) and renameat2(2). We could easily handle them all if we only had an array mapping the tracee's open file descriptors to our corresponding open file descriptors.

Thus i suggest we should make the switch.

Trace more file operations

Tracing files opened is not enough.

Typical code structure is:

open temporary file
write in temporary file
close temporary file
if everything ok, move temporary file to final name

Similar can be seen in larger scale when a browser (or maven) downloads a file: it writes at filename.part and at successful completion renames filename.part to filename.

Till now we would record the temporary file name, but not the final one.

[BUG] segfault at `find_finfo(3)`

segfault is thrown when trying to compile the linux kernel at strcmp(hash, finfo[i].hash) of find_finfo(3). Turns out finfo[i].hash is NULL while hash isn't.

Consider the following scenario:

open f1 = "/some/file" for writing
open f2 = "/some/file" for reading
close f1(write)
close f2(read)

the above sequence will throw a segfault. That's because f1->hash is computed at close(2). Yet f2 open will search the finfo array and stumble upon the scenario mentioned above.

This(bug $1) was definitely a bug and I've patched it, yet the problem mentioned still remains and thus i believe there is another bug(bug $2).

I will submit my solution of bug $1 soon so we can work from there.

Regarding finfo & pinfo data structures

Seeing how big projects can have numerous processes & files, it's probably a huge performance hit linear searching them upon each and every system call and/or file encounter.

Running build_recorder with a compilation of the linux kernel for just 1 minute(while previously having calculated that it takes 27 minutes total to fully compile given the configuration and hardware i am using WITHOUT build_recorder wrapping it) we can see a total of 1301 processes. That's O(1301) upon every system call encountered WITHIN ONLY A MINUTE even if that system call is ignored...
I can only imagine the time required if we were to fully compile it, given that the number of processes increases as time goes by.

Should definitely start testing with alternative data structures other than a plain array in the future.

How do we handle termination?

@zvr We should add a WIFSIGNALLED(status) clause so we are notified if a child has been terminated from a signal. How should we handle this scenario? The same way WIFEXITED(status) is handled or perhaps by introducing a new RDF statement?

What checksum algorithm will we use and how?

Should we use one of the available tools md5sum || sha1sum || sha256sum or implement one in code?

About multithreaded applications...

Let's say that i wanna compile the Linux kernel.

Obviously the way I'd probably go about it is:

./configure
make

In order for our program to work, the above script would be changed to:

./configure
build_recorder make

Now, the issue i wanna talk about is the case we run make with more than one hardware threads.

For example i know my computer's CPU is a two-core CPU with 4 hardware threads, so if i was to compile the Linux kernel I'd instead go at it like this:

./configure
make -j4

and equally, for build_recorder to work:

./configure
build_recorder make -j4

The above should work just fine, although i wanna stress out that it kills all purpose of using multiple threads. I am not entirely sure how wait(2) is implemented, but logging the order in which it picks stopped applications i noticed a pattern.

wait(&status) continuously yields the same process unless a new process is spawned or the process itself has exited

In other words, what would be a multi-threaded applications is transformed into a single threaded by build_recorder.

This happens because a previously stopped process, once restarted with ptrace(PTRACE_SYSCALL, ...) actually stops at another PTRACE_SYSCALL as instructed before our loop even reaches the next wait(&status), i suspect wait(&status) then just pops off the top of the waiting list and we end up with a bunch of threads on the waiting list.

This entire scenario really reminds me of the issues scheduling algorithms face with starving applications.

Proposal: We too should spawn a thread of tracer_main(3) every time a new process has spawned. How does this solve our issue? Well it simply transfers the process starvation problem to the kernel which is already perfectly capable of dealing with that sort of issue.

@zvr We are running out of time, this is probably the last sort of plan changes for this GSoC. We need to decide fast in order to implement it.

It's probably also simpler than what we are currently doing(trying to handle all the threads in a single thread).

Waiting for your thoughts on this!

`FILE_INFO` and `PROCESS_INFO` `outname` aren't initialized.

Because they are currently initialized inside next_finfo(3) and next_pinfo(3). Problem is that these two are called only when the corresponding arrays are out of memory.

When to record filenames (and file contents)

There are two types of file access:

reading an existing file;
writing a file (in most cases also creating it).

Given a (sub-)process that starts at Tstart and ends at Tend, a file access is between the time it's opened Topen and the time it's closed Tclose.

On a linear time line, we have:

-----Tstart ---------- Topen ---------- Tclose ---------- Tend -----
 t0             t1              t2                t3             t4

Where (in which timestamps, t0 to t4) should the recording of the filenames and file contents happen?

Perhaps record enviroment separately for each process as well?

Seeing how each proc also has an environ should we perhaps also store it separately for each process?

Current `handle_syscall(3)` is a mess

Since we will be adding lots and lots of syscall handlers(a couple of rename(2)s as well as others) we should restructure the code.

Bug that freezes in `SIGTRAP` `waitpid(2)`

At the moment there is a bug where running certain programs(for example gcc test/testread.c -otest) where build_recorder freezes eternally at src/tracer.c:273. Looking into it.

Outname field of `PROCESS_INFO` and `FILE_INFO` are redundant.

They are basically a string version of their corresponding pinfo and finfo array index. Do we care about keeping them?

[BUG] `make -j4` hangs

Trying to compile the xbps package manager using make -j4 makes build_recorder freeze eternally at wait(&status).

handle zero-length files

Zero-length files currently produce a crash.

Upon closing, they are supposed to be hashed, but the hashing function tries to mmap(2) the file, which fails for empty files.

Solutions: either

do not mmap / madvise on zero-length; or
return pre-computed hash value for such files

Unique identifiers when saving

Every process and every file should get a unique identifier in the generated records.

For processes, since their pid is unique, I have been using:

sprintf(pbuf, "pid%ld", (long) pid);

But what about files?

Absolute path of executables

Sometimes build tools use symbolic links as executables.

For example alternatives in linux could link /usr/bin/cc to gcc or clang respectively. Our tracer doesn't handle this at the moment.

@zvr Do we care?

Choose output file

Following up to #55, one should be able to specify an output file instead of using the default build-recorder.out one.

The command-line syntax would be

build-recorder [-o outputfile] command [command_arguments ...]

indent(1) messes up vector.h

vector.h has some rather unique syntax due to the preprocessor hacks, indent fails to recognize what's happening and messes up the file.

Only regular files should be hashed

It seems that every closed fd is hashed.

This should happen only on regular files, not everything.

Outdated doc

We need to update the doc to include the new set of triplets we defined, such as b:rename>

`clone(2)`, `fork(2)`, `vfork(2)` sometimes run before post-attach `SIGSTOP`

OpenSSL via autoconf

Presence, location and library invocation of OpenSSL should be made by autoconf, not hard-wired.

autoconf should report errors when dependencies are missing.

Right now the configure script doesn't report that a library, header, function or whatsoever is missing. That is, the configuration process exits without an error.

Need to trace exec(2) in order to see subprocesses created

So far we can't see the build system calling the compiler, the compiler calling the linker etc...

ptrace(2) has an option to trace execve(2) calls, we should add that and modify the tracer accordingly.

First executable file not recorded

Although for each (sub-)process created the executable file is correctly recorded, this does not happen for the main process (the one given in the command line).

This can be easily verified, as the first file entry in every output should have been this executable.

No process relationship is recorded

Running the first test in the tests directory, it reports processes:

p0 a b:process .
p0 b:cmd "gcc -c f1.c" .
p1 a b:process .
p1 b:cmd "/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -imultiarch x86_64-linux-gnu f1.c -quiet -dumpbase f1.c -dumpbase-ext .c -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccDu6HTo.s" .
p2 a b:process .
p2 b:cmd "as --64 -o f1.o /tmp/ccDu6HTo.s" .

but no relationship between them (i.e. p0 forked p1)

Use OpenSSL-3.x if available

Code performing the hashing of file contents uses OpenSSL 1.x functions.

Therefore:

Add code using OpenSSL 3.x functions
Add autoconf code to determine version and use the correct functions according to case.

Why does indent behave like this?

Indent is configured to use tabs of size 8. Yet it actually uses spaces when it comes to "outer scope" statements(that is, statements of indentation depth 1).

For example, here is how it indents the following code taken directly from record.c lines 142 to 147

void
record_process_env(char *poutname, char **envp)
{
    for (char **ep = envp; *ep != NULL; ep++)
	record_triple(poutname, "b:env", *ep, true);
}

for has 4 spaces instead of a tab. Yet record_triple actually uses a tab. This causes issues in my vim editor and pretty much every other editor i tried(VScode, Kate, Vi, Neovim). In my understanding the correct way to handle this would be:

void
record_process_env(char *poutname, char **envp)
{
	for (char **ep = envp; *ep != NULL; ep++)
		record_triple(poutname, "b:env", *ep, true);
}

That is, a tab in front of for and two tabs in front of record_triple.

@zvr share your thoughts on this, Is it expected?

Use SPDX license tags

It's much easier and cleaner to use the one-line SPDX license tag instead of text that has to be parsed.

Recording files

Until now, the system records files that have been open(2)-ed, for reading or writing, by the process (or any other sub-process),

This approach results in the issue that it cannot record files used but not opened, as the case of using rename(2) shows.

Furthermore, having per-process file tables completely obfuscates the fact that the same file is being used by more than one process.

As an example, consider the simple case of gcc foo,c which roughly translates into a series of subprocess: gcc -S foo.c -o foo.S, as -c foo.S -o foo.o, and ld foo.o -o a.out. The intermediate files foo.S and foo.o are being re-used, created by one process and read by the next. However, they are expressed as different file entries in the output, even though they are the same file.

As another example, consider the case of gcc foo.c bar.c where both files #include a common header file (either local, foo.h or system-wide, /usr/include/stdio.h). Since each file is initially compiled in a separate subprocess, there will be two entries for the same header file at the final result.

In order to overcome this, a single run-wide file map has to be kept. For each process, the list of opened files will still be maintained, but the entries of actual file info (like path and hash) have to be in a common structure that can have element re-use across processes.

implement record_fileuse

The last step needed in order to start testing.

Automake should not touch CFLAGS

CFLAGS is a user variable and should not be set by automake.

If the value is needed for the program compilation, it should be in AM_CFLAFS; if it is a user preference, it should be specified when running configure.

Every file listed as written

Running the first test in the tests directory, it reports that every system file is also written to:

p0 a b:process .
p0 b:cmd "gcc -c f1.c" .
p0 b:start "2022-09-30T07:58:43Z" .
f0 a file .
f0 b:name "/etc/ld.so.cache" .
f0 b:hash "c8ce857e31ed7dc80099f66cab1bd7c1ea130f70" .
p0 b:reads f0 .
p0 b:writes f0 .
f1 a file .
f1 b:name "/lib/x86_64-linux-gnu/libc.so.6" .
f1 b:hash "db27b09035aef1f4fe19dbc116eda8a654bd1c07" .
p0 b:reads f1 .
p0 b:writes f1 .

This is obviously not correct.

Regarding performance

@zvr After some profiling, turns out a massive performance hit occurs due to our "file entries re-usage".

While compiling the xbps package manager, i made the following measurements:

(All in CPU time)

without build recorder wrapping it: 6.934s
with build recorder wrapping it: 14.988s

That's an overhead of 8.064s, of which 5.180s is due to get_file_hash function call at src/tracer.c:255.

This raises the question, do we really need to search using the hash as well? Absolute paths are unique to begin with and we are already tracking rename(2) calls. This paired with the fact that the finfo array is sorted in ascending order with respect to time, is a guarantee that the first match on a backward search is the file we are looking for.

Is my logic flawed? Is the hash check there important?

Test Run

./build_recorder gcc test/f2.c test/f5.c -ot produces build-recorder.out.

We can then:
cat build-recorder.out | grep "b:name" to see the files detected and we get:

b:name "/etc/ld.so.cache" .
b:name "/usr/lib/libm.so.6" .
b:name "/usr/lib/libc.so.6" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/cctcPs32.o" .
b:name "test/f0.h" .
b:name "/tmp/cctcPs32.o" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/cctcPs32.o" .
b:name "/tmp/cc10vqP1.o" .
b:name "/usr/include/stdc-predef.h" .
b:name "/tmp/cc10vqP1.o" .
b:name "/tmp/ccmz0vI2.res" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/ccmz0vI2.res" .

Where we can see that f2.c and f5.c are missing, surprisingly enough though the header that 'f2.c' includes f0.h, is detected.

Output file name

How will we name the output file generated from the build_recorder? Will we supply an option used to change the default? Will we even have a default?

Uncertain if indent(1) also messes up some other files

i can see the if statement inside main() not using tabs.

Same for tracer_main, some if statements have their bodies tabbed while others don't.

Unaware if this is a bug or intended behavior.

hash_file throws an error with path being NULL

I suspect it's because there are more than two ways(the ones which we track) one can "open" a file descriptor. For example if they were to call socket(2) to obtain a new socket and then later called close(2) to close the socket, our code would crash because no path was ever associated with the file descriptor being closed.

Now checking for null is a valid option but i am afraid it might hide potential bugs in the future(paths that should be tracked actually aren't). Nevertheless this is what i recommend we should do, and we shouldn't worry about this problem since switching to the model we previously described(that of opening the files upon finding the open(2) calls) will make the problem disappear.

Relative and Absolute path regarding `rename(2)`

When rename(2) is used, it's path argument accepts relative paths. To handle the situation we once again fetch their absolute path using the corresponding absolutepath(3) function.

When recording the rename I've so far only recorded the absolute paths, do we care about recording the relative ones too? And what would be the format in that scenario?

so far we have:

p1 b:rename rename1
rename1 b:from <abspath1>
rename b:to <abspath2>

perhaps we could format it like this:

p1 b:rename rename1
rename1 b:from <path1>
rename b:to <path2>
rename1 b:from_abspath <abspath1>
rename1 b:to_abspath <abspath2>

ignore, or correctly handle openat(2)

Currently openat(2) processing ignores the first argument.

Until it is handled correctly, it should be disabled, rather than produce wrong data,

Define RDF format for running environment and use it

Right now the environment is not printed as RDF.

executables aren't recorded as files

Build process changes current working directory bug

@zvr We've discussed before about what if a build process changes its cwd, relative paths would no longer work since build_recorder doesn't change its cwd accordingly. This was another bug i found when i was trying to run build_recorder while compiling the xbps package manager.

Now my early proposals were to actually use /proc/\<pid\>/cwd as a prefix when resolving relative paths and i still have faith in that solution. Although i have to point out that the proposal #90 about changing the way open(2) is traced actually makes the problem disappear since build_recorder ceases to interact with relative paths and instead only handles absolute paths.

Thus i give extra credit that #90 should be merged.

Allowing the user to perform tracee-only enviroment and user changes

This issue was inspired from strace(3), perhaps we should adopt it here too in the future.

Have a consistent coding style

Automation needed for formatting the code in a consistent way.

Regarding `sync(2)`, `syncfs(2)` `fsync(2)`,`fdatasync(2)`

In my understanding:

Facts

sync(2)
POSIX. Unlike POSIX though, Linux waits for I/O completion before returning.

sync() causes all pending modifications to filesystem metadata and
cached file data to be written to the underlying filesystems.

syncfs(2)
POSIX fsync(2). But Linux specific has the same guarantee sync(2) does on Linux.

syncfs() is like sync(), but synchronizes just the filesystem
containing file referred to by the open file descriptor fd.

fsync(2)
POSIX . Again, unlike POSIX though, on Linux the system call won't return unless the sync actually happens.
fdatasync(2)
fsync(2) with lazy evaluation built in.

Proposal

Obviously what was mentioned above isn't the full picture, since for example it implies that syncfs(2) and fsync(2) are equivalent, which is far from true. But for our purposes, they are to be treated mostly the same.
So given the above statements are correct, my proposal is:

sync(2)
We rehash all open files with pending write operations and store them as new files almost as if they were close(2)ed and re-open(2)ed again.
syncfs(2)
We only rehash the open file identified by fd. Basically what i described for sync(2) but for a singular file.
fsync(2)
Same as for syncfs(2)
fdatasync(2)
This one is quite tricky since it doesn't flush the data unless they are to be read. This is a problem to us since we cannot possibly create the new file with the new hash upon encounter of this system call. Our hash checker that I proposed to be remove due to its very poor performance actually addresses this though. What's the chance a compiler uses any of this?

It's also worth mentioning in general that the hash checker addresses the problems this entire issue tries to solve by tracking the sync family of system calls. Since if one was to sync a file in any way described above, the hash upon the next read would be updated and a new file would be created as a result.

Oh the price we pay for performance.