eellak / build-recorder Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU Lesser General Public License v2.1
License: GNU Lesser General Public License v2.1
@zvr So, the approach i am taking so far is basically read /proc/<pid>/cmdline.
Problem is, according to this thread /proc/<pid>/cmdline is OS specific and it won't work on all Unixes, for example AIX.
So we either stick to it and limit ourselves to Linux, or we have to instead ditch the idea and go for syscall argument reading of execve
and execveat
.
reproduce:
./build_recorder gcc test/f2.c test/f5.c -ot
hash.c:116:
return strdup(ZERO_FILE_HASH);
gdb output:
Starting program: /media/fvalasiad/storage/Projects/opensrc/GSoC/gsoc2022--build-recorder/build_recorder gcc test/f2.c test/f5.c -ot
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[Detaching after fork from child process 9109]
malloc(): invalid size (unsorted)
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1 0x00007ffff7a60536 in __GI_abort () at abort.c:79
#2 0x00007ffff7ab84f8 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7bc7b7f "%s\n")
at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7abff3a in malloc_printerr (str=str@entry=0x7ffff7bca4e0 "malloc(): invalid size (unsorted)")
at malloc.c:5389
#4 0x00007ffff7ac330c in _int_malloc (av=av@entry=0x7ffff7bf9a00 <main_arena>, bytes=bytes@entry=41) at malloc.c:3764
#5 0x00007ffff7ac4974 in __GI___libc_malloc (bytes=bytes@entry=41) at malloc.c:3078
#6 0x00007ffff7ac86eb in __GI___strdup (s=s@entry=0x5555555571c0 "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391")
at strdup.c:42
#7 0x0000555555555fec in get_file_hash (fname=<optimized out>) at src/hash.c:116
#8 0x0000555555555a12 in handle_syscall (exit=0x7fffffffe4d0, exit=0x7fffffffe4d0, entry=<optimized out>, pid=9109)
at src/tracer.c:210
#9 tracer_main (pid=9109, envp=0x7fffffffe4d0, envp@entry=0x7fffffffe6a8) at src/tracer.c:279
#10 0x0000555555555dcf in trace (envp=0x7fffffffe6a8, pid=<optimized out>) at src/tracer.c:325
#11 0x00005555555552a3 in main (argc=<optimized out>, argv=0x7fffffffe680, envp=0x7fffffffe6a8) at src/main.c:37
Autoconf should check that all required syscalls and parameters / arguments are available.
@zvr As previously discussed, open(2)
,openat(2)
,openat2(2)
,open_tree(2)
and a couple of others are currently handled by converting their path
argument to a relative
. We've talked about how this is unsafe and that we should instead consider the following solution:
open(2)
, open a file descriptor in the tracer by opening /proc/<pid>/fd/<fd>
and store it in FILE_INFO
.close(2)
, hash the file pointed by the tracer's open file descriptor,store the result and close the file descriptor.This has the advantage that we are safe of any kind of concurrent file access, it also is much more generic and we won't have to create special handlers for the numerous variants of open(2)
, some of which are pretty complicated.
Starting to look at how we should track rename(2)
i can see that this approach would help us there too. Because there are also rename(2)
variants such as renameat(2)
and renameat2(2)
. We could easily handle them all if we only had an array mapping the tracee's open file descriptors to our corresponding open file descriptors.
Thus i suggest we should make the switch.
Tracing files opened is not enough.
Typical code structure is:
Similar can be seen in larger scale when a browser (or maven) downloads a file: it writes at filename.part
and at successful completion renames filename.part
to filename
.
Till now we would record the temporary file name, but not the final one.
segfault is thrown when trying to compile the linux kernel at strcmp(hash, finfo[i].hash)
of find_finfo(3)
. Turns out finfo[i].hash
is NULL while hash
isn't.
Consider the following scenario:
open f1 = "/some/file" for writing
open f2 = "/some/file" for reading
close f1(write)
close f2(read)
the above sequence will throw a segfault. That's because f1->hash
is computed at close(2)
. Yet f2
open will search the finfo
array and stumble upon the scenario mentioned above.
This(bug $1) was definitely a bug and I've patched it, yet the problem mentioned still remains and thus i believe there is another bug(bug $2).
I will submit my solution of bug $1 soon so we can work from there.
Seeing how big projects can have numerous processes & files, it's probably a huge performance hit linear searching them upon each and every system call and/or file encounter.
Running build_recorder
with a compilation of the linux kernel for just 1 minute(while previously having calculated that it takes 27 minutes total to fully compile given the configuration and hardware i am using WITHOUT build_recorder
wrapping it) we can see a total of 1301
processes. That's O(1301) upon every system call encountered WITHIN ONLY A MINUTE even if that system call is ignored...
I can only imagine the time required if we were to fully compile it, given that the number of processes increases as time goes by.
Should definitely start testing with alternative data structures other than a plain array in the future.
@zvr We should add a WIFSIGNALLED(status)
clause so we are notified if a child has been terminated from a signal. How should we handle this scenario? The same way WIFEXITED(status)
is handled or perhaps by introducing a new RDF statement?
Should we use one of the available tools md5sum
|| sha1sum
|| sha256sum
or implement one in code?
Let's say that i wanna compile the Linux kernel.
Obviously the way I'd probably go about it is:
./configure
make
In order for our program to work, the above script would be changed to:
./configure
build_recorder make
Now, the issue i wanna talk about is the case we run make
with more than one hardware threads.
For example i know my computer's CPU is a two-core CPU with 4 hardware threads, so if i was to compile the Linux kernel I'd instead go at it like this:
./configure
make -j4
and equally, for build_recorder to work:
./configure
build_recorder make -j4
The above should work just fine, although i wanna stress out that it kills all purpose of using multiple threads. I am not entirely sure how wait(2)
is implemented, but logging the order in which it picks stopped applications i noticed a pattern.
wait(&status) continuously yields the same process
unless a new process
is spawned or the process
itself has exited
In other words, what would be a multi-threaded applications is transformed into a single threaded by build_recorder.
This happens because a previously stopped process, once restarted with ptrace(PTRACE_SYSCALL, ...)
actually stops at another PTRACE_SYSCALL
as instructed before our loop even reaches the next wait(&status)
, i suspect wait(&status)
then just pops off the top of the waiting list and we end up with a bunch of threads on the waiting list.
This entire scenario really reminds me of the issues scheduling algorithms face with starving applications.
Proposal: We too should spawn a thread of tracer_main(3)
every time a new process has spawned. How does this solve our issue? Well it simply transfers the process
starvation problem to the kernel which is already perfectly capable of dealing with that sort of issue.
@zvr We are running out of time, this is probably the last sort of plan changes for this GSoC. We need to decide fast in order to implement it.
It's probably also simpler than what we are currently doing(trying to handle all the threads in a single thread).
Waiting for your thoughts on this!
Because they are currently initialized inside next_finfo(3)
and next_pinfo(3)
. Problem is that these two are called only when the corresponding arrays are out of memory.
There are two types of file access:
Given a (sub-)process that starts at Tstart and ends at Tend, a file access is between the time it's opened Topen and the time it's closed Tclose.
On a linear time line, we have:
-----Tstart ---------- Topen ---------- Tclose ---------- Tend -----
t0 t1 t2 t3 t4
Where (in which timestamps, t0 to t4) should the recording of the filenames and file contents happen?
Seeing how each proc
also has an environ
should we perhaps also store it separately for each process?
Since we will be adding lots and lots of syscall handlers(a couple of rename(2)
s as well as others) we should restructure the code.
At the moment there is a bug where running certain programs(for example gcc test/testread.c -otest
) where build_recorder
freezes eternally at src/tracer.c:273. Looking into it.
They are basically a string version of their corresponding pinfo
and finfo
array index. Do we care about keeping them?
Trying to compile the xbps package manager using make -j4
makes build_recorder
freeze eternally at wait(&status)
.
Zero-length files currently produce a crash.
Upon closing, they are supposed to be hashed, but the hashing function tries to mmap(2)
the file, which fails for empty files.
Solutions: either
mmap
/ madvise
on zero-length; orEvery process and every file should get a unique identifier in the generated records.
For processes, since their pid
is unique, I have been using:
sprintf(pbuf, "pid%ld", (long) pid);
But what about files?
Sometimes build tools use symbolic links as executables.
For example alternatives
in linux could link /usr/bin/cc
to gcc
or clang
respectively. Our tracer doesn't handle this at the moment.
@zvr Do we care?
Following up to #55, one should be able to specify an output file instead of using the default build-recorder.out
one.
The command-line syntax would be
build-recorder [-o outputfile] command [command_arguments ...]
vector.h has some rather unique syntax due to the preprocessor hacks, indent fails to recognize what's happening and messes up the file.
It seems that every closed fd is hashed.
This should happen only on regular files, not everything.
We need to update the doc to include the new set of triplets we defined, such as b:rename
>
Presence, location and library invocation of OpenSSL should be made by autoconf, not hard-wired.
Right now the configure script doesn't report that a library, header, function or whatsoever is missing. That is, the configuration process exits without an error.
So far we can't see the build system calling the compiler, the compiler calling the linker etc...
ptrace(2)
has an option to trace execve(2)
calls, we should add that and modify the tracer accordingly.
Although for each (sub-)process created the executable file is correctly recorded, this does not happen for the main process (the one given in the command line).
This can be easily verified, as the first file entry in every output should have been this executable.
Running the first test in the tests directory, it reports processes:
p0 a b:process .
p0 b:cmd "gcc -c f1.c" .
p1 a b:process .
p1 b:cmd "/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -imultiarch x86_64-linux-gnu f1.c -quiet -dumpbase f1.c -dumpbase-ext .c -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccDu6HTo.s" .
p2 a b:process .
p2 b:cmd "as --64 -o f1.o /tmp/ccDu6HTo.s" .
but no relationship between them (i.e. p0 forked p1)
Code performing the hashing of file contents uses OpenSSL 1.x functions.
Therefore:
Indent is configured to use tabs of size 8. Yet it actually uses spaces when it comes to "outer scope" statements(that is, statements of indentation depth 1).
For example, here is how it indents the following code taken directly from record.c lines 142 to 147
void
record_process_env(char *poutname, char **envp)
{
for (char **ep = envp; *ep != NULL; ep++)
record_triple(poutname, "b:env", *ep, true);
}
for
has 4 spaces instead of a tab. Yet record_triple
actually uses a tab. This causes issues in my vim editor and pretty much every other editor i tried(VScode, Kate, Vi, Neovim). In my understanding the correct way to handle this would be:
void
record_process_env(char *poutname, char **envp)
{
for (char **ep = envp; *ep != NULL; ep++)
record_triple(poutname, "b:env", *ep, true);
}
That is, a tab in front of for
and two tabs in front of record_triple
.
@zvr share your thoughts on this, Is it expected?
It's much easier and cleaner to use the one-line SPDX license tag instead of text that has to be parsed.
Until now, the system records files that have been open(2)
-ed, for reading or writing, by the process (or any other sub-process),
This approach results in the issue that it cannot record files used but not opened, as the case of using rename(2)
shows.
Furthermore, having per-process file tables completely obfuscates the fact that the same file is being used by more than one process.
As an example, consider the simple case of gcc foo,c
which roughly translates into a series of subprocess: gcc -S foo.c -o foo.S
, as -c foo.S -o foo.o
, and ld foo.o -o a.out
. The intermediate files foo.S
and foo.o
are being re-used, created by one process and read by the next. However, they are expressed as different file entries in the output, even though they are the same file.
As another example, consider the case of gcc foo.c bar.c
where both files #include
a common header file (either local, foo.h
or system-wide, /usr/include/stdio.h
). Since each file is initially compiled in a separate subprocess, there will be two entries for the same header file at the final result.
In order to overcome this, a single run-wide file map has to be kept. For each process, the list of opened files will still be maintained, but the entries of actual file info (like path and hash) have to be in a common structure that can have element re-use across processes.
The last step needed in order to start testing.
CFLAGS is a user variable and should not be set by automake
.
If the value is needed for the program compilation, it should be in AM_CFLAFS
; if it is a user preference, it should be specified when running configure
.
Running the first test in the tests
directory, it reports that every system file is also written to:
p0 a b:process .
p0 b:cmd "gcc -c f1.c" .
p0 b:start "2022-09-30T07:58:43Z" .
f0 a file .
f0 b:name "/etc/ld.so.cache" .
f0 b:hash "c8ce857e31ed7dc80099f66cab1bd7c1ea130f70" .
p0 b:reads f0 .
p0 b:writes f0 .
f1 a file .
f1 b:name "/lib/x86_64-linux-gnu/libc.so.6" .
f1 b:hash "db27b09035aef1f4fe19dbc116eda8a654bd1c07" .
p0 b:reads f1 .
p0 b:writes f1 .
This is obviously not correct.
@zvr After some profiling, turns out a massive performance hit occurs due to our "file entries re-usage".
While compiling the xbps package manager, i made the following measurements:
(All in CPU time)
without build recorder wrapping it: 6.934s
with build recorder wrapping it: 14.988s
That's an overhead of 8.064s, of which 5.180s is due to get_file_hash
function call at src/tracer.c:255
.
This raises the question, do we really need to search using the hash as well? Absolute paths are unique to begin with and we are already tracking rename(2)
calls. This paired with the fact that the finfo
array is sorted in ascending order with respect to time, is a guarantee that the first match on a backward search is the file we are looking for.
Is my logic flawed? Is the hash check there important?
./build_recorder gcc test/f2.c test/f5.c -ot
produces build-recorder.out
.
We can then:
cat build-recorder.out | grep "b:name"
to see the files detected and we get:
b:name "/etc/ld.so.cache" .
b:name "/usr/lib/libm.so.6" .
b:name "/usr/lib/libc.so.6" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/cctcPs32.o" .
b:name "test/f0.h" .
b:name "/tmp/cctcPs32.o" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/cctcPs32.o" .
b:name "/tmp/cc10vqP1.o" .
b:name "/usr/include/stdc-predef.h" .
b:name "/tmp/cc10vqP1.o" .
b:name "/tmp/ccmz0vI2.res" .
b:name "/tmp/ccTihtf3.s" .
b:name "/tmp/ccmz0vI2.res" .
Where we can see that f2.c
and f5.c
are missing, surprisingly enough though the header that 'f2.c' includes f0.h, is detected.
How will we name the output file generated from the build_recorder
? Will we supply an option used to change the default? Will we even have a default?
i can see the if
statement inside main()
not using tabs.
Same for tracer_main
, some if
statements have their bodies tabbed while others don't.
Unaware if this is a bug or intended behavior.
I suspect it's because there are more than two ways(the ones which we track) one can "open" a file descriptor. For example if they were to call socket(2)
to obtain a new socket and then later called close(2)
to close the socket, our code would crash because no path was ever associated with the file descriptor being closed.
Now checking for null is a valid option but i am afraid it might hide potential bugs in the future(paths that should be tracked actually aren't). Nevertheless this is what i recommend we should do, and we shouldn't worry about this problem since switching to the model we previously described(that of opening the files upon finding the open(2)
calls) will make the problem disappear.
When rename(2)
is used, it's path argument accepts relative paths. To handle the situation we once again fetch their absolute path using the corresponding absolutepath(3)
function.
When recording the rename I've so far only recorded the absolute paths, do we care about recording the relative ones too? And what would be the format in that scenario?
so far we have:
p1 b:rename rename1
rename1 b:from <abspath1>
rename b:to <abspath2>
perhaps we could format it like this:
p1 b:rename rename1
rename1 b:from <path1>
rename b:to <path2>
rename1 b:from_abspath <abspath1>
rename1 b:to_abspath <abspath2>
Currently openat(2)
processing ignores the first argument.
Until it is handled correctly, it should be disabled, rather than produce wrong data,
Right now the environment is not printed as RDF.
@zvr We've discussed before about what if a build process changes its cwd
, relative paths would no longer work since build_recorder
doesn't change its cwd
accordingly. This was another bug i found when i was trying to run build_recorder
while compiling the xbps package manager.
Now my early proposals were to actually use /proc/\<pid\>/cwd
as a prefix when resolving relative paths and i still have faith in that solution. Although i have to point out that the proposal #90 about changing the way open(2)
is traced actually makes the problem disappear since build_recorder
ceases to interact with relative paths and instead only handles absolute paths.
Thus i give extra credit that #90 should be merged.
This issue was inspired from strace(3)
, perhaps we should adopt it here too in the future.
Automation needed for formatting the code in a consistent way.
In my understanding:
sync(2)
sync() causes all pending modifications to filesystem metadata and
cached file data to be written to the underlying filesystems.
syncfs(2)
fsync(2)
. But Linux specific has the same guarantee sync(2)
does on Linux.syncfs() is like sync(), but synchronizes just the filesystem
containing file referred to by the open file descriptor fd.
fsync(2)
POSIX . Again, unlike POSIX though, on Linux the system call won't return unless the sync actually happens.
fdatasync(2)
fsync(2)
with lazy evaluation built in.
Obviously what was mentioned above isn't the full picture, since for example it implies that syncfs(2)
and fsync(2)
are equivalent, which is far from true. But for our purposes, they are to be treated mostly the same.
So given the above statements are correct, my proposal is:
sync(2)
We rehash all open files with pending write operations and store them as new files almost as if they were close(2)
ed and re-open(2)
ed again.
syncfs(2)
We only rehash the open file identified by fd
. Basically what i described for sync(2)
but for a singular file.
fsync(2)
Same as for syncfs(2)
fdatasync(2)
This one is quite tricky since it doesn't flush the data unless they are to be read. This is a problem to us since we cannot possibly create the new file with the new hash upon encounter of this system call. Our hash checker that I proposed to be remove due to its very poor performance actually addresses this though. What's the chance a compiler uses any of this?
It's also worth mentioning in general that the hash checker addresses the problems this entire issue tries to solve by tracking the sync
family of system calls. Since if one was to sync a file in any way described above, the hash upon the next read would be updated and a new file would be created as a result.
Oh the price we pay for performance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.