halo-project / halo Goto Github PK

View Code? Open in Web Editor NEW

26.0 4.0 0.0 42.29 MB

😇 Wholly Adaptive LLVM Optimizer

License: Other

CMake 0.18% C++ 33.75% Dockerfile 0.04% Shell 0.58% C 65.16% Python 0.29%

llvm autotuning jit compiler

halo's Introduction

Halo: Wholly Adaptive LLVM Optimizer

Project Abstract

Low-level languages like C, C++, and Rust are the language-of-choice for performance-sensitive applications and have major implementations that are based on the LLVM compiler framework. The heuristics and trade-off decisions used to guide the static optimization of such programs during compilation can be automatically tuned during execution (online) in response to their actual execution environment. The main objective of this project is to explore what is possible in the space of online adaptive optimization for LLVM-based language implementations.

This project differs from the usual application of runtime systems employing JIT compilation in that we are trying to optimize programs even if they already have very little interpretive overhead, e.g., no dynamic types. Thus, our focus is on trying to profitably tune the compiler optimizations applied to the code while the program is running. This is in contrast to traditional offline tuning where a sample workload and hours of time are required to perform the tuning prior to the deployment of the software, and afterwards the tuning remains fixed.

For more details of our rationale, related work, and overall plan, please see our design document.

What's Here?

There are three major components to the Halo project:

halomon, aka the Halo Monitor, which is a library that is linked into your executable to create a Halo-enabled binary. This component lives under llvm-project/compiler-rt/lib/halomon. This component mainly performs profiling (currently Linux only) and live code patching.
haloserver, aka the Halo Optimization Server, to which the halo-enabled binaries connect in order to recieve newly-generated code that is (hopefully) better optimized. This component lives under tools/haloserver and include.
A special version of clang that supports the -fhalo flag in order to produce Halo-enabled binaries. This is in the usual place llvm-project/clang.

Building

We offer Docker images with Halo pre-installed, so "building" generally should amount to downloading the image:

$ docker pull registry.gitlab.com/kavon1/halo:latest

Please note that by using the pre-built Docker image, you'll be required to have Linux kernel version 4.15 or newer, because our continuous integration machine has that version (see Issue #5).

If you end up building from source (or building the Docker image locally), then only Linux kernel version 3.4 or newer is required. However, you must also ensure that you have perf properly installed and available for use, ideally without requiring sudo. On Ubuntu this process looks like:

# install perf
$ sudo apt install linux-tools-generic linux-tools-common

# allow perf for regular users
$ sudo echo "kernel.perf_event_paranoid=1" >> /etc/sysctl.conf

# to reload the sysctl.conf settings without rebooting
$ sysctl --system

Docker

By default, when you run the Docker image it will launch an instance of haloserver with the arguments you pass it:

$ docker run registry.gitlab.com/kavon1/halo:latest # <arguments to haloserver>

Pass --help to haloserver to get information about some of its options. Look for the flags starting with --halo-*.

If you want to compile and run a Halo-enabled binary within the Docker container, you'll need to provide extra permissions, --cap-add sys_admin when you execute docker. These permissions are required for that binary to use Linux perf_events. Please note that if you just want to run haloserver in the container, then the extra permissions are not required.

If you want to build the Docker image locally, that should just amount to running docker build in the root of the repository.

Building From Source

Check the Dockerfile for dependencies you may need. Once your system has those satisfied, at the root of the cloned repository you can run:

$ ./fresh-build.sh kavon
$ cd build
$ make clang halomon haloserver

Replace make with ninja if you have that that installed, as the script will prefer Ninja. Please note that you'll end up with a debug build with logging enabled under build/bin. I do not currently have a build setup in that script for real usage.

Usage

Please keep in mind that this project is still in an early-development phase. Currently, Halo acts as a simple tiered JIT compilation system with sampling-based profiling and compilation that happens on a server.

To produce a Halo-enabled executable, simply add -fhalo to your invocation of Halo's clang:

$ clang -fhalo program.c -o program

Upon launching the program, a thread for the Halo Monitor will be spawned prior to main running. If the monitor does not find the Halo Server at 127.0.0.1:29000 (currently, over an unencrypted TCP connection) then the monitor goes inactive. Thus, you will want to have the Halo Server running ahead of time.

Generally you can run haloserver with no arguments, but see the --halo-* flags under --help for more options.

halo's People

Contributors

Stargazers

Watchers

halo's Issues

Separate Compilation Tests

Should at least test the following situations

Standard separate compilation with -fhalo enabled for the object files / static libs and during linking to create the executable. It would be great if clang merges the bitcode in those object files into one module in the final executable.
A shared library compiled with -fhalo and an executable (also -fhalo) that uses it. In theory we should be able to access the bitcode in that module (by getting the patch from the proc map and loading the object file) to perform cross-library optimization.
A test case to cause a name clash during linking of two object files who originally did not export a certain global with identical names, but because of HaloPrepare's naive globalization of those symbols, breaks linking. This would motivate work on creating new aliases for these globals during HaloPrepare.

These would make great regression tests.

Multi-threaded client programs

Due to the challenges with implementing client-side performance auditing ( #28 ), we're limiting this prototype to single-threaded client programs.

One way in the future to get around this without thread-local storage to hold an entry counter is to have one thread dedicated to randomly modifying a global variable that indicates whether version A or B should be used, with threads only reading that on each function entry. The added benefit is that all threads, at roughly the same time, will use the same version during each burst, so the I-cache is flushed at once for all.

January TODOs

As of 2c27e1a

Create a JSON specification that specifies the name of each knob and its options, so we know what the tuning space is without having all of that junk hardcoded.
Read in the Knob file upon launching the server and for each tuning group, create and populate a KnobSet.
When running the compilation pipeline, we provide const& access to the KnobSet to read the current settings. Lookups for specific knobs will be via a string key that is kept in sync with the JSON file and the code. Thus, we still apply a configuration to a KnobSet first then hand that set over to the pipeline. This lets us share infrastructure made for KnobSet when both generating a config and using it. A configuration itself could be a protobuf object so it can be serialized easily.
Complete all later todos.

Externalize globals with effects in embedded bitcode.

One of the problems that arose while implementing dynamic linking in is that many C++ programs utilize global class objects, and dynamically linking an object file with these ctors / dtors will trigger / register their execution. This is why we're running into problems with fftbench.cpp regarding __dso_handle.

We need to identify these global ctors (listed in @llvm.global_ctors) and ensure they are not called. Similarly, for any internal global objects, we need to change their declaration to external so that we link against the the object already in-process.

This same situation will arise for the simple case of a static variable declared in a function in C, which the new test case basic/function_static.c should help detect. The global @fib.n is declared internal, so it is probably getting its own global storage separate from the current process's @fib.n.

ORC generates PLT entries for JIT'd code

It might be possible to patch in a PC-relative call to code without going through the trampoline by searching the PLT and patching in a call with the offset to that. These facilities seem to be internal to RuntimeDyld, so some exposure of the functionality would be required.

Dynamically Check Kernel Version

Currently we have a compile-time check of a minimum kernel version when building halomon for feature compatibility.

The problem is that when distributing halo executables or the docker image*, the new minimum kernel version is equal to the version used by the system that compiled the halomon library. Thus, we need a dynamic check through uname (see man 2 uname) to ensure that halomon's sampling system is still compatible with the currently executing system.

*Docker does not capture properties of the kernel in its image

Make the call-graph dynamic instead of static.

The static CallGraph has its own VertexInfo and only knows one name for the function, currently the canonical function. That information is seeded with / derived from the static information in the bitcode.

Dynamically, we might be generating newly-optimized version of existing functions that have different names. In addition, if the optimization generates a new function (via say versioning), then the call-graph will be stale and unaware of how to guide the profiler when it is examining the branch target buffer.

Previously I think we avoided saving the FunctionInfo inside the Vertex because it makes destruction annoying, but if we fix #30 properly, we can make VertexInfo a wrapper around a pointer to FunctionInfo to handle the aliasing problem.

To handle the "new edge" thing for function versioning, we can expose the ability to add new vertices / edges. Then, before sending the code off, we need to analyze the optimized module for fresh call edges.

Hot-cold Splitting

A current deficiency of the profile-guided hot-cold splitting (HCS) in LLVM is that it will not move the cold functions away from the hot code to actually benefit from spatial locality due to issues with linkers. This is most likely why it does not yield performance improvements, only code size reductions.

For Halo, it might make sense to perform HCS (preceded by inlining and followed up by function merging). One way to do it in a way to benefit from locality is to extract those cold functions into another module, compile separately, and send two object files to clients. Thus, the dynamic linker will allocate the the hot and cold functions away from each other as two dylibs. There may be cyclic dependencies between these two modules, so some care will be needed when linking.

Turn halomon into a shared library

Currently, small 50KiB executables balloon into >9MiB executables when we link in halomon. The reason why we had to go with a static library is due to issue #3.

Some progress was made on this front by not linking LLVM into halomon, but instead linking libLLVM into the executable (see issue #4). This brought down executable sizes by a few MiB.

What we're stuck on here is the need for only a subset of XRay to be statically linked in, due to the limitations of its implementation outlined in #3. The problem is that I don't know how to make a shared library resolve to symbols found in the executable that it is linked into.

I played very briefly with halomon as a shared lib and not linking with the XRay object file and declaring the XRay functions we needed to be weak symbols like #pragma weak __xray_init. The problem is when the library is linked into the executable, the symbols in the executable do not override our shared library's weak symbols, so those symbols become NULL and we segfault.

TODO: Play around with a tiny test library and see if you can synthesize the right linker flags to make this work, or ask on StackOverflow because this is a tricky problem.

`patchable-function` Attribute

There seems to be a somewhat new patchable-function attribute for functions that emits a small nop sled at the start of the function that is suitable for a compare-and-swap to redirect control flow via a short jump so that you can overwrite the body of the existing function. It's only supported for x86-64 right now.

Right now we're using XRay to do this and it seems fine and supports more architectures, so it doesn't seem useful, but nonetheless it's worth looking at this again in the future.

FFTBench Failure at -O3

Compiling fftbench.cpp with -O3 -fhalo crashes with missing symbol when linking.

Do not try to compile functions for which no bitcode exists.

The ClientGroup should have a set of functions for which bitcode is available, and possibly in the future a subset of those which are patchable. This set should be consulted by the profiler when determining the hottest function so it doesn't accidentally give us garbage.

Code map for JIT'd code

After dynamic linking, the client needs to send a message back to server indicating where each function was placed in memory. In particular, we care about the position and address of each function in the JITDylib.

Previously there seemed to be no way to get this information, but hopefully we can extend LLVM to provide this information.

Ubuntu 20.04 compatibility

Since I upgraded to 20.04 the following problems cropped up

Boost.ASIO removed get_io_service from tcp::socket
we're getting an EACCES error when accessing perf. [this turned out to be a problem on my system. updated README]

Server should use abstract addresses instead of absolute

It's currently a pain for the server to send a dylib and then immediately follow-up with another message telling the client to patch in some of the code inside of it, because we have to wait for the client to send back the code map information to know where the code is located to craft the 2nd message.

Since the client maintains dylib names and symbol addresses now, there's no need for server-side address translation for every client when specifying an action to take on that function. Instead, the server can just refer to JIT'd code via its unique library name + label pair.

thus, this task requires a few things:

Ensure that all dylibs are assigned a unique name by the server. A special name should be reserved for the original version of the code in-process.
The server's "deployed code" should track library and and function name.
Client performs an abstracted look-up on a ModifyFunction operation instead of using raw addresses.

Feature Importance

We can get a sense of knob sensitivity from the trained model. Here's a comparison of what type of information XGB offers to make such an estimation: https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7

Leaky FunctionInfo

See CodeRegionInfo::~CodeRegionInfo. We need a half std::shared_ptr solution because ICL doesn't work with them.

Static cost models

It might be useful to leverage the LLVM Machine Code Analyzer to obtain performance estimates before experiments.

Calling convention & patchable functions.

Since we embed the pristine bitcode before optimization, we don't know whether some internal function has had its calling convention changed to, say, fastcc. The HaloPrepare pass should embed information in an additional data section that describes the calling convention of the function.

In addition, since it's probably not profitable to mark all functions as being patchable with XRay as we do now, we may want to use this same information to describe what functions in the final output code are patchable.

Make logging / output thread-safe

the multiple_groups test is failing because of concurrent use of LLVM raw_ostream. We need to make halo::log() thread-safe with a lock, and for output we don't want going to the log, make a stdout() and stderr(), etc, version of raw_ostream.

Using XRay Function Patching in a Shared Library

Beyond some linking issues with weak symbols, which I've taken care of, the main issue is that the XRay implementation mades some assumptions about its usage. Currently, to patch a nop sled at function entry points to invoke the handler, XRay will patch in the following code that calls their trampoline:

movd $FuncID, %r10d
call __xray_FunctionEntry

where the call is actually a near call, i.e., CALL rel32 (starting with 0xE8). See here: https://www.felixcloutier.com/x86/call

The problem is that __xray_FunctionEntry is currently linked in as part of the halomon shared library, which is loaded at an address that is too far away from the program's .text section (more than 32-bit relative offset), so we get errors like this:

; non PIE
==11819==XRay Entry trampoline (0x7eff75c60be0) too far from sled (0x000000400660)
; PIE
==11935==XRay Entry trampoline (0x7f18e1167be0) too far from sled (0x5633fe2d1a70)

Potential Solutions:

(most preferred) Make halomon a static library. The limitation is only for the call to the trampoline, not the handler's address. However, we will face a similar issue in the future when we want to fully redirect calls to JIT'd code instead of invoking a logging handler. In that case, we can write a different trampoline that simply loads the far function pointer from a jump table indexed by the XRay function ID, and readjusts the stack before doing a far-jump.
Change XRay codegen & runtime to use more nops so we can perform the mov immediate, register needed before the call. I'm not sure if this will work on x86-64, though, since you can't move a 64-bit immediate into a register. More thinking would be needed to figure out how to make this work.

Running many Halo-enabled processes at once

We run into some system-wide perf_events buffer limitations when many halo-enabled processes are running at once. See the comment here: halo-project/llvm@e389423

We had to change the noop test to not spawn 50 clients on the same system because of a pretty low default kernel limit on the number of pages that can be used for perf_events buffers. Now that we have to create one buffer per core, it's quite possible to hit this limit without some care.

One way around this problem would be to only gather sampling data from some CPUs and/or reduce the number of pages in the ring buffer. We can also find out how to increase /proc/sys/kernel/perf_event_mlock_kb and tell users to follow those instructions.

PerfJITEventListener?

See: https://reviews.llvm.org/rL337789

I don't know if this is actually useful for Halo, since this mainly seems to provide source line information for perf samples in JIT'd code. Basically, a human might care about that but not a tool that operates on LLVM IR. If we were able to then use that source-line information to derive the LLVM IR basic blocks belonging to those samples then maybe it would be useful.

`llvm::raw_os_ostream` does not flush properly

I really dislike llvm::raw_ostream various implementations. Even a newline doesn't cause a flush. Not sure what's going on but I think it's time to switch to an llvm::raw_ostream that dumps to a llvm::SmallString, and then we flush manually to our std::ostream.

Obtain server address from environment variable

Have the halomon look for environment variable(s) that specify the server and/or port to connect to.

Creating a README

It's about time we had one of those... was thinking a simple overview of the project based on the proposal and then basic instructions on downloading the docker image / building from source.

Control-height Reduction

One possible use of path profiles that already exists under llvm/Transforms/Instrumentation/ControlHeightReduction.h:

This pass merges conditional blocks of code and reduces the number of
conditional branches in the hot paths based on profiles.

Running Experiments for ClientGroups

A good starting point: choose a non-empty random % subset of the clients in a group and treat them identically (i.e., all of their profiling data would be treated to as being from the same 'client' to the model).

random TCP connection drops in test suite

We should be using keepalives to both detect dead clients and to prevent the connection from being dropped by intermediate carriers (Source).

Currently it seems that we're not responding to keepalives properly. Either this is because I've only set one end (the server) to use keepalives (as of 5aa95bf) or the client / server is too slow to respond because it's busy/sleeping etc. Investigate.

Tuning Section Selection

With the overhead elimination stuff, fixed_workload's profiling data suggests to the TS selector that driverFn is a better tuning root than workFn, even though driverFn is only called once at the start of the program. The ancestor hotness check needs to look two ancestors back, not one, to see if we can expand the TS region.

This would solve the issue seen in: c583137

Implement a "bakeoff"

The strategy will be based on the one described in "Online Performance Auditing: Using Hot Optimizations Without Getting Burned", by Lau et al. in PLDI'06.

The main implementation challenges are:

The burst counter needs to be thread-safe. How will we handle this? In particular, we will need to be able to access a thread-local counter. Is there a way to determine a stable thread-id from pre-written assembly? Alternatives include periodically querying pthread for any new threads spawned or dead and caching the thread-ids somewhere the assembly can access. We're going with single-threaded clients only in this prototype. See #34
Implement a basic "bakeoff" mechanism.

Static Code Features To Reduce Space

Lets drop some knobs from the base config based on static code features as determined by the ProgramInfoPass. We'll need to change KnobSet::lookup so that it returns a dummy knob or some optional type instead when look-up fails.

User-provided optimization hints on functions.

The optforfuzzing option(s) should be dropped completely.

The minsize, optsize, noinline, alwaysinline options all seem like they might be worth tuning for the individual functions in the tuning section. Thus, the attributes if present can serve as defaults and tuning can play with those options.

The inlinehint option primarily controls which threshold value the inliner should use in its cost model. It could useful to leave these this alone if they appear and have the tuner play with the inliner's hinted-threshold value to see what threshold should be used for that subset of the code. Alternatively, we can make this a tunable flag to set on functions ourselves (randomly or based on some heuristic, etc) so the tuner can use a different threshold for a subset of funcs. In both cases, this is a smarter / less direct way of tuning via noinline/alwaysinline.

Online profiling data can allow us to detect cold functions. This attribute places the function into another inlining threshold class too (among other optimizations). It might be fine to just trust our metric for determining a 'cold' function based on the data and place the attribute directly in the code periodically.

Dynamic linker incorrectly resolves dependencies of a symbol.

JIT'd object file doesn't appear to be linking with the correct stuff?

It seems that for a module where we request symbol fib_left, fib_right resolves to the original code and not the one in the object file. This is based on the addresses here:

Hottest function = fib_left
Symb: FUNC, fib_right, visible = 0
Symb: FUNC, fib_left, visible = 1
Sent code to all clients!
Finished Compile!
{
 "funcs": {
  "fib_right": {
   "label": "fib_right",
   "size": 67,
   "start": "4679600"
  },
  "fib_left": {
   "label": "fib_left",
   "size": 67,
   "start": "139733235445760"
  }
 }
}

The address for fib_right looks like the wrong implementation. Perhaps we need to explicitly say that the search order for dependencies of fib_left should resolve to stuff in the object file first, and then it can use dlsym for if that fails?

Not all functions should be patchable

Was noticed in oopack, oddly only for O1 or lower, as O2 and O3 showed no difference between the two (from minibench testing).

kavon@zeus:~/p/h/test|master⚡?
➤ ../build/bin/clang++ -DSMALL_PROBLEM_SIZE -O1 -fhalo bench/cpp/oopack_v1p8.cpp -o withhalo
kavon@zeus:~/p/h/test|master⚡?
➤ ../build/bin/clang++ -DSMALL_PROBLEM_SIZE -O1 bench/cpp/oopack_v1p8.cpp -o nohalo

➤ perf stat ./nohalo 
                         Seconds       Mflops         
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max             15000
Matrix            200
Complex          2000
Iterator        20000

DONE!

 Performance counter stats for './nohalo':

      71821.383795      task-clock (msec)         #    1.000 CPUs utilized          
                56      context-switches          #    0.001 K/sec                  
                 1      cpu-migrations            #    0.000 K/sec                  
             5,978      page-faults               #    0.083 K/sec                  
   279,471,080,208      cycles                    #    3.891 GHz                    
   456,439,320,592      instructions              #    1.63  insn per cycle         
   130,281,135,616      branches                  # 1813.960 M/sec                  
           993,605      branch-misses             #    0.00% of all branches        

      71.834667907 seconds time elapsed

➤ perf stat ./withhalo 
                         Seconds       Mflops         
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max             15000
Matrix            200
Complex          2000
Iterator        20000

DONE!

 Performance counter stats for './withhalo':

      99067.185263      task-clock (msec)         #    0.999 CPUs utilized          
               204      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             7,283      page-faults               #    0.074 K/sec                  
   385,573,160,135      cycles                    #    3.892 GHz                    
   542,639,830,648      instructions              #    1.41  insn per cycle         
   184,369,036,871      branches                  # 1861.051 M/sec                  
         1,222,494      branch-misses             #    0.00% of all branches        

      99.141422575 seconds time elapsed

Reduce size of executables.

We really should avoid statically linking in LLVM components, as was done in 72bd63b to fix #3

Right now a no-op program produces a 12MB executable. We should instead setup the build system to properly require and use libLLVM. This was done in a hacky way in d4a96c5 but dependencies on libLLVM were not setup properly such that it would work in our test suite.

CCT knots and inconsistencies

We need to be somewhat more robust when handling the perf data when building the calling-context tree, and I think we should rely on the static call graph to help fix-up or workaround the following type of issue in oggenc:

walking BTB from seed_curve
in the context of ancestors:
id = 0; <root>
id = 1; ???
id = 2; ???
id = 3; main
id = 4; oe_encode
id = 5; vorbis_analysis
id = 6; mapping0_forward
id = 8; _vp_tonemask
id = 81; seed_curve

BTB Entry:	seed_loop => seed_curve; call
BTB Entry:	seed_curve => seed_loop; ret
BTB Entry:	seed_loop => seed_curve; call
BTB Entry:	seed_curve => seed_loop; ret
BTB Entry:	seed_loop => seed_curve; call
BTB Entry:	seed_curve => seed_loop; ret
BTB Entry:	seed_loop => seed_curve; call
BTB Entry:	seed_curve => seed_loop; ret
BTB Entry:	seed_loop => seed_curve; call
BTB Entry:	seed_curve => seed_loop; ret
...

Problem is that _vp_tonemask doesn't directly call seed_curve, but the calling context data from perf claims it did. My best guess as to why this happens is that the sample took place in the middle of a stack adjustment.

According to the program, _vp_tonemask directly calls seed_loop who then calls seed_curve. So we end up placing the data in the wrong spot (currently in this "knot" which is another issue in itself):

CRI initialization with multiple identical function definitions

With the latest patch to fix output, we're running into a situation where two different __tls_init functions are in the process and it's triggering an assertion failure for good reason... the CRI is not currently designed to handle this case: https://gitlab.com/kavon1/halo/-/jobs/528979043

The fix for this in the CRI infrastructure to handle this situation is going to be related to solving issue #27 so this is a good test case while working on both simultaneously.

From CodeRegionInfo::init in CodeRegionInfo.cpp:

FuncName = __tls_init @ 4408208
...
FuncName = __tls_init @ 4518512

More Tunables

cl::values instead of relying on string for Strategy

Look in LLVM source for examples. Like:

static cl::opt<CodeModel::Model> CodeModel(
      "code-model", cl::desc("Choose code model"),
      cl::values(clEnumValN(CodeModel::Tiny, "tiny", "Tiny code model"),
                 clEnumValN(CodeModel::Small, "small", "Small code model"),
                 clEnumValN(CodeModel::Kernel, "kernel", "Kernel code model"),
                 clEnumValN(CodeModel::Medium, "medium", "Medium code model"),
                 clEnumValN(CodeModel::Large, "large", "Large code model")));

You need to put the cl::opt in TuningSection.cpp to access the enum. Then, you also don't need to pass it through the JSON config from Main.

Secure communication with `libssh`

Obviously security is a major concern for Halo when operating over the Internet or some untrusted local network. It seems the most straightforward way to have both authentication and encryption is to switch over to communicating via libssh.

The tutorial for C code is here: http://api.libssh.org/master/libssh_tutor_forwarding.html

In particular, there's a pretty easy-to-use C++ wrapper which is summarized here (see the Channel class in particular): http://api.libssh.org/master/group__ssh__cpp.html

There is a way to make non-blocking reads and/or polling. This will may end up replacing all of the Boost.Asio code dealing with TCP/IP since the data will probably have to go through libssh's API.