This project demonstrates how to collect a static profile (PGO aka Profile-Guided Optimization) for a simple console app in order to make it faster. The profile describes a typical behavior of an app: which parts of methods are hot or cold, actual types of objects hidden under abstractions, etc. It can be collected dynamically via tiered compilation or statically where we build a special version of an app (aka "Instrumented Build"), run it, simulate typical workloads, save the resulting profile to a file and then re-use it in production. Both approaches have pros and cons.
NOTE: The workflow to collect static profiles is not final yet and can be improved/simplied in the future versions of daily builds.
- Inliner relies on PGO data and can be very aggressive for hot paths, see dotnet/runtime#52708 and dotnet/runtime#55478. Namely, this code.
- Most virtual calls can be devirtualized using PGO data, e.g.:
void DisposeMe(IDisposable d)
{
d.Dispose();
}
is optimized into:
void DisposeMe(IDisposable d)
{
if (d is MyType) // E.g. Profile states that Dispose here is mostly called on MyType.
((MyType)d).Dispose(); // It can be inlined now (e.g. to no-op if MyType::Dispose() is empty)
else
d.Dispose(); // a cold fallback, just in case
}
^ codegen diff for a case where MyType::Dispose is empty
- JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method.
void DoWork(int a)
{
if (a > 0)
DoWork1();
else
DoWork2();
}
is transformed into:
void DoWork(int a)
{
// E.g. Profile states that DoWork1 branch was never (or rarely) taken
if (a <= 0)
DoWork2();
else
DoWork1();
}
- Some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks
- Guided AOT: We can prejit only the code that was executed during the test run. It should noticeably reduce binary size of R2R'd images as the cold methods won't be prejitted at all. For that, you need to pass
--partial
flag to crossgen2 along with the actual MIBC data.
As I already mentioned, there are pros and cons.
- Easy to use: you just need to set the following env. variables:
DOTNET_TC_QuickJitForLoops=1
,DOTNET_TieredPGO=1
andDOTNET_ReadyToRun=0
- Collects actual profile live - you don't need to worry about "Is my static profile still relevant?" or "Does my static profile cover this specific scenario?"
- Noticeably slower start - mostly because for better results we need to turn off all the prejitted (AOT) code + we emit a lot of additional block counters and class probes in tier0.
- We don't support context-sensitive PGO and de-optimizations yet so we bake profile data into methods after just 30 calls and that data wlll be there forever for all possible callsites, and for some of them it might be less relevant.
DOTNET_TC_QuickJitForLoops=1
sometimes leads to performance issues known as "Cold loop - hot body" and needs OSR in JIT which is not finished yet.
- Doesn't affect startup time or even improves it
- Can be used for Guided-AOT where we prejit only the code that was invoked during the test run. It makes AOT images smaller.
- Since during the test run we never promote methods to tier1 - it's able to avoid that "context-sensitive" issue by collecting all possible scenarios for a specific method.
- Difficult to setup, it requires special steps to create an instrumented build and simulate typical workloads
- It has to be re-collected once something is changed
- Currently it requires Composite-R2R mode (with
compilebubblegenerics
) for better results.
- The latest daily build of .NET 6.0 from here (should be at least 7/25/2021)
dotnet tool install --global dotnet-pgo --version "6.0.0-rc.1.21375.2"
tool. See dotnet-pgo.md
First, we need to build a special version of our sample and run it in order to collect a profile:
dotnet publish -c Release -r win-x64 /p:CollectMibc=true # or linux-x64, osx-arm64, etc..
The console app has a special msbuild task to do that job. Basically, it runs a fully instrumented build, collects traces, converts them to a special format *.mibc that we can use to optimize our app. Now we can re-publish the app using the PGO data we collected previously:
dotnet publish -c Release -r win-x64 /p:PgoData=pgo.mibc
Let's compare performance for StaticPGO, DynamicPGO and Default modes:
- Normal run
dotnet run -c Release
:
Running...
[0/9]: 57 ms.
[1/9]: 56 ms.
[2/9]: 56 ms.
[3/9]: 54 ms.
[4/9]: 54 ms.
[5/9]: 54 ms.
[6/9]: 54 ms.
[7/9]: 54 ms.
[8/9]: 54 ms.
[9/9]: 54 ms.
- Run with static pgo (steps from the How to run the sample section above):
Running...
[0/9]: 19 ms.
[1/9]: 19 ms.
[2/9]: 19 ms.
[3/9]: 19 ms.
[4/9]: 19 ms.
[5/9]: 19 ms.
[6/9]: 18 ms.
[7/9]: 18 ms.
[8/9]: 18 ms.
[9/9]: 18 ms.
- Run with dynamic PGO (steps from How to run the sample aren't needed. Only just set the following env.variables in your console):
$env:DOTNET_ReadyToRun=0 # ignore AOT code
$env:DOTNET_TieredPGO=1 # enable dynamic pgo
$env:DOTNET_TC_QuickJitForLoops=1 # don't bypass tier0 for methods with loops
Running...
[0/9]: 164 ms.
[1/9]: 175 ms.
[2/9]: 19 ms.
[3/9]: 18 ms.
[4/9]: 18 ms.
[5/9]: 18 ms.
[6/9]: 18 ms.
[7/9]: 18 ms.
[8/9]: 18 ms.
[9/9]: 18 ms.
DynamicPGO is easy to use, but you pay for it with a slower start, because we need to disable all the prejitted code and re-compile everything in tier0 with instrumentation - edge counters and class probes. E.g. the following aspnet benchmark demonstrates the difference between Static and Dynamic PGOs:
With the static one you only need to collect it in advance.