The staticpgo_example from aarnott

This project demonstrates how to collect a static profile (PGO aka Profile-Guided Optimization) for a simple console app in order to make it faster. The profile describes a typical behavior of an app: which parts of methods are hot or cold, actual types of objects hidden under abstractions, etc. It can be collected dynamically via tiered compilation or statically where we build a special version of an app (aka "Instrumented Build"), run it, simulate typical workloads, save the resulting profile to a file and then re-use it in production. Both approaches have pros and cons.

NOTE: The workflow to collect static profiles is not final yet and can be improved/simplied in the future versions of daily builds.

What exactly PGO can optimize for us?

Inliner relies on PGO data and can be very aggressive for hot paths, see dotnet/runtime#52708 and dotnet/runtime#55478. Namely, this code.
Most virtual calls can be devirtualized using PGO data, e.g.:

void DisposeMe(IDisposable d)
{
    d.Dispose();
}

is optimized into:

void DisposeMe(IDisposable d)
{
    if (d is MyType)           // E.g. Profile states that Dispose here is mostly called on MyType.
        ((MyType)d).Dispose(); // It can be inlined now (e.g. to no-op if MyType::Dispose() is empty)
    else
        d.Dispose();           // a cold fallback, just in case
}

^ codegen diff for a case where MyType::Dispose is empty

JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method.

void DoWork(int a)
{
    if (a > 0)
        DoWork1();
    else
        DoWork2();
}

is transformed into:

void DoWork(int a)
{
    // E.g. Profile states that DoWork1 branch was never (or rarely) taken
    if (a <= 0)
        DoWork2();
    else
        DoWork1();
}

Some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks
Guided AOT: We can prejit only the code that was executed during the test run. It should noticeably reduce binary size of R2R'd images as the cold methods won't be prejitted at all. For that, you need to pass --partial flag to crossgen2 along with the actual MIBC data.

DynamicPGO vs StaticPGO

As I already mentioned, there are pros and cons.

DynamicPGO

Pros:

Easy to use: you just need to set the following env. variables: DOTNET_TC_QuickJitForLoops=1, DOTNET_TieredPGO=1 and DOTNET_ReadyToRun=0
Collects actual profile live - you don't need to worry about "Is my static profile still relevant?" or "Does my static profile cover this specific scenario?"

Cons:

Noticeably slower start - mostly because for better results we need to turn off all the prejitted (AOT) code + we emit a lot of additional block counters and class probes in tier0.
We don't support context-sensitive PGO and de-optimizations yet so we bake profile data into methods after just 30 calls and that data wlll be there forever for all possible callsites, and for some of them it might be less relevant.
DOTNET_TC_QuickJitForLoops=1 sometimes leads to performance issues known as "Cold loop - hot body" and needs OSR in JIT which is not finished yet.

StaticPGO

Pros

Doesn't affect startup time or even improves it
Can be used for Guided-AOT where we prejit only the code that was invoked during the test run. It makes AOT images smaller.
Since during the test run we never promote methods to tier1 - it's able to avoid that "context-sensitive" issue by collecting all possible scenarios for a specific method.

Cons

Difficult to setup, it requires special steps to create an instrumented build and simulate typical workloads
It has to be re-collected once something is changed
Currently it requires Composite-R2R mode (with compilebubblegenerics) for better results.

Prerequisites

The latest daily build of .NET 6.0 from here (should be at least 7/25/2021)
dotnet tool install --global dotnet-pgo --version "6.0.0-rc.1.21375.2" tool. See dotnet-pgo.md

How to run the sample

First, we need to build a special version of our sample and run it in order to collect a profile:

dotnet publish -c Release -r win-x64 /p:CollectMibc=true # or linux-x64, osx-arm64, etc..

The console app has a special msbuild task to do that job. Basically, it runs a fully instrumented build, collects traces, converts them to a special format *.mibc that we can use to optimize our app. Now we can re-publish the app using the PGO data we collected previously:

dotnet publish -c Release -r win-x64 /p:PgoData=pgo.mibc

Let's compare performance for StaticPGO, DynamicPGO and Default modes:

Performance results

Normal run dotnet run -c Release:

Running...
[0/9]: 57 ms.
[1/9]: 56 ms.
[2/9]: 56 ms.
[3/9]: 54 ms.
[4/9]: 54 ms.
[5/9]: 54 ms.
[6/9]: 54 ms.
[7/9]: 54 ms.
[8/9]: 54 ms.
[9/9]: 54 ms.

Run with static pgo (steps from the How to run the sample section above):

Running...
[0/9]: 19 ms.
[1/9]: 19 ms.
[2/9]: 19 ms.
[3/9]: 19 ms.
[4/9]: 19 ms.
[5/9]: 19 ms.
[6/9]: 18 ms.
[7/9]: 18 ms.
[8/9]: 18 ms.
[9/9]: 18 ms.

Run with dynamic PGO (steps from How to run the sample aren't needed. Only just set the following env.variables in your console):

$env:DOTNET_ReadyToRun=0           # ignore AOT code
$env:DOTNET_TieredPGO=1            # enable dynamic pgo
$env:DOTNET_TC_QuickJitForLoops=1  # don't bypass tier0 for methods with loops

Running...
[0/9]: 164 ms.
[1/9]: 175 ms.
[2/9]: 19 ms.
[3/9]: 18 ms.
[4/9]: 18 ms.
[5/9]: 18 ms.
[6/9]: 18 ms.
[7/9]: 18 ms.
[8/9]: 18 ms.
[9/9]: 18 ms.

Notes

DynamicPGO is easy to use, but you pay for it with a slower start, because we need to disable all the prejitted code and re-compile everything in tier0 with instrumentation - edge counters and class probes. E.g. the following aspnet benchmark demonstrates the difference between Static and Dynamic PGOs:

With the static one you only need to collect it in advance.

aarnott / staticpgo_example Goto Github PK

staticpgo_example's Introduction

What exactly PGO can optimize for us?

DynamicPGO vs StaticPGO

DynamicPGO

Pros:

Cons:

StaticPGO

Pros

Cons

Prerequisites

How to run the sample

Performance results

Notes

staticpgo_example's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent