xparq / iprof Goto Github PK

View Code? Open in Web Editor NEW

This project forked from seanballais/iprof

0.0 0.0 0.0 66 KB

A pretty simple and performant C++ profiling library

License: MIT License

C++ 100.00%

iprof's People

iprof's Issues

Make hitime.hpp an internal part of iProf, while and also include-able alone!

Follow-up of #8

If included by iprof, client code should not see it (outside of the iProf namespace, that is).
-> Can be done by including it all from within the iProf namespace; but it reqiures the ugly hackery of pre-including windows.h and <chrono> by the iprof header...
If the client code wanted to use it regardless of (the rest of) iprof, it can; only it must be included before iprof.hpp!

Fill in the blanks... ;)
E.g., in iprof.hpp:

namespace iProf
{
    #include "hirestime.hpp" // See comment above...
    // To avoid conflicts between HiResTime vs iProf::HiResTime, depending
    // on whether the user has also included it, we use HRTime internally:
    namespace HRTime = HiResTime;
    using HRClock = HRTime::Clock;

Why exactly is Start/Stop not members of ScopedMeasure?

I guess to allow calling them on their own, without scoping. Also, passing the implicit this param also has a cost.
- But, for no scoping... That's not really feasible anyway: it's specifically meant to work within a scope with balanced Start/Stop pairs, and can't span across one, as Stop(), with no arg, can't know what measurement its corresponding Start() started!...

So... It could just store the measurement index (or ptr), so it won't need to find itself again in the dtor with that wicked loop... OTOH, that loop is only triggered for parent scopes, so the deeper a measurement stack is, the costlier it is; i.e. in most cases it's just a conditional & skip branch -> #19

Or it could even store the corresponding Stats map entry, so... No, wait, the whole point is to avoid the map lookup here! :)

Anyway, storing the index can't be done without turning Start/Stop into methods, with different signatures! Which means duplicating them. Which is fine, though.

Is it also intentionally a memberless class to minimize stack impact? (Note that empty-member optimization allows that, despite sizeof(anything) > 0.) Is that a thing in profiling practice?
BTW, not just the stack size, but manipulating members has some cost, too. So, that loop may not be that much more costly.
Should really compare the assemblies...

So, this:

inline void Start(const char* tag)
{
    currentScopePath.push_back(tag);
    measurements.emplace_back(Measurement{currentScopePath});
//	measurements.emplace_back(currentScopePath); // This seemed consistently slower with CLANG! :-o
}
inline void Stop()
{
    // Find the corresponding measurement
    auto depth = currentScopePath.size();
    auto m = measurements.end();
    do --m; while (m->scopePath.size() != depth);

    m->tStop = HiResTime::now();
    currentScopePath.pop_back();
}

struct ScopedMeasure
{
    ScopedMeasure(const char* tag) { Start(tag); }
    ~ScopedMeasure() { Stop(); }
};

could be turned into this:

struct ScopedMeasure
{
    Measurement* m;

    ScopedMeasure(const char* tag)
    {
            currentScopePath.push_back(tag);
            measurements.emplace_back(Measurement{currentScopePath});
            m = &measurements.back();
    }

    ~ScopedMeasure()
    {
            m->tStop = HiResTime::now();
            currentScopePath.pop_back();
    }
};

Or, in fact, not "turned into" but "extended with"...

Measure, and explicitly align the 16-bit counters in TinyVector, if needed

Also, that 16 may be better off as 32 on 64-bit systems? Or just swallow it, and int, at the cost of one less of the array size -- but who needs 15, or even 14 levels of scope nesting depth?!

Static-assert HighResClock to be steady!

The Windows version just is, but the std one... See "Notes" at https://en.cppreference.com/w/cpp/chrono/high_resolution_clock!

Consolidate the (function) naming convention (e.g. Start/Stop vs. all the camelCase)

Just the (usual) problem... Here exacerbated with ambiguities like iProf::AccumulateLatestMeasurements() looking like a ctor call if PascalCased. (Which is kinda OK, as ctors are functions, too, but not really...)

E.g. use camelCase for function calls of the scheme verbObject() consistently.
But don't do that for variables. Those could better be snake_cased -- but then there's the ambiguity of looking exactly like std names... Mmm... TBD!

Consolidate allThreadStat as just IPROF_STAT in MT-mode

MT mode may not just mean !defined(IPROF_DISABLE_MULTITHREAD), because currently threaded measurements require manual "disaptching", which can't be transparently handled by a stupid preproc. macro.

So, just enabling/disabling MT means nothing: e.g. there may still be only one thread...
Also, if there are multiple threads, measurements may still only be done in one... Even if not, they still may not overlap at all...
So, AFAICU, this is currently a matter of manual control, with not much support from the tool.

Make test.cpp runnable also with IPROF_DISABLE

Requires #6 (twice... :) its reopened version, too!), and also macroizing all the directly called iProf:: stuff.

Indent with Tabs (instead of the current 3 spaces)

What happens if IPROF_SYNC is never called?

Document it! I mean "don't do that, as the measurements list would just grow rapidly/indefinitely", I guess.
I mean, I still need to understand it precisely...
I think it's a trade-off between adding an unprocessed data point to a growing vector, vs. updating the totals at the cost of a map lookup... I'm not sure either one is that much better... See also: #20!
Also, if that's still a viable strategy: IPROF_STATS, i.e. iProf::stats could be a function ::stats(), which could then make sure at least a final big SYNC is called implicitly. Also, if the measurements grow excessively, there could also be warnings... But how, where? iprof doesn't know about cerr! Well, in the STATS output itself!
- But then also document it (at describing the stats format), that it may be an optional part of the output!

Add comment about size < capacity after overflow in TinyVector

Well, I'd need to understand it first...

Possible static init issue?

I had an obscure crash in OON, vaguely related to some initial state, but only when also integrating TracyClient! :-o

Understand why the unprofiled tests (-> #12) actually take longer! :-o

Sort out the various flavors of stats in multi-threaded runs!

The current mess:

#define IPROF_SYNC_THREAD iProf::addThisThreadEntriesToAllThreadStats()
/*!!
#ifdef ...IPROF_DISABLE_MULTITHREAD?? -> #9
# define IPROF_STATS        iProf::stats
#else
# define IPROF_STATS        iProf::allThreadStats
//!! Also, this does not actually exist in the end, IPROF_SYNC_THREAD kinda kills it, AFAICU:
//!!# define IPROF_THREAD_STATS iProf::stats
#endif
!!*/
#define IPROF_STATS       iProf::stats //!! AFAICU, this is not valid in multi-threaded runs: IPROF_SYNC_THREAD kills it!
#define IPROF_ALL_THREAD_STATS iProf::allThreadStats

Note: as stated in #9, not even "mt. mode" is clear!...

Call IPROF_SYNC by IPROF_SYNC_THREAD?

Redundant calls to IPROF_SYNC should be fine anyway.
However, it currently creates a local vector on every call, which means at least one heap alloc, too! (Albeit there's a best-effort reserve there to mitigate it. Perhaps a static size_t largest_size_seen could help even more?)

A rather non-compelling question is, perhaps, if there's a use case where one would want to call IPROF_SYNC explicitly. E.g. when a piece of code is sometimes multi-threaded, and sometimes single-threaded, with IPROF_DISABLE_MULTITHREAD. But even then: who cares if an IPROF_SYNC would be called implicitly, too?!

A perhaps less made-up question: is there a use case, where only IPROF_SYNC is wanted in MT mode?

And if yes, update the "Multithreading" section in the README:

IPROF_SYNC;
IPROF_SYNC_THREAD;
...

...SECS(x) macros -> inline functions

inline auto MICROSEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::microseconds>(x).count(); }
inline auto MILLISEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::milliseconds>(x).count(); }
inline auto      SEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::seconds>(x).count(); }

Move HighResClock into the iProf namespace

This loses its nice generic usability, but it helps with other things, like #5 and #6.
Such a generic hi-res timer should be better off as a part of a generic lib instead.

Actually, with #9, it can be part of the namespace, as well as not part of it... :-o ;)

Duplicate the tests in test.cpp with and without IPROF, and compare the overall times!

Tried a lot of run - recompile - rerun cycles, and the errors were just way too big; it was impossible to see any reliable difference.

Unscoped (unbalanced) start/stop pairs to measure specific code paths

-> #18

So that it can do things like

  X f() {
    IPROF_START(worst_case); // auto worst_case = iProf::start();
  
    if (bad) {
      for (...) {
        ...
        if (really_bad) {
           do_some_more();
           IPROF_STOP(worst_case); // iProf::stop(worst_case);

But since it doesn't really use (or update) currentScopePath, it will confuse the accumulator & others (or even the next scoped measurements, perhaps)...
- I mean... it can't update the scope path, for the push/pop pairs make no sense across scope boundaries... Right?...
IPROF_START(name) does not take a string... it can't take a string... IPROF("name"), however, does!... :-/
Maybe also check for multiple accidental IPROF_STOP(x) calls for x.

Readme: put the include path back

"iprof/iprof.hpp"?

Question mark, because I only kinda accidentally removed it, not 100% accidentally: the repo dir has been flattened after the fork, and now there's no canonical subdir for the iprof stuff any more. (I still detest duplicating path components, especially right next to each other, like those "hello/hello/hello.stuff" monstrosities in the build2 tutorial examples :) )

However, the iprof repo dir itself is, in fact, just that... So I think the "iprof/" path prefix should be fine after all. :)

test.cpp: Crash with w64devkit...

Nothing with MSVC and CLANG on WSL2. Nothing with g++ on WSL2, either!
Only with g++ on the 64-bit version of w64devkit.
Also with IPROF_DISABLE_VECTOR_OPT! (Note don't forget to define it twice then: both in test.cpp and iprof.cpp!... Which is so stupid, worth its own issue...)
Also with -O2 (not just -O3).
Also with the prev. 4 commits (oldest tried: 32d3934)! :-o Something is really off here... I'd swear this didn't happen with any of them earlier! Even with the fresh ones today, it was all right for quite a while.
But: OK with d5e596c!

WTF?! And now with `32d3934`, too, again! AND `4805a35`! ?!?!?! AND WITH THE MOST RECENT ONE, TOO!!! What the hell is going on?!

Wow!... :-o If the tmp. project dir (for testing the old commits) was in a subdir called _old_, it comiled all right. As soon as renaming it to _last_ (or _oldx_), it crashed! :D Fantastic... If GCC puts the full path into the exe for some reason, there should be an off-by-one error somewhere... And this can also explain the mysterious slowdown mentioned in a later issue comment: either some unexpected misalignment, or that off-by-one error did something already without actually crashing... IT'S CRAZY!
- However...:
  - Couldn't find that _out_ in the exe... (Could be UCS-16 or something, but browsed it through, seen other 16-bit strings, but not the path.)
  - It's too robust: -O2 vs. -O3 makes no difference, despite the size differences (but this is not conclusive: can be random, and (more likely) O2 vs O3 may not affect it, only some data (segment/region) size.

UPDATE:

Definitely an alignment issue: couldn't reproduce it with renaming the main prj (work tree) dir at first: iprof -> iprof1 did nothing. Neither did iprof12. But iprof123 did! :-o It "fixed" the crash!... :-o

FTW, the crash:
>gdb a.exe Reading symbols from a.exe...

warning: could not convert 'main' from the host encoding (CP65001) to UTF-32.
This normally should not happen, please file a bug report.
(gdb) r
Starting program: a.exe
[New Thread 23768.0x7530]
sizeof(iProf::TagList): 128 bytes

And the lucky double is: 440.835

The profiler stats so far:
SCOPE: AVG_TIME (TOTAL_TIME / TIMES_EXECUTED)
All times in micro seconds
heavyCalc: 3.33874e+06 μs (3338737 μs / 1)
heavyCalc/bigWave: 539.615 μs (539615 μs / 1000)
heavyCalc/hugePower: 2798.49 μs (2798489 μs / 1000)
heavyCalc/hugePower/Interm.: 2797.3 μs (2797305 μs / 1000)
heavyCalc/hugePower/Interm./FirstPowerLoop: 604.288 μs (604288 μs / 1000)
heavyCalc/hugePower/Interm./SecondPowerLoop: 562.654 μs (562654 μs / 1000)
heavyCalc/hugePower/Interm./BigWavePowerLoop: 1629.42 μs (1629422 μs / 1000)
heavyCalc/hugePower/Interm./BigWavePowerLoop/bigWave: 542.846 μs (1628539 μs / 3000)

Second lucky double is 440.835

The profiler stats after the second run:
heavyCalc: 3.33854e+06 μs (6677079 μs / 2)
heavyCalc/bigWave: 539.271 μs (1078542 μs / 2000)
heavyCalc/hugePower: 2798.61 μs (5597211 μs / 2000)
heavyCalc/hugePower/Interm.: 2797.8 μs (5595600 μs / 2000)
heavyCalc/hugePower/Interm./FirstPowerLoop: 605.265 μs (1210530 μs / 2000)
heavyCalc/hugePower/Interm./SecondPowerLoop: 561.518 μs (1123036 μs / 2000)
heavyCalc/hugePower/Interm./BigWavePowerLoop: 1630.06 μs (3260129 μs / 2000)
heavyCalc/hugePower/Interm./BigWavePowerLoop/bigWave: 543.053 μs (3258320 μs / 6000)

Let's try a multithreaded environment
[New Thread 23768.0x59ac]
[New Thread 23768.0x5ff8]
440.835

Thread 4 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 23768.0x5ff8]
0x00007ff6b6ef7dee in std::map<LossyVector<char const*, 15u>, iProf::Totals, std::less<LossyVector<char const*, 15u> >, std::allocator<std::pair<LossyVector<char const*, 15u> const, iProf::Totals> > >
::~map() ()

Why the loop in Stop? When is it not just the last item?

It often is, but not always:

0000025ABF8A0400 vs 0000025ABF8A0400
0000025ABF8A05B0 vs 0000025ABF8A05B0
0000025ABF8A0640 vs 0000025ABF8A0640
0000025ABF8A0760 vs 0000025ABF8A0760
0000025ABF8A07F0 vs 0000025ABF8A07F0
0000025ABF8A0880 vs 0000025ABF8A0880
0000025ABF8A06D0 vs 0000025ABF8A0880
0000025ABF8A0520 vs 0000025ABF8A0880
0000025ABF8A0490 vs 0000025ABF8A0880
0000025ABF8A0910 vs 0000025ABF8A0910
0000025ABF8A0AC0 vs 0000025ABF8A0AC0
0000025ABF8A0B50 vs 0000025ABF8A0B50
0000025ABF8A0C70 vs 0000025ABF8A0C70
0000025ABF8A0D00 vs 0000025ABF8A0D00
0000025ABF8A0D90 vs 0000025ABF8A0D90
0000025ABF8A0BE0 vs 0000025ABF8A0D90
0000025ABF8A0A30 vs 0000025ABF8A0D90
0000025ABF8A09A0 vs 0000025ABF8A0D90
0000025ABF8A0E20 vs 0000025ABF8A0E20
0000025ABF8A0FD0 vs 0000025ABF8A0FD0
0000025ABF8A1060 vs 0000025ABF8A1060
0000025ABF8A1180 vs 0000025ABF8A1180
0000025ABF8A1210 vs 0000025ABF8A1210
0000025ABF8A12A0 vs 0000025ABF8A12A0
0000025ABF8A10F0 vs 0000025ABF8A12A0
0000025ABF8A0F40 vs 0000025ABF8A12A0
0000025ABF8A0EB0 vs 0000025ABF8A12A0
0000025ABF8A1330 vs 0000025ABF8A1330
0000025ABF8A14E0 vs 0000025ABF8A14E0

Shouldn't just IPROF_SYNC_THREAD recommended in this README section?

After:

If enabled (i.e. by default) iprof can also handle gathering stats across multiple threads, like:

```C++
IPROF_SYNC;
IPROF_THREAD_SYNC;
std::cerr << "The latest profiling stats across all threads:\n"
          << IPROF_ALL_THREAD_STATS << '\n';
```

>>> Here:

In case some threads might be still accumulating the results, use the ```iProf::allThreadStatLock```
mutex to guard `IPROF_ALL_THREAD_STATS`:

```C++
{
    std::lock_guard<std::mutex> bouncer(iProf::allThreadStatLock);
    std::cout << "The latest profiler stats from across all threads:\n"
              << IPROF_ALL_THREAD_STATS << '\n';
}
```

Why is IPROF_THREAD_SYNC not enough, with its own std::lock_guard<std::mutex> bouncer(allThreadStatLock);?

-> OK, maybe he (Paweł) meant somewhere else in the thread, irrespective of the preceding code above (i.e. not right after an IPROF_THREAD_SYNC). But even then, would IPROF_THREAD_SYNC be just fine (but maybe redundant)? (I guess I should really try to understand finally what exactly addThisThreadEntriesToAllThreadStats() really does...)

Extend Accumulate...() with a predictive meas. vector resize

...so that adding new measurement items during the sampling phase won't actually trigger reallocation (in most cases).

The recommended looping pattern of IPROF_SYNC calls lets the first loop cycle "break new ground" and record how many measurements have been made, then (after the actual syncing), that std::swap(measurements, unfinished); could be extended with a preallocation step for the new measurements instance.
Evaluate if that unfinished vector, recreated on every call, could just be thread_local instead!

Fix the bogus use of the COUNTER macro

With MSVC /W4:

declaration of 'iProf____COUNTER__' hides previous local declaration

OTOH, the output seems to be all right. Weird...
Ahh, OK, but only because the test doesn't have nested measurements...
Just add e.g. an extra IPROF("Outer; see issue #1!") to hugePower() to trigger the actual error!

(Same with GCC/CLANG, too!)

xparq / iprof Goto Github PK

iprof's People

iprof's Issues

WTF?! And now with 32d3934, too, again! AND 4805a35! ?!?!?! AND WITH THE MOST RECENT ONE, TOO!!! What the hell is going on?!

UPDATE:

Recommend Projects

Recommend Topics

Recommend Org

WTF?! And now with `32d3934`, too, again! AND `4805a35`! ?!?!?! AND WITH THE MOST RECENT ONE, TOO!!! What the hell is going on?!