xparq / iprof Goto Github PK
View Code? Open in Web Editor NEWThis project forked from seanballais/iprof
A pretty simple and performant C++ profiling library
License: MIT License
This project forked from seanballais/iprof
A pretty simple and performant C++ profiling library
License: MIT License
Follow-up of #8
iProf
namespace, that is).iProf
namespace; but it reqiures the ugly hackery of pre-including windows.h
and <chrono>
by the iprof header...iprof.hpp
!namespace iProf
{
#include "hirestime.hpp" // See comment above...
// To avoid conflicts between HiResTime vs iProf::HiResTime, depending
// on whether the user has also included it, we use HRTime internally:
namespace HRTime = HiResTime;
using HRClock = HRTime::Clock;
this
param also has a cost.
Stop()
, with no arg, can't know what measurement its corresponding Start()
started!...So... It could just store the measurement index (or ptr), so it won't need to find itself again in the dtor with that wicked loop... OTOH, that loop is only triggered for parent scopes, so the deeper a measurement stack is, the costlier it is; i.e. in most cases it's just a conditional & skip branch -> #19
Or it could even store the corresponding Stats
map entry, so... No, wait, the whole point is to avoid the map lookup here! :)
Anyway, storing the index can't be done without turning Start/Stop into methods, with different signatures! Which means duplicating them. Which is fine, though.
Is it also intentionally a memberless class to minimize stack impact? (Note that empty-member optimization allows that, despite sizeof(anything) > 0.) Is that a thing in profiling practice?
BTW, not just the stack size, but manipulating members has some cost, too. So, that loop may not be that much more costly.
Should really compare the assemblies...
So, this:
inline void Start(const char* tag)
{
currentScopePath.push_back(tag);
measurements.emplace_back(Measurement{currentScopePath});
// measurements.emplace_back(currentScopePath); // This seemed consistently slower with CLANG! :-o
}
inline void Stop()
{
// Find the corresponding measurement
auto depth = currentScopePath.size();
auto m = measurements.end();
do --m; while (m->scopePath.size() != depth);
m->tStop = HiResTime::now();
currentScopePath.pop_back();
}
struct ScopedMeasure
{
ScopedMeasure(const char* tag) { Start(tag); }
~ScopedMeasure() { Stop(); }
};
could be turned into this:
struct ScopedMeasure
{
Measurement* m;
ScopedMeasure(const char* tag)
{
currentScopePath.push_back(tag);
measurements.emplace_back(Measurement{currentScopePath});
m = &measurements.back();
}
~ScopedMeasure()
{
m->tStop = HiResTime::now();
currentScopePath.pop_back();
}
};
Or, in fact, not "turned into" but "extended with"...
Also, that 16 may be better off as 32 on 64-bit systems? Or just swallow it, and int
, at the cost of one less of the array size -- but who needs 15, or even 14 levels of scope nesting depth?!
The Windows version just is, but the std one... See "Notes" at https://en.cppreference.com/w/cpp/chrono/high_resolution_clock!
Just the (usual) problem... Here exacerbated with ambiguities like iProf::AccumulateLatestMeasurements()
looking like a ctor call if PascalCased. (Which is kinda OK, as ctors are functions, too, but not really...)
verbObject()
consistently.MT mode may not just mean !defined(IPROF_DISABLE_MULTITHREAD)
, because currently threaded measurements require manual "disaptching", which can't be transparently handled by a stupid preproc. macro.
So, just enabling/disabling MT means nothing: e.g. there may still be only one thread...
Also, if there are multiple threads, measurements may still only be done in one... Even if not, they still may not overlap at all...
So, AFAICU, this is currently a matter of manual control, with not much support from the tool.
Requires #6 (twice... :) its reopened version, too!), and also macroizing all the directly called iProf:: stuff.
IPROF_STATS
, i.e. iProf::stats
could be a function ::stats()
, which could then make sure at least a final big SYNC
is called implicitly. Also, if the measurements grow excessively, there could also be warnings... But how, where? iprof doesn't know about cerr
! Well, in the STATS output itself!
Well, I'd need to understand it first...
I had an obscure crash in OON, vaguely related to some initial state, but only when also integrating TracyClient! :-o
The current mess:
#define IPROF_SYNC_THREAD iProf::addThisThreadEntriesToAllThreadStats()
/*!!
#ifdef ...IPROF_DISABLE_MULTITHREAD?? -> #9
# define IPROF_STATS iProf::stats
#else
# define IPROF_STATS iProf::allThreadStats
//!! Also, this does not actually exist in the end, IPROF_SYNC_THREAD kinda kills it, AFAICU:
//!!# define IPROF_THREAD_STATS iProf::stats
#endif
!!*/
#define IPROF_STATS iProf::stats //!! AFAICU, this is not valid in multi-threaded runs: IPROF_SYNC_THREAD kills it!
#define IPROF_ALL_THREAD_STATS iProf::allThreadStats
Note: as stated in #9, not even "mt. mode" is clear!...
Redundant calls to IPROF_SYNC should be fine anyway.
However, it currently creates a local vector
on every call, which means at least one heap alloc, too! (Albeit there's a best-effort reserve
there to mitigate it. Perhaps a static size_t largest_size_seen
could help even more?)
A rather non-compelling question is, perhaps, if there's a use case where one would want to call IPROF_SYNC explicitly. E.g. when a piece of code is sometimes multi-threaded, and sometimes single-threaded, with IPROF_DISABLE_MULTITHREAD. But even then: who cares if an IPROF_SYNC would be called implicitly, too?!
A perhaps less made-up question: is there a use case, where only IPROF_SYNC is wanted in MT mode?
And if yes, update the "Multithreading" section in the README:
IPROF_SYNC;
IPROF_SYNC_THREAD;
...
inline auto MICROSEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::microseconds>(x).count(); }
inline auto MILLISEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::milliseconds>(x).count(); }
inline auto SEC(const HighResClock::duration& x) { return std::chrono::duration_cast<std::chrono::seconds>(x).count(); }
Tried a lot of run - recompile - rerun cycles, and the errors were just way too big; it was impossible to see any reliable difference.
-> #18
X f() {
IPROF_START(worst_case); // auto worst_case = iProf::start();
if (bad) {
for (...) {
...
if (really_bad) {
do_some_more();
IPROF_STOP(worst_case); // iProf::stop(worst_case);
currentScopePath
, it will confuse the accumulator & others (or even the next scoped measurements, perhaps)...
IPROF_START(name)
does not take a string... it can't take a string... IPROF("name")
, however, does!... :-/"iprof/iprof.hpp"
?
Question mark, because I only kinda accidentally removed it, not 100% accidentally: the repo dir has been flattened after the fork, and now there's no canonical subdir for the iprof stuff any more. (I still detest duplicating path components, especially right next to each other, like those "hello/hello/hello.stuff" monstrosities in the build2 tutorial examples :) )
However, the iprof
repo dir itself is, in fact, just that... So I think the "iprof/" path prefix should be fine after all. :)
Nothing with MSVC and CLANG on WSL2. Nothing with g++ on WSL2, either!
Only with g++ on the 64-bit version of w64devkit.
Also with IPROF_DISABLE_VECTOR_OPT
! (Note don't forget to define it twice then: both in test.cpp and iprof.cpp!... Which is so stupid, worth its own issue...)
Also with -O2 (not just -O3).
Also with the prev. 4 commits (oldest tried: 32d3934)! :-o Something is really off here... I'd swear this didn't happen with any of them earlier! Even with the fresh ones today, it was all right for quite a while.
But: OK with d5e596c!
Wow!... :-o If the tmp. project dir (for testing the old commits) was in a subdir called _old_
, it comiled all right. As soon as renaming it to _last_
(or _oldx_
), it crashed! :D Fantastic... If GCC puts the full path into the exe for some reason, there should be an off-by-one error somewhere... And this can also explain the mysterious slowdown mentioned in a later issue comment: either some unexpected misalignment, or that off-by-one error did something already without actually crashing... IT'S CRAZY!
_out_
in the exe... (Could be UCS-16 or something, but browsed it through, seen other 16-bit strings, but not the path.)Definitely an alignment issue: couldn't reproduce it with renaming the main prj (work tree) dir at first: iprof -> iprof1 did nothing. Neither did iprof12. But iprof123
did! :-o It "fixed" the crash!... :-o
FTW, the crash:
>gdb a.exe Reading symbols from a.exe...
warning: could not convert 'main' from the host encoding (CP65001) to UTF-32.
This normally should not happen, please file a bug report.
(gdb) r
Starting program: a.exe
[New Thread 23768.0x7530]
sizeof(iProf::TagList): 128 bytes
And the lucky double is: 440.835
The profiler stats so far:
SCOPE: AVG_TIME (TOTAL_TIME / TIMES_EXECUTED)
All times in micro seconds
heavyCalc: 3.33874e+06 μs (3338737 μs / 1)
heavyCalc/bigWave: 539.615 μs (539615 μs / 1000)
heavyCalc/hugePower: 2798.49 μs (2798489 μs / 1000)
heavyCalc/hugePower/Interm.: 2797.3 μs (2797305 μs / 1000)
heavyCalc/hugePower/Interm./FirstPowerLoop: 604.288 μs (604288 μs / 1000)
heavyCalc/hugePower/Interm./SecondPowerLoop: 562.654 μs (562654 μs / 1000)
heavyCalc/hugePower/Interm./BigWavePowerLoop: 1629.42 μs (1629422 μs / 1000)
heavyCalc/hugePower/Interm./BigWavePowerLoop/bigWave: 542.846 μs (1628539 μs / 3000)
Second lucky double is 440.835
The profiler stats after the second run:
heavyCalc: 3.33854e+06 μs (6677079 μs / 2)
heavyCalc/bigWave: 539.271 μs (1078542 μs / 2000)
heavyCalc/hugePower: 2798.61 μs (5597211 μs / 2000)
heavyCalc/hugePower/Interm.: 2797.8 μs (5595600 μs / 2000)
heavyCalc/hugePower/Interm./FirstPowerLoop: 605.265 μs (1210530 μs / 2000)
heavyCalc/hugePower/Interm./SecondPowerLoop: 561.518 μs (1123036 μs / 2000)
heavyCalc/hugePower/Interm./BigWavePowerLoop: 1630.06 μs (3260129 μs / 2000)
heavyCalc/hugePower/Interm./BigWavePowerLoop/bigWave: 543.053 μs (3258320 μs / 6000)
Let's try a multithreaded environment
[New Thread 23768.0x59ac]
[New Thread 23768.0x5ff8]
440.835
Thread 4 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 23768.0x5ff8]
0x00007ff6b6ef7dee in std::map<LossyVector<char const*, 15u>, iProf::Totals, std::less<LossyVector<char const*, 15u> >, std::allocator<std::pair<LossyVector<char const*, 15u> const, iProf::Totals> > >
::~map() ()
It often is, but not always:
0000025ABF8A0400 vs 0000025ABF8A0400
0000025ABF8A05B0 vs 0000025ABF8A05B0
0000025ABF8A0640 vs 0000025ABF8A0640
0000025ABF8A0760 vs 0000025ABF8A0760
0000025ABF8A07F0 vs 0000025ABF8A07F0
0000025ABF8A0880 vs 0000025ABF8A0880
0000025ABF8A06D0 vs 0000025ABF8A0880
0000025ABF8A0520 vs 0000025ABF8A0880
0000025ABF8A0490 vs 0000025ABF8A0880
0000025ABF8A0910 vs 0000025ABF8A0910
0000025ABF8A0AC0 vs 0000025ABF8A0AC0
0000025ABF8A0B50 vs 0000025ABF8A0B50
0000025ABF8A0C70 vs 0000025ABF8A0C70
0000025ABF8A0D00 vs 0000025ABF8A0D00
0000025ABF8A0D90 vs 0000025ABF8A0D90
0000025ABF8A0BE0 vs 0000025ABF8A0D90
0000025ABF8A0A30 vs 0000025ABF8A0D90
0000025ABF8A09A0 vs 0000025ABF8A0D90
0000025ABF8A0E20 vs 0000025ABF8A0E20
0000025ABF8A0FD0 vs 0000025ABF8A0FD0
0000025ABF8A1060 vs 0000025ABF8A1060
0000025ABF8A1180 vs 0000025ABF8A1180
0000025ABF8A1210 vs 0000025ABF8A1210
0000025ABF8A12A0 vs 0000025ABF8A12A0
0000025ABF8A10F0 vs 0000025ABF8A12A0
0000025ABF8A0F40 vs 0000025ABF8A12A0
0000025ABF8A0EB0 vs 0000025ABF8A12A0
0000025ABF8A1330 vs 0000025ABF8A1330
0000025ABF8A14E0 vs 0000025ABF8A14E0
After:
If enabled (i.e. by default) iprof can also handle gathering stats across multiple threads, like:
```C++
IPROF_SYNC;
IPROF_THREAD_SYNC;
std::cerr << "The latest profiling stats across all threads:\n"
<< IPROF_ALL_THREAD_STATS << '\n';
```
>>> Here:
In case some threads might be still accumulating the results, use the ```iProf::allThreadStatLock```
mutex to guard `IPROF_ALL_THREAD_STATS`:
```C++
{
std::lock_guard<std::mutex> bouncer(iProf::allThreadStatLock);
std::cout << "The latest profiler stats from across all threads:\n"
<< IPROF_ALL_THREAD_STATS << '\n';
}
```
Why is IPROF_THREAD_SYNC
not enough, with its own std::lock_guard<std::mutex> bouncer(allThreadStatLock);
?
-> OK, maybe he (Paweł) meant somewhere else in the thread, irrespective of the preceding code above (i.e. not right after an IPROF_THREAD_SYNC). But even then, would IPROF_THREAD_SYNC be just fine (but maybe redundant)? (I guess I should really try to understand finally what exactly addThisThreadEntriesToAllThreadStats()
really does...)
...so that adding new measurement
items during the sampling phase won't actually trigger reallocation (in most cases).
The recommended looping pattern of IPROF_SYNC
calls lets the first loop cycle "break new ground" and record how many measurements have been made, then (after the actual syncing), that std::swap(measurements, unfinished);
could be extended with a preallocation step for the new measurements
instance.
Evaluate if that unfinished
vector, recreated on every call, could just be thread_local instead!
With MSVC /W4:
declaration of 'iProf____COUNTER__' hides previous local declaration
OTOH, the output seems to be all right. Weird...
Ahh, OK, but only because the test doesn't have nested measurements...
Just add e.g. an extra IPROF("Outer; see issue #1!")
to hugePower()
to trigger the actual error!
(Same with GCC/CLANG, too!)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.