hsf / prmon Goto Github PK
View Code? Open in Web Editor NEWStandalone monitor for process resource consumption
License: Apache License 2.0
Standalone monitor for process resource consumption
License: Apache License 2.0
Hi,
I'm creating this issue to follow-up on the discussion that was held at the ATLAS SPOT meeting where we discussed adding GPU monitoring support to prmon
.
Currently the easiest and the most reliable/useful information seems to be provided by the utilities that are developed by the hardware manufacturers, e.g. nvidia-smi
.
For the initial implementation, the agreed-upon idea is to write a new plug-in, where we invoke nvidia-smi
to gather GPU statistics and add these to the rest. The rationale behind this choice is the wide usage of NVIDIA accelerators where prmon
is used, the minimum dependency at the compilation/installation time (as opposed to using the C API) etc. The implementation will be done with the understanding that support for other hardware, eg. Intel/AMD, can/will be added in the future.
Looking at the relevant documentation, e.g. here, it seems as if nvidia-smi
supports different options, the most relevant ones being dmon
which provides device monitoring and pmon
which provides process monitoring statistics. There are a number of concerns in terms of the overhead of using the command-line utility etc. but these will need to be assessed as the project progresses.
Please feel free to provide relevant input under this ticket. Many thanks.
Best,
Serhan
prmon
resource consumption seems to be very reasonable, even for 8 core multi-process jobs (0.5% as seen in #50), however we should support a profiling build option so that we can analyse where we do actually consume resources.
In prmon_plot.py, the axis labels are set to [kb] when the field is PSS, VMEM, RSS or SWAP, but in prmon.txt the names are lowercase, so the axis labels do not get printed when e.g. --xvar pss.
Running prmon
on an ATLAS 8 core simulation job I notice that after about an hour's runtime:
aiatlas161:/build/graemes/montest/sim$ ps gux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
graemes 24285 26.0 0.0 15212 1372 pts/2 S+ 12:12 17:10 ../bin/prmon --interval 10 -- Sim_tf.py --inputEVNTFile=EVNT.13043099._000859.pool.root.1 --maxEvents=1000 --postIncl
graemes 24286 0.0 0.3 633780 97532 pts/2 S+ 12:12 0:02 python /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasOffline/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/share/Sim_tf
graemes 24325 0.0 0.0 9704 1428 pts/2 S+ 12:12 0:00 /bin/sh ./runwrapper.EVNTtoHITS.sh
graemes 24326 1.1 0.0 17012 1132 pts/2 S+ 12:12 0:45 MemoryMonitor --pid 24325 --filename mem.full.EVNTtoHITS --json-summary mem.summary.EVNTtoHITS.json --interval 30
graemes 24327 2.5 6.3 2683568 1890308 pts/2 Sl+ 12:12 1:39 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25206 99.9 6.3 2686716 1901632 pts/2 R 12:16 61:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25207 99.9 6.3 2691088 1904744 pts/2 R 12:16 61:19 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25208 99.9 6.3 2689316 1904996 pts/2 R 12:16 61:20 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25209 99.9 6.3 2687544 1906452 pts/2 R 12:16 61:20 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25210 99.9 6.3 2691420 1903956 pts/2 R 12:16 61:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25211 99.9 6.3 2688096 1902832 pts/2 R 12:16 61:20 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25215 99.9 6.3 2688372 1903712 pts/2 R 12:16 61:21 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
graemes 25216 99.9 6.3 2688648 1907700 pts/2 R 12:16 61:20 /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt
Which indicates that prmon
is sucking up about 1/4 of a CPU, 17 minutes. Compare that with MemoryMonitor that's used only 45 seconds.
prmon's cadence is 10 seconds, cf. MemoryMonitor at 30s. But even accounting for that this is about a x7 degradation. This is way too high a resource cost for a monitoring program. We have to profileprmon
and see what's costing so much.
We should setup Travis to run some CI jobs for prmon
.
Start with one platform (like Ubuntu16) and then add some others once that works.
I think we're at (or close to) a point where the prmon
is stable enough that we can create the first tag, which we can use as basis to perhaps replace MemoryMonitor
with prmon
in the ATLAS workflow. There are a few open issues currently but I don't think these are showstoppers. What do you guys think, @graeme-a-stewart, [email protected]?
Would it be possible to change prmon_lot.py in such a way that one can give a list as argument to --yvar, which would produce a stack of plots in a single png file? This would avoid having to manually combine pngs and it would naturally show the time correlation among different metrics.
Thanks,
Andrea
I see lots of these messages sometimes
rename fails: No such file or directory
prmon.json_tmp prmon.json_snapshot
Any idea why?
As noted in #37 netBurner.py only works in python2. This is because of the significant restructuring of the urllib2
library between py2 and py3.
We should see if this can be fixed with the help of the six
module.
Strange, but true... at least I occasionally see this on Ubuntu 16 when extreme testing:
root@6af3cf4ea516:/tmp/prmon/package/tests# ../prmon -- ./burner -p 100 -c 0.1 -t 10
Will run for 10s using 100 process(es) and 10 thread(s)
Children will run for 1s
Segmentation fault (core dumped)
root@6af3cf4ea516:/tmp/prmon/package/tests# file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'pstree -A -p 1773'
It's harmless to prmon
but would be weird for the user to see.
This will disappear when the https://github.com/HSF/prmon/tree/new-child-pid-monitoring branch is merged.
I'm attempting to run prmon (using the following script) on an existing ATLAS payload as an initial test.
#!/bin/bash
(
cd /root/ATLAS/rundir
export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw
export ATHENA_PROC_NUMBER=8;
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase;
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh --quiet;
source $AtlasSetup/scripts/asetup.sh Athena,21.0.31,notest --platform x86_64-slc6-gcc62-opt --makeflags="$MAKEFLAGS";
Sim_tf.py --inputEVNTFile=EVNT.13322104._000284.pool.root.1 --maxEvents=1000 --postInclude default:RecJobTransforms/UseFrontier.py --preExec 'EVNTtoHITS:simFlags.SimBarcodeOffset.set_Value_and_Lock(200000)' 'EVNTtoHITS:simFlags.TRTRangeCut=30.0;
simFlags.TightMuonStepping=True' --preInclude EVNTtoHITS:SimulationJobOptions/preInclude.BeamPipeKill.py --skipEvents=0 --firstEvent=2290001 --outputHITSFile=HITS.13322110._003927.pool.root.1 --physicsList=FTFP_BERT_ATL_VALIDATION --randomSeed=2291 --DBRelease=all:current --conditionsTag default:OFLCOND-MC16-SDR-14 --geometryVersion=default:ATLAS-R2-2016-01-00-01_VALIDATION --runNumber=423210 --AMITag=a875 --DataRunNumber=284500 --simulator=ATLFASTII --truthStrategy=MC15aPlus
) &
MYPID=$$
./prmon --pid ${MYPID}
wait
echo "Done"
All seems to work well and until we get to the parallel phase when the tool start to produce 0's for all the metrics:
1521110863 29926216 4465862 23351072 0 568325215 24847004 1774543360 22401024 749.94 24.67 13.98 8.53
1521110865 29927240 4469183 23354380 0 568734129 24847004 1774973440 22401024 770.05 24.8 13.98 8.53
1521110868 20148528 3014859 15607184 0 546868043 16035106 1689260544 13406208 532.48 19.72 13.98 8.53
1521110870 0 0 0 0 0 0 0 0 0 0 0 0
1521110872 0 0 0 0 0 0 0 0 0 0 0 0
1521110874 0 0 0 0 0 0 0 0 0 0 0 0
1521110876 0 0 0 0 0 0 0 0 0 0 0 0
1521110878 0 0 0 0 0 0 0 0 0 0 0 0
The prmon tool reports the following errors:
rename fails: No such file or directory
prmon.json_tmp prmon.json_snapshot
If I turn on verbose output by passing "true" to the ReadProcs function I see:
rename fails: No such file or directory
prmon.json_tmp prmon.json_snapshot
MemoryMonitor: unable to open pstree pipe!
So it appears that it stops being able to create the pstree command, and I'm unsure why. If I restart prmon and point it to the same pid it starts back up again (although I haven't run it to completion so it may reoccur). I may not have built it correctly or have done something else dumb but I'm really not sure why it can't open the pstree pipe and get a list of pids.
Also as a suggestion we could replace pstree -A -p pid | tr ...
with pgrep -P pid
which give a list of pids (although you do need to call it recursively to get all child pids) without needing to do the more complex parsing (only need to look for newlines).
System Config:
CentOS Linux release 7.4.1708 (Core)
libgcc-4.8.5-16.el7_4.2.x86_64
gcc-c++-4.8.5-16.el7_4.2.x86_64
gcc-4.8.5-16.el7_4.2.x86_64
rapidjson-devel-1.1.0-2.el7.noarch
I noticed that the version that we just tagged (v1.0.0) and what we have in the main cmake configuration file are out of sync:
Line 7 in 16dc9cc
We should probably fix this for the next tag.
Testing some improvements in "extreme" conditions, I see that prmon
is core dumping on occasion:
[root@577ee70de0aa tests]# ../prmon -- ./burner -p 100 -c 0.1 -t 10
Will run for 10s using 100 process(es) and 10 thread(s)
Children will run for 1s
terminate called after throwing an instance of 'std::ios_base::failure'
what(): basic_filebuf::underflow error reading the file
Aborted (core dumped)
[root@577ee70de0aa tests]# gdb ../prmon core
[...]
Core was generated by `../prmon -- ./burner -p 100 -c 0.1 -t 10'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f1d121b51f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 libgcc-4.8.5-16.el7_4.2.x86_64 libstdc++-4.8.5-16.el7_4.2.x86_64
(gdb) bt
#0 0x00007f1d121b51f7 in raise () from /lib64/libc.so.6
#1 0x00007f1d121b68e8 in abort () from /lib64/libc.so.6
#2 0x00007f1d12abbac5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007f1d12ab9a36 in ?? () from /lib64/libstdc++.so.6
#4 0x00007f1d12ab9a63 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007f1d12ab9c83 in __cxa_throw () from /lib64/libstdc++.so.6
#6 0x00007f1d12b0ee07 in std::__throw_ios_failure(char const*) () from /lib64/libstdc++.so.6
#7 0x00007f1d12b1125e in std::basic_filebuf<char, std::char_traits<char> >::underflow() () from /lib64/libstdc++.so.6
#8 0x00007f1d12ad9cdd in std::istream::sentry::sentry(std::istream&, bool) () from /lib64/libstdc++.so.6
#9 0x00007f1d12ad1487 in std::basic_istream<char, std::char_traits<char> >& std::operator>><char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () from /lib64/libstdc++.so.6
#10 0x0000000000419bf6 in memmon::update_stats (this=0x7fff47f40b30, pids=...) at /mnt/code/prmon/package/src/memmon.cpp:33
#11 0x0000000000408c0b in MemoryMonitor (mpid=mpid@entry=11061, filename="prmon.txt", jsonSummary="prmon.json", interval=interval@entry=1, netdevs=std::vector of length 0, capacity 0)
at /mnt/code/prmon/package/src/prmon.cpp:188
#12 0x0000000000406129 in main (argc=<optimized out>, argv=<optimized out>) at /mnt/code/prmon/package/src/prmon.cpp:358
(for reference, this is at 8ca4872).
This is in memmon.cpp:33
, while the smaps buffer is being read. Although ifstreams do not usually throw (unless std::ios::exceptions
is set), it seems that they will throw if the whole file disappears from /proc
.
So we need to protect against these exceptions in the loop. Although they are being seen for the smaps buffer (which is the largest/slowest status file to parse), I suppose they could in principle happen anywhere, so probably best to catch in the main monitoring loop and skip an iteration if this occurs.
As was noted in #2 the structure of prmon
makes it difficult to easily add or remove parts of the monitoring.
I think this could be refactored along the following lines:
cpumon
, netmon
, iomon
, ...)prmon
(e.g., print text file headers, get statistics, print JSON entries)Then prmon
can keep a vector of the enabled instances and just loop over them to assemble the requisite monitoring in a flexible way.
This is to test prmon against a 'standard candle' on IO loads.
I wanted to make a static binary tarball that could be used to easily distribute prmon without needing to build it. Unfortunately it seems the plotting script is missing from the files generated by make package
:
[root@58bab5ac49c3 prmon]# tar -tvzf prmon_0.1.0_x86_64-centos7-gnu72-opt.tar.gz
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/prmon/
-rw-r--r-- root/root 1409 2018-06-11 07:26 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/prmon/prmonConfig.cmake
-rw-r--r-- root/root 3198 2018-06-11 07:26 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/prmon/prmonTargets.cmake
-rw-r--r-- root/root 1607 2018-06-11 07:26 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/prmon/prmonConfigVersion.cmake
-rw-r--r-- root/root 759 2018-06-11 07:26 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/cmake/prmon/prmonTargets-release.cmake
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/doc/
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/doc/prmon/
-rw-r--r-- root/root 11357 2018-03-05 13:13 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/doc/prmon/LICENSE
-rw-r--r-- root/root 544 2018-03-05 13:13 prmon_0.1.0_x86_64-centos7-gnu72-opt/share/doc/prmon/NOTICE
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/bin/
-rwxr-xr-x root/root 2348928 2018-06-11 07:47 prmon_0.1.0_x86_64-centos7-gnu72-opt/bin/prmon
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/include/
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/include/prmon/
-rw-r--r-- root/root 81 2018-06-11 07:26 prmon_0.1.0_x86_64-centos7-gnu72-opt/include/prmon/prmonVersion.h
drwxr-xr-x root/root 0 2018-06-11 07:48 prmon_0.1.0_x86_64-centos7-gnu72-opt/lib64/
Would it be feasible to add the possibility to monitor the frequency of the cores on which the application is running, or of all the cores on the system (easier and probably more sensible)?
A simple
cat /proc/cpuinfo | grep MHz
shows in real time the actual frequency of each core. This would allow to find out if the CPU was using a turbo frequency or was throttling at any point during the execution of the application.
Cheers,
Andrea
Since version 0.24.0, pandas.read_table is deprecated. We need to use pandas.read_csv instead. Need to update prmon_plot.py
accordingly.
prmon
should collect cpu statistics on the monitored process and its children (similar to time
, but including the children).
This is more an observation than a problem. When debugging the handling of child processes in prmon
I find that every time we wait for the condition variable, a SIGCHLD
is issued. It must be related to an internal thread being spawned to handle this.
It's probably not really a problem, but it might be worth reviewing the way that we do the waiting and handling of the SIGUSR1
condition.
Testing by @sciaba showed that there's a problem with VMEM and PSS. Evidently there's a bug in the re-implementation of the smaps parser in memmon.cpp
.
Changes to memmon
should be validated against the built-in MemoryMonitor
run by ATLAS jobs.
I believe we can make things look better/more consistent by adopting a standard format.
I created a new branch: https://github.com/HSF/prmon/tree/master-formatting where I passed the C++ code through clang-format w/ Google style.
I can put in a PR if others also agree. I'll be happy to hear any comments, @graeme-a-stewart :) ?
When developing the nvidia monitoring in #105 I was constantly adding and removing debug std::cout
statements to try and understand/debug the code. Some of these are definitely throwaway, but it really would be useful to be able to switch on logging for prmon
. e.g., I can imagine with GPU monitoring that what we implement might not work on all platforms and we'd need some debug output from a user to understand why.
In the spirit of not reinventing the wheel, I searched for lightweight logging libraries for C++ and there are a few, notably:
Which are header only, which I think is attractive.
I notice that spdlog
is in EPEL and it seems to be more popular, so I'd probably try this one first.
Currently prmon CPU accounting separates user/system time measurements into two - the utime
and stime
for running processes and the cutime
and cstime
for exited child processed. When looking at the accounting for parents whose children exit this leads to some decreases in the u/stime
that look strange, e.g.,
[root@aca2c5b831e5 tests]# cat prmon.txt
Time VMEM PSS RSS Swap rchar wchar rbytes wbytes utime stime cutime cstime wtime rx_bytes rx_packets tx_bytes tx_packets
1524239875 92048 2078 9648 0 5280 79 0 0 3.94 0.09 0 0 1 0 0 0 0
1524239877 92048 2078 9648 0 5280 79 0 0 11.93 0.17 0 0 3 0 0 0 0
1524239879 92048 2078 9648 0 5280 79 0 0 19.87 0.38 0 0 5 0 0 0 0
1524239881 92048 2078 9648 0 5280 79 0 0 27.86 0.56 0 0 7 0 0 0 0
1524239883 92048 2078 9648 0 5280 79 0 0 35.92 0.64 0 0 9 0 0 0 0
1524239885 23012 1491 3252 0 5280 79 0 0 20.64 0.44 19.54 0.29 11 0 0 0 0
1524239887 23012 1491 3252 0 5280 79 0 0 22.59 0.48 19.54 0.29 13 0 0 0 0
1524239889 23012 1491 3252 0 5280 79 0 0 24.58 0.5 19.54 0.29 15 0 0 0 0
1524239891 23012 1491 3252 0 5280 79 0 0 26.55 0.54 19.54 0.29 17 0 0 0 0
1524239893 23012 1491 3252 0 5280 79 0 0 28.51 0.59 19.54 0.29 19 0 0 0 0
See that utime
and stime
drop at 1524239885.
I think a more consistent accounting would be to sum u
and s
time for each parent and exited children together, which should give nice monotonic rising curves for these values.
Working on #2 made me realise just how unpleasant it is to extend the monitor because all of the parsing is done with C style IO (snprintf
and sscanf
, ugh!) and much use is made of magic numbers in the offsets to position things correctly.
I will implement the improvements in #2 in C++ style and that can serve as a model for improving the rest of the statistics.
This would also make enabling/disabling sets of statistics a lot easier, if done correctly.
Not hugely clear how to get this per-process, but measuring "global" network activity in a controlled process space might be sufficient for this purpose (docker, cgroups?).
Would it be possible to add two metrics: no. of processes in the process tree and no. of threads?
These numbers may vary over the lifetime of the monitored process, so it is interesting to track them.
Another interesting metric is the CPU efficiency over time. There are two options here:
@graeme-a-stewart : I was thinking about building the pushes to the master (in addition to the PRs) in Travis-CI and reporting the results on the main page. Maybe while we're at it, perhaps I can also try to setup Coverity checks, too. What do you think?
As discussed in #37, the JSON files opened in the python test scripts don't get closed properly.
We should put them into a with
construct to guarantee clean-up.
There was some discussion on ATLAS lists that concluded that people favoured Niels Lohmann's JSON code for C++. It's a header only library that can actually be made as concise as a single (albeit large) file to be included.
It would be good to evaluate that as an alternative to RapidJSON, especially if it could be provided easily bundled with prmon.
@bencouturier found that the default gcc5 compiler on Ubuntu throws an unused-variable
warning for the clock tick variables in utils.h
. Then combined with -Werror
the build fails.
This looks to me like a buggy compiler, because no other compiler we test against (gcc4.7, gcc6.2, gcc7.3) shows this issue.
I had a whirl at patching this with the #pragma GCC diagnostic ignored "-Wunused-variable"
, however, it seems that this is quite buggy in g++
(problems observed up to at least gcc6.3, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53431).
As noted, stupid stupid compiler...
The use of these clock tick variables is probably only marginally useful, so maybe it's best just to use sysconf(_SC_CLK_TCK)
directly where needed.
As discussed in #103, it might be better to make the default x/y-axis units variable-dependent (i.e. MB for memory, SEC for time etc.) instead of the currently hardcoded ones (x=SEC and y=MB).
-s
The way that child processes are found right now is pretty suboptimal, involving C-style text parsing of mangled pstree
output. It's opaque and fragile.
A better way would be to interrogate /proc
directly, from the parent process by finding children from
/proc/PID/task/PID/children
and recursing.
Getting the vector of child PIDs should be a separate stage of the monitoring loop, with the PID vector then being passed to each of the monitoring plugins in #31.
As noted in #37, I find the current use of RapidJSON to be rather unsatisfying. Creating a c_str
and then using the DOM model is fragile (and it core dumps if you get it wrong).
It would (probably) be rather better to use the SAX model and have a Writer
assemble the JSON object structure that we need.
Anyway, at least other ways to do this should be thought about...
Trying to compile prmon on a gentoo linux system (part of the HSF Packaging group's test drive) I got this error:
CMake Error at cmake/prmonCPack.cmake:93 (string):
string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
command.
Call Stack (most recent call first):
cmake/prmonCPack.cmake:136 (hsf_get_platform)
CMakeLists.txt:69 (include)
This seems to come from an empty/undefined value for HSF_OS_VERSION
.
So this should be handled better for sure.
(There might well be an upstream issue for the HSF package template creator.)
Hi,
I would humbly suggest to add time derivatives of some metrics, namely the IO and the network metrics. As we know, spikes in IO rates are extremely important when characterising an application and having prmon_plot.py being able to generate plots for them would be very handy.
IMO it can be done just in prmon_plot.py, no need to touch prmon.
Cheers,
Andrea
It would be good to implement a method to reduce the size of the prmon output by discarding "useless" data; for example, if a metric is linear between two points, there is no reason to keep the intermediate values. For metrics with a linear behaviour across large time intervals, this may reduce the data generated by 2 orders of magnitude or more. The algorithm I'm proposing is demonstrated at
https://cernbox.cern.ch/index.php/s/lKZgVgk2Fhwl3f6
This compression should not be the default but it should be enabled by command line options and needs a "precision" parameter with a sensible default (I propose 10% of the difference between max and min of the metric).
A very nice feature extension for prmon
would be to be able to fork off the program that should be monitored, as alternative to passing the PID
of an existing process.
Use would be like,
prmon --json-summary myout.json --interval 5 -- ./monitored_program --progarg1 1 --progarg2 2
using the traditional --
to separate prmon
arguments from the child's arguments.
The option parsing in prmon
is a bit rubbish and not easy to extend.
Moving to Boost.Program_options
would seem to be the natural way to go (pretty easy to satisfy this dependency).
I realised our summary file doesn't have a walltime measure - it should.
I don't know if we can get this from /proc, or if we just guesstimate that the wall time is the wall clock time of prmon
itself.
I have the following prmon output:
Time nprocs nthreads wtime stime utime pss rss swap vmem rchar read_bytes wchar write_bytes rx_bytes rx_packets tx_bytes tx_packets
1576523936 2 0 0 0 0 5666 7500 0 41168 651766 5222400 351 0 119240 1742 94493 1744
1576523967 3 10 31 4 16 645854 1133956 0 32669000 115241254 549240832 3084919 892928 9512993 137047 8696700 140331
and I'd like to see nprocs and nthreads. at a reasonable scale. I tried:
prmon_plot.py --input prmon.full.BSRDOtoRAW --xunit 1 --xvar wtime --yvar nthreads,nprocs
It would be good to bundle a small test program with prmon
so that it can itself run some meaningful tests.
Such a program could initially be a cpuburner, that can run in multi-threaded or multi-process mode. Network and/or disk reads might be useful to add later.
I did a few tests of prmon
against different numbers of threads and child processes. From what I can see the multi-thread monitoring is in good shape, measuring 76s CPU time from 20s wall x 4 threads. However, the multi-process monitoring is broken, measuring only one process (it seems).
[root@39bbb7746bc5 prmon]# ./package/tests/burner --time 20 --procs 4 &
[1] 2750
[root@39bbb7746bc5 prmon]# Will run for 20s using 4 process(es) and 1 thread(s)
[root@39bbb7746bc5 prmon]# ./package/prmon --pid $!
[root@39bbb7746bc5 prmon]# cat prmon.json
{"Max":{"maxVMEM":110752,"maxPSS":2525,"maxRSS":10800,"maxSwap":0,"totRCHAR":6944,"totWCHAR":53,"totRBYTES":0,"totWBYTES":0,"totUTIME":19.100000381469728,"totSTIME":0.29999998211860659,"totCUTIME":0.0,"totCSTIME":0.0},"Avg":{"avgVMEM":110752,"avgPSS":2525,"avgRSS":10800,"avgSwap":0,"rateRCHAR":385,"rateWCHAR":2,"rateRBYTES":0,"rateWBYTES":0}}
CPU time about 19.4s.
top
confirms that burner
started 4 processes, as expected, so the issue is on the prmon
side.
Would it make sense to add in prmon.json inside Max the total wtime, utime and stime?
I know that it's enough to look at the last line of prmon.txt, but I think it would be nicer to have this data as part of the json summary as well.
Clearly, it does not make sense to put them in Avg.
Thanks,
Andrea
I was thinking it might be good to add a script that produces basic 2-D plots that'll help user visualize resource usage (CPU, memory etc.) as a function of time out of prmon
output. Can be easily done using commonly used libraries, like matplotlib
.
The current sampling time in prmon is very short, just 1 second:
Line 255 in a273db9
This is great for doing short tests, but it seems rather too short for measuring grid workflows that last O(hours). The old MemoryMonitor default was rather too long though, at 600 seconds (https://gitlab.cern.ch/atlas/athena/blob/master/Control/AthenaMP/src/memory-monitor/MemoryMonitor.cxx#L244). In practice, we used 30 seconds in the production transforms (https://gitlab.cern.ch/atlas/athena/blob/master/Tools/PyJobTransforms/python/trfExe.py#L664).
My gut feeling is that somewhere between 10 seconds and 60 seconds is probably right. It can always be changed with the interval
argument, but a better default seems sensible to me.
@amete , @sciaba , [email protected] what do you think?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.