Code Monkey home page Code Monkey logo

eeeslab / examon Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 7.0 9.74 MB

A highly scalable framework for the performance and energy monitoring of HPC servers

License: Other

Makefile 1.11% C 87.60% CSS 0.20% HTML 0.19% Python 4.78% CMake 0.12% C++ 0.60% NSIS 0.15% JavaScript 0.63% XSLT 0.02% Roff 4.07% Perl 0.05% Shell 0.07% Dockerfile 0.02% TypeScript 0.04% Jupyter Notebook 0.35%
monitoring-daemon performance-monitoring mqtt-protocol hpc-clusters exascale bigdata

examon's Introduction

Examon HPC Monitoring

A highly scalable framework for the performance and energy monitoring of HPC servers

Features

The main executable "pmu_pub" measures:

- Per-core performance counters
  • Instructions retired
  • Un-halted core clock cycles at the current frequency
  • Un-halted core clock cycles at the reference frequency
  • Temperature
  • Time stamp counter
  • Cycles in C3 state
  • Cycles in C6 state
  • Aperf Cycles
  • Mperf Cycles
  • Programmable PMU events
- Per-CPU/Socket
  • Package temperature
  • Package energy
  • DRAM energy
  • Programmable Uncore events

The measured data are sent over the network using the MQTT protocol (TCP/IP).

Dependencies

It requires two libraries to work:

  • Iniparser: used to handle the .conf files
  • Mosquitto: used for the MQTT protocol

(These libraries are provided in the ./lib folder)

To properly build the Mosquitto library you also need

  • libssl
  • libcrypto

Available in the following distro packages:

  • "libssl-dev" in Ubuntu/Debian
  • "openssl-devel" in Centos

Enable RDPMC instruction in Kernels >4.X

Starting from Kernels v4.X the usage of the RDPMC instruction from userspace need to be explicitly enabled. This can be accomplished executing this command:

>$ sudo sh -c "echo '2' > /sys/bus/event_source/devices/cpu/rdpmc"

Repository organization

The repository is structured as follow:

- Publishers: this folder contains the MQTT publishers’ plugins.
  • in this framework release, it contains the pmu_pub plugin.
- Parser: this folder contains the software components that run in the Front-end side of the framework and process MQTT data delivered by the publishers
  • The pmu_pub_sp.py script provides an example of how to calculate additional metrics in real time, starting from the data delivered by the pmu_pub plugin.
  • Collector: it contains the Collector component. The Collector library can be used in programs to retrive the pmu_pub monitored data directly from an application running on the monitored nodes. Please refer to the readme contained in the ./collector folder for more detailed information.
  • Lib: this folder contains external libraries needed by the framework.

Installation

Build

To build all the libraries and the main executable "pmu_pub", go to the main directory and:

>$ make

Install

WARNING: To install the plugin binary only (and excluding the libraries) DO NOT execute make install in the main directory but move in the plugin directory first:

>$ cd ./publishers/pmu_pub

Create and edit the configuration files (see the Configuration section for details):

>$ cp example_pmu_pub.conf pmu_pub.conf

>$ cp example_host_whitelist host_whitelist

>$ make install

The default install folder is ./bin. To specify a different install location:

>$ make PREFIX=<install-dir> install

The install step will copy the executable, the "pmu_pub.conf" file and the "host_whitelist" file to the <install-dir>.

Configuration

The main executable needs at least of the "pmu_pub.conf" file to work. If available, it uses also the "host_whitelist" file to filter the hosts where to run. The executable will search for the "pmu_pub.conf" file and the "host_whitelist" file in the current working folder first and then, if not found, in the "/etc/" folder.

The "pmu_pub.conf" file

The "pmu_pub.conf" file in the ./publishers/pmu_pub directory contains the default parameters needed by the "pmu_pub" executable.

MQTT parameters:

  • brokerHost: IP address of the MQTT broker
  • brokerPort: Port number of the MQTT broker (1883)
  • topic: Base topic where to publish data (usually it is built as: org/<organization name>/cluster/<cluster name>)

Sampling process parameters:

  • dT: data sampling interval in seconds
  • daemonize: Boolean value to daemonize or not the sampling process
  • pidfiledir: path to the folder where the pidfile will be stored
  • logfiledir: path to the folder where the logfile will be stored

The "pmu_pub.conf" file must be in the working directory of the executable.

However, most of the parameters can be overridden, when executed, by command line:

>$ sudo ./pmu_pub -h


usage: pmu_pub [-h] [-b B] [-p P] [-t T] [-q Q] [-s S] [-x X] [-l L] [-e E] 
                    [-c C] [-P P] [-v]
                    {run,start,stop,restart}

positional arguments:
 {run,start,stop,restart}
                       Run mode

optional arguments:
 -h                    Show this help message and exit
 -b B                  IP address of the MQTT broker
 -p P                  Port of the MQTT broker
 -s S                  Sampling interval (seconds)
 -t T                  Output topic
 -q Q                  Message QoS level (0,1,2)
 -x X                  Pid filename dir
 -l L                  Log filename dir
 -c C                  Enable or disable extra counters (Bool)
 -e E                  Perf events list (comma separated)
 -P P                  Enable or disable perf subsystem (Bool)
 -v                    Print version number

The "host_whitelist" file

This file contains the list of the hosts in the cluster enabled to execute the plugin. The hostnames enabled are listed one per row. Optionally can be included the broker IP address where the hosts that follows are going to be connected. This is useful for example in the balancing of the load/bandwidth in the front-end nodes.

The format of the file can be:

[BROKER:] <IP address> <port number>
host0
host1
host2

To disable an host or a group of hosts use "#" as a general comment marker.

Example of the host_whitelist file:

[BROKER:] 192.168.0.1 1883
node100
node101

[BROKER:] 192.168.0.1 1884
#node102
node103

In this example, there are 4 total hosts and 2 brokers. node100 and node101 will connect to the broker at 192.168.0.1:1883. node102 and node103 will connect to the broker at 192.168.0.1:1884. Host "node102" is disabled so the plugin will not run.

Usage

The following instructions indicate how to build a single node measuring setup composed by:

  • A broker used as endpoint where to send and ask for the CPU data.
  • A publisher agent that collects and publishes CPU data to the broker.
  • A subscriber agent that receives the CPU data.
  1. Run the broker process as daemon:

    >$ ./lib/mosquitto-1.3.5/src/mosquitto -d 
  2. Edit the "pmu_pub.conf" file and set at least the following parameters:
    1. brokerHost: IP address of the node where the broker is running. If it is running on the same machine set equal to 127.0.0.1
    2. topic: set it to: org/myorg/cluster/testcluster
  3. Make sure that the msr driver is loaded:

    >$ sudo modprobe msr
  4. Run the pmu_pub process (publisher) as supeurser, cd ./publishers/pmu_pub/ and:

    >$ sudo ./pmu_pub

    At this point the CPU data should be available to the broker at the topic indicated in the .conf file

  5. Subscribing to the topic it is possible to redirect the data stream to the shell or to a file. An MQTT subscriber client is available in the ./lib/mosquitto-1.3.5/client folder. Assuming the broker is running at IP address 127.0.0.1, the following command will print on the standard output the data published by the sampling process "pmu_pub":

    >$ LD_LIBRARY_PATH=../lib/:$LD_LIBRARY_PATH ./mosquitto_sub -h 127.0.0.1 -t "org/myorg/cluster/testcluster/#" -v

    or:

    >$ LD_LIBRARY_PATH=../lib/:$LD_LIBRARY_PATH ./mosquitto_sub -h 127.0.0.1 -t "org/myorg/cluster/testcluster/#" -v >> cpudata.log

    for saving to a file.

  6. To calculate additional metrics see the pmu_pub_sp doc in the ./parser/pmu_pub_sp folder.

    Example (assuming that "TESTNODE" is the hostname where the pmu_pub service is running:

    >$ python ./pmu_pub_sp.py -b 127.0.0.1 -p 1883 -t org/myorg/cluster/testcluster/node/TESTNODE/plugin/pmu_pub/chnl/data -o org/myorg/cluster/testcluster/node/TESTNODE/plugin/pmu_pub/chnl/data 

    the additional metrics will be available at:

    >$ LD_LIBRARY_PATH=../lib/:$LD_LIBRARY_PATH ./mosquitto_sub -h 127.0.0.1 -t "org/myorg/cluster/testcluster/node/TESTNODE/plugin/pmu_pub/chnl/data/#" -v
  7. To kill the sampling process, in the ./publishers/pmu_pub folder execute:

    >$ sudo ./pmu_pub stop

    While, to kill the pmu_pub_sp process, in the ./parser/pmu_pub_sp folder execute:

    >$ python ./pmu_pub_sp.py stop

ACKNOWLEDGMENTS

Development of the EXAMON has been supported by the EU FETHPC project ANTAREX (http://www.antarex-project.eu) (g.a. 671623), and EU ERC Project MULTITHERMAN (g.a. 291125).

examon's People

Contributors

abartolini avatar andreaborghesi avatar fbeneventi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

examon's Issues

Can't build

I followed the instructions provided in the README on Installation/Building.

However this error appears:

make[2]: Entering directory '/home/jitschin/git/examon/lib/mosquitto-1.3.5/client'
cc pub_client.o -o mosquitto_pub  -L../lib ../lib/libmosquitto.so.1
../lib/libmosquitto.so.1: undefined reference to `OPENSSL_sk_num'
../lib/libmosquitto.so.1: undefined reference to `OPENSSL_init_ssl'
../lib/libmosquitto.so.1: undefined reference to `OPENSSL_sk_value'
../lib/libmosquitto.so.1: undefined reference to `OPENSSL_init_crypto'
../lib/libmosquitto.so.1: undefined reference to `SSL_CTX_set_options'
collect2: error: ld returned 1 exit status
Makefile:8: recipe for target 'mosquitto_pub' failed
make[2]: *** [mosquitto_pub] Error 1
make[2]: Leaving directory '/home/jitschin/git/examon/lib/mosquitto-1.3.5/client'
Makefile:17: recipe for target 'mosquitto' failed

It sounded to me like there could be some issue with OpenSSL. So I reinstalled openssl sudo apt-get install --reinstall libssl-dev. Nothing changed. So I figured I could fetch a more recent OpenSSL, like version 1.1.1. I built it and installed it. However building examon still failed with the same error.

I am running Ubuntu 16.04.4 LTS. uname -a gives me the following:
4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Here is a full listing of what make told me upon invocation: https://pastebin.com/HZxKRdn2

I would be happy for a response within the next days. Even a mere hint to a solution would be appreciated.

Difficulties (segmentation fault)

Thanks again for your quick mail-response. Sorry it took me a bit to compile this issue, I had to make sure the error wasn't due to some strange configuration on the test system.

I ran EXAMON on a test system, a HASWELL, using mod msr and mosquitto, both with and without kpti patch (i.e. tried it with kernel option nopti). Each time it segfaulted in the same function: rdpmc ()

I compiled EXAMON with options -DDEBUG=True and -g and ran it through gdb. It told me a few more details about the segfault, here's what I got: https://pastebin.com/qzwmaLmv

The Linux version I am running according to uname -a is Linux mabus 4.13.0-26-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Could it be that the CPU (with family 6, model 60) is not supported?

Which steps could I take to get EXAMON running on said CPU?

Should I test EXAMON on a different CPU e.g. one of Ivy Bridge, Westmere EP, Skylake, Sandy Bridge, Sandy Bridge EP or another Haswell?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.