Code Monkey home page Code Monkey logo

system_monitor's Introduction

system_monitor

Erlang telemetry collector

Build Status License Developed at Klarna

system_monitor is a BEAM VM monitoring and introspection application that helps in troubleshooting live systems. It collects various information about Erlang processes and applications. Unlike observer, system_monitor does not require connecting to the monitored system via the Erlang distribution protocol, and can be used to monitor systems with very tight access restrictions.

Features

Process top

Information about top N Erlang processes consuming the most resources (such as reductions or memory), or have the longest message queues, is presented on the process top dashboard:

Process top

Historical data can be accessed via standard Grafana time picker. status panel can display important information about the node state. Pids of the processes on that dashboard are clickable links that lead to the process history dashboard.

Process history

Process history

Process history dashboard displays time series data about certain Erlang process. Note that some data points can be missing if the process didn't consume enough resources to appear in the process top.

Application top

Application top

Application top dashboard contains various information aggregated per OTP application.

Usage example

In order to integrate system_monitor into your system, simply add it to the release apps. Add the following lines to rebar.config:

{deps, [..., system_monitor]}.

{relx,
 [ {release, {my_release, "1.0.0"},
    [kernel, sasl, ..., system_monitor]}
 ]}.

To enable export to Postgres:

application:load(system_monitor),
application:set_env(system_monitor, callback_mod, system_monitor_pg)

Custom node status

system_monitor can export arbitrary node status information that is deemed important for the operator. This is done by defining a callback function that returns an HTML-formatted string (or iolist):

-module(foo).

-export([node_status/0]).

node_status() ->
  ["my node type<br/>",
   case healthy() of
     true  -> "<font color=#0f0>UP</font><br/>"
     false -> "<mark>DEGRADED</mark><br/>"
   end,
   io_lib:format("very important value=~p", [very_important_value()])
  ].

This callback then needs to be added to the system_monitor application environment:

{system_monitor,
   [ {node_status_fun, {foo, node_status}}
   ...
   ]}

More information about configurable options is found here.

How it all works out

System_monitor will spawn several processes that handle different states:

  • system_monitor_top Collects a certain amount of data from the BEAM for a preconfigured number of processes
  • system_monitor_events Subscribes to certain types of preconfigured BEAM events such as: busy_port, long_gc, long_schedule etc
  • system_monitor Runs a set of preconfigured monitors periodically

What are the preconfigured monitors

  • check_process_count Logs if the process_count passes a certain threshold
  • suspect_procs Logs if it detects processes with suspiciously high memory
  • report_full_status Gets the state from system_monitor_top and produces to a backend module that implements the system_monitor_callback behavior, selected by binding callback_mod in the system_monitor application environment to that module. If callback_mod is unbound, this monitor is disabled. The preconfigured backend is Postgres and is implemented via system_monitor_pg.

system_monitor_pg allows for Postgres being temporary down by storing the stats in its own internal buffer. This buffer is built with a sliding window that will stop the state from growing too big whenever Postgres is down for too long. On top of this system_monitor_pg has a built-in load shedding mechanism that protects itself once the message length queue grows bigger than a certain level.

Local development

A Postgres and Grafana cluster can be spun up using make dev-start and stopped using make dev-stop. Start system_monitor by calling rebar3 shell and start the application with application:ensure_all_started(system_monitor).

At this point a grafana instance will be available on localhost:3000 with default login "admin" and password "admin" including some predefined dashboards.

Production setup

For production, a similar Postgres has to be setup as is done in the Dockerfile for Postgres in case one chooses to go with a system_monitor -> Postgres setup.

How to contribute

See our guide on contributing.

Release History

See our changelog.

License

Copyright © 2020-2023 Klarna Bank AB

For license details, see the LICENSE file in the root of this project.

system_monitor's People

Contributors

andreashasse avatar k32 avatar onno-vos-dev avatar mikpe avatar jesperes avatar taddic avatar enidgjoleka avatar johanrhodin avatar onnovos avatar

Stargazers

Shahryar Tavakkoli avatar Phu Lien avatar Mohamed Sabry Hegazy avatar Benjamin Krenn avatar Alejandro M. Ramallo avatar Matthew Pope avatar Nelson Vides avatar Marko Minđek avatar leeyi avatar  avatar  avatar Valerii Vasylkov avatar  avatar  avatar Louis-Philippe Gauthier avatar Nicolò Marchi avatar Hadrian93 avatar Dániel Szoboszlay avatar Fredrik Malmros avatar Will R avatar Donald Nguyen avatar Andrey Tretyakov avatar Patrik Ragnarsson avatar Yordis Prieto avatar Daniel Widgren avatar Luke Barbuto avatar Rudolf Manusadzhian avatar Nicolò G. avatar Zachary Dean avatar Nathan Long avatar AICells avatar Richard Carlsson avatar Lachlan Marks avatar Adam Lindberg avatar Jimmy Zöger avatar  avatar  avatar Carlos Ferney Clavijo Rendón avatar Johan Eckerström avatar  avatar LJZN avatar Weslei Juan Novaes Pereira avatar Marcel Lanz avatar Felipe Menegazzi avatar JianBo He avatar  avatar Denis Denisov avatar  avatar Nikita avatar Medson Oliveira avatar Roman Hossain Shaon avatar Sora Morimoto avatar Dominic Morneau avatar Ruan Pienaar avatar Max Strother avatar V avatar Juan Macias avatar Victor Rodrigues avatar Ricardo Lanziano avatar Rui Coelho avatar Michal Slaski avatar UM404 avatar Dan Petrov avatar Juan Bono avatar Guilherme Andrade avatar Roberto Aloi avatar Leonardo Rossi avatar  avatar Radek Szymczyszyn avatar  avatar

Watchers

 avatar James Cloos avatar Héla Ben Khalfallah avatar  avatar  avatar  avatar  avatar

system_monitor's Issues

split into producer/consumer

Hi,

Very nice application!

Here is just a thought: it would be nice to split it into a producer that dig up the info, then post it (as e.g json) via HTTP to an accompanying consumer that then talks to Postgres.

This way, you would restrict the amount of code that need to run on your production machine.

Hey, I might even implement it myself someday... ;-)

Cheers, Tobbe

Wrong tuple size used in match

In the system_monitor:terminate/2 there is a tuple match done with what is stored in State#state.monitors;
this looks wrong to me, surely it should match against a 5-element tuple ( {Module,Function,RunOnTerminate, TicksReset, TicksDecremented} ) and not as it is done today (a 4-element tuple).

{Monitor, true, _TicksReset, _Ticks} <- State#state.monitors].

Handle duplicate_table errors in system_monitor_pg

The query creating new table partitions in system_monitor_pg may crash with duplicate_table error in case multiple nodes are using the same DB:

[{ok, [], []}, {ok, [], []}] = epgsql:squery(Conn, Query)

This was surprising to me, because I was expecting IF NOT EXISTS to avoid this error, but as I've learned on Stackoverflow:

The IF NOT EXISTS is meant to deal idempotency, not concurrency.

🤷

Consider syncing changes from https://github.com/ieQu1/system_monitor

Hello,

We've made some significant performance optimizations in this application mostly to support systems with much larger number of processes.

  • Size of the delta record has been shrunk to reduce memory usage (now it only stores the pid, prev. reductions and memory, everything else is derived from the runtime data).
  • Postgres operations are batched
  • Data collection has been moved to a separate process to avoid blocking the server with requests for data
  • Simplified sampling algorithm
  • Improved postgres schema for application and function (non-BWC change)

Improvements:

  • "Very Important Processes": always collect metrics for a configurable list of registered processes
  • Added automatic tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.