appliscale / xprof Goto Github PK

View Code? Open in Web Editor NEW

275.0 31.0 33.0 13.1 MB

A visual tracer and profiler for Erlang and Elixir.

License: Other

Makefile 0.44% JavaScript 47.55% HTML 0.42% Erlang 50.84% CSS 0.74%

erlang elixir tool tracking profiling tracing erlang-otp erlang-vm hex capture

xprof's Introduction

XProf

XProf is a visual tracer and profiler that allows you to track execution of Elixir / Erlang functions in real-time.

Goal

XProf was originally created to help solving performance problems of live, highly concurrent and utilized back-end systems. It's often the case that high latency or big CPU usage is caused by very specific requests that are triggering inefficient code. Finding this code is usually pretty difficult.

In this original usage one would first inspect execution time statistics to get an overview of the system. Then capture arguments and results (return value or exception) of function calls that lasted longer than given number of milliseconds.

With the introduction of xprof-commands via an extended query syntax in 2.0 more versatile stats, filters and other features became available.

How does it look like

Click the image below to watch a short demo investigating the TryMe application with XProf. The function nap sometimes takes way too much time (as you would guess from the name, it takes a bit of sleep). In the video we:

observe call count and duration percentiles
capture arguments and return values of a few long calls
apply a match spec to filter out "long" calls
compare two functions

How to use it

Add xprof to your build tool config file (and optionally also to the release config file such as reltool.config in order to include it in your release).
Build your project.
Start xprof by executing xprof:start(). in Erlang shell, or :xprof.start in Elixir shell.
Go to http://localhost:7890 (replace localhost with your server’s hostname if you connect to a remote host).
Type in function that you would like to start tracing.
Start tracing clicking green button.

The preferred way is to add the xprof Hex package as a dependency to your rebar3 config or Mix project file:

%% rebar.config (at least version `3.3.3` is required):

{deps, [
       ...
       {xprof, "2.0.0-rc.5"}
]}.

# `mix.exs`:

defp deps do
    [
      ...
      {:xprof, "~> 2.0.0-rc.5"}
    ]
  end

You can also fetch from the github repository (not recommended, only for development, requires rebar 3.14):

{deps, [
        ...
        {xprof_core, {git_subdir, "https://github.com/Appliscale/xprof.git", {tag, "2.0.0-rc.5"}, "apps/xprof_core"}},
        {xprof_gui, {git_subdir, "https://github.com/Appliscale/xprof.git", {tag, "2.0.0-rc.5"}, "apps/xprof_gui"}},
        {xprof, {git_subdir, "https://github.com/Appliscale/xprof.git", {tag, "2.0.0-rc.5"}, "apps/xprof"}}
]}.

Supported Versions

XProf currently supports Erlang/OTP 18 - 26. Newer OTP versions (if any) might work but are not tested.

Syntax mode

XProf supports both Erlang and Elixir syntax. If the elixir application is running it will use Elixir syntax and Erlang syntax otherwise to read the function to trace and to print captured arguments. It is also possible to manually set the preferred mode.

XProf flavoured match-spec funs

In the function browser apart from simple module-function-arity you can also specify further filters in the form of a match-spec functions (similar to recon or redbug) as well as xprof commands with options. For details see the page on Query syntax

Recursive functions

By default XProf only measures the outermost call to a recursive function. For example lists:map(fun f/1, [1, 2, 3]). will only register one call to lists:map/2. This is also true for indirectly recursive functions (such as when a calls b and b calls a again). This behaviour can be undesirable so it can be disabled by setting the ignore_recursion environment variable to false.

Erlang records

Erlang record syntax is supported in the queries and works similar to the Erlang shell. XProf keeps a single global list of loaded record definitions. Record definitions can be loaded at startup time from modules listed in app env load_records or at runtime calling xprof_core:rr(Module) (see documentation of xprof_core for more details). The record definitions are extracted from debug_info of the beam files belonging to the loaded modules. As the list is global there can be only one record with the same name loaded at a time and records loaded later might override previously loaded ones.

Configuration

You can configure XProf by changing the below application variables:

Application	Key	Default	Description
`xprof_gui`	`ip`	any	Listen address of the web interface (in tuple format, see `inet:ip_address()`)
`xprof_gui`	`port`	7890	Port for the web interface
`xprof_gui`	`favourites_enabled`	true	Whether saving/loading favourite queries is enabled
`xprof_gui`	`favourites_config`	./favourites.cfg	Path of the file storing favourite queries
`xprof_core`	`max_tracer_queue_len`	1000	Overflow protection. If main tracer process will have more than 1000 messages in its process queue tracing will be stopped and one needs to use trace button to resume. The purpose of this is to prevent out of memory crashes when tracer process is not able to process incoming traces fast enough. This may happen when we trace very "hot" function.
`xprof_core`	`max_duration`	30000	The largest duration value in ms. In case a call takes even longer, this maximum value is stored instead.
`xprof_core`	`ignore_recursion`	true	Whether to only measure the outermost call to a recursive function or not (ie. measure all calls).
`xprof_core`	`mode`		Syntax mode (`erlang` or `elixir`)
`xprof_core`	`load_records`	[]	List of modules from which to load record definitions at startup.

Compile-time configuration

XPROF_ERL_HIST - By default XProf uses the hdr_histogram_erl NIF library. If you have compilation problems you can choose to use a native Erlang histogram implementation by defining the OS env var XPROF_ERL_HIST when compiling xprof_core.

COWBOY_VERSION - By default XProf uses Cowboy version 2.x. This version is only supported from Erlang/OTP 19 and is not backwards compatible with older Cowboy versions. If for some reason you would like to use Cowboy version 1.x you can define the OS env var COWBOY_VERSION=1 when compiling xprof_gui.

XPROF_JSON_LIB - By default XProf uses the jsone library. If you would like to use a different json library you can define the OS env var XPROF_JSON_LIB when compiling xprof_gui. It is assumed that the library module exports an encode/1 function that returns the encoded binary. If your preferred json library uses a different name for such a function, you can set it with XPROF_JSON_ENC_FUN.

Examples

export XPROF_ERL_HIST=true
export COWBOY_VERSION=1
export XPROF_JSON_LIB='Elixir.Jason'
export XPROF_JSON_ENC_FUN='encode!'

Web Interface

XProf's web interface supports a lot of small but convenient features as query autocomplition, recent queries, favourite queries, list of called functions, collapsing graphs, multiple graphs in a row and so on.

Keyboard shortcuts

Ctrl-i: switch between "search" and "favourites" mode of the query box

In "search" mode

UP/DOWN arrows (cursor in query box): scroll through recent queries
UP/DOWN arrows (cursor in suggestion list below query box): scroll through auto-completion suggestions
TAB: if no suggetion is selected yet auto-complete to longest common prefix of dropdown list items. Otherwise copy the selected item to the search box and refresh the dropdown list.
ENTER: start monitoring either the selected suggestion if there is any or the expression in the search box.

In "favorites" mode

Search only starts after typing the second character. (For 0 and 1 chars you see the list of all favorite queries)
UP/DOWN arrows: scroll through the list
ENTER: start monitoring highlighted query if there is any (also added to recent queries)
ESC: reset search (clear the query box and hide the list)
TAB: show full list of favourite queries, when the query box is empty and the list isn't visible

Contributing

All improvements, fixes and ideas are very welcomed!

Project uses rebar3 for building and testing Erlang code. WebUI part resides in xprof_gui app's priv directory and it's already precompiled so there is no need to build JS sources in order to run xprof.

Running tests

make test

Working with JS sources

The WebUI uses

React.js
ECMAScript 6 (with elements from 7th version).
Bootstrap
Webpack

All resources are in apps/xprof_gui/priv directory. The src folder contains the sources and the build folder is a placeholder for final JS generated by webpack and then served by cowboy server (XProf's dependency).

Starting XProf in development mode

To develop xprof in a convenient way the following setup is recommended.

You have to invoke following command once, if you do not have js dependencies or you need to update them:

$ make bootstrap_front_end

Then going with normal development flow - in the first terminal window start Erlang xprof. The sync app will be started, It automatically reloads erlang modules that have changed, so you don't need to recompile every time something changed.

$ make dev_back_end

In the second window install all the assets and start webpack in development mode which is also going to recompile all JS files into apps/xprof_gui/priv/build-dev directory when they are modified. To achieve that use following command:

$ make dev_front_end

xprof's People

Contributors

Stargazers

Watchers

xprof's Issues

Run elixir tests properly on travis

We would like to test xprof with and without elixir.
Because we also would like to test with old Erlang versions (like OTP 16) that is not supported by any elixir version, we cannot use language: elixir.

A workaround seems to be to keep using language: erlang and manually download specific elixir version (try with kiex)

Tab is crashing after some time

There is a memory leak on client side that is more visible because of certain browser extensions and it is causing tab to crash after 2-5 minutes. It is related to charts and graphs library.

Incognito mode masks the problem, memory grows - but on smaller rate and you are able to use xprof much longer.

Display full graph legend

We miss description for example for percentile lines/colors. All in all there are more lines&colors on the graph than in the legend

Introduce websocket for some updates from the server

Some of the REST endpoints (that are currently polled by the front end) could be replaced by websocket notifications from the server:

these definitely as they usually don't return new info

"api/trace_status"
"api/mon_get_all"

maybe these too:

"api/capture_data"
"api/data"

Support both cowboy 1.x and 2.x version

Why would we support cowboy 1.x

first of all because it requires OTP 19+ and uses maps and we would like to use xprof even on R16B
there might be other projects that use newer Erlang version but their own app still depends on cowboy 1.x

Handling cowboy dependency

Mix allows depending on multiple version {cowboy, "~> 1.1.2 or ~> 2.0"}
then we need to add also a mix.exs to xprof
rebar3 supports overrides so the app including xprof can set arbitrary cobwoy version
We need to update instructions how to do this (and decide which cowboy version should be default in xprof, probably 2.0)
maybe different rebar3 profiles (how to set profile from user's app?)
we can create separate branch for legacy cowboy 1.x support
we need to keep it up-to-date ( 👎 )
we can tag separate versions and publish them on hex.pm (xprof 2.0+cowboy1)

Compiling

As a minimum we need a macro whether maps are supported and ommit code snippets (with ifdefs) which include maps
This can be done automatically with platform_define in xprof's rebar.config
The user can define env var/macro XPROF_COWBOY_1 to enforce cowboy 1.x usage otherwise cowboy 2.x will be used (at compile time!)
rebar.config.script can adjust deps based on this (as well as unneeded code can be skipped with ifdef)

Runtime

xprof_gui_app can detect at runtime cowboy version and use the appropriate cowboy API and handler module to start cowboy
This is automatic and cool

To sum up there are probably two options

Explicit
- user defines XPROF_COWBOY_1 to enforce version at compile time
- rebar.config.script + ifdefs in code
Implicite/Automatice
- mix.exs and rebar3 override to adjust dependency
- automatice macro if maps are supported (+ifdefs in code)
- xprof_gui_app detects verison at runtime

...and probably a third, simplest one I did not think of.

Sample WIP code for 1. can be seen here: xprof_gui_cowboy_handler.erl, xprof_gui_app.erl

Update query for monitored function

Allow to update the query (in place in the graph title bar) for already monitored functions. This can be done without backend modification with a demonitor/monitor combo.
Some restrictions may apply such as only allow to modify the match-spec part and not module and function.
Some attention needs to be payed for error handling, eg if the new query is invalid keep the old graph unchanged.

UI Bugs - Function Call Capturing

It is related with placeholders in text fields which, are not send if you will not type them (no defaults). After that a huge stack trace is printed in the console (error related to converting empty binary to integer).

The same applies for values which are non integers and floats (e.g. 0.1).

Lager collides with Elixir.Logger

iex(1)> :xprof.start()
13:33:15.967 [error] Supervisor 'Elixir.Logger.Supervisor' had 
child 'Elixir.Logger.ErrorHandler' started with 
'Elixir.Logger.Watcher':watcher(error_logger, 'Elixir.Logger.ErrorHandler',
{true,false,500}) at <0.275.0> exit with reason normal in 
context child_terminated

Conflict because of lager. Nothing critical for now, it works properly - but we need to think how to handle that in future.

Cowboy 2 vs Zotonic vs XProf

Currently available in branch cowboy_2.0. See #73 (comment).

It compiles and xprof:start() works, but I’m getting some JavaScript errors:

json encoding crashes on some lists

The below crash was seen when capturing a function with arglist: [164]

{[{reason,badarg},{mfa,{xprof_gui_cowboy1_handler,handle,2}},
{stacktrace,[{jsone_encode,escape_string,[<<"¤\"">>,[{object_members,[{res,<<"ok">>}]},{array_values,[
{[{id,2},{pid,<<"<0.10852.1>">>},{call_time,164129},{args,<<"\"¤\"">>},{res,<<"ok">>}]},
{[{id,3},{pid,<<"<0.10822.1>">>},{call_time,164566},{args,<<"\"¤\"">>},{res,<<"ok">>}]},
{[{id,4},{pid,<<"<0.10904.1>">>},{call_time,164138},{args,<<"\"¤\"">>},{res,<<"ok">>}]},
{[{id,5},{pid,<<"<0.10834.1>">>},{call_time,164467},{args,<<"\"¤\"">>},{res,<<"ok">>}]}]},
{object_members,[{has_more,false}]}],<<"{\"capture_id\":1,\"threshold\":150,\"limit\":5,\"items\":[{\"id\":1,\"pid\":\"<0.10817.1>\",\"call_time\":164566,\"args\":\"\\\"">>,
{encode_opt_v2,false,[{scientific,20}],{iso8601,0},string,0,0}],[{line,222}]},

Using OTP 19.2, XProf release_2.0 branch commit ef1ab9d, erlang syntax mode

Of course there is something wrong with integer-list and unicode handling.
As a side-idea we could pass the list of arguments as a json list (formatting each arg separately), maybe this would give a bit better readability and line-wrapping on the gui.

Change overload protection logic

Right now we check the process queue, if it exceeds 1K then we stop tracing, when this goes back to 0 we turn tracing again. The idea is disable this auto-on feature because it doesn't work as expected. When the node enters this on-off loop, sooner or later it dies. All in all graphs plotted during that period are not valueable

Support erlang record in the UI

As a xprof user I want to use #record{} syntax in match spec.

Extend query syntax

In order to add more functionality we need to extend the query syntax to allow more options and alternatives. (eg #108 )

Requirements:

It must be parsable (at list tokenize, then split, then parse - this is how query processing in erlang-syntax mode already works) and have familiar syntax in both Erlang and Elixir.
(Maybe) It also has to be generated from a list of options (oposit of parsing, one-one mapping)

1. Action-functions

One option is to extend the match-spec syntax by allowing more action-functions.

examples

mod:fun(_,A,_) -> argdist(A)
mod:fun(_,A,B) -> argdist(A+B)
mod:fun(_,A,_) -> argdist(A >= 0, enum)
mod:fun(_,A,_) -> argdist(A >= 0, [{enum, 2}, {interval, 5}])

mod:fun(_,_,_) when caller(mod2,fun2,_)

advantages: compact, allows some small sugars in the match-spec (like caller guard)
disadvantage: limited, what if a feature does not work on mfas at all (eg gc profiling)

2. Record/Struct/Keyword list

Other choice is to have something like records or structs:

construction:
- starts with a special character
- followed by a named-command
- then key-value pairs of named-arguments
- last one is the match-spec (to allow special parsing)
- no enclosing {}

examples (multiple lines for readability)
Erlang:

#argdist enum = 2,
  interval = "5sec",
  derive: if A > 0 -> positive; true -> neg_or_zero end,
  mfa = mod:fun(_, A, _)

#funlatency caller: xprof_core_lib:detect_mod/0, mfa : lists:keymember/3

Elixir:

%Argdist int: -100..100,
    interval: "5sec",
    mfa: Mod.fun(_, a, _)
%Funlatency caller: String.split/3, mfa : Keyword.get/3

advantage: more flexible, it is easier to generate (even on the frontend)
disadvantage: more verbose (how will this fit in graph title?)

We can still expose the raw/parsed Erlang API:
xprof_core:monitor_pp(QueryString) would parse the query and convert into a lower level xprof_core:monitor(Command, Options) format.
Current xprof_core:monitor({Mod,Fun,Arity}) would be equal with xprof_core:monitor(funlatency, [{mfa, {Mod,Fun,Arity}}])

3. Options

Alternatively we can keep current query syntax and add option buttons/widgets on the gui. They map to a separate option list in the Erlang API.

JS warnings demonitoring from shell

When a function is demonitored directly from the command line the GUI still keeps on polling the server for data emmiting in the js console the following indefinitely:

Warning: setState(...): Can only update a mounted or mounting component. This usually means you called setState() on an unmounted component. This is a no-op.vendors.js:28918:8
GET XHR http://localhost:7890/api/data [HTTP/1.1 404 Not Found 1ms]

this setState is called here
https://github.com/mniec/xprof/blob/master/priv/app/call_tracer.jsx#L90

Change styling of Edoc documentation

Save favourite queries

Keep a list of favourite queries in the backend.
(Advantage is that it can be shared among users, but can only be shared among multiple target nodes in case of a setup with a separate common xprof gui node.)

Backend

Responsibility of xprof_gui (as opposed to xprof_core)
Have to be persisted to a file to survive xprof_gui application restarts
File format should be human readable and editable (eg erlang term format)
File location configurable with app env (with sensible default)
Allow loading multiple files (could be done in second iteration)

Frontend

Each graph (monitored function) should have a button (star) to add it to favourites
Find a way to list favourite queries (open to discussion). Would be nice if it would fit in style to the current Query textbox/Autocomplete scheme (eg with some key-binding, star button or menu item list all favourites in the suggestion dropdown list and allow filtering based on text in query text box)

Generate docs for hexpm

Generate documentation that looks somewhat better than the default edoc style (look at elixir docs).
Figure out how to publish (eg. maybe both Erlang API from xprof_core and REST API/GUI guide from xprof_gui should be copied/duplicated under xprof app documentation).
What should be the source format of non-edoc pages/guides (eg markdown)

Graph layout

Allow to fit more than 1 graph per row (1..4)
Hide/collapse each graph

Open question: if there are multiple graphs per row, should it collapse all?

Save recent queries

Keep a list of recent queries in the browser storage (or also in the backend?).

With a keybinding (eg tab, up arrow, down arrow in empty query textbox), button or menu item list recent queries in the drop-down suggestion list and allow searching by prefix (or subtring?) by typing in the query textbox.
Enter on selected list-item should only copy the query to the textbox allowing for further editing (as opposed to normal query input when enter sends the query) (should it? open to discussion)

Make some ct testcases more robust

Some of the ct testcases fail from time to time but indeterministically

This failure is quite common (precision of hdr histogram)

%%% xprof_tracing_SUITE ==> long_call: FAILED
%%% xprof_tracing_SUITE ==> 
Failure/Error: ?assertMatch({ true , _ }, { Min < 21 * 1000 , Min })
  expected: = { true , _ }
       got: {false,21136}      line: 303

Another case that's rare to fail

%%% xprof_http_e2e_SUITE ==> get_overflow_status_after_hitting_overload: FAILED
%%% xprof_http_e2e_SUITE ==> 
Failure/Error: ?assertEqual([{<<115,116,97,116,117,115>>,<<111,118,101,114,102,108,111,119>>}], JSON)
  expected: [{<<"status">>,<<"overflow">>}]
       got: [{<<"status">>,<<"running">>}]      line: 81

Split GUI from core

Several prospective users indicated that it would be useful to have the GUI (webserver+js) split from the tracing core part.
This ticket is to track requirements, use cases and implementation ideas. Please give feedback in the comments.

The suggested way of using xprof currently is to include it as is in the Erlang/Elixir release of the target node.
Some of the motivations for only including the tracer core are

target release already contains conflicting version of cowboy
target system already has a UI and only xprof tracer core would be used with a different custom UI
include the least dependencies in the target release and run xprof GUI in a separate node (possibly even on a separate host)

Some implementation options that are discarded

include nothing in target node and send trace messages via port: unfortunately this solution turned out to have unacceptable overhead
include nothing in target node and inject code: we want to use hdr_histogram nif in the target node (and only send aggregated data outside) and nifs cannot be injected (on a nice way)

Requirements/implementation highlights so far

two components would communicate via erlang distribution
the tracer core will have a documented erlang API
current way of usage should be still possible (ie include everything in target node)

Open questions so far

how exactly to build releases and packages

Make occasionally failing ct testcases more robust

Some of the ct testcases fail from time to time but not deterministically

This failure is quite common and hopeless to fix because of how timers work in Erlang. (A call to timer:sleep(20) is guaranteed to last at least 20 ms but there is no upper limit)

%%% xprof_tracing_SUITE ==> long_call: FAILED
%%% xprof_tracing_SUITE ==> 
Failure/Error: ?assertMatch({ true , _ }, { Min < 22 * 1000 , Min })
  expected: = { true , _ }
       got: {false,24480}      line: 396

Another example when samples are not yet available. (This happens quite often in long_call testcase)

%%% xprof_http_e2e_SUITE ==> capture_data_when_traced_test: FAILED
%%% xprof_http_e2e_SUITE ==>
Failure/Error: ?assertEqual(1, length ( proplists : get_value ( << "items" >> , Data ) ))
expected: 1
got: 0 line: 210

To make snapshots more deterministic (without waiting 1-2 seconds) is to add a function to xprof_core_trace_handler to trigger taking one synchronously (and callerlang:trace_delivered before). But as the key of the snapshots is the timestamp in seconds, there can only be one snapshot per second (and some testcases require two).

Or captured data is not yet available:

%%% xprof_http_e2e_SUITE ==> capture_data_with_formatted_exception_test: FAILED
%%% xprof_http_e2e_SUITE ==>
Failure/Error: ?assertMatch([ << "** exception error: no match of right hand side value ok" >> ], [ proplists : get_value ( << "res" >> , Item ) || Item <- proplists : get_value ( << "items" >> , Data ) ])
expected: = [ << "** exception error: no match of right hand side value ok" >> ]
got: [] line: 244

No error information in UI

Currently all erroneous situation are silently ignored without any notification in UI. We need to provide mechanism for communicating following types of errors:

Errors in syntax for XProf query.
Errors in impossible situations (e.g. tracing Elixir.Enum.member?/2 and after similar pattern Elixir.Enum.member?(_, 2).
Errors for autocompletion (no such pattern found etc.).

FE micro-opt: avoid rendering if not needed

A setState call will always trigger a re-render of the component although maybe nothing has changed. Rendering a component will also trigger re-rendering of its children. Polling functions such as api/mon_get_all, api/trace_status and api/capture_data will usually do not change the state. There are multiple options:

only call setState conditionally if any of the fields in state would change
(eg. like gomoripeti@8345fd2)
implement shouldComponentUpdate() to check if state changed
use React.PureComponent to automatically detect no change

(https://reactjs.org/docs/react-component.html#setstate)

This optimisation might not give much gain as the most costly to update are the graphs themselves which actually change with every update.

Switch plotting library to c3.js

The Flot library is no longer developed(since 2014) and it lacks a lot of features that are out of the box in case of c3.js. Additionally, it has React wrappers that can simplify our current codebase.

As a byproduct we can ditch bower, currently it is kept only for the flot dependencies. c3.js is available through npm.

link: http://c3js.org/

ProcDict leak in trace handler

The trace handler process stores start time and arguments for each function call in its process dictionary so that it can fetch it when the function returns. However it is possible that the function finished event (return_from or exception_from) never returns. (Eg. the function does an infinite loop or calls another function that does that. Or the process executing the function exits/gets killed without crashing)

One idea is to garbage collect such entries from the procdict removing those older than max_duration.

Improve match-spec fun syntax

Current implementation is a direct mapping of arguments of dbg:fun2ms.

Take a look at rebdug's syntax which supports some handy shortcuts, and aim for intuitive simplicity.

don't wrap arguments in a list
leave out body and -> if not relevant
etc

Introduce sampling of function calls and number of processes

Allow to order captured long calls by any column

Currently rows are ordered as events arrive latest at the bottom. This makes sense in most cases especially if the rows are limited to a handful.
Some users might want to capture more rows (eg ~100) and then it can be useful to look at the newest events or longest etc on the top.
The selected column for ordering should be saved per function and preserved as long as that function is monitored. (ie. between stopping and restarting long call capturing for that function)

UI Bugs - Auto-complete

When you type almost full name, but without e.g. arity and you hit ENTER - it does not select first one out of the box.

The same if you choose one with arrows and then hit ENTER - it does not search that one out of the box, you need second ENTER hit for those.

Pause time on graph

Add a button to pause time and temporarily stop updating the graphs. This can be useful if some "extraordinary" event happens and we don't want to let valuable data to slowly slide away.

Change flot to a different JS plotting library

Very low priority task - Flot is good enough, but it seems it is not maintained. Consider different library for graphing purpose.

To meta-trace or not to meta-trace

Nomral tracing in Erlang can be controlled in two parts:

erlang:trace/3 controls which processes to trace (all/none/new/existing/Pid) and global flags (like timestamp format, arity, tracer module, and which events to trace like call/proc/gc)
erlang:trace_pattern/3 controls MFA+match-spec

In XProf the former is controlled by the singleton xprof_core_tracer
gen_server (for example to pause all tracing at once manually or automatically
in case of overload) and the later by xprof_core_trace_handler per each
MFA.

The problem is that we would like to set some of the global flags on a per MFA granularity. Eg use arity when we don't capture long calls. (This is worked around currently with some match-spec trick). With the dawn of our tracing NIF we would also like to set different state for the tracer module per MFA.

Meta-tracing traces all processes and does not care about the process trace
flags set by erlang:trace/3, the trace flags are instead fixed to [call, timestamp]. The match specification function return_trace() works with
meta-trace and sends its trace message to the same tracer.

Meta-tracing would allow to set a different tracer module with a different state
per MFA. (This would be ideal for us). (It does not allow to use the arity
flag, so this is out of question to use in the legacy solution without a tracer
module.)

On the other hand meta-tracing would always trace all the processes. (Currently
we only support tracing all the processes from the GUI anyway but tracing only 1
process from the shell works and this is used by some tests and profiling.) If
we would like to implement something else than all-the-processes we need to
implement it in the tracer module (enable function is a good candidate) and we
cannot use the VM's tracing infrastructure.

More importantly the role of XProf processes would change. The main
xprof_core_tracer would loose its global control of pausing all tracing at
once. (It could do this by calling erlang:trace_pattern/3 for each monitored
MFA.) It would also be possible to send trace messages directly to the per MFA
gen_servers instead of the singleton proxy process which would have main role
remaining to control starting/stopping monitoring MFAs. Overload protection
would be moved to the per MFA servers too. This would be a somewhat big
architectural change (To be investigated if removing the proxy is good or bad
from safety perspective.)

An alternative is to pass a map as the global tracer module state which has one entry per MFA. Unfortunately this needs to be updated whenever there is a change in any of the MFA states. Also needs to be done by the singleton xprof_core_tracer while per MFA state is the responsibility of the per MFA gen_servers.

Another similar tracing is call-counting which is also valid for all processes and controlled only by erlang:trace_pattern/3.

Investigate LTTng support

The newest version of Erlang has support for LTTng. Investigate if XProf can benefit from it and can be seamlessly integrated in the XProf app.

http://erlang.org/doc/apps/runtime_tools/LTTng.html
http://lttng.org/features/

Argument distribution

Instead of plotting latency of a function, plot the distribution of an
argument. Additionally it would be nice to plot a value derived from one or more
arguments. Deriving can happen in the match-spec (limited) or after passing all
the necessary arguments to an arbitrary "derive" fun which could be executed in
the trace handler.

The value

int: in the simplest case must be an integer (with a lower and upper bound; default range can be 0..max_duration)
enum: or any term with max number of combinations/buckets

Options:

aggregation method: histogram (only integer; int) or frequency count (any term, enum)
interval: how often to take a snapshot of the collected data (this is hard-coded currently for function latency to 1 second, but could make sense to set larger intervals in this case) (maybe does not need to be implemented)

http://manpages.ubuntu.com/manpages/zesty/man8/argdist-bpfcc.8.html

Query syntax

With the current syntax it would be possible to add extra action function:

mod:fun(_,A,_) -> argdist(A)
mod:fun(_,A,B) -> argdist(A+B)
mod:fun(_,A,_) -> argdist(A >= 0, enum)

Extended syntax:
(multiple lines for readability)

#argdist enum = 2,
    interval = "5sec",
    derive: if A > 0 -> positive; true -> neg_or_zero end,
    mfa = mod:fun(_, A, _)

%Argdist int: -100..100,
    interval: "5sec",
    mfa: Mod.fun(_, a, _)

Prerequisits:

histogram/spectrogram visualisation ( #107 )
extended query syntax ( #110 )

Consider a separate "GUI mode" for elixir

Right now to trace an elixir function one needs to put 'Elixir.sth.sth':fun/1 into the search box. I'd say this is Elixir unfriendly. I don't know if it is hard but maybe we could "fork" our GUI.
Whenever xprof detects that is running in the Elixir env(I guess it is possible by looking if some Elixir apps are loaded), it could serve GUI which is translated & optimised for Elixir.

Graph gui feature wishlist

The motivation is that sometimes there are extreme outlier samples which causes the max values being far off from all the other graphs and the other graphs being scaled down to near zero. To have a better idea of the values of the other percentiles I have to suggestions:

show latest value for each line next to the labels (this way we see eg the value of mean being 0.04 ms although max is 10 ms)
selectively hide certain graphs eg clicking the label toggles show/hide for the given graph (this way eg max can be hidden and the whole graph can be scaled according to 99 %)

Either or both could be implemented on the js gui.

Add automatic GUI tests

Follow js best practices to add some GUI test which can be unit test or e2e test. Not too thorough but just in order to detect any regression.

Cannot autocomplete incomplete MFA for Elixir

If you will search by: Enum.member? you will receive autocompletion, however TAB completion does not work - it works only with full name Elixir.Enum.member? (notice the missing Elixir. in first case).

New frontend

I will put here my updates and thought regarding work on the new frontend.

Explore callees

Add a way to show list of functions called by a monitored function so the user can select one for monitoring. This supports "zoom-in" way of profiling without need to look at source code.

BE: add erlang API: xprof_core:get_called_functions(mfa()) -> [mfa()]
- can work based on documented syntax tree format (runtime dependency on syntax_tools application) requires debug_info to be present in beam files
- can work based on officially not documented beam asm format (runtime dependency on compiler application) always works
FE: add drop down list to each monitored function.
- selecting one element should start monitoring (the easy way)
- or should copy the selection into the query text box so that some match-spec can be added (more advanced but needs more clicks)

Long input in call tracer will kill the tab

Steps to reproduce:

Start with make dev.
Setup a trace on lists:seq/2.
Invoke in the shell lists:seq(1,100000).
Observe like the tab burn. 🔥

We need to think how to crop longer inputs.

Document REST API

As a starter it can be added to the cowboy handler module as edoc.
This is essential for frontend contributors.

Problematic dependency - hdr_histogram

Currently we're leveraging work done by hdr_histogram regarding statistics accumulation / storage, which brings set of problems to the table:

It is a NIF - and that causes a lot of troubles when building xprof on various platforms (too old GCC, too new GCC.
Current hex.pm package is too old and it is not in sync with upstream - that's why we're using this fork https://github.com/afronski/hdr_histogram_erl and different name on hex.pm (customized_hdr_histogram).

We need to improve that. Possible solutions:

Getting rid of NIFs - Erlang / Elixir only library.
Contributing build process fixes and enforcing updating hex.pm package regularly.

Create Erlang NIF tracer

Create extensible NIF tracer in C.
Implement filtering of funciton return values.

Optional:

Interface ERL_NIF_* calls with C++
Find a way to store start/stop time using low level storage.

Insert graphs in top-last order

When adding a second function for monitoring it is added currently at the bottom of the page. This can be misleading if only one graph fits on the screen of a user, and one might think that nothing happened (before scrolling down).
Modify behaviour to add the new graph as the topmost one.

JSON Library Replacement

Currently using jiffy in Elixir projects requires small hack: https://github.com/functional-miners/workshop-production-debugging-ex/blob/master/mix.exs#L44, because new version is not pushed to the hex.pm yet.

However we can use different (e.g. purely Erlang) lib for that purpose and avoid NIFs.

Alternative histogram visualisation

Add option to show histogram (buckets) instead of percentiles on the graphs to better understand distribution.
This is not only useful for latency but even more for distribution of other values (eg argument).

The ideal visualisation would be something like a heatmap or spectogram of sysdig.

Time should be on the x axis for consistency with the percentile graph.
Experiment whether it still makes sense to overlay total call count on left-y axis.
Unfortunately c3-js does not support heatmaps out of the box so this needs to be implemented manually based on d3. (or find another lib)

This is only the frontend story assuming that the backend sends data in appropriate json format.

Return-value matching

Allow to filter functions based on their return value (instead of arguments).

Apart from the XProf-flavoured match-spec takes an additional parameter
RetMatchFun which matches the value returned by the function. The call latency
is only measured if there is a match.

RetMatchFun receives the value returned by the traced function as a single
argument and can return:

false: no match, don't measure
true: match, measure and capture the original return value.
{true, NewValue}: match, measure and capture NewValue instead of the original return value.

If RetMatchFun is arity-2 it could match on an exception (class and reason). Should an arity-1 RetMatchFun match on exceptions of form {Class, Reason} too?
We can think of some shortcuts eg. as a convenience function_clause could mean no match (don't have to handle all return values, only the positive case, as in the second example below)

Query example (suggestion):

#retmatch matchfun: fun(List) -> lists:member(12, List) end, mfa: ...
#retmatch matchfun: fun(error, badarg) -> true end, mfa: ...

It would be useful if the graph would show both a total count (number of
return_trace messages) and matched count (only when matchfun matches) to see a
rate how oftern thr return value matches.

How to specify a different port?

I want to run two different apps from the same codebase and here is what I get when I try:

Could not start application xprof: :xprof_app.start(:normal, []) returned an error: :eaddrinuse