tarantool / metrics Goto Github PK
View Code? Open in Web Editor NEWMetric collection library for Tarantool
License: MIT License
Metric collection library for Tarantool
License: MIT License
I've found some sort implementations written in pure Lua - https://github.com/DarkRoku12/lua_sort (also you can take the tests here)
Could you check them, may be some of them could be a bit faster.
Also it will be great if you share some benchmarks shows that current approach is better than C-written.
(Without consideration that we drop C part and gcc requirement - it's obviously perfect)
Originally posted by @olegrok in #112 (comment)
HTTP latency collector starts working only on first processed request, before it it's null rps
, not 0 rps
.
metrics/metrics/http_middleware.lua
Line 20 in 23e5674
We need to add in-depth description of default metrics, because link to stat deprecated repo contains outdated info and no descriptions.
To answer the question "What default metrics contains of?" it is needed to start a Tarantool instance and collect default metrics (to get the list of default metrics) and then search for info in the net (like https://www.tarantool.io/ru/doc/1.10/reference/reference_lua/fiber/#fiber-info and https://www.tarantool.io/en/doc/1.10/reference/reference_lua/box_slab/) for default metrics meaning.
It would be great to add:
*. the number of requests in queues (iproto -> tx, tx -> iproto);
*. utilization of readahead buffer;
*. the number of requests executing simultaneously;
Sometimes, we need to add some label (e.g. instance alias) to each metric we collect. But it's inconvenient to pass or change it (or even them) on every create and update operation we call. This problem can be solved by setting up some global_labels
table for metrics, which we can append to collected metrics' labels.
Possible solutions:
label_pairs
field in each metric on its creation/update. It is driver-independent and straightforward solution, but it require some excessive memory, table copy on each metrics update. It will also be harder to set global labels along the way.label_pairs
on output. It is simpler to code, cause it doesn't change inner logic, and it require less memory operations and storage. It will also be easier to set them along the way. On the contrary, it's driver-dependent.label_pairs
on Shared:collect(...)
method call. It has positive aspects of both previous solutions (driver-independent, easy to set along the way, don't require excessive memory and memory operations, don't revise inner update logic) and don't inherits any significant disadvantages.I enabled default metrics for Prometheus, however, the documentation for these metrics is poor and hasn't full information.
I suggest deleting these metrics.
there is no description about what is it and how it works
Average
Can be used only as a collector for HTTP statistics (described below) and cannot be built explicitly.```
metrics == 0.5.0
unix/:/var/run/tarantool/app.control> package.reload()
---
- error: '/.rocks/share/tarantool/metrics/quantile.lua:10: cannot change a protected
metatable'
...
{"label_pairs":{"some_label":"label"},"timestamp":1598451366672309,"metric_name":"name","value":"1605461ULL"}
{"label_pairs":{"name":"space_name","engine":"memtx"},"timestamp":1598462586194806,"metric_name":"tnt_space_total_bsize","value":"0ULL"}
Because of it we could not measure time in nanoseconds.
I suppose the reason is here:
metrics/metrics/plugins/json/init.lua
Lines 5 to 17 in f567492
getrusage()
allows us to get our own resource usage. It would be nice to add it to metrics.
Average collector resets count on each collect, so if we have two metric collectors, some part of observation data will be given to the first one and the rest will be given to the second one.
It is not clear from an example, that in order for this to work correctly you need ONLY ONE collector for router. Preferably even default collector.
Otherwise you get inconsistent metric names and a lot of metrics, like:
{
"label_pairs": {
"path": "/labels",
"method": "GET",
"status": 200,
"alias": "tnt-router"
},
"timestamp": 1600351181350306,
"metric_name": "labels_latency_avg",
"value": 0
},
instead of:
{
"label_pairs": {
"path": "/labels",
"method": "GET",
"status": 200,
"alias": "tnt-router"
},
"timestamp": 1600352083432399,
"metric_name": "http_server_request_latency_avg",
"value": 0
},
So when registering routes for endpoints, everything must use the same collector. This can be done using default collector. Something along the lines of:
local http_middleware = require('cartridge').service_get('metrics').http_middleware
if http_collector == nil then
http_collector = http_middleware.build_default_collector('average')
end
server:route({ path = "/labels", method = "ANY"}, http_middleware.v1(handler, http_collector))
Currently if you do this:
local counter = metrics.counter('foobar')
Then it creates a new counter object every time. This means that users have to store counter objects carefully.
What I propose is to make such calls idemponent: if the parameters (like histogram buckets etc) don't change -- then just return an existing counter.
Пожалуйста добавьте метрики по статусу репликации downstream и upstream. Сейчас есть только lag и lsn, но это не позволяет проверить, что реплика развалилась.
The only types supported by Prometheus are Gauge, Counter, Histogram and Summary (docs):
Using graphite 1.1.7 (docker image graphiteapp/graphite-statsd) faced with graphite ignoring some default metrics from module because of ULL suffix of values.
Log sample:
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=transfers;engine=memtx 11511360ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=kv;engine=memtx 49379ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=attempts_count;engine=memtx 98850ULL 1594291594]
The same behavior noticed with metrics: tnt_space_bsize
, tnt_cfg_current_time
We should test different sorting solution for ffi double arrays to avoid using dynamic lib with comparator function. Note that performance must be highest priority.
Currently the list of files to be installed is hardcoded in both the rockspec and the rpm spec. This should not happen. Instead, the package should be installed with 'make install'.
Please port the build script from Makefile to CMake, and implement packing for debian (now absent)
Currently we have metrics.connect()
in public API, which creates a worker doing periodic exports to metrics.server
.
We should move it under metrics/plugins
folder.
It would be nice to add some tests for tarantoolctl rocks install ...
and luarocks install ...
and run as rocks since in #101 there is dynamic lib in package
Example:
tnt_space_bsize{name="sequence",engine="memtx"} +Inf
tnt_space_bsize{name="jobs",engine="memtx"} +Inf
tnt_space_bsize{name="repair_queue",engine="memtx"} +Inf
tnt_space_bsize{name="audit_log_repair",engine="memtx"} +Inf
tnt_space_bsize{name="command_list",engine="memtx"} +Inf
tnt_space_bsize{name="test",engine="memtx"} +Inf
But in fact they are empty
Since metrics now combines the power of above-mentioned modules, it is safe to add a deprecation note to each of them.
Some values from enable_default_metrics()
may have suffix ULL for long numbers and Prometheus doesn't recognize it.
Quick solution:
local ret = prometheus.collect_http(req)
ret.body = ret.body:gsub("ULL", "")
I propose to add to CI validation of metrics with promtool to make sure that Prometheus accepts it.
this should be refactored, it is pretty dangerous to use test only switches in production builds
Originally posted by @vasiliy-t in #69
Most latency-related functions, described with ldoc (I suppose), contains description of input parameters but lacks description of output parameters.
https://github.com/tarantool/metrics/blob/master/metrics/collectors/shared.lua#L93
https://github.com/tarantool/metrics/blob/master/metrics/http_middleware.lua#L63
At current version vclock metric looks as info_vclock_1, info_vclock_2, etc. I think that is not convenient. I suggest making a number of vclock as a tag.
Default metrics return metric tnt_cfg_listen
with the string type that unsupported by Prometheus.
local metrics = require('metrics')
local http_router = require('http.router')
local http_server = require('http.server')
local http_handler = require('metrics.plugins.prometheus').collect_http
box.cfg{
listen = '0.0.0.0:3301',
}
metrics.enable_default_metrics()
local httpd = http_server.new('0.0.0.0', 8088, {log_requests = true})
local router = http_router.new():route({path = '/metrics'}, http_handler)
httpd:set_router(router)
httpd:start()
# HELP tnt_cfg_listen Tarantool port
# TYPE tnt_cfg_listen gauge
tnt_cfg_listen 0.0.0.0:3301
level=warn ts=2020-01-24T14:45:02.109Z caller=scrape.go:930 component="scrape manager" scrape_pool=tarantool target=http://test5.tarantool.e:8088/metrics msg="append failed" err="strconv.ParseFloat: parsing \"0.0.0.0:3301\": invalid syntax"
If I enable this role in my init.lua I agree that my metrics will be enabled on ALL instances. It could be really strange to enable metrics per replicaset.
So I propose:
According to tarantool/doc#1328
I propose the following structure in docs:
User's Guide
Prometheus have some problems with suffixes '_count' and '_total' for non-summary and non-histogram metrics
I think it makes sense to move metrics/default_metrics/tarantool/utils.lua
in a different package, because it may be used in other packages.
Originally posted by @oleggator in #69
Graphite version: 1.1.7.
User states that Graphite accept only time in seconds, but we send microseconds. It results in not working graphs.
User suggestion: replace
ts = tostring(fiber.time64()):sub(1, -4) -- Delete ULL sufix
with
ts = tostring(fiber.time64()):sub(1, -10)) -- Delete ULL suffix and 6 digits
Using tarantool: 2.3.2-1-g9be641b
and Metrics library: metrics == 0.1.8
curl 10.3.151.235:8081/metrics
Unhandled error: ...e/tarantool/metrics/default_metrics/tarantool/spaces.lua:41: variable 'include_vinyl_count' is not declared
stack traceback:
/opt/tarantool/.rocks/share/tarantool/http/server.lua:743: in function 'process_client'
/opt/tarantool/.rocks/share/tarantool/http/server.lua:1199: in function </opt/tarantool/.rocks/share/tarantool/http/server.lua:1198>
[C]: in function 'pcall'
builtin/socket.lua:1073: in function <builtin/socket.lua:1071>
The metrics endpoint is being initialised using the cartridge http server using this function
local httpd = cartridge.service_get('httpd')
if httpd == nil then
error('failed to get cartridge httpd service for prometheus')
end
metrics.enable_default_metrics()
httpd:route({
path = '/metrics',
method = 'GET',
public = true,
}, prometheus.collect_http)
The current version contains two kinds of metrics: tnt_stats_op_*_total
and tnt_stats_op_*_rps
, where the asterisk is an operation name. I suggest making it a tag.
It's necessary to add documentation and examples for summary collector in README and Getting Started docs.
We could collect such metrics as:
issues_count
of type gauge. Value is a number of cluster issues this instance knows. This should be good enough for basic alerting - healthy cluster reports 0 issues. - closed in #243 and #144tnt_info_uptime
tnt_read_only
, but needs desing toometrics/cartridge/roles/metrics.lua
Line 6 in a780a21
According to: tarantool/cartridge#873
proposed configuration format:
metrics:
export:
- path: "/metrics/json"
format: "json"
- path: "/metrics/prom"
format: "prometheus"
where
metrics
is a top level section nameexport
is exporter configuration, e.g. [1] is a way to enable json metrics via http endpoint /metrics/jsonThere is no vshard metrics collected at all.
In current version name of an index is in name of metrics. I suggest making it a tag.
Currently it's not clear how to create counter objects and set their values. Please add it to readme
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.