scylladb / scylla-monitoring Goto Github PK

View Code? Open in Web Editor NEW

225.0 17.0 138.0 103.35 MB

Simple monitoring of Scylla with Grafana

Home Page: https://scylladb.github.io/scylla-monitoring/

License: Apache License 2.0

Shell 62.90% Python 34.22% Dockerfile 2.88%

prometheus monitoring scylladb

scylla-monitoring's Introduction

Scylla Monitoring Stack

The ScyllaDB Monitoring Stack is used to monitor ScyllaDB. It is container-based and has dashboards and some scripts to set up the containers.

Before using it, make sure you have docker enabled. Check the documentation for installation and configuration details.

You can ask questions and discuss ScyllaDB Monitoring in the ScyllaDB community forum and on the Slack channel.

The Monitoring lesson on ScyllaDB University is another useful resource, and it includes a hands-on example.

scylla-monitoring's People

Contributors

Stargazers

Watchers

Forkers

nirmaayan tzach nunb glommer tgrabiec amoskong amnonh antrepo zeronewb blunney1 mancubus77 roydahan siculars avikivity noamha itayb mimarpe testerrandolph ramy-ahmed tribune mmatczuk emmerry brandonlamb movableink sermandurai oliveiradan bentsi shadowridgedev vladzcloudius doytsujin nyh mailmahee denesb adey ikenchina slivne sriduth lockyse7en incheol77 kannan-zeotap hatan4ik securecloud-biz sburn kiwicom jwnx origomali operasoftware rjeczalik gjaldon i-mine bcui6611 dkropachev ovaltzer bhaveshtank sneako luismartins-td enaydanov alfonsozheng laashub-soa erebe zimnx edward2a lauranovich kennethito dgarcia360 eliransin fruch wen313 yonimoses rzetelskik anthony-cervantes pveentjer coolkiran2006 morika-t christinegraham sreev guangminglion aleksbykov maks-ua lsnovich tangyiyong michoecho dpelaez-reco sbhattiprolu anakaiti uzzal2k5 gavinje vitalii94lukianenko guy9 madhur vovakaplenko slavavrn bshetty6 athegaul annastuchlik hopugop nsonic001 harfondytoped isabella232 noellymedina

scylla-monitoring's Issues

Attach external disk to prometheus

Docker default disk does not have much space.
It is better to expose a larger disk on the host for metric data.

Slow response to dead nodes

The move to native Prometheus in Scylla 1.4 (#79 ) made the dead node metric update only 5 min after a node is dead. Here is why:

New dead node expression (post #79)

count(up{job=\"scylla\"})-count(seastar_memory{metric=\"free\",shard=\"0\",type=\"total_operations\"})

does not as well as the old (pre #79)

count(up)-count(collectd_processes_ps_code{processes=\"scylla\"}>0

The first take a long time to refresh (5m), while the first was immediate.
The reason is Prometheus assume missing metric have the value of the last seen one. It takes 5 min for it to recognize it is missing. In the old version, the metric was always there, just the value change.
In the new version the metric is missing.

Grafana Capacity pie chart is not clear enough + Need for Capacity indication per node

The Pie chart is not very clear and some info is missing

What color represents the Free / available capacity vs the used? (no legend)
Why does the toolbar shows '{}: [value] (%)' ? need a more professional string.
As for the actual calc, a single Capacity pie chart is not enough. If it's a sum of the total capacity from all nodes, than this does not reflect situation where data is not balanced, and some node/s might be storing too much data, which will assist the customer when write errors occur.
We need to create a container view / pie chart view per node to illustrate the capacity used per node.
From my experience in the Storage world, percentage is nice to have, but users like to know the actual capacity numbers (in GB / TB):

Total capacity at hand
Used capacity
Free capacity (I assume it's not total - used. as there are reservations for Compaction or other background activities)

Reduce default pulling interval to 5s

https://github.com/scylladb/scylla-grafana-monitoring/blob/master/prometheus/prometheus.yml#L2

Explain the network and disk templates usage in the README

@amnonh
Following #84 , please add instructions on choosing the right network interface and disk to the Use section of the READMD
best with screen capture, like this one

Dashboard disk and network interface default are not always right

Current dashboard assumes eth0 as network interface and md0 as drive.
Clearly this is not alway the right choice, and we need an easy way to set it up, maybe in grafana.yml.

Update prometheus/grafana to latest stable versions

So we can use the latest improvements.

Specify unit on the axis instead of in the title

Before:

After:

Add legend info to aggregated charts

When hovering over the chart the results is : {}: some number on aggregated graphs.
Per server charts show the server IP, which make sense.

Upload Scylla dashboard to grafana.net for easy consumption

Provide Prometheus retention value

Without a retention value, Prometheus will fill up the disk.
Prometheus expose this value with -storage.local.retention parameter.
The actual value depends on the use case and disk

Recommendations for production systems

It can be useful for production systems to have a basic configuration settings, before they start installation of Scylla-Grafana-monitoring.

For example:

In some environment it is impossible to install docker, what other options are available?
When using in production systems, define minimum requirements for storage of data (suggesting :500GB minimum). For example, default EBS volumes come at 8GB, a monitoring solution running for 3-4 days can consume the 8GB and cause monitoring data loss.
provide an answer if it is possible to monitor multiple clusters from a single browser? or it is mandatory to create an installation of the scylla-grafana for each cluster?

README.md: no kill-all.sh in repo

Problem:
There is no kill-all.sh in latest repo, but it's mentioned in README [1]

[1] https://github.com/scylladb/scylla-grafana-monitoring#kill

@tzach

"Timeouts and errors" and "Reads and Writes" labels should swap places in the dashboard

Reported rates may spuriously go down to zero when used with collectd_exporter

collectd_exporter will be polled by prometheus with period P. collectd_exporter will be updated through collectd by Scylla with period C. Assume that P is set to be the same as C.

It can happen that effective C will be slightly larger than P due to hiccups in the push path. If phases are right, the whole P period may be contained in the C period. Since collectd_exporter exports samples with the timestamp at poll time, it will return the same value of a counter but with different timestamps. In such case we will see the rate of 0 for that period, even though the actual rate is not zero.

Another effect at play here is that Scylla's collectd client will start its periodic timer after metrics are sent, which adds some delay, so the phase for C is moving. Periodically it will get close to the phase of P, increasing chances of seeing this effect.

If P is grated than C, rates would not go to zero, but can still get distorted.

This wouldn't be a problem if collectd_exporter was exporting actual timestamps of samples. Currently values are exported with the timestamp at poll time.

This doesn't happen in a setup which polls through Scylla's built-in prometheus server.

Add drill down chart for Thrift and CQL

#55 sum Thrift and CQL on the same chart.
In some cases, it will be useful to break the data in to two charts, one for CQL and one for Thrift

Install section - need to "add IP" example

Under Install section, there is no example how the IP format need to look like:
['<server_ip>:'] in case that there is more than one server - use comma to separate between the IP

Unable to drive the metrics to the Graphana dashboard

Trying to monitor an on-premise setup of 3 servers.
CentOS 7.2, Kernel 3.18.4.
The servers are configured with dual networks, 12.9.31.x high speed network connecting the scylla servers communication
10.9.31.x is the slow, ssh network used also for the monitoring traffic.
Setup is using Scylla 1.3, collectd.conf and scylla.conf are attached.
Setting the exporter looks successful and reaching to the data through the URL works.
However,
When looking on the Graphana monitor no info is visible. the servers are running loads.
Is there anything missing here?

The exporter command on each server is:
./collectd_exporter -collectd.listen-address="0.0.0.0:65534" &

prometheus yml setting:
[root@localhost ~]# cat scylla-grafana-monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

job_name: scylla
honor_labels: true
static_configs:
- targets: ['10.9.31.182:9103', '10.9.31.183:9103', '10.9.31.184:9103']

Started the server with:
root@localhost scylla-grafana-monitoring]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@localhost scylla-grafana-monitoring]# ./start-all.sh
4d156596f38e536e9ebc24986656f872c50a2541f2668014c69db50afc7b8739
10d3cb213368da71f7793dd962e14960efdb26a2ecc57845d56acb7c7175785d
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Set-Cookie: grafana_sess=2ce4fc7c24a6d04f; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 37

{"id":1,"message":"Datasource added"}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=58ce57e2e35dea80; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 64

{"slug":"scylla-cluster-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=90d44045c2d53a01; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 67

{"slug":"scylla-per-server-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=49db89a8e86b16c9; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 68

The data connection on Graphana:

collectd_conf.txt
scylla_conf.txt

Dashboard should includes CQL reads and writes

Extracted from metrics:
collectd_database_total_operations.total_writes
collectd_database_total_operations.total_reads

The prometheus container doesn't start due to "permission denied" on /etc/prometheus/prometheus.yml

time="2016-08-30T12:06:28Z" level=info msg="Starting prometheus (version=1.0.0, branch=master, revision=e2bb136)" source="main.go:73" 
time="2016-08-30T12:06:28Z" level=info msg="Build context (go=go1.6.2, user=root@98d6f366491c, date=20160718-15:12:02)" source="main.go:74" 
time="2016-08-30T12:06:28Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:206" 
time="2016-08-30T12:06:28Z" level=error msg="Couldn't load configuration (-config.file=/etc/prometheus/prometheus.yml): open /etc/prometheus/prometheus.yml: permission denied" source="main.go:218"

master's dashboard shows idle load as 100 and busy as 0

Use 1 second scrape period by default

Current period of 15 seconds will average out variations in utilization which happen with period shorter than 15 seconds. Using shorter interval will give us more information about system's behavior. 1s should be still large enough to not put significant load on the system.

initialized: undefined failure, Default prometheus container address wrong?

Hi,
A fresh installation of a monitoring solution fails to start.
The error message received on the screen is:

Looking into the data sources, the ip address setting for the prometheus database is set to 127.0.0.1:9090 and direct

Looking into another working monitoring installation, it is 172.17.0.2:9090 and proxy based.

Going back to the freshly installed monitoring and modifying the IP of the datasource to 172.17.0.2:9090 and to proxy access, it solves the issue.

Options for resolution:

We document the needed change on the DB IP addresses
Ship the monitoring solution with an IP that is defined in the installation of DB

collectd disk metrics are showing distorted rates

Example:

The actual rate is around 400 and doesn't vary as much in reality as it is shown on the graph.

The rates are calculated by taking difference between consecutive samples of monotonic counters. Collectd emits samples with 10 second period, while prometheus samples them by default with 15 second period. So half of the time the period between prometheus samples will cover 1 collectd sample and half of the time it will cover 2 samples. The former will underestimate the rate and the latter will over estimate it.

To avoid this, the sampling period used by prometheus must be a multiply of collectd period. We can set the scraping interval to 10s, or modify the collectd plugin to export stats with 1s period. The latter approach is better because it gives better accuracy and works with any scrape period.

"Foreground reads" graph actually shows all active reads, including background reads

Total Storage Reporting is not accurate

Using GCE platform with 3 NVMe drives, each 375GB (total of 1.1TB), 3 servers.
Each looks like the following in terms of drivers:
[eyal@sdb1bignvme ~]$ sudo fdisk -l

Disk /dev/sda: 10.7 GB, 10737418240 bytes, 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x00091e3d

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 20971519 10484736 83 Linux

Disk /dev/nvme0n1: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/nvme0n2: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/nvme0n3: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/md0: 1207.6 GB, 1207556898816 bytes, 294813696 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes

The total reported storage space is "yellow" as if there is an issue with the total storage, See image.

Changing Query A to:
sum(node_filesystem_avail{mountpoint="/var/lib/scylla"})/1000000000

and Query B to:
(sum(node_filesystem_size{mountpoint="/var/lib/scylla"})-sum(node_filesystem_avail{mountpoint="/var/lib/scylla"}))/1000000000

provided the accurate view of the chart:

Freeze docker images releases

Avoiding new Grafana or Prometheus breaking the monitoring by using particular Docker image versions which are known to work.
For example:

docker run grafana/grafana:3.0.4
docker run prom/prometheus:0.19.2

Mock node status metrics

Current dashboard Live nodes metrics are mocked. Should use actual metrics from servers.
Live nodes can already be extracted, the rest (all server, dead server) dependent on scylladb/scylladb#1234

All in one Docker image

Current install is complex, running two docker images, and use REST API to upload dashboards.
A simpler solution will be to have one Docker image with all 3:
Prometheus
Grafanam
Dashboards

"Read Timeouts per Minute per Server" et al graphs actually show per-second rates not per-minute

The metric is calculated with irate function: sum(irate(collectd_storage_proxy_total_operations{type="read timeouts"}[30s])) by (instance)

irate calculates per-second rate:

irate(v range-vector) calculates the per-second instant rate of increase of the time series in the range vector. This is based on the last two data points.

Which is supported by this graph:

Dynamically update Scylla targets

When adding / removing Scylla nodes from a cluster, one needs to manually update Prometheus config.
It would be much better to use dynamic information, maybe using one of Prometheus service discovery mechanisms or a new custom one.

Grafana graphs are jiggly with short scraping intervals

When I set scraping interval to 1s I see graphs "jiggle" in a way that historical data points change values. After looking closer at this it turns out that graphs alternate on refreshes between showing samples from only odd and only even timestamp values.

Setting "resolution" parameter for queries from 1/2 to 1/1 fixes the problem. In dashboard JSON the change amounts to:

           "targets": [
             {
               "expr": "avg(collectd_reactor_gauge{type=\"load\"} ) by (instance)",
-              "intervalFactor": 2,
+              "intervalFactor": 1,
               "refId": "A",
-              "step": 2
+              "step": 1
             }
           ],
           "timeFrom": null,

Add more dashboards, with server drill down

Scylla provides many more metrics which are currently not presented
Adding a "drill down" dashboard (or many) with more metrics will be very useful for Scylla debugging

Make sure monitoring work for all Scylla deployment types

AWS
Bare metal
Docker

In particular, disk and network (interface) monitoring tend to break

Easy installation and upgrade without Docker

Some users do not want to run Docker.
We should provide instructions on setting up the monitoring stack on top of Grafana and Prometheus install. It can be as easy as running the API part of the start-all.sh script, uploading the dashboard

per-server dashboard doesn't split the metrics if single collectd_exporter is used

I have a setup with a single collectd sink instance and therefore single collectd_exporter. In such configuration all samples will fall under the same instance label - the collectd exporter IP. They do however have distinct exported_instance label, which is set from instance label as recorded by collectd. Perhaps we should modify the dashboard to group by that by default.

There is also something called relabelling in prometheus config which maybe could be used to set instance from exported_instance, but I had no success with it.

Make monitoring great again :(

This is an on-prem installation of 3 scylla servers with a monitoring soluiotn on one of the clients, all connected in the same private network 10.9.31.xx
Followed the instructions on the the git page. Installed a new setup on a server, Installed dockers and created the git clone, the setting up of the monitoring tool is not idiotproof, and requires extensive work to make it work, please help to ease the use of the system.

Add alerts when one of the docker containers (grafana or Prometheus) is not up, will save a lot of hassle, add a manual on how to look for the issue (journalctl -xe ?)
Verify how many servers are up and how many are down in a manner that makes sense, an N/A note on dead servers that are actually up(phantom) is not helpfull.
Are there any iptables/firewall requirement to set the connection between the containers/host and the Scylla servers?
Add example line on how to add servers to the prometheus.yml file, a single server 127.0.0.1 is not explaining on how to add multiple servers to monitor.
exporter on the servers, constantly crashes and does not reflect on the monitoring tool, the server is up the exported is down, server is considered dead or N/A.
If docker daemon is not running, either start it with the start script, or exit immediately, do not try to start the rest of the tools. There is a string of continuous dots printing to the screen with no information on what's going on what does the tool trying to do.

[root@localhost ~]# git clone https://github.com/scylladb/scylla-grafana-monitoring.git
Cloning into 'scylla-grafana-monitoring'...
remote: Counting objects: 280, done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 280 (delta 21), reused 0 (delta 0), pack-reused 236
Receiving objects: 100% (280/280), 64.42 KiB | 0 bytes/s, done.
Resolving deltas: 100% (140/140), done.
[root@localhost ~]# ls -ltr
total 12
-rw-------. 1 root root 1556 Sep 7 17:40 anaconda-ks.cfg
drwxr-xr-x. 2 root root 23 Sep 23 11:16 cassandra.logdir_IS_UNDEFINED
-rw-r--r--. 1 root root 3416 Sep 28 16:19 loadmlnx.yaml
drwxr-xr-x. 5 root root 4096 Oct 4 01:18 scylla-grafana-monitoring
[root@localhost ~]# service docker start
Redirecting to /bin/systemctl start docker.service
[root@localhost ~]# cd scylla-grafana-monitoring/
[root@localhost scylla-grafana-monitoring]# cd prometheus/
[root@localhost prometheus]# vi prometheus.yml
[root@localhost prometheus]# cd ../
[root@localhost scylla-grafana-monitoring]# ./start-all.sh
Unable to find image 'prom/prometheus:v1.0.0' locally
Trying to pull repository docker.io/prom/prometheus ...
v1.0.0: Pulling from docker.io/prom/prometheus
385e281300cc: Pull complete
a3ed95caeb02: Pull complete
e418e02f5f37: Pull complete
6c2c7730b5ef: Pull complete
bbc184d7f32a: Pull complete
17a6ebba0cea: Pull complete
d1b2d64d311e: Pull complete
356f67417ef1: Pull complete
Digest: sha256:13cca70de2522231af89f19fc246fad6bc594698ede40fc7712a74ce71f1068f
Status: Downloaded newer image for docker.io/prom/prometheus:v1.0.0
0c9ffbb5da10e333e2e702a4f1585c0ded7c0130efb6cf3584475aa8a5a09353
Unable to find image 'grafana/grafana:3.1.0' locally
Trying to pull repository docker.io/grafana/grafana ...
3.1.0: Pulling from docker.io/grafana/grafana
5c90d4a2d1a8: Pull complete
b1a9a0b6158e: Pull complete
acb23b0d58de: Pull complete
Digest: sha256:3476700a51ff136a507f9d09a6626964b6cfbc9352ed23e0063d8785d2b2c30f
Status: Downloaded newer image for docker.io/grafana/grafana:3.1.0
7b452d487663df60df543fe17c9e3a0396e01f8c6118d628d0c83f3025670d25
.HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Set-Cookie: grafana_sess=bea800eac5a7fac0; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 37

{"id":1,"message":"Datasource added"}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=cfe2c9b9168e0059; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 64

{"slug":"scylla-cluster-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=405697f0dfdd1d35; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 67

{"slug":"scylla-per-server-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=0220efe08badfc1e; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 68

{"slug":"scylla-per-server-disk-i-o","status":"success","version":0}[root@localhost scylla-grafana-monitoring]#

Added the servers trying to read from to the prometheus yml file:
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

job_name: scylla
honor_labels: true
static_configs:
targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']

Going to the web browser, pointing to 10.9.31.186, where my monitor system is installed, no data appears:

Looking into the data sources on the grafana setup, I see:

Tried to verify the installation, getting:

Tried to point the IP address of the setup(10.9.31.186), got the same error message:

Well, it seems that the prometheus server, didn't come up.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
70175d8a3609 grafana/grafana:3.1.0 "/run.sh" 11 seconds ago Up 9 seconds 0.0.0.0:3000->3000/tcp agraf

From some reason it didn't read the prometheus.yml file from the workin directory.
Oct 04 01:37:04 localhost.localdomain avahi-daemon[13905]: Withdrawing workstation service for veth4b6c354.
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth4b6c354): failed to disable userspace IPv6LL address handling
Oct 04 01:37:04 localhost.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth409be50: link becomes ready
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth409be50): link connected
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (docker0): link connected
Oct 04 01:37:04 localhost.localdomain sudo[31344]: root : TTY=pts/0 ; PWD=/root/scylla-grafana-monitoring ; USER=root ; COMMAND=/bin/docker run -d
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Starting prometheus (version=1.0.0, branch=mas
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Build context (go=go1.6.2, user=root@98d6f3664
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Loading configuration file /etc/prometheus/pro
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=error msg="Couldn't load configuration (-config.file=/et
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T01:37:04.891507324-07:00" level=info msg="{Action=create, Username=root,
Oct 04 01:37:04 localhost.localdomain systemd[1]: Stopped docker container 20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.
-- Subject: Unit docker-20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.scope has finished shutting down

Changing the starting script where forcing to read the yml file got the promethues container up:
from the starting script:
if [ -z $DATA_DIR ]
then
sudo docker run -d -**v /root/scylla-grafana-monitoring/prometheus/prometheus.yml -**p 9090:9090 --name aprom prom/prometheus:v1.0.0
else
echo "Loading prometheus data from $DATA_DIR"
sudo docker run -d -v $DATA_DIR:/prometheus:Z -v $PWD/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:Z -p 9090:9090 --name aprom prom/prometheus:v1.0.0
fi
Now got the source active:

Still, having 3 servers in the yml file list,
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

job_name: scylla
honor_labels: true
static_configs:
- targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']

The monitoring show only one server :(

When trying to start the monitoring system again it gets halted:
for example:
e6b803cbcc89d11c20f128808eeea7a18192447a94367615592b4b48d0d1071c
79949fcb5d3f4208106f7c71a78ab168a811682238a2dd780d31f025e979da1e
............................ (and it goes to infinity and beyond) system is in this state for minutes until ctrl-c.
This is the information from journalctl -xe what are the containers trying to do?:

Oct 04 02:17:00 localhost.localdomain oci-systemd-hook[14799]: systemdhook : Skipping as container command is /run.sh, not init or systemd
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to find device 17 'vethcd2d3e9' with udev
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): new Veth device (carrier: OFF, driver: 'veth', ifindex: 17)
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): link disconnected
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for vethcd2d3e9.
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for veth02da683.
Oct 04 02:17:00 localhost.localdomain kernel: device veth02da683 left promiscuous mode
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (docker0): bridge port veth02da683 was detached
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): released from master docker0
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain kernel: XFS (dm-5): Unmounting Filesystem
Oct 04 02:17:14 localhost.localdomain kernel: docker0: port 1(veth6c657e7) entered forwarding state

5 second refresh time is to fast

On a slower network, it can cause bad experience. 30 s should be enough.

Missing instructions on how to set up node_exporter

It is mentioned in the instructions, but no pointers on how to install it or run it.

Deliver dashboard as Grafana 3.0 plugin

Meta issue: branch scylla-grafana-monitoring to match Scylla releases

Each version of scylla-grafana-monitoring match particular version of Scylla.
With each Scylla release, we need to branch scylla-grafana-monitoring with the same tags, branches.

Need to add docker start command

After the install section
Need to verify that docker is running - ps aux | grep docker
If not - start docker with
$ sudo systemctl restart docker

Add latency measurements

Throughput is not enough. Let's add latency too and show it dynamically.

Make dashboard "starred" by default

starred dashboard are easier to access.

Dashboards don't work with scylla 1.6+

The following sed fixes some graphs, but not all:

sed 's/{type=\\\"\([a-zA-Z_ ]*\)\\\", metric/_\1{type/g' -i grafana/scylla-dash-per-server.json

Wrong headers in per server dashboard

Error and read/writes headers are switched

"Dead" node shows unreachable nodes

The dashboard assumes non-reporting nodes are dead, which is not necessarily true.
Until we have scylladb/scylladb#1234, we should call it "unreachable"

Mount Grafana dashboard directory

Current solution uploads dashboard using REST API. This is error prone and requires opening the REST API which might be a security issue.
A better solution will be to mount Grafana dashboard directory, as done with prometheus yaml file.

Move to latest stable grafana release

Current release is graanjonlo/grafana:3.0.0-beta6
Better to move to an official grafana/grafana:3.0.4

https://hub.docker.com/r/grafana/grafana/tags/

Does Prometheus have a phantom data directory?

The system I use have 3 servers, I see all of them are working and active through Nodetool, still I get that I have a dead node. and the information on the metrics is way off.
I Can't get rid of a phantom data, that keeps creeping on the system.
After deleting the docker volumes, reinstalled docker, reinstalled the grafana tool. Some phantom data keeps coming back, cleared the browser cache and killed the collector exporters on the servers.
Is it possible to have a way to clear all history data the tool has accumulated in the past?
./kill-all.sh only removes the current docker nodes running, and it is not clearing the information.

scylladb / scylla-monitoring Goto Github PK

scylla-monitoring's Introduction

Scylla Monitoring Stack

scylla-monitoring's People

Contributors

Stargazers

Watchers

Forkers

scylla-monitoring's Issues

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

Recommend Projects

Recommend Topics

Recommend Org