Code Monkey home page Code Monkey logo

scylla-monitoring's Introduction

Scylla Monitoring Stack

The ScyllaDB Monitoring Stack is used to monitor ScyllaDB. It is container-based and has dashboards and some scripts to set up the containers.

Before using it, make sure you have docker enabled. Check the documentation for installation and configuration details.

You can ask questions and discuss ScyllaDB Monitoring in the ScyllaDB community forum and on the Slack channel.

The Monitoring lesson on ScyllaDB University is another useful resource, and it includes a hands-on example.

scylla-monitoring's People

Contributors

amnonh avatar amoskong avatar annastuchlik avatar avikivity avatar benipeled avatar dependabot[bot] avatar dgarcia360 avatar duarten avatar fruch avatar gavinje avatar guy9 avatar k7krishnar avatar lauranovich avatar madhurgames avatar mrlexor avatar mykaul avatar nirmaayan avatar pdziepak avatar ricardoborenstein avatar roydahan avatar sbhattiprolu avatar siculars avatar sneako avatar tgrabiec avatar tzach avatar ultrabug avatar vladzcloudius avatar wprzytula avatar yaronkaikov avatar zimnx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scylla-monitoring's Issues

Slow response to dead nodes

The move to native Prometheus in Scylla 1.4 (#79 ) made the dead node metric update only 5 min after a node is dead. Here is why:

New dead node expression (post #79)

count(up{job=\"scylla\"})-count(seastar_memory{metric=\"free\",shard=\"0\",type=\"total_operations\"})

does not as well as the old (pre #79)

count(up)-count(collectd_processes_ps_code{processes=\"scylla\"}>0

The first take a long time to refresh (5m), while the first was immediate.
The reason is Prometheus assume missing metric have the value of the last seen one. It takes 5 min for it to recognize it is missing. In the old version, the metric was always there, just the value change.
In the new version the metric is missing.

Grafana Capacity pie chart is not clear enough + Need for Capacity indication per node

The Pie chart is not very clear and some info is missing

  1. What color represents the Free / available capacity vs the used? (no legend)
  2. Why does the toolbar shows '{}: [value] (%)' ? need a more professional string.
  3. As for the actual calc, a single Capacity pie chart is not enough. If it's a sum of the total capacity from all nodes, than this does not reflect situation where data is not balanced, and some node/s might be storing too much data, which will assist the customer when write errors occur.
    We need to create a container view / pie chart view per node to illustrate the capacity used per node.
  4. From my experience in the Storage world, percentage is nice to have, but users like to know the actual capacity numbers (in GB / TB):
  • Total capacity at hand
  • Used capacity
  • Free capacity (I assume it's not total - used. as there are reservations for Compaction or other background activities)

Provide Prometheus retention value

Without a retention value, Prometheus will fill up the disk.
Prometheus expose this value with -storage.local.retention parameter.
The actual value depends on the use case and disk

Recommendations for production systems

It can be useful for production systems to have a basic configuration settings, before they start installation of Scylla-Grafana-monitoring.

For example:

  1. In some environment it is impossible to install docker, what other options are available?
  2. When using in production systems, define minimum requirements for storage of data (suggesting :500GB minimum). For example, default EBS volumes come at 8GB, a monitoring solution running for 3-4 days can consume the 8GB and cause monitoring data loss.
  3. provide an answer if it is possible to monitor multiple clusters from a single browser? or it is mandatory to create an installation of the scylla-grafana for each cluster?

Reported rates may spuriously go down to zero when used with collectd_exporter

collectd_exporter will be polled by prometheus with period P. collectd_exporter will be updated through collectd by Scylla with period C. Assume that P is set to be the same as C.

It can happen that effective C will be slightly larger than P due to hiccups in the push path. If phases are right, the whole P period may be contained in the C period. Since collectd_exporter exports samples with the timestamp at poll time, it will return the same value of a counter but with different timestamps. In such case we will see the rate of 0 for that period, even though the actual rate is not zero.

Another effect at play here is that Scylla's collectd client will start its periodic timer after metrics are sent, which adds some delay, so the phase for C is moving. Periodically it will get close to the phase of P, increasing chances of seeing this effect.

If P is grated than C, rates would not go to zero, but can still get distorted.

This wouldn't be a problem if collectd_exporter was exporting actual timestamps of samples. Currently values are exported with the timestamp at poll time.

This doesn't happen in a setup which polls through Scylla's built-in prometheus server.

Install section - need to "add IP" example

Under Install section, there is no example how the IP format need to look like:
['<server_ip>:'] in case that there is more than one server - use comma to separate between the IP

Unable to drive the metrics to the Graphana dashboard

Trying to monitor an on-premise setup of 3 servers.
CentOS 7.2, Kernel 3.18.4.
The servers are configured with dual networks, 12.9.31.x high speed network connecting the scylla servers communication
10.9.31.x is the slow, ssh network used also for the monitoring traffic.
Setup is using Scylla 1.3, collectd.conf and scylla.conf are attached.
Setting the exporter looks successful and reaching to the data through the URL works.
However,
When looking on the Graphana monitor no info is visible. the servers are running loads.
Is there anything missing here?
image

The exporter command on each server is:
./collectd_exporter -collectd.listen-address="0.0.0.0:65534" &

prometheus yml setting:
[root@localhost ~]# cat scylla-grafana-monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

  • job_name: scylla
    honor_labels: true
    static_configs:
    • targets: ['10.9.31.182:9103', '10.9.31.183:9103', '10.9.31.184:9103']

Started the server with:
root@localhost scylla-grafana-monitoring]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@localhost scylla-grafana-monitoring]# ./start-all.sh
4d156596f38e536e9ebc24986656f872c50a2541f2668014c69db50afc7b8739
10d3cb213368da71f7793dd962e14960efdb26a2ecc57845d56acb7c7175785d
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Set-Cookie: grafana_sess=2ce4fc7c24a6d04f; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 37

{"id":1,"message":"Datasource added"}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=58ce57e2e35dea80; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 64

{"slug":"scylla-cluster-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=90d44045c2d53a01; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 67

{"slug":"scylla-per-server-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=49db89a8e86b16c9; Path=/; HttpOnly
Date: Thu, 22 Sep 2016 19:03:41 GMT
Content-Length: 68

The data connection on Graphana:
image

collectd_conf.txt
scylla_conf.txt

The prometheus container doesn't start due to "permission denied" on /etc/prometheus/prometheus.yml

time="2016-08-30T12:06:28Z" level=info msg="Starting prometheus (version=1.0.0, branch=master, revision=e2bb136)" source="main.go:73" 
time="2016-08-30T12:06:28Z" level=info msg="Build context (go=go1.6.2, user=root@98d6f366491c, date=20160718-15:12:02)" source="main.go:74" 
time="2016-08-30T12:06:28Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:206" 
time="2016-08-30T12:06:28Z" level=error msg="Couldn't load configuration (-config.file=/etc/prometheus/prometheus.yml): open /etc/prometheus/prometheus.yml: permission denied" source="main.go:218" 

Use 1 second scrape period by default

Current period of 15 seconds will average out variations in utilization which happen with period shorter than 15 seconds. Using shorter interval will give us more information about system's behavior. 1s should be still large enough to not put significant load on the system.

initialized: undefined failure, Default prometheus container address wrong?

Hi,
A fresh installation of a monitoring solution fails to start.
The error message received on the screen is:
image

Looking into the data sources, the ip address setting for the prometheus database is set to 127.0.0.1:9090 and direct

Looking into another working monitoring installation, it is 172.17.0.2:9090 and proxy based.

Going back to the freshly installed monitoring and modifying the IP of the datasource to 172.17.0.2:9090 and to proxy access, it solves the issue.

Options for resolution:

  1. We document the needed change on the DB IP addresses
  2. Ship the monitoring solution with an IP that is defined in the installation of DB

collectd disk metrics are showing distorted rates

Example:

disk_stats

The actual rate is around 400 and doesn't vary as much in reality as it is shown on the graph.

The rates are calculated by taking difference between consecutive samples of monotonic counters. Collectd emits samples with 10 second period, while prometheus samples them by default with 15 second period. So half of the time the period between prometheus samples will cover 1 collectd sample and half of the time it will cover 2 samples. The former will underestimate the rate and the latter will over estimate it.

To avoid this, the sampling period used by prometheus must be a multiply of collectd period. We can set the scraping interval to 10s, or modify the collectd plugin to export stats with 1s period. The latter approach is better because it gives better accuracy and works with any scrape period.

Total Storage Reporting is not accurate

Using GCE platform with 3 NVMe drives, each 375GB (total of 1.1TB), 3 servers.
Each looks like the following in terms of drivers:
[eyal@sdb1bignvme ~]$ sudo fdisk -l

Disk /dev/sda: 10.7 GB, 10737418240 bytes, 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: dos
Disk identifier: 0x00091e3d

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 20971519 10484736 83 Linux

Disk /dev/nvme0n1: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/nvme0n2: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/nvme0n3: 402.7 GB, 402653184000 bytes, 98304000 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/md0: 1207.6 GB, 1207556898816 bytes, 294813696 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes

The total reported storage space is "yellow" as if there is an issue with the total storage, See image.
image

Changing Query A to:
sum(node_filesystem_avail{mountpoint="/var/lib/scylla"})/1000000000

and Query B to:
(sum(node_filesystem_size{mountpoint="/var/lib/scylla"})-sum(node_filesystem_avail{mountpoint="/var/lib/scylla"}))/1000000000

provided the accurate view of the chart:
image

image
image

Freeze docker images releases

Avoiding new Grafana or Prometheus breaking the monitoring by using particular Docker image versions which are known to work.
For example:

docker run grafana/grafana:3.0.4
docker run prom/prometheus:0.19.2

All in one Docker image

Current install is complex, running two docker images, and use REST API to upload dashboards.
A simpler solution will be to have one Docker image with all 3:
Prometheus
Grafanam
Dashboards

Dynamically update Scylla targets

When adding / removing Scylla nodes from a cluster, one needs to manually update Prometheus config.
It would be much better to use dynamic information, maybe using one of Prometheus service discovery mechanisms or a new custom one.

Grafana graphs are jiggly with short scraping intervals

When I set scraping interval to 1s I see graphs "jiggle" in a way that historical data points change values. After looking closer at this it turns out that graphs alternate on refreshes between showing samples from only odd and only even timestamp values.

Setting "resolution" parameter for queries from 1/2 to 1/1 fixes the problem. In dashboard JSON the change amounts to:

           "targets": [
             {
               "expr": "avg(collectd_reactor_gauge{type=\"load\"} ) by (instance)",
-              "intervalFactor": 2,
+              "intervalFactor": 1,
               "refId": "A",
-              "step": 2
+              "step": 1
             }
           ],
           "timeFrom": null,

Easy installation and upgrade without Docker

Some users do not want to run Docker.
We should provide instructions on setting up the monitoring stack on top of Grafana and Prometheus install. It can be as easy as running the API part of the start-all.sh script, uploading the dashboard

per-server dashboard doesn't split the metrics if single collectd_exporter is used

I have a setup with a single collectd sink instance and therefore single collectd_exporter. In such configuration all samples will fall under the same instance label - the collectd exporter IP. They do however have distinct exported_instance label, which is set from instance label as recorded by collectd. Perhaps we should modify the dashboard to group by that by default.

There is also something called relabelling in prometheus config which maybe could be used to set instance from exported_instance, but I had no success with it.

Make monitoring great again :(

This is an on-prem installation of 3 scylla servers with a monitoring soluiotn on one of the clients, all connected in the same private network 10.9.31.xx
Followed the instructions on the the git page. Installed a new setup on a server, Installed dockers and created the git clone, the setting up of the monitoring tool is not idiotproof, and requires extensive work to make it work, please help to ease the use of the system.

  1. Add alerts when one of the docker containers (grafana or Prometheus) is not up, will save a lot of hassle, add a manual on how to look for the issue (journalctl -xe ?)
  2. Verify how many servers are up and how many are down in a manner that makes sense, an N/A note on dead servers that are actually up(phantom) is not helpfull.
  3. Are there any iptables/firewall requirement to set the connection between the containers/host and the Scylla servers?
  4. Add example line on how to add servers to the prometheus.yml file, a single server 127.0.0.1 is not explaining on how to add multiple servers to monitor.
  5. exporter on the servers, constantly crashes and does not reflect on the monitoring tool, the server is up the exported is down, server is considered dead or N/A.
  6. If docker daemon is not running, either start it with the start script, or exit immediately, do not try to start the rest of the tools. There is a string of continuous dots printing to the screen with no information on what's going on what does the tool trying to do.

[root@localhost ~]# git clone https://github.com/scylladb/scylla-grafana-monitoring.git
Cloning into 'scylla-grafana-monitoring'...
remote: Counting objects: 280, done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 280 (delta 21), reused 0 (delta 0), pack-reused 236
Receiving objects: 100% (280/280), 64.42 KiB | 0 bytes/s, done.
Resolving deltas: 100% (140/140), done.
[root@localhost ~]# ls -ltr
total 12
-rw-------. 1 root root 1556 Sep 7 17:40 anaconda-ks.cfg
drwxr-xr-x. 2 root root 23 Sep 23 11:16 cassandra.logdir_IS_UNDEFINED
-rw-r--r--. 1 root root 3416 Sep 28 16:19 loadmlnx.yaml
drwxr-xr-x. 5 root root 4096 Oct 4 01:18 scylla-grafana-monitoring
[root@localhost ~]# service docker start
Redirecting to /bin/systemctl start docker.service
[root@localhost ~]# cd scylla-grafana-monitoring/
[root@localhost scylla-grafana-monitoring]# cd prometheus/
[root@localhost prometheus]# vi prometheus.yml
[root@localhost prometheus]# cd ../
[root@localhost scylla-grafana-monitoring]# ./start-all.sh
Unable to find image 'prom/prometheus:v1.0.0' locally
Trying to pull repository docker.io/prom/prometheus ...
v1.0.0: Pulling from docker.io/prom/prometheus
385e281300cc: Pull complete
a3ed95caeb02: Pull complete
e418e02f5f37: Pull complete
6c2c7730b5ef: Pull complete
bbc184d7f32a: Pull complete
17a6ebba0cea: Pull complete
d1b2d64d311e: Pull complete
356f67417ef1: Pull complete
Digest: sha256:13cca70de2522231af89f19fc246fad6bc594698ede40fc7712a74ce71f1068f
Status: Downloaded newer image for docker.io/prom/prometheus:v1.0.0
0c9ffbb5da10e333e2e702a4f1585c0ded7c0130efb6cf3584475aa8a5a09353
Unable to find image 'grafana/grafana:3.1.0' locally
Trying to pull repository docker.io/grafana/grafana ...
3.1.0: Pulling from docker.io/grafana/grafana
5c90d4a2d1a8: Pull complete
b1a9a0b6158e: Pull complete
acb23b0d58de: Pull complete
Digest: sha256:3476700a51ff136a507f9d09a6626964b6cfbc9352ed23e0063d8785d2b2c30f
Status: Downloaded newer image for docker.io/grafana/grafana:3.1.0
7b452d487663df60df543fe17c9e3a0396e01f8c6118d628d0c83f3025670d25
.HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Set-Cookie: grafana_sess=bea800eac5a7fac0; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 37

{"id":1,"message":"Datasource added"}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=cfe2c9b9168e0059; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 64

{"slug":"scylla-cluster-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=405697f0dfdd1d35; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 67

{"slug":"scylla-per-server-metrics","status":"success","version":0}HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=0220efe08badfc1e; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 68

{"slug":"scylla-per-server-disk-i-o","status":"success","version":0}[root@localhost scylla-grafana-monitoring]#

Added the servers trying to read from to the prometheus yml file:
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

  • job_name: scylla
    honor_labels: true
    static_configs:
  • targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']

Going to the web browser, pointing to 10.9.31.186, where my monitor system is installed, no data appears:
image

Looking into the data sources on the grafana setup, I see:
image

Tried to verify the installation, getting:
image

Tried to point the IP address of the setup(10.9.31.186), got the same error message:
image

Well, it seems that the prometheus server, didn't come up.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
70175d8a3609 grafana/grafana:3.1.0 "/run.sh" 11 seconds ago Up 9 seconds 0.0.0.0:3000->3000/tcp agraf

From some reason it didn't read the prometheus.yml file from the workin directory.
Oct 04 01:37:04 localhost.localdomain avahi-daemon[13905]: Withdrawing workstation service for veth4b6c354.
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth4b6c354): failed to disable userspace IPv6LL address handling
Oct 04 01:37:04 localhost.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth409be50: link becomes ready
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth409be50): link connected
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (docker0): link connected
Oct 04 01:37:04 localhost.localdomain sudo[31344]: root : TTY=pts/0 ; PWD=/root/scylla-grafana-monitoring ; USER=root ; COMMAND=/bin/docker run -d
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Starting prometheus (version=1.0.0, branch=mas
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Build context (go=go1.6.2, user=root@98d6f3664
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Loading configuration file /etc/prometheus/pro
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=error msg="Couldn't load configuration (-config.file=/et
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T01:37:04.891507324-07:00" level=info msg="{Action=create, Username=root,
Oct 04 01:37:04 localhost.localdomain systemd[1]: Stopped docker container 20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.
-- Subject: Unit docker-20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.scope has finished shutting down

Changing the starting script where forcing to read the yml file got the promethues container up:
from the starting script:
if [ -z $DATA_DIR ]
then
sudo docker run -d -**v /root/scylla-grafana-monitoring/prometheus/prometheus.yml -**p 9090:9090 --name aprom prom/prometheus:v1.0.0
else
echo "Loading prometheus data from $DATA_DIR"
sudo docker run -d -v $DATA_DIR:/prometheus:Z -v $PWD/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:Z -p 9090:9090 --name aprom prom/prometheus:v1.0.0
fi
Now got the source active:
image

Still, having 3 servers in the yml file list,
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.

Attach these labels to any time series or alerts when communicating with

external systems (federation, remote storage, Alertmanager).

external_labels:
monitor: 'scylla-monitor'

scrape_configs:

  • job_name: scylla
    honor_labels: true
    static_configs:
    • targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']

The monitoring show only one server :(
image

When trying to start the monitoring system again it gets halted:
for example:
e6b803cbcc89d11c20f128808eeea7a18192447a94367615592b4b48d0d1071c
79949fcb5d3f4208106f7c71a78ab168a811682238a2dd780d31f025e979da1e
............................ (and it goes to infinity and beyond) system is in this state for minutes until ctrl-c.
This is the information from journalctl -xe what are the containers trying to do?:

Oct 04 02:17:00 localhost.localdomain oci-systemd-hook[14799]: systemdhook : Skipping as container command is /run.sh, not init or systemd
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to find device 17 'vethcd2d3e9' with udev
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): new Veth device (carrier: OFF, driver: 'veth', ifindex: 17)
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): link disconnected
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for vethcd2d3e9.
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for veth02da683.
Oct 04 02:17:00 localhost.localdomain kernel: device veth02da683 left promiscuous mode
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (docker0): bridge port veth02da683 was detached
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): released from master docker0
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain kernel: XFS (dm-5): Unmounting Filesystem
Oct 04 02:17:14 localhost.localdomain kernel: docker0: port 1(veth6c657e7) entered forwarding state

Need to add docker start command

After the install section
Need to verify that docker is running - ps aux | grep docker
If not - start docker with
$ sudo systemctl restart docker

Mount Grafana dashboard directory

Current solution uploads dashboard using REST API. This is error prone and requires opening the REST API which might be a security issue.
A better solution will be to mount Grafana dashboard directory, as done with prometheus yaml file.

Does Prometheus have a phantom data directory?

The system I use have 3 servers, I see all of them are working and active through Nodetool, still I get that I have a dead node. and the information on the metrics is way off.
I Can't get rid of a phantom data, that keeps creeping on the system.
After deleting the docker volumes, reinstalled docker, reinstalled the grafana tool. Some phantom data keeps coming back, cleared the browser cache and killed the collector exporters on the servers.
Is it possible to have a way to clear all history data the tool has accumulated in the past?
./kill-all.sh only removes the current docker nodes running, and it is not clearing the information.
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.