bosh-prometheus / prometheus-boshrelease Goto Github PK

View Code? Open in Web Editor NEW

111.0 46.0 162.0 3.3 MB

bosh release for prometheus ecosystem

License: Apache License 2.0

Shell 92.11% Dockerfile 0.26% HTML 7.63%

bosh-release prometheus cloud-foundry metrics

prometheus-boshrelease's People

Stargazers

Watchers

Forkers

orange-cloudfoundry omearaj janaurka szaouam arthurhlt cppforlife vmware-archive brightzheng100 st3v jmcarp mjsjinsu pillopl jigsheth57 jccarte shinji62 wfernandes ljfranklin jtuchscherer comdaze wjk940 cnelson vchrisr benjaminguttmann-avtq dr8tsh making avarteqgmbh kkitai engineerbetter drnic peterellisjones joaquinrz govau ntdt bhudlemeyer wjwoodson ramonskie 18f sba30 daxterm geofffranks ryanaross robbo10 cwlbraa arghya88 manashcsc maning8 matthewfischer scoobed acho-bacho sifiaicha svrc stephen-reaves ruixiang nwmac riguelbf sapient007 cze-p7s1 irfanhabib rmoorman naveen-goswami springerpe bonzofenix alexvianet currycan keymon gdenn jacoblnewton mylucidreality kinjelom alphagov robplahn daichi703n aryanet taskinfurkan srothwell01 xyloman fbuchmeier jacopen mhtike pdev1989 hae-anwar-bardai pdubey2 jupilhwang denverops infra-red dhiren051 mdcarwile-az bengerman13 scottgai andinod dohq bottkars maxknee gstackio rabbitmq bkez322 dystudio 3v1lw1th1n akhettal tsato1222

prometheus-boshrelease's Issues

Corrupt packages in release v13?

Hi,
I tried to upgrade to Version 13 today and ran into an issue during the compile phase:

 Started compiling packages
  Started compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea
  Started compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e
  Started compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93
   Failed compiling packages > cf_exporter/149d9106a695629e0fe86f38edd4501261dad41e: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:33)
   Failed compiling packages > rabbitmq_exporter/85c19dbb43607b7d2f6eb6a35c0ca30fa313ca93: Action Failed get_task: Task 03f1847d-d5e1-40c6-7021-fd250418c042 result: Compiling package rabbitmq_exporter: Fetching package rabbitmq_exporter: Fetching package blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 43f0c2f9-6d98-404c-ab10-fde39d50f1b7 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get610207793', stdout: 'Error running app - Getting dav blob 43f0c2f9-6d98-404c-ab10-fde39d50f1b7: Get http://172.16.106.4:25250/ce/43f0c2f9-6d98-404c-ab10-fde39d50f1b7: read tcp 172.16.106.31:42296->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:07:39)
   Failed compiling packages > grafana/5271e14376bcd8484d934e4379fcfb1e8467adea: Action Failed get_task: Task c49dfcb5-680c-4c96-5298-0f406bc7df17 result: Compiling package grafana: Fetching package grafana: Fetching package blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get 8e050cc5-6ec7-4b63-80d3-65b957a8179e /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get218393412', stdout: 'Error running app - Getting dav blob 8e050cc5-6ec7-4b63-80d3-65b957a8179e: Get http://172.16.106.4:25250/7b/8e050cc5-6ec7-4b63-80d3-65b957a8179e: read tcp 172.16.106.30:56982->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1 (00:09:08)
   Failed compiling packages (00:09:08)

Error 450001: Action Failed get_task: Task cede0d5c-cac7-4027-79ea-975a73f6201a result: Compiling package cf_exporter: Fetching package cf_exporter: Fetching package blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Getting blob from inner blobstore: Getting blob from inner blobstore: Shelling out to bosh-blobstore-dav cli: Running command: 'bosh-blobstore-dav -c /var/vcap/bosh/etc/blobstore-dav.json get f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0 /var/vcap/data/tmp/bosh-blobstore-externalBlobstore-Get233330094', stdout: 'Error running app - Getting dav blob f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: Get http://172.16.106.4:25250/0c/f4b8c5c4-4c0c-4009-8b7f-f591c2b816c0: read tcp 172.16.106.32:34312->172.16.106.4:25250: read: connection reset by peer', stderr: '': exit status 1

The other packages compiled smoothly. Might it be the case, that those 3 files are corrupt?

remove v6 final release as it cannot be build in a clean env in bosh.io

seems like it contained invalid spec files

Add smoke-test

Add smoke-tests so we can test that the final deploy succeed (ie alertmanager, prometheus and grafana)

Support arbitrary alerts from deployment manifest

As a user of this release, I want to add miscellaneous custom alerts without writing a whole new bosh release. Instead, it would be useful to configure arbitrary alerts like this:

...
properties:
  prometheus:
    custom_rules:
    - (( file "path/to/some/rules" ))
    - (( file "path/to/different/rules" ))
...

The release would then write the contents of prometheus.custom_rules to a file and add the file path to rule_files. If that makes sense, we'd be happy to send a patch. WDYT @frodenas ?

cc @cnelson

Audit events

Any plans to add auditing events to the cf_exporter? It would be nice to gather those details per app too.

Update Cloud Foundry alerts

The Cloud Foundry alerts needs review & update.

grafana dashboard error

Hello, hit his problem since 15.0.0 (reproduced on 17.0.0 and latest 17.0.2)

Grafana post starts fails with a strange file

Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_cf_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_cf_exporter.json
Updating dashboard /var/vcap/jobs/cloudfoundry_dashboards/prometheus_firehose_exporter.json at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//prometheus_firehose_exporter.json
Updating dashboard agent.cert at Wed Jun  7 10:52:23 UTC 2017
Validating /var/vcap/store/grafana/dashboards//agent.cert
parse error: Invalid numeric literal at line 1, column 11

This file is present in /var/vcap/store, dont know where it comes from.

grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# pwd                                                                       
/var/vcap/store/grafana/dashboards                                                                                                                         
grafana/47c0244a-3c59-4837-8602-4b619ef29c60:/var/vcap/store/grafana/dashboards# ls -lrt                                                                   
total 860                                                                                                                                                  
-rw-r--r-- 1 root root  6687 Jun  7 10:52 bosh_deployments.json                                                                                            
-rw-r--r-- 1 root root 35055 Jun  7 10:52 bosh_jobs.json                                                                                                   
-rw-r--r-- 1 root root 35046 Jun  7 10:52 bosh_overview.json                                                                                               
-rw-r--r-- 1 root root 15181 Jun  7 10:52 bosh_processes.json                                                                                              
-rw-r--r-- 1 root root 28081 Jun  7 10:52 prometheus_bosh_exporter.json
-rw-r--r-- 1 root root 11501 Jun  7 10:52 prometheus_bosh_tsdb_exporter.json
-rw-r--r-- 1 root root 16435 Jun  7 10:52 cf_apps_events.json
-rw-r--r-- 1 root root 13365 Jun  7 10:52 cf_apps_latency.json
-rw-r--r-- 1 root root 36548 Jun  7 10:52 cf_apps_requests.json
-rw-r--r-- 1 root root 20817 Jun  7 10:52 cf_apps_system.json
-rw-r--r-- 1 root root 19154 Jun  7 10:52 cf_bbs.json
-rw-r--r-- 1 root root 21344 Jun  7 10:52 cf_cc.json
-rw-r--r-- 1 root root 20338 Jun  7 10:52 cf_cells_capacity.json
-rw-r--r-- 1 root root 25104 Jun  7 10:52 cf_component_metrics.json
-rw-r--r-- 1 root root 13976 Jun  7 10:52 cf_diego_auctions.json
-rw-r--r-- 1 root root 11125 Jun  7 10:52 cf_diego_health.json
-rw-r--r-- 1 root root 21552 Jun  7 10:52 cf_doppler_server.json
-rw-r--r-- 1 root root 26061 Jun  7 10:52 cf_etcd.json
-rw-r--r-- 1 root root 22500 Jun  7 10:52 cf_etcd_operations.json
-rw-r--r-- 1 root root 11880 Jun  7 10:52 cf_garden_linux.json
-rw-r--r-- 1 root root 78176 Jun  7 10:52 cf_kpis.json
-rw-r--r-- 1 root root 16062 Jun  7 10:52 cf_lrps_tasks.json
-rw-r--r-- 1 root root 15427 Jun  7 10:52 cf_metron_agent.json
-rw-r--r-- 1 root root 13902 Jun  7 10:52 cf_metron_agent_doppler.json
-rw-r--r-- 1 root root 10589 Jun  7 10:52 cf_organization_memory_quotas.json
-rw-r--r-- 1 root root 15680 Jun  7 10:52 cf_organization_summary.json
-rw-r--r-- 1 root root 11204 Jun  7 10:52 cf_route_emitter.json
-rw-r--r-- 1 root root 27468 Jun  7 10:52 cf_router.json
-rw-r--r-- 1 root root 22637 Jun  7 10:52 cf_space_summary.json
-rw-r--r-- 1 root root 47125 Jun  7 10:52 cf_summary.json
-rw-r--r-- 1 root root 22052 Jun  7 10:52 cf_uaa.json
-rw-r--r-- 1 root root 65631 Jun  7 10:52 prometheus_cf_exporter.json
-rw-r--r-- 1 root root 53417 Jun  7 10:52 prometheus_firehose_exporter.json
-rw-r--r-- 1 root root  1058 Jun  7 10:52 agent.cert

Create MongoDB dashboards & alerts

The MongoDB dashboards are currently empty and there is no job/package for MongoDB alerts.

Metrics showing for a non-existent CF component after scale down

I scaled my routing tier to include 2 routers and then scaled back down to a single router to validate my dashboard settings and ability to add metrics as the platform changed. I noticed the second router is still showing in the dashboards. Using the prometheus graph tool and put in my query and it shows both routers. What can I look at to validate this behavior?

Alerting on diego low remaining memory seems incorrect

First, thank for the great work with release that's help a lot.

I have just an error with alerting, I have always a "DiegoLowRemainingMemory" alert on the alert manager. But I have 82 GB available in my cells as the Cf cells capacity dashboard says. This value on the dashboard seems actually correct as the query used to build this dashboard.

For me the alerting have a bad expression because as we can see here https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/src/cloudfoundry_alerts/diego.alerts#L26 it never does a sum with multiple cells we can have in a deployment.

Am I wrong ? If no, I can propose this solution: sum(avg(firehose_value_metric_rep_capacity_remaining_memory) by(bosh_deployment, bosh_job_name, bosh_job_id)) to replace the actual expression in the alert.

Create cadvisor dashboards and alerts

There aren't any dashboards or alerts for cadvisor metrics.

Mysql Dashboard open source license type

When prometheus BOSH Release v17.0.0 released, All Dashboards file moved from packages to job templates.
But mysql dashboard LICENSE file was removed. d5abc49
What is open source license type of mysql dashboard?(Apache or AGPL-3.0)

Grafana Admin Passwort Update Failing

Hi,

we updated today from Prometheus v 17.0.0 to 17.5.0 and the deployment failed because of Grafana not successfully executing the post-script.

Error: Failed to update user password
 
NAME:
   Grafana cli admin reset-admin-password - reset-admin-password <new password>
 
USAGE:
   Grafana cli admin reset-admin-password [command options] [arguments...]
 
OPTIONS:
   --homepath   path to grafana install/home path, defaults to working directory
   --config     path to config file

But to me the command executed seems correct. Any suggestions?

Create System (node) alerts

There aren't any alert configured for system (node_exporter) metrics.

Alert BOSHJobHighCPULoad does not take number of CPUs into account

The BOSHJobHighCPULoad Alert queries the bosh_job_load_avg01 metric. The problem is that the warning threshold value for this metric is dependend on the number of CPUs. Generally speaking 100% cpu load on 1 core is indicated by a 1. 100% cpu load on a 16 core machine would result in a 16.

Since the query for this alert does not divide the mtric by the number of CPUs it completely useless: On a 1 core machine the default of 5 will mean that you have to fix it immediately. On a 4 core machine a load average of 5 still indicates a slight problem. But on an 8 core machine a load avg of 5 absolutely fine.

So my request: can the load avg please be divided by the number of cpus in the machine before comparing it to the threshold value?

Create MySQL alerts

The MySQL alerts are actually empty.

firehose_value_metric_etcd_is_leader duplicated

We are starting to have a look and test the prometheus boshrelease with 2 firehose exporters, Prometheus started to trigger one alert CFEtcdMoreThanOneLeader (1 active):

ALERT CFEtcdMoreThanOneLeader
  IF count(firehose_value_metric_etcd_is_leader == 1) BY (environment, bosh_deployment) > 1
  FOR 10m
  LABELS {service="cf-etcd", severity="critical"}
  ANNOTATIONS {description="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` had more than one leader in the last 10 minutes: {{value}}", summary="CF etcd cluster at deployment `{{$labels.environment}}/{{$labels.bosh_deployment}}` > 1 leader"}

because there are two "similar" metrics. I have checked the status of the cluster and it is healthy, the leader is 10.230.16.79 (no partitions, only one leader). In the next picture you can see the "same" metric was reported by a different Firehose instance at a different time (see instance tag) so Prometheus considers these different metrics.

I do not have enough experience with Prometheus, but, assuming that tagging the metric with the firehose_exporter instance is needed, where do you think this issue should be fixed: in the alert definition or doing some kind of "duplicates" deletion? Any other ideas?

Thanks!

Fix Alert CFAuctioneerFetchStatesDurationTooHigh

Hello,

I'm using prometheus in an pivotal cloudfoundry environment and have an issue with the CFAuctioneerFetchStatesDurationTooHigh alert.

This alert uses the AuctioneerFetchStatesDuration metric and divides this metric by 1.000.000.
But this metric is measured in nanoseconds. Therefor it have to be divided by 1.000.000.000.
https://docs.pivotal.io/pivotalcf/1-11/monitoring/kpi.html#AuctioneerFetchStatesDuration

high cpu usage on director and slow bosh CLI with bosh_exporter

Running against a small PCF 1.9, the bosh_exporter scrape interval at 30sec is really causing bosh task queueing as expected but that is impacting the bosh user experience

in the bosh director, top reports 50% of cpu usage for user
in the bosh cli, this takes close to 2 min to do a "bosh vms" across 5 bosh releases
The FAQ is not so clear about this
Attached queues example

Moving to a scrap interval of 10min changes this fully, but is likely to impact alerting on bosh healthy messages.
I am planning to change and default to use BoshHMforwarder.
On PCF the ECS team has made that easy with a tile - http://www.ecsteam.com/deploying-bosh-health-metrics-forwarder-pivotal-cloud-foundry-tile
I would think defaulting this bosh release to using boshhmforwarder (even without a tile, bringing its own as part of this release) would be a wiser choice

Better stemcell alerts

I was talking to @LinuxBozo about the bosh outdated stemcell alerts that I contributed here, and he pointed out it's easy to miss outdated stemcells. If the prometheus deploy that bumps the expected version fails, or gets canceled, or if an operator pauses the concourse job that deploys it, etc, we won't notice if stemcells are out of date.

I'm wondering if we can do better by adding a tiny exporter that emits the current stemcell version for a particular stemcell series. Two quick proposals:

First approach

The stemcell exporter queries http://bosh.io/api/v1/stemcells/ and emits metrics like this:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1
bosh_stemcell_info{bosh_stemcell_version="3312.28"} 0
bosh_stemcell_info{bosh_stemcell_version="3312.27"} 0
...

Then we can write a query like this:

bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info

This gives a simple prometheus query, but could potentially miss stemcells that are from the wrong series entirely. I don't know if that's realistic, but we can handle that using a different approach:

Second approach

The stemcell exporter emits a single metric, for the expected stemcell:

bosh_stemcell_info{bosh_stemcell_version="3312.29"} 1

Then we can find outdated stemcells by listing all deployments, then subtracting deployments with the expected version:

bosh_deployment_stemcell_info unless (bosh_deployment_stemcell_info * on(bosh_stemcell_version) group_left bosh_stemcell_info)

I'm still new to prometheus, so maybe there's a simpler approach. WDYT?

cc @cnelson

Split off dashboards and alerts for customization?

@frodenas @mkuratczyk what do you think about splitting off dashboards and alerts so its easier to customize them? Do you have other ideas on how to do it?

Here is a sample of what it could look like: https://github.com/dlapiduz/prometheus-custom-boshrelease/

Bosh Dashboard and "All" deployment failed.

Hi,

In grafana since v11 it seems that all bosh dashboard are failing.
Means when I select "All" jobs for example I got this error

TypeError: Cannot read property 'replace' of undefined

firehose_exporter.metrics.environment values

Hi,
currently trying to upgrade to 15.0.0 and was stopped by the following error:

Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'prometheus'. Errors are:
     - Unable to render templates for job 'bosh_exporter'. Errors are:
       - Error filling in template 'bosh_exporter_ctl' (line 76: Can't find property '["bosh_exporter.metrics.environment"]')
     - Unable to render templates for job 'firehose_exporter'. Errors are:
       - Error filling in template 'firehose_exporter_ctl' (line 70: Can't find property '["firehose_exporter.metrics.environment"]')
     - Unable to render templates for job 'cf_exporter'. Errors are:

Looking at the code at:

https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/589eb4e4292c1687910978d7330c0aed09d89444/jobs/firehose_exporter/templates/bin/firehose_exporter_ctl#L70

I see it is mandatory. Is this a mistake or intended behaviour? If so, what would be appropriate values for it?

dup-ed grafana

It would be great if all just have one grafana realease (like https://github.com/vito/grafana-boshrelease). May be @vito can donate it to cf community if we promise to keep it nice and tidy.

Prometheus Benchmark Dashboard

While deploying prometheus bosh release on a large (~200vms) deployment, we met scraping performance issues.
Theses dashboards was incredibly usefull to identify and fix :

https://grafana.com/dashboards/2078
http://demo.robustperception.io:3000/dashboard/db/prometheus-benchmark-1-6-x?refresh=1m&orgId=1

It could be a usefull addition to the built-in dashboard
cc @ArthurHlt

Filter out bosh_job ~ compilation* in alerts

The alert BOSHJobEphemeralDiskWillFillIn2Hours is firing during compilation due to the compilation workers.

spec.address issue with grafana job

When we deploy the prometheus without bosh power dns, the deployment is failing. The jobs' script is using the bosh hostname.
<spec.address> is used for the job scripts but without the powerdns it can't resolve the hostname.
I think if we could choose whether we will use spec.address or spec.ip, it would be better.

Errands / delete deployment as unhealthy job

Hi,
it's seems that errand / or deleted job/release are count as unhealthy and so trigger alert.
This unhealthy job remain forever right now.

Thanks

Wrong Port for Postgres Exporter

According to the specs here: https://github.com/prometheus/prometheus/wiki/Default-port-allocations the Postgres Exporter exposes to 9187. Currently configured in prometheus-boshrelease we have 9113.

Is there any special reason why?

"No data points" when viewing Apps dashboards

Been wanting to kick the tires on Prometheus for a while and this release got it up and going super quickly! The only issue I'm having is the Apps dashboards don't display any info (firehose and bosh dashboards are fine). Here's my config, a cf-deployment ops file: https://github.com/cloudfoundry/capi-ci/blob/master/cf-deployment-operations/add-prometheus.yml.

Looks like the cf_exporter is in charge of generating metrics for the Apps dashboards. Even when changing the debug level for the exporter job, the only log line is level=info msg="Listening on :9193" source="cf_exporter.go:278". The expected metrics, like "cf_total_application_events" also don't appear in the Prometheus metrics explorer.

Appreciate the help and thanks for building this release!

Prometheus data files stored under root partition

Prometheus data files currently stored under root partition /var/vcap/store/prometheus and that can quickly cause root partition running out of disk space. Can we add an attribute to spec to override the attribute to point to data partition /var/vcap/data/prometheus?

It seems that storage.local.path is hardcoded to use /var/vcap/store/prometheus
https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/prometheus/templates/bin/prometheus_ctl#L90

Create documentation on how to monitor BOSH and Cloud Foundry

multiple bosh exporters / exporter authent

Hello, i have complex deployment, incuding multiple bosh director.
I understand i can configure different ports for multiple exporters. However i cant see how to configure multiple exporters in the same vm/instance group.

Do i have to configure separate vm/instance group per bosh exporter ?
In that case, is it possible to secure prometheus => bosh exporter link (basic auth, ssl ?)
Could bosh-links help wiring multiple bosh exporter to prometheus server transparently ?

thx
Pierre

BTW: fantastic job on this bosh release! Really nice out of the box grafana / prometheus / alerting experience out of the box !

Monitoring multiple bosh directors

Is it possible to monitor multiple bosh directors with a single prometheus deployment? We would like to run a single prometheus that collects metrics from multiple directors (staging and production). It looks like the bosh exporter only talks to a single director, so we'd need to run multiple instances of the exporter. And that would mean running each exporter on a separate vm, since bosh doesn't know how to run multiple instances of the same job on the same host. And that would mean prometheus wouldn't be able to read service discovery files generated by bosh exporters, since we can only colocate prometheus with one of the exporters.

It seems like our options for this use case would be:

Teach the bosh exporter to monitor multiple bosh directors
Run multiple vms, each with prometheus and bosh exporter, and use federation to combine metrics

Does the first option make sense to you @frodenas, or do you think the exporter should only monitor a single director? Or am I missing a simpler solution?

cc @cnelson

Enhancement of documentation

Hi,
I am currently trying to install prometheus-boshrelease on OS CF, backed by OpenStack. The deployment went through smoothly after creating a OS deployment manifest, I am able to open Grafana, but I am not able:

to see the dashboards provided
when importing the dashboards manually, any datapoints

So it seems, that I am doing something wrong. Could any give me heads up? I would also participate creating the documentation.

Thanks a lot & BR,
Johannes

Is anyone having issues with `label_values` in v17 Grafana?

Just upgraded to the 17.0.0 release and I am having a quite weird issue with Grafana.

If I keep the label_values in the template as is it doesn't return anything. If I change it to just use the label it works.

I ended up changing the dashboards to label_values(environments).

Has anyone seen something like this?

Add a warning regarding the `cf_exporter` resource consumption

The cf_exporter can slow down the cloud_controller dramatically when the scrape_interval has a low value.

Authentication issue

when I am upgrading prometheus from version 12.3.3 to 14.0.0, and hitting the grafana dashboard I am getting authentication pop up box multiple times asking for username and password and saying your connection is not private.

After upgrade from 12.0.0 to 12.2.0 graphs not showing deployment details

I upgrade to 12.2 and my graphs are not showing bosh deployment details.

Bump elasticsearch_exporter again

Looks like 0.3.2 introduced a regression that was fixed in 0.3.3. Unfortunately, the regression causes the exporter to panic on /metrics, so it's basically unusable for now. Fix at prometheus-community/elasticsearch_exporter#61.

CF Exporter - Filter Collector

Hi,

I am facing currently an issue with the cf_exporter.
When I define any collector mentioned in the specs of the exporter the export isn't able to start and shows errors like :

time="2017-07-21T12:43:56Z" level=error msg="Collector filter [ApplicationEvents] is not supported" source="cf_exporter.go:213"

Any ideas?

Regards,

Benjamin

Some metrics are missing

Hi
I try to monitoring CF with prometheus-boshrelease
currently I set up cf exporter, firehose exporter

Most metrics are collected with this exporter
but some metric were not show in prometheus DB

Ex)
firehose_counter_event_gorouter_bad_gateways_delta
firehose_counter_event_gorouter_bad_gateways_total
firehose_counter_event_gorouter_rejected_requests_delta
firehose_counter_event_gorouter_rejected_requests_total
firehose_counter_event_bbs_* metrics empty, except two metrics "firehose_counter_event_bbs_request_count_delta", "firehose_counter_event_bbs_request_count_total"

ENV)
cf : 238
diego : 0.1476.0

cf_exporter, version 0.4.3 (branch: master, revision: 9e37d9069bbb87d739d2c326981fed917ec016e4)
build user: root@1d21624a3782
build date: 20170216-02:38:17
go version: go1.7.5

firehose_exporter, version 4.1.0 (branch: master, revision: 95333eab4c8295bf727faa564add32422f5d71c6)
build user: root@387dfcd86a81
build date: 20170216-01:18:51
go version: go1.7.5

Interesting behaviour during upgrade

Hi,
I did an extremely smooth upgrade from v239 to v245 of CF Release and Diego (according to specs of the CF Release) today. There were not errors the deployment went through.

I have to findings, which I would like to share and also discuss:

I the picture above you can see, that the capacity decreased in this demo deployment although CF is running fine and we are still able to scale our sample application

The Compilation Workers are visible in the dashboard. Perhaps we should enable filtering on those for the dashboards?

bosh_exporter should not filter.azs by default

Hi,

What is the use case for the 'else' statement here: https://github.com/cloudfoundry-community/prometheus-boshrelease/blob/master/jobs/bosh_exporter/templates/bin/bosh_exporter_ctl#L50-L51 ? We just ran into an issue where we had no job-level information after deploying prometheus-release and we tracked it down to this - we didn't intend to filter at all but by default it was filtering on AZ (and we didn't have any interesting jobs in that AZ).

Thanks,

dropping unnecessary property namespacing

it would be nice to drop job name property namespacing to remove some of the repetitiveness. for example:

prometheus.evaluation_interval would become just evaluation_interval.

wdyt? i would be happy to do a PR that applies this to all the jobs.

Prometheus API can respond but Prometheus WebUI can't do.

Hi.

Although I did setup prometheus-17.6.0 with bosh, I couldn't get data from prometheus webui(no data points).
but Prometheus API is working.

$ curl -g 'http://100.99.50.80:9090/api/v1/series?match[]=cf_organization_info{environment="hoge"}'
{"status":"success","data":[{"__name__":"cf_organization_info","deployment":"cf","environment":"hoge","instance":"localhost:9193","job":"cf","organization_id":"e00539b9-4437-4e88-9a9a-6b807861383c","organization_name":"org","quota_name":"default"},
...
..
.

Why would it happen?

grafana v4.2.0 release

Hi
Do you have a plan update of grafana job?
I think useful feature are merged

https://community.grafana.com/t/release-notes-for-grafana-v4-2-0/167

Scale-out

As numbers of metrics ca be huge, for huge deployment how do we achieve scale-out architecture ?

Thanks

firehose_exporter internal servererror 500 after update to Pivotal RabbitMQ Service 1.8.*

Hello,

I'm using the prometheus bosh release in an Pivotal Cloud Foundry environment.
After updating the rabbitmq service tile to 1.8.* I am experiencing an issue with the firehose_exporter.

The update added an additional service broker for dedicated rabbitmq services.
This additional service broker producing following issue in firehose_exporter:

collected metric firehose_value_metric_p_rabbitmq_log_sender_total_messages_read label:<name:"bosh_deployment" value:"cf-rabbitmq" > label:<name:"bosh_job_id" value:"8daeceac-05cf-4094-a1ab-d9355dac1584" > label:<name:"bosh_job_ip" value:"192.168.17.15" > label:<name:"bosh_job_name" value:"rabbitmq-broker" > label:<name:"environment" value:"P" > label:<name:"origin" value:"p-rabbitmq" > label:<name:"unit" value:"count" > gauge:<value:0 > has help "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p-rabbitmq'." but should have "Cloud Foundry Firehose 'logSenderTotalMessagesRead' value metric from 'p.rabbitmq'."

Turning off the logs from dedicated rabbitmq service broker fix this issue temporary.

Can you please update the firehose_exporter to consume logs from shared rabbitmq service broker and from dedicated rabbitmq service broker?

Thank you,
Martin

Grafana-UAA integration

Create a document explaining how you can integrate Grafana user authentication with Cloud Foundry UAA.

bosh-prometheus / prometheus-boshrelease Goto Github PK

prometheus-boshrelease's People

Stargazers

Watchers

Forkers

prometheus-boshrelease's Issues

First approach

Second approach

Recommend Projects

Recommend Topics

Recommend Org