Code Monkey home page Code Monkey logo

Comments (7)

frodenas avatar frodenas commented on July 17, 2024 1

refactoring the exporter is not currently possible. Gathering info about release and stemcells is easy, but gathering processes is still a challenge.

If you're experiencing high cpu load at your bosh director, can you please open an issue at the bosh repo? It does not make any sense that director consumes so much cpu just to gather info from vms.

from prometheus-boshrelease.

frodenas avatar frodenas commented on July 17, 2024

@avasseur-pivotal It'd be interesting to see you BOSH VM configuration (VM CPU, memory, disk), because we never ran into this issue, even when using the default OM director configuration. Bear in mind also, that the bosh_exporter FAQ recommends increasing the default scrape interval and the scrape timeout, but again something is really wrong with that BOSH when it takes so much time to fetch the VM states. It might happen that some VM's agents are unresponsive, but that should not take longer that 45 seconds. Why are so many tasks queued is also a mystery, I'll need to dig into the bosh logs to see what might be happening.

This release doesn't set any exporter by default, it's up to you to decide which exporter you want to use. Although the boshhmforwarder is an option (outlined at the bosh_exporter FAQ), I'm reluctant to use it on my deployments because you then have a dependency on the CF Firehose. What we have used in some deployments is the graphite_exporter and the BOSH Graphite Health Monitor plugin, but I understand that you cannot use this configuration when using OM (because there is no option to enable that plugin).

Also, I want to point that I've never been happy with the bosh_exporter querying bosh and creating so many tasks, but it was the only way to fetch the VM IP addresses in order to use the service discovey. Some recent additions (cloudfoundry/bosh@6a70432) to bosh have made this task more feasible, so I plan to refactor the bosh_exporter to use a different mechanism to gather both the VM IP addresses and metrics.

from prometheus-boshrelease.

shinji62 avatar shinji62 commented on July 17, 2024

Hi @frodenas,

The problem that @avasseur-pivotal describe is the same as I explain to you some time ago by email.

When I checked, I saw that one of the bosh worker process was stuck and stop working on the queue. Task were just piling up and the cpu was stuck..
Even when prometheus was stop, I had to restart all bosh process to recover.

I increase the scrape interval to 5m even 10m to be able to solve the issue.

I am working on the tsdb collector that I need to test it out, but I guess it should be possible to do the same as the graphite one.

from prometheus-boshrelease.

frodenas avatar frodenas commented on July 17, 2024

@avasseur-pivotal @shinji62 Have you guys opened an issue at the Bosh repo? I'd interesting to know why the worker processes consumes so much cpu and finally got stuck.

from prometheus-boshrelease.

frodenas avatar frodenas commented on July 17, 2024

Just a quick update. The plan here is to use:

  1. the bosh_tsdb_exporter to gather vm metrics. This exporter will receive metrics from the BOSH OpenTSDB Health Monitor plugin and will be compatible with OpsManager (it only allows you to configure this plugin). This exporter doesn't hit the BOSH API (and it will not generate a Task), so the problem stated in this issue will be mitigated.

  2. the bosh_exporter will be responsible to only gather administrative info (like releases, stemcells, ... being used) and the VM's IPs (but this will require a director version >= 261). To get this info, the exporter will not need to generate a task, so this will also help to mitigate the problem stated in this issue.

I'm still finishing 2) and doing some tests. After that, I'll update this release and the associated dashboards and alerts.

from prometheus-boshrelease.

alexanelli avatar alexanelli commented on July 17, 2024

@avasseur-pivotal just checking in to see if this is still happening; thanks

from prometheus-boshrelease.

lancefrench avatar lancefrench commented on July 17, 2024

I'll throw it out that I have a customer experiencing this issue. I can gather more information if needed. We've gone with the workaround of deploying the TSDB exporter, but they would like to collect the administrative info if the bosh_exporter is refactored.

from prometheus-boshrelease.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.