Code Monkey home page Code Monkey logo

Comments (12)

phyber avatar phyber commented on June 3, 2024 1
  1. Sounds simple, but I'm not sure I understand additional integration comment.
  2. Seems to still have an issue, in the case where old_value < value, it's going to increment by the whole value each time, instead of the difference between the old and current value.

With Solution 1, we should be able to continue to treat the Gauge as a Counter in things like Grafana, rate should work as before, decreases in the metric would still count as a reset. Counting total CPU time consumed by all jails with a given name should still be a case of sum by(name) (rate(metric[5m])), etc.

The HELP and TYPE comments for the metrics in the scrape would unfortunately be "wrong" with regards to how the metrics should be used, but we could document their special case.

I'll have a proper look at this some time tomorrow :)

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

Could not reproduce this for now.

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

Reopening this as it seems to be a real issue when Poudriere is running in bulk mode (probably others too, but bulk hits the resource counters enough to trigger it).

Debugging information thrown out is:
jail_exporter[22120]: thread 'tokio-runtime-worker-0' panicked at 'assertion failed: v >= P::T::from_i64(0)', /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/prometheus-0.4.1/src/counter.rs:70:9
Which seems to indicate that a counter somewhere is trying to be decremented.

The likely cause of this is the build jails coming and going, but having the same names each time. This means that their rctl resource usage stats would reset to 0 and once we've done some math on the figures to find out how much we're incrementing by, the result would now be negative.

jail_exporter/src/main.rs

Lines 204 to 208 in fa4a7fc

"cputime" => {
let series = JAIL_CPUTIME_SECONDS.with_label_values(&[&name]);
let inc = *value - series.get();
series.inc_by(inc);
},

A possible solution here is to just take the absolute value after doing that math, and increase the counter by that amount.

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

Currently testing the absolute value approach.

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

Taking the absolute values here seems to have solved the issue.

from jail_exporter.

fabianfreyer avatar fabianfreyer commented on June 3, 2024

I'm not sure the absolute value really makes sense here, as old_value - new_value isn't really the amount of CPU time consumed. This becomes clear when we look at the following example:

  1. Jail A (1) consumed 100 seconds of CPU time: jail_cpu_time{name = "A"} 100
  2. Jail A (1) gets restarted. Let the new Jail be denoted as A (2)
  3. Jail A (2) consumes another 10 seconds of CPU time:
    // *value == 10
    // series.get() == 100
    let inc = *value - series.get(); 
    // Inc.abs() is now 90, but should be 10
    This would result in the following counter value: jail_cpu_time{name = "A"} 190, while in total only 110 seconds were consumed.

I see two solutions here:

  1. Make jail_cpu_time a gauge representing the CPU time consumed between scrapes. This would have the advantage of reflecting a Jail being restarted, but calculating the total CPU time consumed by all jails with a given name would require additional integration.
  2. Handle jail restarts transparently:
    let increment = match series.get() {
       old_value if old_value <= *value => *value - old_value,
      _ => *value,
    };
    This would be a lower bound on the CPU time consumed, as any CPU time consumed by Jail A (1) between the last scrape while it was still running and the restart will be dropped, but at least it would be more accurate than the current solution.

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

Hm, yeah. That's what I get for trying to think about that issue between meetings :)

from jail_exporter.

fabianfreyer avatar fabianfreyer commented on June 3, 2024

You are right, I think I mixed up the match arms. I'll fix it. The basic idea would be:

  • if increment is non-negative (current > old value), just add it
  • otherwise, the jail must have been resarted between scrapes. In this case, add the current time, as it is a lower bound on the CPU time consumed between the scrapes.

With "integration" I meant sum by(name).

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

otherwise, the jail must have been resarted between scrapes. In this case, add the current time, as it is a lower bound on the CPU time consumed between the scrapes.

Ah, I see what you were getting at. This would only work for the very next scrape though. Imagine these numbers were scraped in this order, the jail_cpu_time counter is already at 1000 in the exporter:

  • 1020: new total: 2020
  • jail is reset on the host, cputime value resets to 0 on host, still 2020 in exporter.
  • 10: new total: 2030
  • 50: new total: 2080

In the last scrape, the whole value of 50 was added, since we have no idea what the previous value was and didn't know we had to deduct 10.
I'm thinking I'll have to add a value in the lazy_static! to keep track of the old scrape value.

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

This commit fixes the cputime counter via a book keeping variable. A later commit fixes the wallclock counter in the same way. Tests were added for both of them and the new behaviour passes the tests.

Going to let you check this approach out before (hopefully) closing the bug. :)

from jail_exporter.

fabianfreyer avatar fabianfreyer commented on June 3, 2024

You are absolutely right, you do need a bookkeeping variable 👍. Your approach does seem right to me :)

from jail_exporter.

phyber avatar phyber commented on June 3, 2024

👍

from jail_exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.